Fast forward to 2023...
Most of what I wrote below is correct. But unless you have lots of time on your hands (or work on very difficult recordings), I would strongly suggest that you use Whisper, the OpenAI solution to automatic transcription. It requires basic command line skills, but nothing that you can’t learn in a 10-minute video tutorial. Your machine specs will also make a difference: my 2.8 GHz Quad-Core Intel Core i7 2015 MBP with 16gb of RAM requires about 1.5~2x the length of the soundtrack to produce a very good transcription (of Japanese interviews) that I will later have to edit to produce something useable.
The github page is here: openai/whisper, and includes plenty of links that you might want to read.
About 4 years ago I got a job where I needed to transcribe about 40 hours of interviews. I wrote most of this article then and let it to rest until now. The solution I propose is a workable solution for transcribing audio/video and can also can be used as a practical introduction to Applescript. I just tested everything in High Sierra, with the current versions of all the mentioned applications.
This article also demonstrates how a few macOS technologies can be put together to create a very robust and integrated solution in a number of very easy steps. The idea is:
- Find a process that you need to automate
- Use Applescript to code the automation
- Use Automator to create a system-wide service to access the automation
- Use System Preferences to assign a shortcut to the service, either available system-wide, or only in a given application
Update (the day after...): a comment on reddit says that the title is not accurate because I do not propose speech recognition. For people not familiar with transcription, plain speech recognition is not a solution because it requires two things: good sound quality and that the software be used to the voice. That is not the case with most transcription situations. But, it is possible to dictate the audio that you are listening to, in which case you'll need the same tools as described here, and you just have to add macOS dictation to the workflow if you want to stick to macOS bundled software.
Homemade transcription software...
- QuickTime Player
- Applescript / Script Editor
- System Preferences
Quicktime Player and Applescript
System Preferences and Services shortcuts
After "Play": "Pause" and "Rewind"
But what about inserting time codes in your document ?
The resulting string that is put into the clipboard would be something like:
An strong alternative to QuickTime Player: VLC
There is an alternative to QuickTime Player that has none of the problems we just described. The software is VLC. It is Free Software and is available directly from the development site. You can make donations to contribute to the development too.
There are a lot of areas where the above code can be improved, but the solutions we have work well enough and can be the basis for a lot of other relatively simple developments.
I've changed the code a bit after testing on a real job (in VLC).
First, the time code:
- a line break → move to the next line
- a time code → insert the time code
- another line break → move to the next line
so that I can start to type right away.
I have changed the way the Automator service works too. Instead of feeding the clipboard, and pasting the time code myself, I ask the code to return the string (return TCstring) and I checked the [Output replaces selected text] box at the top of the Automator actions list.
That way the returned value is automatically pasted where I have the cursor.
Another modification, minor this one, in the Step Backward function: I changed the delay from 2 to 1, which is in fact way enough when you just need to clarify a sound. A delay of 2 requires that you wait too long before you can resume typing.
Now, you must really be careful about the shortcuts so that the don't interfere with normal navigation in the text.
I chose the following:
Control + ] for "VLC Play+Pause"
Control + [ for "VLC Rewind"
Control + ↓ for "VLC Time Code"
I've just finished transcribing a short 6 minutes interview with this setting and everything worked like a charm.