Subtitles using ffmpeg

New in ffmpeg 8.0 has been the inclusion of Whisper (a project by Open AI) that allows you to transcribe audio (in video files) to subtitle files. But how does it work?

First we should note that your flavor of Linux might not have the latest version off ffmpeg in it’s repository. Or you might be even using Windows *shivers*. So the easiest way is to download a static build for you OS of choice: https://www.ffmpeg.org/download.html

wget https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz
tar xJf ffmpeg-master-latest-linux64-gpl.tar.xz --strip-components=2 ffmpeg-master-latest-linux64-gpl/bin/ffmpeg

Next we need to download the language model we’re going to use. It’s basically the ‘brains’. We will go with the default one for now. You can view a list of them here: https://huggingface.co/ggerganov/whisper.cpp/tree/main

wget 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin'

Now that we have all we need, pick a media file and we’ll start:

./ffmpeg -i movie.mp4 -vn -af \
  "whisper=model=ggml-base.en.bin:language=en:queue=3:destination=movie.srt:format=srt" \
  -f null -

This tells ffmpeg to ignore the video and run a filter on the audio. If your media is in another language then English, be sure to change the ‘language’ option.

After some time (it really does take a lot of CPU power) you’ll end up with an SRT file that most modern mediaplayers can use to show you subtitles. If the results are not great you can try again with another model, but keep in mind that we are at the beginning of the AI age so results will improve over time.

If you decide to correct text and chop up it into better chunks, you can look into a ‘forced aligner’ like Aeneas (https://github.com/readbeyond/aeneas) to map the text back to the correct timestamps.