Autosubsync
Automatically synchronize subtitles with audio using machine learning
Install / Use
/learn @oseiskar/AutosubsyncREADME
Automatic subtitle synchronization tool
Did you know that hundreds of movies, especially from the 1950s and '60s, are now in public domain and available online? Great! Let's download Plan 9 from Outer Space. As a non-native English speaker, I prefer watching movies with subtitles, which can also be found online for free. However, sometimes there is a problem: the subtitles are not in sync with the movie.
But fear not. This tool can resynchronize the subtitles without any human input. A correction for both shift and playing speed can be found automatically... using "AI & machine learning"
Installation
macOS / OSX
Prerequisites: Install Homebrew and pip. Then install FFmpeg and this package
brew install ffmpeg
pip install autosubsync
Linux (Debian & Ubuntu)
Make sure you have Pip, e.g., sudo apt-get install python-pip.
Then install FFmpeg and this package
sudo apt install ffmpeg
sudo apt install libsndfile1 # sometimes optional
sudo pip install autosubsync
The libsndfile1 is sometimes but not always needed due to https://github.com/bastibe/python-soundfile/issues/258.
Usage
autosubsync [input movie] [input subtitles] [output subs]
# for example
autosubsync plan-9-from-outer-space.avi \
plan-9-out-of-sync-subs.srt \
plan-9-subtitles-synced.srt
See autosubsync --help for more details.
Features
-
Automatic speed and shift correction
-
Typical synchronization accuracy ~0.15 seconds (see performance)
-
Wide video format support through ffmpeg
-
Supports all reasonably encoded SRT files in any language
-
Should work with any language in the audio (only tested with a few though)
-
Quality-of-fit metric for checking sync success
-
Python API. Example (save as
batch_sync.py):"Batch synchronize video files in a folder: python batch_sync.py /path/to/folder" import autosubsync import glob, os, sys if __name__ == '__main__': for video_file in glob.glob(os.path.join(sys.argv[1], '*.mp4')): base = video_file.rpartition('.')[0] srt_file = base + '.srt' synced_srt_file = base + '_synced.srt' # see help(autosubsync.synchronize) for more details autosubsync.synchronize(video_file, srt_file, synced_srt_file)
Development
Training the model
- Collect a bunch of well-synchronized video and subtitle files and put them
in a file called
training/sources.csv(seetraining/sources.csv.example) - Run (and see)
train_and_test.sh. This- populates the
training/datafolder - creates
trained-model.bin - runs cross-validation
- populates the
Synchronization (predict)
Assumes trained model is available as trained-model.bin
python3 autosubsync/main.py input-video-file input-subs.srt synced-subs.srt
Build and distribution
- Create virtualenv:
python3 -m venv venvs/test-python3 - Activate venv:
source venvs/test-python3/bin/activate pip install -e .pip install wheelpython setup.py bdist_wheel
Methods
The basic idea is to first detect speech on the audio track, that is, for each point in time, t, in the film, to estimate if speech is heard. The method described below produces this estimate as a probability of speech p(t). Another input to the program is the unsynchronized subtitle file containing the timestamps of the actual subtitle intervals.
Synchronization is done by finding a time transformation t → f(t) that makes s(f(t)), the synchronized subtitles, best match, p(t), the detected speech. Here s(t) is the (unsynchronized) subtitle indicator function whose value is 1 if any subtitles are visible at time t and 0 otherwise.
Speech detection (VAD)
Speech detection is done by first computing a spectrogram of the audio, that is, a matrix of features, where each column corresponds to a frame of duration Δt and each row a certain frequency band. Additional features are engineered by computing a rolling maximum of the spectrogram with a few different periods.
Using a collection of correctly synchronized media files, one can create a training data set, where the each feature column is associated with a correct label. This allows training a machine learning model to predict the labels, that is, detect speech, on any previously unseen audio track - as the probability of speech p(iΔt) on frame number i.
The weapon of choice in this project is logistic regression, a common baseline method in machine learning, which is simple to implement. The accuracy of speech detection achieved with this model is not very good, only around 72% (AURoC). However, the speech detection results are not the final output of this program but just an input to the synchronization parameter search. As mentioned in the performance section, the overall synchronization accuracy is quite fine even though the speech detection is not.
Synchronization parameter search
This program only searches for linear transformations of the form f(t) = a t + b, where b is shift and a is speed correction. The optimization method is brute force grid search where b is limited to a certain range and a is one of the common skew factors. The parameters minimizing the loss function are selected.
Loss function
The data produced by the speech detection phase is a vector representing the speech probabilities in frames of duration Δt. The metric used for evaluating match quality is expected linear loss:
loss(f) = Σ<sub>i</sub> s(f<sub>i</sub>) (1 - p<sub>i</sub>) + (1 - s(f<sub>i</sub>)) p<sub>i</sub>,
where p<sub>i</sub> = p(iΔt) is the probability of speech and s(f<sub>i</sub>) = s(f(iΔt)) = s(a iΔt + b) is the subtitle indicator resynchronized using the transformation f at frame number i.
Speed correction
Speed/skew detection is based on the assumption that an error in playing speed is not an arbitrary number but caused by frame rate mismatch, which constraints the possible playing speed multiplier to be ratio of two common frame rates sufficiently close to one. In particular, it must be one of the following values
- 24/23.976 = 30/29.97 = 60/59.94 = 1001/1000
- 25/24
- 25/23.976
or the reciprocal (1/x).
The reasoning behind this is that if the frame rate of (digital) video footage needs to be changed and the target and source frame rates are close enough, the conversion is often done by skipping any re-sampling and just changing the nominal frame rate. This effectively changes the playing speed of the video and the pitch of the audio by a small factor which is the ratio of these frame rates.
Performance
Based on somewhat limited testing, the typical shift error in auto-synchronization seems to be around 0.15 seconds (cross-validation RMSE) and generally below 0.5 seconds. In other words, it seems to work well enough in most cases but could be better. Speed correction errors did not occur.
Auto-syncing a full-length movie currently takes about 3 minutes and utilizes around 1.5 GB of RAM.
References
I first checked Google if someone had already tried to solve the same problem and found this great blog post whose author had implemented a solution using more or less the same approach that I had in mind. The post also included good points that I had not realized, such as using correctly synchronized subtitles as training data for speech detection.
Instead of starting from the code linked in that blog post I decided to implement my own version from scratch, since this might have been a good application for trying out RNNs, which turned out to be unnecessary, but this was a nice project nevertheless.
Other similar projects
- https://github.com/tympanix/subsync Apparently based on the blog post above, looks good
- https://github.com/smacke/subsync Newer project, uses WebRTC VAD (instead of DIY machine learning) for speech detection
- https://github.com/Koenkk/PyAMC/blob/master/autosubsync.py
- https://github.com/pulasthi7/AutoSubSync-old & https://github.com/pulasthi7/AutoSubSync (looks inactive)
Related Skills
claude-opus-4-5-migration
92.1kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
model-usage
343.3kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
TrendRadar
50.3k⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构,赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ,数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。
mcp-for-beginners
15.7kThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.
