Autosubsync

Automatically synchronize subtitles with audio using machine learning

Generate Convert Improve

Install / Use

/learn @oseiskar/Autosubsync

About this skill

Quality Score

0/100

README

Automatic subtitle synchronization tool

Did you know that hundreds of movies, especially from the 1950s and '60s, are now in public domain and available online? Great! Let's download Plan 9 from Outer Space. As a non-native English speaker, I prefer watching movies with subtitles, which can also be found online for free. However, sometimes there is a problem: the subtitles are not in sync with the movie.

But fear not. This tool can resynchronize the subtitles without any human input. A correction for both shift and playing speed can be found automatically... using "AI & machine learning"

Installation

macOS / OSX

Prerequisites: Install Homebrew and pip. Then install FFmpeg and this package

brew install ffmpeg
pip install autosubsync

Linux (Debian & Ubuntu)

Make sure you have Pip, e.g., sudo apt-get install python-pip. Then install FFmpeg and this package

sudo apt install ffmpeg
sudo apt install libsndfile1 # sometimes optional
sudo pip install autosubsync

The libsndfile1 is sometimes but not always needed due to https://github.com/bastibe/python-soundfile/issues/258.

Usage

autosubsync [input movie] [input subtitles] [output subs]

# for example
autosubsync plan-9-from-outer-space.avi \
  plan-9-out-of-sync-subs.srt \
  plan-9-subtitles-synced.srt

See autosubsync --help for more details.

Features

Automatic speed and shift correction
Typical synchronization accuracy ~0.15 seconds (see performance)
Wide video format support through ffmpeg
Supports all reasonably encoded SRT files in any language
Should work with any language in the audio (only tested with a few though)
Quality-of-fit metric for checking sync success

Python API. Example (save as batch_sync.py):

"Batch synchronize video files in a folder: python batch_sync.py /path/to/folder"

import autosubsync
import glob, os, sys

if __name__ == '__main__':
    for video_file in glob.glob(os.path.join(sys.argv[1], '*.mp4')):
        base = video_file.rpartition('.')[0]
        srt_file = base + '.srt'
        synced_srt_file = base + '_synced.srt'

        # see help(autosubsync.synchronize) for more details
        autosubsync.synchronize(video_file, srt_file, synced_srt_file)

Development

Training the model

Collect a bunch of well-synchronized video and subtitle files and put them in a file called training/sources.csv (see training/sources.csv.example)
Run (and see) train_and_test.sh. This
- populates the training/data folder
- creates trained-model.bin
- runs cross-validation

Synchronization (predict)

Assumes trained model is available as trained-model.bin

python3 autosubsync/main.py input-video-file input-subs.srt synced-subs.srt

Build and distribution

Create virtualenv: python3 -m venv venvs/test-python3
Activate venv: source venvs/test-python3/bin/activate
pip install -e .
pip install wheel
python setup.py bdist_wheel

Methods

The basic idea is to first detect speech on the audio track, that is, for each point in time, t, in the film, to estimate if speech is heard. The method described below produces this estimate as a probability of speech p(t). Another input to the program is the unsynchronized subtitle file containing the timestamps of the actual subtitle intervals.

Synchronization is done by finding a time transformation t → f(t) that makes s(f(t)), the synchronized subtitles, best match, p(t), the detected speech. Here s(t) is the (unsynchronized) subtitle indicator function whose value is 1 if any subtitles are visible at time t and 0 otherwise.

Speech detection (VAD)

Speech detection is done by first computing a spectrogram of the audio, that is, a matrix of features, where each column corresponds to a frame of duration Δt and each row a certain frequency band. Additional features are engineered by computing a rolling maximum of the spectrogram with a few different periods.

Using a collection of correctly synchronized media files, one can create a training data set, where the each feature column is associated with a correct label. This allows training a machine learning model to predict the labels, that is, detect speech, on any previously unseen audio track - as the probability of speech p(iΔt) on frame number i.

The weapon of choice in this project is logistic regression, a common baseline method in machine learning, which is simple to implement. The accuracy of speech detection achieved with this model is not very good, only around 72% (AURoC). However, the speech detection results are not the final output of this program but just an input to the synchronization parameter search. As mentioned in the performance section, the overall synchronization accuracy is quite fine even though the speech detection is not.

Synchronization parameter search

This program only searches for linear transformations of the form f(t) = a t + b, where b is shift and a is speed correction. The optimization method is brute force grid search where b is limited to a certain range and a is one of the common skew factors. The parameters minimizing the loss function are selected.

Loss function

The data produced by the speech detection phase is a vector representing the speech probabilities in frames of duration Δt. The metric used for evaluating match quality is expected linear loss:

loss(f) = Σi s(fi) (1 - pi) + (1 - s(fi)) pi,

where pi = p(iΔt) is the probability of speech and s(fi) = s(f(iΔt)) = s(a iΔt + b) is the subtitle indicator resynchronized using the transformation f at frame number i.

Speed correction

Speed/skew detection is based on the assumption that an error in playing speed is not an arbitrary number but caused by frame rate mismatch, which constraints the possible playing speed multiplier to be ratio of two common frame rates sufficiently close to one. In particular, it must be one of the following values

24/23.976 = 30/29.97 = 60/59.94 = 1001/1000
25/24
25/23.976

or the reciprocal (1/x).

The reasoning behind this is that if the frame rate of (digital) video footage needs to be changed and the target and source frame rates are close enough, the conversion is often done by skipping any re-sampling and just changing the nominal frame rate. This effectively changes the playing speed of the video and the pitch of the audio by a small factor which is the ratio of these frame rates.

Performance

Based on somewhat limited testing, the typical shift error in auto-synchronization seems to be around 0.15 seconds (cross-validation RMSE) and generally below 0.5 seconds. In other words, it seems to work well enough in most cases but could be better. Speed correction errors did not occur.

Auto-syncing a full-length movie currently takes about 3 minutes and utilizes around 1.5 GB of RAM.

References

I first checked Google if someone had already tried to solve the same problem and found this great blog post whose author had implemented a solution using more or less the same approach that I had in mind. The post also included good points that I had not realized, such as using correctly synchronized subtitles as training data for speech detection.

Instead of starting from the code linked in that blog post I decided to implement my own version from scratch, since this might have been a good application for trying out RNNs, which turned out to be unnecessary, but this was a nice project nevertheless.

Other similar projects

https://github.com/tympanix/subsync Apparently based on the blog post above, looks good
https://github.com/smacke/subsync Newer project, uses WebRTC VAD (instead of DIY machine learning) for speech detection
https://github.com/Koenkk/PyAMC/blob/master/autosubsync.py
https://github.com/pulasthi7/AutoSubSync-old & https://github.com/pulasthi7/AutoSubSync (looks inactive)

Related Skills

claude-opus-4-5-migration

92.1k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

343.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

TrendRadar

50.3k

⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载，你的 AI 舆情监控助手与热点筛选工具！聚合多平台热点 + RSS 订阅，支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机，也支持接入 MCP 架构，赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ，数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。

mcp-for-beginners

15.7k

This open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.

oseiskar

View profile

View on GitHub

GitHub Stars453

CategoryEducation

Updated12d ago

Forks38

oseiskar/autosubsync

Languages

Python

Security Score

100/100

Audited on Mar 19, 2026

No findings