vid2cleantxt

vid2cleantext simple

vid2cleantxt: a transformers-based pipeline for turning heavily speech-based video files into clean, readable text from the audio. Robust speech transcription is now possible like never before with OpenAI's whisper model.

TL;DR check out this Colab notebook for a transcription and keyword extraction of a speech by John F. Kennedy by simply running all cells.

Table of Contents

Motivation
Overview
- Example Output
- Pipeline Intro
Quickstart
- Installation
- example usage
Notebooks on Colab
Details & Application
ScatterText example use case
Design Choices & Troubleshooting
Examples
Future Work, Collaboration, & Citations

Motivation

Video, specifically audio, is inefficient in conveying dense or technical information. The viewer has to sit through the whole thing, while only part of the video may be relevant to them. If you don't understand a statement or concept, you must search through the video or re-watch it. This project attempts to help solve that problem by converting long video files into text that can be easily searched and summarized.

Overview

Example Output

Example output text of a video transcription of JFK's speech on going to the moon:

https://user-images.githubusercontent.com/74869040/151491511-7486c34b-d1ed-4619-9902-914996e85125.mp4

vid2cleantxt output:

Now look into space to the moon and to the planets beyond and we have vowed that we shall not see it governed by a hostile flag of conquest but by a banner of freedom and peace we have vowed that we shall not see space filled with weapons of mass destruction but with instruments of knowledge and understanding yet the vow. In short our leadership in science and industry our hopes for peace and security our obligations to ourselves as well as others all require a. To solve these mysteries to solve them for the good of all men and to become the worlds leading space faring nation we set sail on this new sea because there is new knowledge to be gained and new rights to be won and they must be won and used for the progress of all people for space science like nuclear science and all technology. Has no conscience of its own whether it will become a force for good or ill depends on man and only if the united states occupies a position of preeminence can we help decide whether this new ocean will be a sea of peace

model = `openai/whisper-medium.en`

See the demo notebook for the full-text output.

Pipeline Intro

vid2cleantxt detailed

The transcribe.py script uses audio2text_functions.py to convert video files to .wav format audio chunks of duration X* seconds
transcribe all X audio chunks through a pretrained transformer model
Write all list results into a text file, store various runtime metrics into a separate text list, and delete the .wav audio chunk directory after using them.
(Optional) create two new text files: one with all transcriptions appended and one with all metadata appended.
FOR each transcription text file:
- Passes the 'base' transcription text through a spell checker (Neuspell) and auto-correct spelling. Saves as a new text file.
- Uses pySBD to infer sentence boundaries on the spell-corrected text and add periods to delineate sentences. Saves as a new file.
- Runs essential keyword extraction (via YAKE) on spell-corrected file. All keywords per file are stored in one data frame for comparison and exported to the .xlsx format

** (where X is some duration that does not overload your computer/runtime)

Given INPUT_DIRECTORY:

final transcriptions in.txt will be in INPUT_DIRECTORY/v2clntxt_transcriptions/results_SC_pipeline/
metadata about transcription process will be in INPUT_DIRECTORY/v2clntxt_transc_metadata

Quickstart

Install, then you can use vid2cleantxt in two ways:

CLI via transcribe.py script from the command line (python vid2cleantxt/transcribe.py --input-dir "path/to/video/files" --output-dir "path/to/output/dir"\)
As a python package, import vid2cleantxt and use the transcribe module to transcribe videos (vid2cleantxt.transcribe.transcribe_dir())

Don't want to use it locally or don't have a GPU? you may be interested in the demo notebook on Google Colab.

Installation

As a Python package

(recommended) Create a new virtual environment with python3 -m venv venv
- Activate the virtual environment with source venv/bin/activate
Install the repo with pip:

pip install git+https://github.com/pszemraj/vid2cleantxt.git

The library is now installed and ready to use in your Python scripts.

import vid2cleantxt

text_output_dir, metadata_output_dir = vid2cleantxt.transcribe.transcribe_dir(
    input_dir="path/to/video/files",
    model_id="openai/whisper-base.en",
    chunk_length=30,
)

# do things with text files in text_output_dir

See below for more details on the transcribe_dir function.

Install from source

git clone https://github.com/pszemraj/vid2cleantxt.git
- use the --depth=1 switch to clone only the latest master (faster)
cd vid2cleantxt/
pip install -e .

As a shell block:

git clone https://github.com/pszemraj/vid2cleantxt.git --depth=1
cd vid2cleantxt/
pip install -e .

install details & gotchas

This should be automatically completed upon installation/import, but a spacy model may need to be downloaded for post-processing transcribed audio. This can be completed with spacy download en_core_web_sm
FFMPEG is required as a base system dependency to do anything with video/audio. This should be already installed on your system; otherwise see the FFmpeg site.
We've added an implementation for whisper to the repo. Until further tests are completed, it's recommended to stick with the default 30s chunk length for these models. (plus, they are fairly compute-efficient for the resulting quality)

example usage

CLI example: transcribe a directory of example videos in ./examples/ with the whisper-small model (not trained purely english) and print the transcriptions with the cat command:

python examples/TEST_folder_edition/dl_src_videos.py
python vid2cleantxt/transcribe.py -i ./examples/TEST_folder_edition/ -m openai/whisper-small
find ./examples/TEST_folder_edition/v2clntxt_transcriptions/results_SC_pipeline -name "*.txt" -exec cat {} +

Run python vid2cleantxt/transcribe.py --help for more details on the CLI.

Python API example: transcribe an input directory of user-specified videos using whisper-tiny.en, a smaller but faster model than the default.

import vid2cleantxt

_my_input_dir = "path/to/video/files"
text_output_dir, metadata_output_dir = vid2cleantxt.transcribe.transcribe_dir(
    input_dir=_my_input_dir,
    model_id="openai/whisper-tiny.en",
    chunk_length=30,
)

Transcribed files can then be interacted with for whatever purpose (see [Visualizati

Vid2cleantxt

Install / Use

README

vid2cleantxt

Motivation

Overview

Example Output

model = `openai/whisper-medium.en`

Pipeline Intro

Quickstart

Installation

As a Python package

Install from source

install details & gotchas

example usage

Vid2cleantxt

Install / Use

README

vid2cleantxt

Motivation

Overview

Example Output

model = openai/whisper-medium.en

Pipeline Intro

Quickstart

Installation

As a Python package

Install from source

install details & gotchas

example usage

model = `openai/whisper-medium.en`