Song2Vec
Compare musical similarity between two songs using audio signal processing and machine learning.
Install / Use
/learn @pranshu05/Song2VecREADME
Song2Vec — Bass Pattern Similarity Detection

Join the Discord Server for discussion: https://discord.gg/AqMSZ3b3xM
Bass pattern recognition and similarity detection
Compare musical similarity between two songs by analyzing how their bass patterns evolve over time.
What This Project Does
Song2Vec is a bass pattern recognition system that detects if the bass sequences of two songs are similar, like Shazam fingerprinting.
Simple Explanation
When you listen to music, you notice bass patterns—the low-frequency rhythm bumps. Song2Vec:
- Extracts the bass from two songs
- Compares if their bass patterns move in similar ways over time
- Tells you how similar they are (as a percentage)
- Shows you where the matches are using interactive graphs
How It Works
The Audio Processing Pipeline
Step 1: Load & Normalize
- Load audio file with librosa at 22050 Hz
- Normalize amplitude (peak or RMS normalization)
Step 2: Extract Bass Spectrogram
- Apply Short-Time Fourier Transform (STFT) with n_fft=4096, hop_length=512
- Isolate bass frequency band (20–250 Hz)
- Create spectrogram showing "how loud each frequency at each moment in time"
- Result: 43 frequency bins × N time frames
Step 3: Compute Energy Envelope
- Sum bass energy across all frequencies for each time frame
- Convert to log scale for perceptual alignment
- Get single curve: "bass energy level at each moment"
Step 4: Temporal Pattern Matching (Core Innovation) The system compares sequences over time using three approaches:
-
Dynamic Time Warping (DTW)
- Solves: "Two songs have same bass pattern but at different speeds"
- Flexibly aligns sequences despite tempo differences
- Returns DTW distance → converted to 0-100% similarity
-
Cross-Correlation
- Finds repeating patterns and shifted versions
- Shows where in the song patterns align
- Peaks indicate strong alignment points
-
Frame-by-Frame Similarity
- For each moment, compare bass levels: how similar is the energy?
- Shows WHERE in the track patterns match (temporal localization)
- Average across all frames = overall similarity
Step 5: Visualize Results
- Side-by-side spectrograms (what frequencies are loud when)
- Frame similarity graph (shows matching moments over time)
- Detected matched segments with timestamps and match percentages
- Overall similarity score (0-100%)
Installation
Install dependencies:
uv sync
Notes:
librosarequiressoundfile/audioread. For MP3 support, installffmpeg:- Ubuntu:
sudo apt-get install ffmpeg - macOS:
brew install ffmpeg
- Ubuntu:
Quick Start
Web UI
bash run.sh
Then open http://localhost:5000 in your browser. Upload 2 songs via drag-drop to compare.
Python API
from core import (
load_audio, normalize_waveform,
compute_stft_magnitude, isolate_frequency_band,
match_bass_patterns
)
# Load and process
audio_a = load_audio("song1.mp3", sr=22050)
audio_b = load_audio("song2.mp3", sr=22050)
y_a = normalize_waveform(audio_a.y, method="peak")
y_b = normalize_waveform(audio_b.y, method="peak")
# Get spectrograms
S_mag_a, freqs, times = compute_stft_magnitude(y_a, 22050, n_fft=4096, hop_length=512)
S_mag_b, _, _ = compute_stft_magnitude(y_b, 22050, n_fft=4096, hop_length=512)
# Extract bass band (20-250 Hz)
S_bass_a, bass_freqs = isolate_frequency_band(S_mag_a, freqs, 20, 250)
S_bass_b, _ = isolate_frequency_band(S_mag_b, freqs, 20, 250)
# Match patterns
result = match_bass_patterns(S_bass_a, S_bass_b)
# Results
print(f"Similarity: {result.overall_similarity:.2%}")
print(f"Matched segments: {result.matched_segments}")
print(f"Frame scores: {result.frame_similarity}")
Project Structure
Song2Vec/
├── app.py ← Flask web app entry point
├── run.sh ← Start web server
├── uv.lock
├── README.md
|
├── benchmarks/
| ├── profiler.py
│
├── core/ ← All audio processing logic
│ ├── __init__.py (Public API exports)
│ ├── audio.py (Load, normalize, resample)
│ ├── dtw.py (Dynamic Time Warping implementation)
│ ├── features.py (STFT, bass extraction)
│ ├── pattern_matching.py (DTW, correlation, matching)
│ └── similarity.py (Cosine, euclidean metrics)
│
├── web/ ← Flask-specific code
│ ├── __init__.py
│ └── api.py (REST endpoints)
│
├── templates/ ← HTML UI
│ └── index.html (Plotly spectrograms, interface)
│
├── data/
│ ├── raw/ (Sample audio files)
│ ├── uploads/ (Temporary uploads)
│ └── images/ (Logo, documentation)
Core Modules
core/audio.py — Audio loading and preprocessing
load_audio()— Load audio with librosanormalize_waveform()— Peak/RMS normalizationresample_audio()— Change sample rate
core/features.py — Bass feature extraction
compute_stft_magnitude()— STFT spectrogramisolate_frequency_band()— Extract frequency rangebass_energy()— Per-frame energy calculation
core/dtw.py — Structural DTW and section-level comparison
fast_dtw()— Fast approximate DTW in linear time/spacecompare_song_structures()— Structural similarity using self-similarity matricesbatch_compare_structures()— Candidate ranking with LB-Keogh pruning
core/pattern_matching.py — Temporal pattern matching (core)
dtw_distance()— Dynamic Time Warping alignmentcross_correlate_patterns()— Pattern shift detectionframe_wise_similarity()— Per-frame comparisondetect_pattern_matches()— Find matching segmentsmatch_bass_patterns()— Main comparison function
core/similarity.py — Classical similarity metrics
cosine_similarity()— Cosine distanceeuclidean_distance()— L2 distance
web/api.py — REST API endpoints
/api/compare— File upload and comparison
API Response Notes
The /api/compare response includes:
similarity.overall_similarity: final score in[0, 1]similarity.frame_similarity: per-frame alignment similarity in Song 1 timelinesimilarity.matched_segments: contiguous high-similarity regions with:start_frame,end_frame,length_frames,mean_similaritystart_time_s,end_time_s(already converted on backend)
Important:
start_frame/end_frameare indexed on Song 1's frame axis.- UI timing should use
start_time_sandend_time_sinstead of re-deriving time from downsampled arrays.
Understanding Results
Example Results
Score: 41.25% with matched segments:
- Overall bass is somewhat different (41%)
- But there's a section that matches strongly (86%)
- Suggests songs share a bass phrase/pattern, but differ overall
Visualizations Explained
- Bass Spectrogram: Heatmap showing frequency loudness over time (brighter = more energy)
- Alignment Graph: Shows which moments have similar bass. Shaded area highlights matched regions.
- Matched Segments: List of time ranges where bass patterns align strongly
Design Decisions
Why DTW instead of sliding window?
- Bass patterns can stretch/compress (different tempos)
- DTW handles elastic alignment gracefully
Why only bass (20–250 Hz)?
- Most distinctive and robust for pattern recognition
- Less affected by high-frequency noise
- Can extend to other frequency bands later
Why frame-by-frame similarity?
- Shows WHERE patterns match (temporal localization)
- More informative than a single number
- Helps identify specific matching sections
How do we prevent false "100% full-song" matches?
- Frame matching is constrained to tempo-aligned local neighborhoods (not global any-to-any frame search).
- Local windows use mean-centered correlation to compare shape changes and avoid saturation on positive-only envelopes.
- Segment post-filtering suppresses suspicious near-perfect full-song segments when global evidence does not support them.
Troubleshooting
If results look suspicious:
-
matched_segments = 0but high similarity curve:- Hard refresh browser (to avoid stale cached JS), then rerun.
- Verify backend is restarted after code changes (
bash run.sh).
-
similarity appears near
1.0everywhere:- Use songs with clearly different rhythm structures to sanity check.
- Confirm you're on the latest code where frame similarity uses mean-centered windows.
-
processing is too slow:
- Reduce audio duration (e.g. first 30-60 seconds).
- Lower FFT size to
n_fft=2048for faster inference.
Performance Tips
Faster processing:
- Reduce sample rate:
sr=11025(instead of 22050) - Reduce duration:
duration=15(instead of full track) - Smaller FFT:
n_fft=2048(instead of 4096)
Better accuracy:
- Increase sample rate:
sr=44100 - Increase FFT:
n_fft=8192 - Process full track duration
References
-
McFee et al., librosa: Audio and music signal analysis in python (SciPy 2015)
- https://librosa.org/
-
Bogdanov et al., Essentia (open-source MIR feature extraction)
- https://essentia.upf.edu/
-
Müller, Fundamentals of Music Processing (MIR reference textbook)
-
Lerch, An Introduction to Audio Content Analysis
- https://www.audiocontentanalysis.org/
-
ISMIR community (music information retrieval conference)
- https://ismir.net/
-
Choi et al., Automatic tagging using deep convolutional neural networks (ISMIR 2016)
- https://arxiv.org/abs/1606.00298
-
Wyse, Audio Spectrogram Representations for Processing with CNNs (2017)
- https://arxiv.org/abs/1706.09559
Research Papers on Core Techniques
Dynamic Time Warping (DTW)
-
Sakoe, H., & Chiba, S., Dynamic programming algorithm optimization for spoken word recognition (IEEE 1978)
- https://doi.org/10.1109/TASSP.1978.1163055
- Foundational DTW algorithm for pattern matching and temporal alignment
-
Müller, M.,
