SongGeneration
The official code repository for LeVo: High-Quality Song Generation with Multi-Preference Alignment
Install / Use
/learn @tencent-ailab/SongGenerationREADME
SongGeneration 2
<p align="center"><img src="img/logo.jpg" width="40%"></p>SongGeneration 2
SongGeneration (old version)
🚀 We introduce LeVo 2 (SongGeneration 2), an open-source music foundation model designed to shatter the ceiling of open-source AI music by achieving true commercial-grade generation.
Through a large-scale, rigorous expert evaluation (20 industry professionals, 6 core dimensions, 100 songs per model), LeVo 2 (SongGeneration 2) has proven its superiority:
- 🏆 Commercial-Grade Musicality: Comprehensively outperforms all open-source baselines across Overall Quality, Melody, Arrangement, Sound Quality, and Structure. Its subjective generation quality successfully rivals top-tier closed-source commercial systems (e.g., MiniMax 2.5).
- 🎯 Precise Lyric Accuracy: Achieves an outstanding Phoneme Error Rate (PER) of 8.55%, effectively solving the lyrical hallucination problem. This remarkable accuracy significantly outperforms top commercial models like Suno v5 (12.4%) and Mureka v8 (9.96%).
- 🎛️ Exceptional Controllability: Highly responsive to multi-modal instructions, including text descriptions and audio prompts, allowing for precise control over the generated music.
📊 For detailed experimental setups and comprehensive metrics, please refer to the Evaluation Performance section below or our upcoming technical report.
📢 All the experimental results above are based on the latest checkpoint released on March 9th. If you downloaded the weights before March 9th, please re-download the latest checkpoint.
News and Updates
- 2026.03.01 🚀: We proudly introduce SongGeneration 2! We have officially open-sourced the SongGeneration-v2-large (4B parameters) model. It achieves commercial-grade music generation with an outstanding PER of 8.55% and supports multi-lingual lyrics. Please update to the newest code to ensure optimal performance and user experience. We also launch the SongGeneration-v2-Fast version on Hugging Face Space! You can now generate a complete song in under 1 minute, trading a slight loss in musicality for significantly faster generation speed.
- 2025.10.16 🔥: Our Demo webpage now supports full-length song generation (up to 4m30s)! 🎶 Experience end-to-end music generation with vocals and accompaniment — try it out now!
- 2025.10.15 🔥: We have updated the codebase to improve inference speed and generation quality, and adapted it to the latest model version. Please update to the newest code to ensure the best performance and user experience.
- 2025.10.14 🔥: We have released the large model (SongGeneration-large).
- 2025.10.13 🔥: We have released the full time model (SongGeneration-base-full) and evaluation performance.
- 2025.10.12 🔥: We have released the english enhanced model (SongGeneration-base-new).
- 2025.09.23 🔥: We have released the Data Processing Pipeline, which is capable of analyzing the structure and lyrics of entire songs and providing precise timestamps without the need for additional source separation. On the human-annotated test set SSLD-200, the model’s performance outperforms mainstream models including Gemini-2.5, Seed-ASR, and Qwen3-ASR.
- 2025.07.25 🔥: SongGeneration can now run with as little as 10GB of GPU memory.
- 2025.07.18 🔥: SongGeneration now supports generation of pure music, pure vocals, and dual-track (vocals + accompaniment separately) outputs.
- 2025.06.16 🔥: We have released the SongGeneration series.
TODOs📋
- [ ] Release the Automated Music Aesthetic Evaluation Framework.
- [ ] Release finetuning scripts.
- [ ] Release Music Codec and VAE.
- [ ] Release SongGeneration-v2-fast.
- [ ] Release SongGeneration-v2-medium.
- [x] Release SongGeneration-v2-large.
- [x] Release large model.
- [x] Release full time model.
- [x] Release English enhanced model.
- [x] Release data processing pipeline.
- [x] Update Low memory usage model.
- [x] Support single vocal/bgm track generation.
Model Versions
| Model | Max Length | Language | GPU Memory | RTF(H20) | Download Link | | ------------------------ | :--------: | :------------------: | :--------: | :------: | ------------------------------------------------------------ | | SongGeneration-base | 2m30s | zh | 10G/16G | 0.67 | Huggingface | | SongGeneration-base-new | 2m30s | zh, en | 10G/16G | 0.67 | Huggingface | | SongGeneration-base-full | 4m30s | zh, en | 12G/18G | 0.69 | Huggingface | | SongGeneration-large | 4m30s | zh, en | 22G/28G | 0.82 | Huggingface | | SongGeneration-v2-large | 4m30s | zh, en, es, ja, etc. | 22G/28G | 0.82 | Huggingface | | SongGeneration-v2-medium | 4m30s | zh, en, es, ja, etc. | 12G/18G | 0.69 | Coming soon | | SongGeneration-v2-fast | 4m30s | zh, en, es, ja, etc. | - | - | Coming soon |
💡 Notes:
- GPU Memory — “X / Y” means X: no prompt audio; Y: with prompt audio.
- RTF — Real Time Factor (pure inference, excluding model loading).
Overview
<img src="img/over.jpg" alt="img" style="zoom:100%;" />To shatter the ceiling of open-source AI music and achieve commercial-grade generation, SongGeneration 2 introduces a paradigm shift in both its underlying architecture and training strategy.
-
Model Architecture: Hybrid LLM-Diffusion Architecture & Hierarchical Language Model
SongGeneration 2 adopts a hybrid LLM-Diffusion architecture to balance musicality and sound quality:
- LeLM (The "Composer Brain"): The language model manages the global musical structure and performance details.
- Diffusion (The "Hi-Fi Renderer"): Guided by the language model, it synthesizes complex acoustic details for high-fidelity audio.
- Hierarchical Language Model: We introduce a hierarchical language model for the parallel modeling of Mixed Tokens (to capture high-level semantics like melody and structure) and Dual-Track Tokens (to model vocal and accompaniment tracks in parallel for fine-grained acoustic details).
-
Training Strategy: Automated Aesthetic Evaluation & Multi-stage Progressive Post-Training
To resolve lyrical hallucinations and stiff musicality, we utilize a highly structured training pipeline:
-
Automated Aesthetic Evaluation Framework: We built a fine-grained evaluation framework trained on a massive expert-annotated dataset to provide the model with musicality priors.
-
Multi-stage Progressive Post-training: We implemented a 3-stage alignment process:
Stage 1 - SFT: Narrows the data distribution using high-quality songs to build a solid generation baseline.
Stage 2 - Large-scale Offline DPO: Utilizes ~200k strict positive/negative pairs to completely eliminate lyrical hallucinations and stabilize controllability.
Stage 3 - Semi-online DPO: Periodically updates the model based strictly on aesthetic scores to maximize musicality limits.
-
Installation
Start from scratch
You can install the necessary dependencies using the requirements.txt file with Python>=3.8.12 and CUDA>=11.8:
pip install -r requirements.txt
pip install -r requirements_nodeps.txt --no-deps
(Optional) Then install flash attention from git. For example, if you're using Python 3.10 and CUDA 12.0
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Start with docker
docker pull juhayna/song-generation-levo:hf0613
docker run -it --gpus all --network=host juhayna/song-generation-levo:hf0613 /bin/bash
Inference
To ensure the model runs correctly, please download all the required folders from the original source at Hugging Face.
-
Download
ckptandthird_partyfolder from Hugging Face 1 or Hugging Face 2, and move them into the root directory of the project. You can also download models using huggingface-cli.huggingface-cli download lglg666/SongGeneration-Runtime --local-dir ./runtime mv runtime/ckpt
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
