SongGeneration

The official code repository for LeVo: High-Quality Song Generation with Multi-Preference Alignment

Generate Convert Improve

Install / Use

/learn @tencent-ailab/SongGeneration

About this skill

Quality Score

0/100

README

SongGeneration 2

SongGeneration 2

SongGeneration (old version)

🚀 We introduce LeVo 2 (SongGeneration 2), an open-source music foundation model designed to shatter the ceiling of open-source AI music by achieving true commercial-grade generation.

Through a large-scale, rigorous expert evaluation (20 industry professionals, 6 core dimensions, 100 songs per model), LeVo 2 (SongGeneration 2) has proven its superiority:

🏆 Commercial-Grade Musicality: Comprehensively outperforms all open-source baselines across Overall Quality, Melody, Arrangement, Sound Quality, and Structure. Its subjective generation quality successfully rivals top-tier closed-source commercial systems (e.g., MiniMax 2.5).
🎯 Precise Lyric Accuracy: Achieves an outstanding Phoneme Error Rate (PER) of 8.55%, effectively solving the lyrical hallucination problem. This remarkable accuracy significantly outperforms top commercial models like Suno v5 (12.4%) and Mureka v8 (9.96%).
🎛️ Exceptional Controllability: Highly responsive to multi-modal instructions, including text descriptions and audio prompts, allowing for precise control over the generated music.

📊 For detailed experimental setups and comprehensive metrics, please refer to the Evaluation Performance section below or our upcoming technical report.

📢 All the experimental results above are based on the latest checkpoint released on March 9th. If you downloaded the weights before March 9th, please re-download the latest checkpoint.

News and Updates

2026.03.01 🚀: We proudly introduce SongGeneration 2! We have officially open-sourced the SongGeneration-v2-large (4B parameters) model. It achieves commercial-grade music generation with an outstanding PER of 8.55% and supports multi-lingual lyrics. Please update to the newest code to ensure optimal performance and user experience. We also launch the SongGeneration-v2-Fast version on Hugging Face Space! You can now generate a complete song in under 1 minute, trading a slight loss in musicality for significantly faster generation speed.
2025.10.16 🔥: Our Demo webpage now supports full-length song generation (up to 4m30s)! 🎶 Experience end-to-end music generation with vocals and accompaniment — try it out now!
2025.10.15 🔥: We have updated the codebase to improve inference speed and generation quality, and adapted it to the latest model version. Please update to the newest code to ensure the best performance and user experience.
2025.10.14 🔥: We have released the large model (SongGeneration-large).
2025.10.13 🔥: We have released the full time model (SongGeneration-base-full) and evaluation performance.
2025.10.12 🔥: We have released the english enhanced model (SongGeneration-base-new).
2025.09.23 🔥: We have released the Data Processing Pipeline, which is capable of analyzing the structure and lyrics of entire songs and providing precise timestamps without the need for additional source separation. On the human-annotated test set SSLD-200, the model’s performance outperforms mainstream models including Gemini-2.5, Seed-ASR, and Qwen3-ASR.
2025.07.25 🔥: SongGeneration can now run with as little as 10GB of GPU memory.
2025.07.18 🔥: SongGeneration now supports generation of pure music, pure vocals, and dual-track (vocals + accompaniment separately) outputs.
2025.06.16 🔥: We have released the SongGeneration series.

TODOs📋

[ ] Release the Automated Music Aesthetic Evaluation Framework.
[ ] Release finetuning scripts.
[ ] Release Music Codec and VAE.
[ ] Release SongGeneration-v2-fast.
[ ] Release SongGeneration-v2-medium.
[x] Release SongGeneration-v2-large.
[x] Release large model.
[x] Release full time model.
[x] Release English enhanced model.
[x] Release data processing pipeline.
[x] Update Low memory usage model.
[x] Support single vocal/bgm track generation.

Model Versions

| Model | Max Length | Language | GPU Memory | RTF(H20) | Download Link | | ------------------------ | :--------: | :------------------: | :--------: | :------: | ------------------------------------------------------------ | | SongGeneration-base | 2m30s | zh | 10G/16G | 0.67 | Huggingface | | SongGeneration-base-new | 2m30s | zh, en | 10G/16G | 0.67 | Huggingface | | SongGeneration-base-full | 4m30s | zh, en | 12G/18G | 0.69 | Huggingface | | SongGeneration-large | 4m30s | zh, en | 22G/28G | 0.82 | Huggingface | | SongGeneration-v2-large | 4m30s | zh, en, es, ja, etc. | 22G/28G | 0.82 | Huggingface | | SongGeneration-v2-medium | 4m30s | zh, en, es, ja, etc. | 12G/18G | 0.69 | Coming soon | | SongGeneration-v2-fast | 4m30s | zh, en, es, ja, etc. | - | - | Coming soon |

💡 Notes:

GPU Memory — “X / Y” means X: no prompt audio; Y: with prompt audio.
RTF — Real Time Factor (pure inference, excluding model loading).

Overview

To shatter the ceiling of open-source AI music and achieve commercial-grade generation, SongGeneration 2 introduces a paradigm shift in both its underlying architecture and training strategy.

Model Architecture: Hybrid LLM-Diffusion Architecture & Hierarchical Language Model

SongGeneration 2 adopts a hybrid LLM-Diffusion architecture to balance musicality and sound quality:
- LeLM (The "Composer Brain"): The language model manages the global musical structure and performance details.
- Diffusion (The "Hi-Fi Renderer"): Guided by the language model, it synthesizes complex acoustic details for high-fidelity audio.
- Hierarchical Language Model: We introduce a hierarchical language model for the parallel modeling of Mixed Tokens (to capture high-level semantics like melody and structure) and Dual-Track Tokens (to model vocal and accompaniment tracks in parallel for fine-grained acoustic details).
Training Strategy: Automated Aesthetic Evaluation & Multi-stage Progressive Post-Training

To resolve lyrical hallucinations and stiff musicality, we utilize a highly structured training pipeline:
- Automated Aesthetic Evaluation Framework: We built a fine-grained evaluation framework trained on a massive expert-annotated dataset to provide the model with musicality priors.
- Multi-stage Progressive Post-training: We implemented a 3-stage alignment process:
  
  Stage 1 - SFT: Narrows the data distribution using high-quality songs to build a solid generation baseline.
  
  Stage 2 - Large-scale Offline DPO: Utilizes ~200k strict positive/negative pairs to completely eliminate lyrical hallucinations and stabilize controllability.
  
  Stage 3 - Semi-online DPO: Periodically updates the model based strictly on aesthetic scores to maximize musicality limits.

Installation

Start from scratch

You can install the necessary dependencies using the requirements.txt file with Python>=3.8.12 and CUDA>=11.8:

pip install -r requirements.txt
pip install -r requirements_nodeps.txt --no-deps

(Optional) Then install flash attention from git. For example, if you're using Python 3.10 and CUDA 12.0

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Start with docker

docker pull juhayna/song-generation-levo:hf0613
docker run -it --gpus all --network=host juhayna/song-generation-levo:hf0613 /bin/bash

Inference

To ensure the model runs correctly, please download all the required folders from the original source at Hugging Face.

Download ckpt and third_party folder from Hugging Face 1 or Hugging Face 2, and move them into the root directory of the project. You can also download models using huggingface-cli.
```
huggingface-cli download lglg666/SongGeneration-Runtime --local-dir ./runtime
mv runtime/ckpt 
```

Related Skills

node-connect

353.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。