SkillAgentSearch skills...

Jamify

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Install / Use

/learn @declare-lab/Jamify
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="https://declare-lab.github.io/jamify-logo-new.png" width="200"/ > <br/> <h1>JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment</h1> <br/>

arXiv Static Badge Static Badge Static Badge Github Static Badge

</div>

JAM is a rectified flow-based model for lyrics-to-song generation that addresses the lack of fine-grained word-level controllability in existing lyrics-to-song models. Built on a compact 530M-parameter architecture with 16 LLaMA-style Transformer layers as the Diffusion Transformer (DiT) backbone, JAM enables precise vocal control that musicians desire in their workflows. Unlike previous models, JAM provides word and phoneme-level timing control, allowing musicians to specify the exact placement of each vocal sound for improved rhythmic flexibility and expressive timing.

News

📣 05/08/25: Training code has been released! You can now train your own JAM models from scratch.

📣 29/07/25: We have released JAM-0.5, the first version of the AI song generator from Project Jamify!

Features

cover-photo

  • Fine-grained Word and Phoneme-level Timing Control: The first model to provide word-level timing and duration control in song generation, enabling precise prosody control for musicians
  • Compact 530M Parameter Architecture: Less than half the size of existing models, enabling faster inference with reduced resource requirements
  • Enhanced Lyric Fidelity: Achieves over 3× reduction in Word Error Rate (WER) and Phoneme Error Rate (PER) compared to prior work through precise phoneme boundary attention
  • Global Duration Control: Controllable duration up to 3 minutes and 50 seconds.
  • Aesthetic Alignment through Direct Preference Optimization: Iterative refinement using synthetic preference datasets to better align with human aesthetic preferences, eliminating manual annotation requirements

The Pipeline

cover-photo

Table of Contents

JAM Samples

Check out the example generated music in the generated_examples/ folder to hear what JAM can produce:

  • Hybrid Minds, Brodie - Heroin.mp3 - Electronic music with synthesized beats and electronic elements
  • Jade Bird - Avalanche.mp3 - Country music with acoustic guitar and folk influences
  • Rizzle Kicks, Rachel Chinouriri - Follow Excitement!.mp3 - Rap music with rhythmic beats and hip-hop style

These samples demonstrate JAM's ability to generate high-quality music across different genres while maintaining vocal intelligence, style consistency and musical coherence.

Requirements

  • Python 3.10 or higher
  • CUDA-compatible GPU with sufficient VRAM (8GB+ recommended)

Installation

1. Clone the Repository

git clone https://github.com/declare-lab/jamify
cd jamify

2. Run Installation Script

The project includes an automated installation script, run it in your own virtual environment:

bash install.sh

This script will:

  • Initialize and update git submodules (DeepPhonemizer)
  • Install Python dependencies from requirements.txt
  • Install the JAM package in editable mode
  • Install the DeepPhonemizer external dependency

3. Manual Installation (Alternative)

If you prefer manual installation:

# Initialize submodules
git submodule update --init --recursive

# Install dependencies
pip install -r requirements.txt

# Install JAM package
pip install -e .

# Install DeepPhonemizer
pip install -e externals/DeepPhonemizer

Quick Start

Simple Inference

The easiest way to run inference is using the provided inference.py script:

python inference.py

This script will:

  1. Download the pre-trained JAM-0.5 model from Hugging Face
  2. Run inference with default settings
  3. Save generated audio to the outputs directory

Input Format

Create an input file at inputs/input.json with your songs:

[
  {
    "id": "my_song",
    "audio_path": "inputs/reference_audio.mp3",
    "lrc_path": "inputs/lyrics.json", 
    "duration": 180.0,
    "prompt_path": "inputs/style_prompt.txt"
  }
]

Required files:

  • Audio file: Reference audio for style extraction
  • Lyrics file: JSON with timestamped lyrics
  • Prompt file: Text description of desired style/genre. Text prompt is not used in the default setting where the audio reference is utilized.

Inference

Using python -m jam.infer

For more control over the generation process:

# Basic usage with custom checkpoint
python -m jam.infer evaluation.checkpoint_path=path/to/model.safetensors

# With custom output directory
python -m jam.infer evaluation.checkpoint_path=path/to/model.safetensors evaluation.output_dir=my_outputs

# With custom configuration file
python -m jam.infer config=configs/my_config.yaml evaluation.checkpoint_path=path/to/model.safetensors

Multi-GPU Inference

Use Accelerate for distributed inference:

# Basic usage with custom checkpoint
accelerate launch -m jam.infer config=configs/jam_infer.yaml

# With custom configuration file
accelerate launch -m jam.infer config=configs/jam_infer.yaml

Configuration Options

Evaluation Settings

  • evaluation.checkpoint_path: Path to model checkpoint (required)
  • evaluation.output_dir: Output directory (default: "outputs")
  • evaluation.test_set_path: Input JSON file (default: "inputs/input.json")
  • evaluation.batch_size: Batch size for inference (default: 1)
  • evaluation.num_samples: Only generate first n samples in test_set_path (null = all)
  • evaluation.vae_type: VAE model type ("diffrhythm" or "stable_audio")

Style Control

  • evaluation.ignore_style: Ignore style prompts (default: false)
  • evaluation.use_prompt_style: Use text prompts for style (default: false)
  • evaluation.num_style_secs: Style audio duration in seconds (default: 30)
  • evaluation.random_crop_style: Randomly crop style audio (default: false)

Input File Formats

Lyrics File (*.json)

[
    {"start": 2.2, "end": 2.5, "word": "First word of lyrics"},
    {"start": 2.5, "end": 3.7, "word": "Second word of lyrics"},
    {"more lines ...."}
]

Style Prompt File (*.txt)

Electronic dance music with heavy bass and synthesizers

Input Manifest (input.json)

[
  {
    "id": "unique_song_id",
    "audio_path": "path/to/reference.mp3",
    "lrc_path": "path/to/lyrics.json",
    "duration": 180.0,
    "prompt_path": "path/to/style.txt"
  }
]

Output Structure

Generated files are saved to the output directory:

outputs/
├── generated/          # Final trimmed audio files
├── generated_orig/     # Original generated audio
├── cfm_latents/       # Intermediate latent representations
├── local_files/       # Process-specific metadata
└── generation_config.yaml  # Configuration used for generation

Training

JAM supports three training stages: pretraining, supervised fine-tuning (SFT), and direct preference optimization (DPO). Each stage requires specific data formats and training commands.

Data Formats

WebDataset Format (Pretrain & SFT)

For pretraining and SFT, JAM expects data in WebDataset format - tar files containing:

your_dataset-000000.tar
├── song_id_1/
│   ├── latent.pt      # Audio latent representation (torch tensor)
│   ├── style.pt       # MuQ style embedding (torch tensor, 512-dim or n*512-dim)  
│   └── json           # Metadata with phoneme information
├── song_id_2/
│   ├── latent.pt
│   ├── style.pt
│   └── json
└── ...

JSON Structure:

{
  "word": [
    {"start": 2.2, "end": 2.5, "phoneme": "ˈfɜrst"},
    {"start": 2.5, "end": 3.7, "phoneme": "wɜrd"},
    ...
  ]
}

ID List JSONL (Optional): When provided via id_list_jsonl, enables advanced filtering and sampling:

{"

Related Skills

View on GitHub
GitHub Stars156
CategoryDevelopment
Updated16d ago
Forks19

Languages

Python

Security Score

85/100

Audited on Mar 24, 2026

No findings