SkillAgentSearch skills...

Daifuku

Serve Text-to-Video Models in Production

Install / Use

/learn @VikramxD/Daifuku

README

<div align="center"> <h2>Daifuku: A Sweet Way to Serve Multiple Text-to-Video Models</h2> </div> <div align="center"> <p></p> <img src="https://github.com/user-attachments/assets/ef233ca4-275a-4817-9042-60e53045821e" width="250" height="250" alt="Daifuku Logo" /> <p></p> <pre> ✅ Multi-model T2V ✅ GPU offload & BF16 ✅ Parallel batch processing ✅ Prometheus metrics ✅ Docker-based deployment ✅ Pydantic-based config ✅ S3 integration for MP4s ✅ Minimal code, easy to extend </pre> </div>

Table of Contents


Introduction

Daifuku is a versatile framework designed to serve multiple Text-to-Video (T2V) models (e.g., Mochi, LTX, and more). It streamlines T2V model deployment by providing:

  • A unified API for multiple models
  • Parallel batch processing
  • GPU optimizations for efficiency
  • Easy Docker-based deployment
  • Integrated monitoring, logging, and metrics

Inspired by the concept of daifuku mochi—a sweet stuffed treat—this framework “stuffed” with multiple T2V capabilities aims to make your video generation as sweet and satisfying as possible.


Quick Start

Installation

git clone https://github.com/VikramxD/Daifuku.git
cd Daifuku

# Create a virtual environment
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e . --no-build-isolation

Optional: Download Mochi weights for faster first use:

python scripts/download_weights.py

Note: LTX weights download automatically on first usage.

Running the Servers

Daifuku can serve models individually or combine them behind a single endpoint:

<details> <summary><strong>Mochi Server</strong></summary>
python api/mochi_serve.py
# Endpoint: http://127.0.0.1:8000/api/v1/video/mochi
</details> <details> <summary><strong>LTX Server</strong></summary>
python api/ltx_serve.py
# Endpoint: http://127.0.0.1:8000/api/v1/video/ltx
</details> <details> <summary><strong> Allegro Server </strong></summary>
python api/ltx_serve.py
# Endpoint: http://127.0.0.1:8000/api/v1/video/ltx
</details> <details> <summary><strong>Combined Server</strong></summary>
python api/serve.py
# Endpoint: http://127.0.0.1:8000/predict
# Must supply "model_name" in the request payload.
</details>

Usage Examples

Single Prompt Requests

Mochi Model Example

import requests

url = "http://127.0.0.1:8000/api/v1/video/mochi"
payload = {
    "prompt": "A serene beach at dusk, gentle waves, dreamy pastel colors",
    "num_inference_steps": 40,
    "guidance_scale": 4.0,
    "height": 480,
    "width": 848,
    "num_frames": 120,
    "fps": 10
}

response = requests.post(url, json=payload)
print(response.json())

LTX Model Example

import requests

url = "http://127.0.0.1:8000/api/v1/video/ltx"
payload = {
    "prompt": "A cinematic scene of autumn leaves swirling around the forest floor",
    "negative_prompt": "blurry, worst quality",
    "num_inference_steps": 40,
    "guidance_scale": 3.0,
    "height": 480,
    "width": 704,
    "num_frames": 121,
    "frame_rate": 25
}

response = requests.post(url, json=payload)
print(response.json())

Allegro Model Example

import requests

url = "http://127.0.0.1:8000/api/v1/video/allegro"
payload = {
    "prompt": "A lively jazz band performing on a dimly lit stage, audience clapping",
    "num_inference_steps": 45,
    "guidance_scale": 4.5,
    "height": 720,
    "width": 1280,
    "num_frames": 150,
    "fps": 24
}

response = requests.post(url, json=payload)
print(response.json())

Batch Requests

Process multiple requests simultaneously with Daifuku’s parallel capabilities:

curl -X POST http://127.0.0.1:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "batch": [
      {
        "model_name": "mochi",
        "prompt": "A calm ocean scene, sunrise, realistic",
        "num_inference_steps": 40
      },
      {
        "model_name": "ltx",
        "prompt": "A vintage film style shot of the Eiffel Tower",
        "height": 480,
        "width": 704
      }
    ]
  }'

Features

  1. Multi-Model T2V
    Serve each model individually, or unify them under one endpoint.

  2. Parallel Batch Processing
    Handle multiple requests concurrently for high throughput.

  3. GPU Optimizations
    BF16 precision, attention slicing, VAE tiling, CPU offload, etc.

  4. Prometheus Metrics
    Monitor request latency, GPU usage, and more.

  5. S3 Integration
    Automatically upload .mp4 files to Amazon S3 and return signed URLs.

  6. Advanced Logging
    Uses Loguru for detailed and structured logging.


Prompt Engineering

Mochi 1 Prompt Engineering Guide

Daifuku currently ships with Genmo’s Mochi model as one of the primary text-to-video generation options. Crafting effective prompts is crucial to producing high-quality, consistent, and predictable results. Below is a product-management-style guide with detailed tips and illustrative examples:

1. Goal-Oriented Prompting

Ask yourself: What is the end experience or visual story you want to convey?

  • Example: “I want a short clip showing a hand gently picking up a lemon and rotating it in mid-air before placing it back.”
  • Pro Tip: Write prompts with the final user experience in mind—like describing a scene for a storyboard.

2. Technical Guidelines

  1. Precise Descriptions

    • Include motion verbs and descriptors (e.g., “gently tosses,” “rotating,” “smooth texture”).
    • Use specifics for objects (e.g., “a bright yellow lemon in a wooden bowl”).
  2. Scene Parameters

    • Define environment details: lighting (soft sunlight, tungsten glow), camera position (top-down, eye-level), and any background elements.
    • Focus on how these details interact (e.g., “shadows cast by the overhead lamp moving across the marble table”).
  3. Motion Control

    • Specify movement timing or speed (e.g., “the camera pans at 0.3m/s left to right,” “the object rotates 90° every second”).
    • For multi-step actions, break them down into time-coded events (e.g., “t=1.0s: the hand appears, t=2.0s: the hand gently tosses the lemon...”).
  4. Technical Parameters

    • Provide explicit numeric values for lighting conditions or camera angles (e.g., “5600K color temperature,” “f/2.8 aperture,” “ISO 400”).
    • If controlling atmospheric or environmental effects (e.g., fog density, volumetric lighting), add them as key-value pairs for clarity.

3. Reference Prompts

Below are extended examples showing how you can move from a simple directive to a fully descriptive, technical prompt.

<details> <summary><strong>Example 1: Controlled Motion Sequence</strong></summary>
  • Simple Prompt:

    PRECISE OBJECT MANIPULATION
    
  • Detailed Prompt:

    A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and fresh mint sprigs against a peach-colored background.
    The hand gently tosses the lemon up and catches it mid-air, highlighting its smooth texture.
    A beige string bag rests beside the bowl, adding a rustic touch.
    Additional lemons, including one halved, are scattered around the bowl’s base.
    Even, diffused lighting accentuates vibrant colors, creating a fresh, inviting atmosphere.
    Motion Sequence:
    - t=0.0 to 0.5s: Hand enters from left
    - t=1.0 to 1.2s: Lemon toss in slow motion
    - t=1.2 to 2.0s: Hand exits, camera remains static
    

Why It Works

  • Provides both visual (color, environment) and temporal (timing, motion) details.
  • Mentions lighting explicitly for consistent results.
  • The final action is clearly staged with micro-timings.
</details> <details> <summary><strong>Example 2: Technical Scene Setup</strong></summary>
  • Simple Prompt:

    ARCHITECTURAL VISUALIZATION
    
  • Detailed Prompt:

    Modern interior space with precise lighting control.
    The camera tracks laterally at 0.5m/s, maintaining a 1.6m elevation from the floor.
    Natural light at 5600K color temperature casts dynamic shadows across polished surfaces,
    while secondary overhead lighting at 3200K adds a warm glow.
    The scene uses soft ambient occlusion for depth,
    and focus remains fixed on the primary subject: a minimalist white sofa placed near full-height windows.
    

Why It Works

  • Encourages a photo-realistic interior shot.
  • Combines color temperature specifics and motion parameters for consistent lighting and camera movement.
</details> <details> <summary><strong>Example 3: Environmental Control</strong></summary>
  • Simple Prompt:

    ATMOSPHERIC DYNAMICS
    
  • Detailed Prompt:

    Volumetric lighting with carefully controlled particle density.
    The camera moves upward at 0.3m/s, starting at ground level and ending at 2.0m elevation.
    Light scatter coefficient: 0.7, atmospheric transmission: 85%.
    Particles glisten under a single overhead spotlight, forming dynamic l
    
View on GitHub
GitHub Stars7
CategoryContent
Updated9mo ago
Forks1

Languages

Python

Security Score

82/100

Audited on Jun 24, 2025

No findings