Daifuku
Serve Text-to-Video Models in Production
Install / Use
/learn @VikramxD/DaifukuREADME
Table of Contents
- Introduction
- Quick Start
- Usage Examples
- Features
- Prompt Engineering
- Docker Support
- Monitoring
- License
Introduction
Daifuku is a versatile framework designed to serve multiple Text-to-Video (T2V) models (e.g., Mochi, LTX, and more). It streamlines T2V model deployment by providing:
- A unified API for multiple models
- Parallel batch processing
- GPU optimizations for efficiency
- Easy Docker-based deployment
- Integrated monitoring, logging, and metrics
Inspired by the concept of daifuku mochi—a sweet stuffed treat—this framework “stuffed” with multiple T2V capabilities aims to make your video generation as sweet and satisfying as possible.
Quick Start
Installation
git clone https://github.com/VikramxD/Daifuku.git
cd Daifuku
# Create a virtual environment
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e . --no-build-isolation
Optional: Download Mochi weights for faster first use:
python scripts/download_weights.py
Note: LTX weights download automatically on first usage.
Running the Servers
Daifuku can serve models individually or combine them behind a single endpoint:
<details> <summary><strong>Mochi Server</strong></summary>python api/mochi_serve.py
# Endpoint: http://127.0.0.1:8000/api/v1/video/mochi
</details>
<details>
<summary><strong>LTX Server</strong></summary>
python api/ltx_serve.py
# Endpoint: http://127.0.0.1:8000/api/v1/video/ltx
</details>
<details>
<summary><strong> Allegro Server </strong></summary>
python api/ltx_serve.py
# Endpoint: http://127.0.0.1:8000/api/v1/video/ltx
</details>
<details>
<summary><strong>Combined Server</strong></summary>
python api/serve.py
# Endpoint: http://127.0.0.1:8000/predict
# Must supply "model_name" in the request payload.
</details>
Usage Examples
Single Prompt Requests
Mochi Model Example
import requests
url = "http://127.0.0.1:8000/api/v1/video/mochi"
payload = {
"prompt": "A serene beach at dusk, gentle waves, dreamy pastel colors",
"num_inference_steps": 40,
"guidance_scale": 4.0,
"height": 480,
"width": 848,
"num_frames": 120,
"fps": 10
}
response = requests.post(url, json=payload)
print(response.json())
LTX Model Example
import requests
url = "http://127.0.0.1:8000/api/v1/video/ltx"
payload = {
"prompt": "A cinematic scene of autumn leaves swirling around the forest floor",
"negative_prompt": "blurry, worst quality",
"num_inference_steps": 40,
"guidance_scale": 3.0,
"height": 480,
"width": 704,
"num_frames": 121,
"frame_rate": 25
}
response = requests.post(url, json=payload)
print(response.json())
Allegro Model Example
import requests
url = "http://127.0.0.1:8000/api/v1/video/allegro"
payload = {
"prompt": "A lively jazz band performing on a dimly lit stage, audience clapping",
"num_inference_steps": 45,
"guidance_scale": 4.5,
"height": 720,
"width": 1280,
"num_frames": 150,
"fps": 24
}
response = requests.post(url, json=payload)
print(response.json())
Batch Requests
Process multiple requests simultaneously with Daifuku’s parallel capabilities:
curl -X POST http://127.0.0.1:8000/predict \
-H "Content-Type: application/json" \
-d '{
"batch": [
{
"model_name": "mochi",
"prompt": "A calm ocean scene, sunrise, realistic",
"num_inference_steps": 40
},
{
"model_name": "ltx",
"prompt": "A vintage film style shot of the Eiffel Tower",
"height": 480,
"width": 704
}
]
}'
Features
-
Multi-Model T2V
Serve each model individually, or unify them under one endpoint. -
Parallel Batch Processing
Handle multiple requests concurrently for high throughput. -
GPU Optimizations
BF16 precision, attention slicing, VAE tiling, CPU offload, etc. -
Prometheus Metrics
Monitor request latency, GPU usage, and more. -
S3 Integration
Automatically upload.mp4files to Amazon S3 and return signed URLs. -
Advanced Logging
Uses Loguru for detailed and structured logging.
Prompt Engineering
Mochi 1 Prompt Engineering Guide
Daifuku currently ships with Genmo’s Mochi model as one of the primary text-to-video generation options. Crafting effective prompts is crucial to producing high-quality, consistent, and predictable results. Below is a product-management-style guide with detailed tips and illustrative examples:
1. Goal-Oriented Prompting
Ask yourself: What is the end experience or visual story you want to convey?
- Example: “I want a short clip showing a hand gently picking up a lemon and rotating it in mid-air before placing it back.”
- Pro Tip: Write prompts with the final user experience in mind—like describing a scene for a storyboard.
2. Technical Guidelines
-
Precise Descriptions
- Include motion verbs and descriptors (e.g., “gently tosses,” “rotating,” “smooth texture”).
- Use specifics for objects (e.g., “a bright yellow lemon in a wooden bowl”).
-
Scene Parameters
- Define environment details: lighting (soft sunlight, tungsten glow), camera position (top-down, eye-level), and any background elements.
- Focus on how these details interact (e.g., “shadows cast by the overhead lamp moving across the marble table”).
-
Motion Control
- Specify movement timing or speed (e.g., “the camera pans at 0.3m/s left to right,” “the object rotates 90° every second”).
- For multi-step actions, break them down into time-coded events (e.g., “t=1.0s: the hand appears, t=2.0s: the hand gently tosses the lemon...”).
-
Technical Parameters
- Provide explicit numeric values for lighting conditions or camera angles (e.g., “5600K color temperature,” “f/2.8 aperture,” “ISO 400”).
- If controlling atmospheric or environmental effects (e.g., fog density, volumetric lighting), add them as key-value pairs for clarity.
3. Reference Prompts
Below are extended examples showing how you can move from a simple directive to a fully descriptive, technical prompt.
<details> <summary><strong>Example 1: Controlled Motion Sequence</strong></summary>-
Simple Prompt:
PRECISE OBJECT MANIPULATION -
Detailed Prompt:
A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and fresh mint sprigs against a peach-colored background. The hand gently tosses the lemon up and catches it mid-air, highlighting its smooth texture. A beige string bag rests beside the bowl, adding a rustic touch. Additional lemons, including one halved, are scattered around the bowl’s base. Even, diffused lighting accentuates vibrant colors, creating a fresh, inviting atmosphere. Motion Sequence: - t=0.0 to 0.5s: Hand enters from left - t=1.0 to 1.2s: Lemon toss in slow motion - t=1.2 to 2.0s: Hand exits, camera remains static
Why It Works
- Provides both visual (color, environment) and temporal (timing, motion) details.
- Mentions lighting explicitly for consistent results.
- The final action is clearly staged with micro-timings.
-
Simple Prompt:
ARCHITECTURAL VISUALIZATION -
Detailed Prompt:
Modern interior space with precise lighting control. The camera tracks laterally at 0.5m/s, maintaining a 1.6m elevation from the floor. Natural light at 5600K color temperature casts dynamic shadows across polished surfaces, while secondary overhead lighting at 3200K adds a warm glow. The scene uses soft ambient occlusion for depth, and focus remains fixed on the primary subject: a minimalist white sofa placed near full-height windows.
Why It Works
- Encourages a photo-realistic interior shot.
- Combines color temperature specifics and motion parameters for consistent lighting and camera movement.
-
Simple Prompt:
ATMOSPHERIC DYNAMICS -
Detailed Prompt:
Volumetric lighting with carefully controlled particle density. The camera moves upward at 0.3m/s, starting at ground level and ending at 2.0m elevation. Light scatter coefficient: 0.7, atmospheric transmission: 85%. Particles glisten under a single overhead spotlight, forming dynamic l
