Mochi

The best OSS video generation models, created by Genmo

Generate Convert Improve

Install / Use

/learn @genmoai/Mochi

About this skill

Quality Score

0/100

README

Mochi 1

Blog | Direct Download | Hugging Face | Playground | Careers

A state of the art video generation model by Genmo.

https://github.com/user-attachments/assets/4d268d02-906d-4cb0-87cc-f467f1497108

News

⭐ November 26, 2024: Added support for LoRA fine-tuning
⭐ November 5, 2024: Consumer-GPU support for Mochi natively in ComfyUI

Overview

Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. We’re releasing the model under a permissive Apache 2.0 license. Try this model for free on our playground.

Installation

Install using uv:

git clone https://github.com/genmoai/mochi
cd mochi
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation

If you want to install flash attention, you can use:

uv pip install -e .[flash] --no-build-isolation

You will also need to install FFMPEG to turn your outputs into videos.

Download Weights

Use download_weights.py to download the model + VAE to a local directory. Use it like this:

python3 ./scripts/download_weights.py weights/

Or, directly download the weights from Hugging Face or via magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce to a folder on your computer.

Running

Start the gradio UI with

python3 ./demos/gradio_ui.py --model_dir weights/ --cpu_offload

Or generate videos directly from the CLI with

python3 ./demos/cli.py --model_dir weights/ --cpu_offload

If you have a fine-tuned LoRA in the safetensors format, you can add --lora_path <path/to/my_mochi_lora.safetensors> to either gradio_ui.py or cli.py.

API

This repository comes with a simple, composable API, so you can programmatically call the model. You can find a full example here. But, roughly, it looks like this:

from genmo.mochi_preview.pipelines import (
    DecoderModelFactory,
    DitModelFactory,
    MochiSingleGPUPipeline,
    T5ModelFactory,
    linear_quadratic_schedule,
)

pipeline = MochiSingleGPUPipeline(
    text_encoder_factory=T5ModelFactory(),
    dit_factory=DitModelFactory(
        model_path=f"weights/dit.safetensors", model_dtype="bf16"
    ),
    decoder_factory=DecoderModelFactory(
        model_path=f"weights/decoder.safetensors",
    ),
    cpu_offload=True,
    decode_type="tiled_spatial",
)

video = pipeline(
    height=480,
    width=848,
    num_frames=31,
    num_inference_steps=64,
    sigma_schedule=linear_quadratic_schedule(64, 0.025),
    cfg_schedule=[6.0] * 64,
    batch_cfg=False,
    prompt="your favorite prompt here ...",
    negative_prompt="",
    seed=12345,
)

Fine-tuning with LoRA

We provide an easy-to-use trainer that allows you to build LoRA fine-tunes of Mochi on your own videos. The model can be fine-tuned on one H100 or A100 80GB GPU.

Model Architecture

Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture. Additionally, we are releasing an inference harness that includes an efficient context parallel implementation.

Alongside Mochi, we are open-sourcing our video AsymmVAE. We use an asymmetric encoder-decoder structure to build an efficient high quality compression model. Our AsymmVAE causally compresses videos to a 128x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.

AsymmVAE Model Specs

|Params Count | Enc Base Channels | Dec Base Channels |Latent Dim | Spatial Compression | Temporal Compression | |:--:|:--:|:--:|:--:|:--:|:--:| |362M | 64 | 128 | 12 | 8x8 | 6x |

An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements. Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.

AsymmDiT Model Specs

|Params Count | Num Layers | Num Heads | Visual Dim | Text Dim | Visual Tokens | Text Tokens | |:--:|:--:|:--:|:--:|:--:|:--:|:--:| |10B | 48 | 24 | 3072 | 1536 | 44520 | 256 |

Hardware Requirements

The repository supports both multi-GPU operation (splitting the model across multiple graphics cards) and single-GPU operation, though it requires approximately 60GB VRAM when running on a single GPU. While ComfyUI can optimize Mochi to run on less than 20GB VRAM, this implementation prioritizes flexibility over memory efficiency. When using this repository, we recommend using at least 1 H100 GPU.

Safety

Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.

Limitations

Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.

Related Work

ComfyUI-MochiWrapper adds ComfyUI support for Mochi. The integration of Pytorch's SDPA attention was based on their repository.
ComfyUI-MochiEdit adds ComfyUI nodes for video editing, such as object insertion and restyling.
mochi-xdit is a fork of this repository and improve the parallel inference speed with xDiT.
Modal script for fine-tuning Mochi on Modal GPUs.

BibTeX

@misc{genmo2024mochi,
      title={Mochi 1},
      author={Genmo Team},
      year={2024},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished={\url{https://github.com/genmoai/models}}
}

Related Skills

qqbot-channel

343.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

343.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

genmoai

View profile

View on GitHub

GitHub Stars3.6k

CategoryContent

Updated7h ago

Forks477

genmoai/mochi

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings