SkillAgentSearch skills...

CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Install / Use

/learn @zai-org/CogVideo
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CogVideo & CogVideoX

中文阅读

日本語で読む

<div align="center"> <img src=resources/logo.svg width="50%"/> </div> <p align="center"> Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> or <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> </p> <p align="center"> 📚 View the <a href="https://arxiv.org/abs/2408.06072" target="_blank">paper</a> and <a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh" target="_blank">user guide</a> </p> <p align="center"> 👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/dCGfUsagrD" target="_blank">Discord</a> </p> <p align="center"> 📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models. </p>

Project Updates

  • 🔥🔥 News: 2025/03/24: We have launched CogKit, a fine-tuning and inference framework for the CogView4 and CogVideoX series. This toolkit allows you to fully explore and utilize our multimodal generation models.
  • 🔥 News: 2025/02/28: DDIM Inverse is now supported in CogVideoX-5B and CogVideoX1.5-5B. Check here.
  • 🔥 News: 2025/01/08: We have updated the code for Lora fine-tuning based on the diffusers version model, which uses less GPU memory. For more details, please see here.
  • 🔥 News: 2024/11/15: We released the CogVideoX1.5 model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.
  • 🔥 News: 2024/11/08: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX. The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code here.
  • 🔥 News: 2024/10/13: A more cost-effective fine-tuning framework for CogVideoX-5B that works with a single 4090 GPU, cogvideox-factory, has been released. It supports fine-tuning with multiple resolutions. Feel free to use it!
  • 🔥 News: 2024/10/10: We have updated our technical report. Please click here to view it. More training details and a demo have been added. To see the demo, click here.- 🔥 News: 2024/10/09: We have publicly released the technical documentation for CogVideoX fine-tuning on Feishu, further increasing distribution flexibility. All examples in the public documentation can be fully reproduced.
  • 🔥 News: 2024/9/19: We have open-sourced the CogVideoX series image-to-video model CogVideoX-5B-I2V. This model can take an image as a background input and generate a video combined with prompt words, offering greater controllability. With this, the CogVideoX series models now support three tasks: text-to-video generation, video continuation, and image-to-video generation. Welcome to try it online at Experience.
  • 🔥 2024/9/19: The Caption model CogVLM2-Caption, used in the training process of CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it.
  • 🔥 2024/8/27: We have open-sourced a larger model in the CogVideoX series, CogVideoX-5B. We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run CogVideoX-2B on older GPUs like GTX 1080TI, and CogVideoX-5B on desktop GPUs like RTX 3060. Please strictly follow the requirements to update and install dependencies, and refer to cli_demo for inference code. Additionally, the open-source license for the CogVideoX-2B model has been changed to the Apache 2.0 License.
  • 🔥 2024/8/6: We have open-sourced 3D Causal VAE, used for CogVideoX-2B, which can reconstruct videos with almost no loss.
  • 🔥 2024/8/6: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B **.
  • 🌱 Source: 2022/5/19: We have open-sourced the CogVideo video generation model (now you can see it in the CogVideo branch). This is the first open-source large Transformer-based text-to-video generation model. You can access the ICLR'23 paper for technical details.

Table of Contents

Jump to a specific section:

Quick Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

SAT

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

Follow instructions in sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Then follow diffusers_demo: A more detailed explanation of the inference code, mentioning the significance of common parameters.

For more details on quantized inference, please refer to diffusers-torchao. With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at diffusers-torchao.

Gallery

CogVideoX-5B

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video> </td> </tr> </table>

CogVideoX-2B

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video> </td> </tr> </table>

To view the corresponding prompt words for the gallery, please click here

Model Introduction

CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, a

View on GitHub
GitHub Stars12.5k
CategoryContent
Updated15h ago
Forks1.3k

Languages

Python

Security Score

100/100

Audited on Mar 20, 2026

No findings