ColossalAI
Making large AI models cheaper, faster and more accessible
Install / Use
/learn @hpcaitech/ColossalAIREADME
Colossal-AI
<div id="top" align="center">Colossal-AI: Making large AI models cheaper, faster, and more accessible
<h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | <a href="https://www.colossalai.org/"> Documentation </a> | <a href="https://github.com/hpcaitech/ColossalAI/tree/main/examples"> Examples </a> | <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | <a href="https://colossalai.org/zh-Hans/docs/get_started/bonus/">GPU Cloud Playground </a> | <a href="https://hpc-ai.com/blog"> Blog </a></h3> </div>Instantly Run Colossal-AI on Enterprise-Grade GPUs
Skip the setup. Access a powerful, pre-configured Colossal-AI environment on HPC-AI Cloud.
Train your models and scale your AI workload in one click!
- NVIDIA Blackwell B200s: Experience the next generation of AI performance (See Benchmarks). Now available on cloud from $2.47/hr.
- Cost-Effective H200 Cluster: Get premier performance with on-demand rental from just $1.99/hr.
Get Started Now & Claim Your Free Credits →
<div align="center"> <a href="https://hpc-ai.com/?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai"> <img src="https://github.com/hpcaitech/public_assets/blob/main/colossalai/img/2-3.png" width="850" /> </a> </div>Colossal-AI Benchmark
To see how these performance gains translate to real-world applications, we conducted a large language model training benchmark using Colossal-AI on Llama-like models. The tests were run on both 8-card and 16-card configurations for 7B and 70B models, respectively.
| GPU | GPUs | Model Size | Parallelism | Batch Size per DP | Seqlen | Throughput | TFLOPS/GPU | Peak Mem(MiB) | | :-----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :--------------: | :-------------: | :-------------: | :-------------: | | H200 | 8 | 7B | zero2(dp8) | 36 | 4096 | 17.13 samp/s | 534.18 | 119040.02 | | H200 | 16 | 70B | zero2 | 48 | 4096 | 3.27 samp/s | 469.1 | 150032.23 | | B200 | 8 | 7B | zero1(dp2)+tp2+pp4 | 128 | 4096 | 25.83 samp/s | 805.69 | 100119.77 | | H200 | 16 | 70B | zero1(dp2)+tp2+pp4 | 128 | 4096 | 5.66 samp/s | 811.79 | 100072.02 |
The results from the Colossal-AI benchmark provide the most practical insight. For the 7B model on 8 cards, the B200 achieved a 50% higher throughput and a significant increase in TFLOPS per GPU. For the 70B model on 16 cards, the B200 again demonstrated a clear advantage, with over 70% higher throughput and TFLOPS per GPU. These numbers show that the B200's performance gains translate directly to faster training times for large-scale models.
Latest News
- [2025/02] DeepSeek 671B Fine-Tuning Guide Revealed—Unlock the Upgraded DeepSeek Suite with One Click, AI Players Ecstatic!
- [2024/12] The development cost of video generation models has saved by 50%! Open-source solutions are now available with H200 GPU vouchers [code] [vouchers]
- [2024/10] How to build a low-cost Sora-like app? Solutions for you
- [2024/09] Singapore Startup HPC-AI Tech Secures 50 Million USD in Series A Funding to Build the Video Generation AI Model and GPU Platform
- [2024/09] Reducing AI Large Model Training Costs by 30% Requires Just a Single Line of Code From FP8 Mixed Precision Training Upgrades
- [2024/06] Open-Sora Continues Open Source: Generate Any 16-Second 720p HD Video with One Click, Model Weights Ready to Use
- [2024/05] Large AI Models Inference Speed Doubled, Colossal-Inference Open Source Release
- [2024/04] Open-Sora Unveils Major Upgrade: Embracing Open Source with Single-Shot 16-Second Video Generation and 720p Resolution
- [2024/04] Most cost-effective solutions for inference, fine-tuning and pretraining, tailored to LLaMA3 series
Table of Contents
<ul> <li><a href="#Why-Colossal-AI">Why Colossal-AI</a> </li> <li><a href="#Features">Features</a> </li> <li> <a href="#Colossal-AI-in-the-Real-World">Colossal-AI for Real World Applications</a> <ul> <li><a href="#Open-Sora">Open-Sora: Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models</a></li> <li><a href="#Colossal-LLaMA-2">Colossal-LLaMA-2: One Half-Day of Training Using a Few Hundred Dollars Yields Similar Results to Mainstream Large Models, Open-Source and Commercial-Free Domain-Specific Llm Solution</a></li> <li><a href="#ColossalChat">ColossalChat: An Open-Source Solution for Cloning ChatGPT With a Complete RLHF Pipeline</a></li> <li><a href="#AIGC">AIGC: Acceleration of Stable Diffusion</a></li> <li><a href="#Biomedicine">Biomedicine: Acceleration of AlphaFold Protein Structure</a></li> </ul> </li> <li> <a href="#Parallel-Training-Demo">Parallel Training Demo</a> <ul> <li><a href="#LLaMA3">LLaMA 1/2/3 </a></li> <li><a href="#MoE">MoE</a></li> <li><a href="#GPT-3">GPT-3</a></li> <li><a href="#GPT-2">GPT-2</a></li> <li><a href="#BERT">BERT</a></li> <li><a href="#PaLM">PaLM</a></li> <li><a href="#OPT">OPT</a></li> <li><a href="#ViT">ViT</a></li> <li><a href="#Recommendation-System-Models">Recommendation System Models</a></li> </ul> </li> <li> <a href="#Single-GPU-Training-Demo">Single GPU Training Demo</a> <ul> <li><a href="#GPT-2-Single">GPT-2</a></li> <li><a href="#PaLM-Single">PaLM</a></li> </ul> </li> <li> <a href="#Inference">Inference</a> <ul> <li><a href="#Colossal-Inference">Colossal-Inference: Large AI Models Inference Speed Doubled</a></li> <li><a href="#Grok-1">Grok-1: 314B model of PyTorch + HuggingFace Inference</a></li> <li><a href="#SwiftInfer">SwiftInfer:Breaks the Length Limit of LLM for Multi-Round Conversations with 46% Acceleration</a></li> </ul> </li> <li> <a href="#Installation">Installation</a> <ul> <li><a href="#PyPI">PyPI</a></li> <li><a href="#Install-From-Source">Install From Source</a></li> </ul> </li> <li><a href="#Use-Docker">Use Docker</a></li> <li><a href="#Community">Community</a></li> <li><a href="#Contributing">Contributing</a></li> <li><a href="#Cite-Us">Cite Us</a></li> </ul>Why Colossal-AI
<div align="center"> <a href="https://youtu.be/KnXSfjqkKN0"> <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/JamesDemmel_Colossal-AI.png" width="600" /> </a>Prof. James Demmel (UC Berkeley): Colossal-AI makes training AI models efficient, easy, and scalable.
</div> <p align="right">(<a href="#top">back to top</a>)</p>Features
Colossal-AI provides a collection of parallel components for you. We aim to support you to write your distributed deep learning models j

