Colossal-AI

Colossal-AI: Making large AI models cheaper, faster, and more accessible

<h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | <a href="https://www.colossalai.org/"> Documentation </a> | <a href="https://github.com/hpcaitech/ColossalAI/tree/main/examples"> Examples </a> | <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | <a href="https://colossalai.org/zh-Hans/docs/get_started/bonus/">GPU Cloud Playground </a> | <a href="https://hpc-ai.com/blog"> Blog </a></h3>

| English | 中文 |

</div>

Instantly Run Colossal-AI on Enterprise-Grade GPUs

Skip the setup. Access a powerful, pre-configured Colossal-AI environment on HPC-AI Cloud.

Train your models and scale your AI workload in one click!

NVIDIA Blackwell B200s: Experience the next generation of AI performance (See Benchmarks). Now available on cloud from $2.47/hr.
Cost-Effective H200 Cluster: Get premier performance with on-demand rental from just $1.99/hr.

Get Started Now & Claim Your Free Credits →

Colossal-AI Benchmark

To see how these performance gains translate to real-world applications, we conducted a large language model training benchmark using Colossal-AI on Llama-like models. The tests were run on both 8-card and 16-card configurations for 7B and 70B models, respectively.

| GPU | GPUs | Model Size | Parallelism | Batch Size per DP | Seqlen | Throughput | TFLOPS/GPU | Peak Mem(MiB) | | :-----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :--------------: | :-------------: | :-------------: | :-------------: | | H200 | 8 | 7B | zero2(dp8) | 36 | 4096 | 17.13 samp/s | 534.18 | 119040.02 | | H200 | 16 | 70B | zero2 | 48 | 4096 | 3.27 samp/s | 469.1 | 150032.23 | | B200 | 8 | 7B | zero1(dp2)+tp2+pp4 | 128 | 4096 | 25.83 samp/s | 805.69 | 100119.77 | | H200 | 16 | 70B | zero1(dp2)+tp2+pp4 | 128 | 4096 | 5.66 samp/s | 811.79 | 100072.02 |

The results from the Colossal-AI benchmark provide the most practical insight. For the 7B model on 8 cards, the B200 achieved a 50% higher throughput and a significant increase in TFLOPS per GPU. For the 70B model on 16 cards, the B200 again demonstrated a clear advantage, with over 70% higher throughput and TFLOPS per GPU. These numbers show that the B200's performance gains translate directly to faster training times for large-scale models.

Latest News

<ul> <li><a href="#Why-Colossal-AI">Why Colossal-AI</a> </li> <li><a href="#Features">Features</a> </li> <li> <a href="#Colossal-AI-in-the-Real-World">Colossal-AI for Real World Applications</a> <ul> <li><a href="#Open-Sora">Open-Sora: Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models</a></li> <li><a href="#Colossal-LLaMA-2">Colossal-LLaMA-2: One Half-Day of Training Using a Few Hundred Dollars Yields Similar Results to Mainstream Large Models, Open-Source and Commercial-Free Domain-Specific Llm Solution</a></li> <li><a href="#ColossalChat">ColossalChat: An Open-Source Solution for Cloning ChatGPT With a Complete RLHF Pipeline</a></li> <li><a href="#AIGC">AIGC: Acceleration of Stable Diffusion</a></li> <li><a href="#Biomedicine">Biomedicine: Acceleration of AlphaFold Protein Structure</a></li> </ul> </li> <li> <a href="#Parallel-Training-Demo">Parallel Training Demo</a> <ul> <li><a href="#LLaMA3">LLaMA 1/2/3 </a></li> <li><a href="#MoE">MoE</a></li> <li><a href="#GPT-3">GPT-3</a></li> <li><a href="#GPT-2">GPT-2</a></li> <li><a href="#BERT">BERT</a></li> <li><a href="#PaLM">PaLM</a></li> <li><a href="#OPT">OPT</a></li> <li><a href="#ViT">ViT</a></li> <li><a href="#Recommendation-System-Models">Recommendation System Models</a></li> </ul> </li> <li> <a href="#Single-GPU-Training-Demo">Single GPU Training Demo</a> <ul> <li><a href="#GPT-2-Single">GPT-2</a></li> <li><a href="#PaLM-Single">PaLM</a></li> </ul> </li> <li> <a href="#Inference">Inference</a> <ul> <li><a href="#Colossal-Inference">Colossal-Inference: Large AI Models Inference Speed Doubled</a></li> <li><a href="#Grok-1">Grok-1: 314B model of PyTorch + HuggingFace Inference</a></li> <li><a href="#SwiftInfer">SwiftInfer:Breaks the Length Limit of LLM for Multi-Round Conversations with 46% Acceleration</a></li> </ul> </li> <li> <a href="#Installation">Installation</a> <ul> <li><a href="#PyPI">PyPI</a></li> <li><a href="#Install-From-Source">Install From Source</a></li> </ul> </li> <li><a href="#Use-Docker">Use Docker</a></li> <li><a href="#Community">Community</a></li> <li><a href="#Contributing">Contributing</a></li> <li><a href="#Cite-Us">Cite Us</a></li> </ul>

Why Colossal-AI

Prof. James Demmel (UC Berkeley): Colossal-AI makes training AI models efficient, easy, and scalable.

</div> <p align="right">(<a href="#top">back to top</a>)</p>

Features

Colossal-AI provides a collection of parallel components for you. We aim to support you to write your distributed deep learning models j

ColossalAI

Install / Use

README