SkillAgentSearch skills...

CM3Leon

An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images

Install / Use

/learn @kyegomez/CM3Leon

README

Multi-Modality

CM3Leon: Autoregressive Multi-Modal Model for Text and Image Generation (wip)

GitHub issues GitHub forks GitHub stars GitHub license Share on Twitter Share on Facebook Share on LinkedIn Discord Share on Reddit Share on Hacker News Share on Pinterest Share on WhatsApp Open In Colab

CM3Leon is a transformer-based autoregressive model designed for multi-modal tasks, specifically text and image generation. The model is trained in two stages, using a large diverse multimodal dataset and augmented retrieval pretraining. It also implements contrastive decoding to enhance the quality of the generated samples.

CM3LEON, PAPER LINK

  • Please Help with this open source implementation in the Agora discord, Discord
  • This implementation is still not finished.

Install

pip3 install cm3


Usage & Example

To start with CM3Leon in a PyTorch environment:

import torch
from cm3.model import CM3

# usage
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

model = CM3()

output = model(img, caption)
print(output.shape)  # (1, 1024, 20000)


This repository hosts the open-source implementation of CM3Leon, a state-of-the-art autoregressive multi-modal model for text and image generation. The model is introduced in the paper "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning".


Overview

Key Features of CM3Leon:

  • Retrieval augmented pretraining on a large diverse multimodal dataset.
  • Two-stage training: pretraining and supervised fine-tuning.
  • Contrastive decoding for enhanced sample quality.

CM3Leon sets a new benchmark in text-to-image generation, outperforming comparable models while requiring 5x less computational resources.

Getting Started

The following sections provide a detailed analysis of the model architecture, the necessary resources, and the steps needed to replicate the CM3Leon model.

Requirements

Replicating CM3Leon involves several critical components and requires proficiency in the following areas:

  • Large-scale distributed training of transformer models using a significant number of GPUs/TPUs.
  • Efficient data loading and preprocessing to handle extensive multimodal datasets.
  • Memory optimization techniques to accommodate large models within the GPU memory.
  • Custom tokenizer implementation for both text and image modalities.
  • Setting up a retrieval infrastructure for dense retrieval during pretraining.
  • Developing a fine-tuning framework to handle mixed text-image tasks.
  • Inference optimizations such as compiler-accelerated decoders, lower precision computing, and batching.

System Architecture

The CM3Leon implementation comprises:

  • A distributed training framework, preferably TensorFlow or PyTorch.
  • High-performance compute infrastructure (HPC cluster with GPUs/TPUs).
  • A retrieval index and dense retriever module for augmentation.
  • Data pipelines for efficient preprocessing and loading.
  • Custom code for tokenizers and the CM3 model architecture.
  • Fine-tuning framework and relevant task datasets.
  • Serving infrastructure for low-latency inference.

Implementing these components involves challenges such as efficient utilization of large compute clusters, minimizing data loading and preprocessing bottlenecks, optimizing memory usage during training and inference, and ensuring low latency serving.

Model Architecture

The architecture of CM3Leon includes:

  • Text and Image Tokenizers: Custom text tokenizer trained on CommonCrawl data and Image tokenizer that encodes 256x256 images into 1024 tokens.
  • Special Tokens: Usage of <break> token to indicate modality transitions.
  • Retrieval Augmentation: Using a bi-encoder based on CLIP to retrieve relevant text and images from the memory bank.
  • Autoregressive Decoder-only Transformer: Standard transformer architecture similar to GPT models.
  • Two-Stage Training: Pretraining with retrieval augmentation and supervised finetuning on text-image tasks via instruction tuning.
  • Contrastive Decoding: Modified contrastive decoding for better sample quality.

The model size ranges from 350M to 7B parameters.

Data

Here is a markdown table with the datasets used in the paper along with additional metadata and source links:

| Dataset | Domain | Size | Source | |-|-|-|-|
| Shutterstock | Images and captions | 3 billion text tokens, licensed image data | Proprietary dataset, described in paper | | MS-COCO | Image captioning | 591K image-caption pairs | Microsoft COCO Captions | | Flickr30k | Image captioning | 144K image-caption pairs | Flickr30k Entities |
| Image Paragraph | Dense image captioning | 14K images with paragraph captions | Image Paragraph dataset | | Localized Narratives | Image paragraph captioning | 164K images with localized narratives | Localized Narratives | | VQA2 | Visual question answering | 1.3M images with question-answer pairs | VQA2 dataset |
| VizWiz | Visual question answering for blind users | 92K images with question-answer pairs | VizWiz dataset | | OKVQA | Knowledge-based VQA | 26K images with question-answer pairs | OK-VQA dataset | | ScienceQA | Scientific visual QA | 6K images with multi-choice QA pairs | ScienceQA |

The model was trained and evaluated on several datasets including MS-COCO [...] (Chen et al., 2015), Flickr30k [...] (Young et al., 2014), etc.

For successful implementation, CM3Leon requires:

  • A large (100M+ examples) diverse multimodal dataset like Shutterstock for pretraining.
  • A mixture of text and image tasks with accompanying datasets for finetuning.
  • Efficient and scalable data loading that does not bottleneck model training.
  • Preprocessing steps like resizing images to 256x256 pixels and text tokenization.

Training

CM3Leon's training process involves:

  • Pretraining with retrieval augmentation and CM3 objective.
  • Supervised finetuning on text-image tasks.
  • Efficient distributed training infrastructure for large-scale model training.
  • Hyperparameter tuning for learning rates, batch sizes, optimizers, etc.

Inference

For efficient inference, consider:

  • Using compiler-accelerated decoders like FasterTransformer.
  • Other optimizations like lower precision (FP16/INT8) and batching.
  • Efficient implementation of contrastive decoding.

HyperParameters

350M 24 1024 4096 8M 6e-04 
View on GitHub
GitHub Stars365
CategoryEducation
Updated1mo ago
Forks16

Languages

Python

Security Score

100/100

Audited on Feb 18, 2026

No findings