CrossFlow

This is a PyTorch-based reimplementation of CrossFlow, as proposed in

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution

CVPR 2025 Highlight

Qihao Liu | Xi Yin | Alan Yuille | Andrew Brown | Mannat Singh

[project page] | [huggingface demo] | [paper] | [arxiv]

teaser

This repository provides a PyTorch-based reimplementation of CrossFlow for the text-to-image generation task, with the following differences compared to the original paper:

Model Architecture: The original paper utilizes DiMR as the model architecture. In contrast, this codebase supports training and inference with both DiT (ICCV 2023, a widely adopted architecture) and DiMR (NeurIPS 2024, a state-of-the-art architecture).
Dataset: The original model was trained on proprietary 350M dataset. In this implementation, the models are trained on open-source datasets, including LAION-400M and JourneyDB (4M).
LLMs: The original 1B model only supports CLIP as the language model, whereas this implementation includes 1B models with CLIP and T5-XXL.

TODO

[x] ~~Release inference code and 512px CLIP DiMR-based model.~~
[x] ~~Release training code and a detailed training tutorial (ETA: Dec 20).~~
[x] ~~Release inference code for linear interpolation and arithmetic.~~
[x] ~~Release all pretrained checkpoints, including: (ETA: Dec 23)~~
[x] ~~Update pretrained checkpoints (ETA: Dec 28)~~
[x] ~~Provide a demo via Hugging Face Space and Colab.~~

Setup

Environment

The code has been tested with PyTorch 2.1.2 and Cuda 12.1.

An example of installation commands is provided as follows:

git clone git@github.com:qihao067/CrossFlow.git
cd CrossFlow

pip3 install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip3 install -U --pre triton
pip3 install -r requirements.txt

Model Preparation

To train or test the model, you will also need to download the VAE model from Stable Diffusion, and the reference statistics for zero-shot FID on the MSCOCO validation set. For your convenience, you can directly download all the models from here.

Pretrained Models

| Architecture | Resolution | LM | Download | Details | | :----------- | ---------- | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | DiMR | 256x256 | CLIP | [t2i_256px_clip_dimr.pth] | Train from scratch on LIAON-400M for 1 epoch, then fine-tune on JourneyDB for 10 epochs. | | DiMR | 256x256 | T5-XXL | [t2i_256px_t5_dimr.pth] | Initialize with [t2i_256px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. | | DiMR | 512x512 | CLIP | [t2i_512px_clip_dimr.pth] | Initialize with [t2i_256px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. (Model with the best T-I alignment*) | | DiMR | 512x512 | T5-XXL | [t2i_512px_t5_dimr.pth] | Initialize with [t2i_512px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. | | DiT | 512x512 | T5-XXL | [t2i_512px_t5_dit.pth] | Initialize with [t2i_512px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. |

*To save training time, all T5-XXL-based models are initialized with a CLIP-based model and fine-tuned on JourneyDB (4M) for ten epochs. As a result, these models may occasionally exhibit very minor text-image misalignments, which are not observed in the original paper's T5 models since they are trained from scratch.

Sampling

T2I generation

You can sample from the pre-trained CrossFLow model with the demo_t2i.py. Before running the script, download the appropriate checkpoint and configure hyperparameters such as the classifier-free guidance scale, random seed, and mini-batch size in the corresponding configuration files.

To accelerate the sampling process, the script supports multi-GPU sampling. For example, to sample from the 512px CLIP DiMR-based CrossFlow model with N GPUs, you can use the following command. It generates N x mini-batch size images each time:
```
# if only sample with one GPU:
# accelerate launch --num_processes 1 --mixed_precision bf16 demo_t2i.py \

accelerate launch --multi_gpu --num_processes N --mixed_precision bf16 demo_t2i.py \
          --config=configs/t2i_512px_clip_dimr.py \
          --nnet_path=path/to/t2i_512px_clip_dimr.pth \
          --img_save_path=temp_saved_images \
          --prompt='your prompt' \
```
Linear Interpolation in Latent Space

Our model provides visually smooth interpolations in the latent space. By using the demo_t2i_arith.py script, images can be generated through linear interpolation between two input prompts using the following command:
```
accelerate launch --num_processes 1 --mixed_precision bf16 demo_t2i_arith.py \
          --config=configs/t2i_512px_clip_dimr.py \
          --nnet_path=path/to/t2i_512px_clip_dimr.pth \
          --img_save_path=temp_saved_images \
          --test_type=interpolation \
          --prompt_1='A dog cooking dinner in the kitchen' \
          --prompt_2='An orange cat wearing sunglasses on a ship' \
          --num_of_interpolation=40 \
          --save_gpu_memory \
```
This script supports sampling on a single GPU only. For linear interpolation, you need to adjust the num_of_interpolation parameter, which controls the number of interpolated images generated. The script requires a minimum of 5 images, but we recommend setting it to 40 for smoother interpolations. Additionally, you can enable the save_gpu_memory option to optimize GPU VRAM usage, though this will require extra time.

Finally, the command will generate num_of_interpolation images in the specified img_save_path. Using the provided random seed (1234), the resulting images will appear as follows
Arithmetic Operations in Latent Space

Our model supports arithmetic operations in the text latent space. Using the Text Variational Encoder, we first encode the input text into the latent space. Arithmetic operations are then applied within this latent space, and the resulting latent representation is utilized to generate the corresponding image. An example can be demonstrated using the following command:
```
accelerate launch --num_processes 1 --mixed_precision bf16 demo_t2i_arith.py \
          --config=configs/t2i_512px_clip_dimr.py \
          --nnet_path=path/to/t2i_512px_clip_dimr.pth \
          --img_save_path=temp_saved_images \
          --test_type=arithmetic \
          --prompt_ori='A corgi wearing a red hat in the park' \
          --prompt_a='book' \
          --prompt_s='hat' \
```
The images generated in the folder img_save_path include images of the the input prompts, followed by the resulting image after the arithmetic operations (prompt_ori + prompt_a - prompt_s ) .
<p align="center" style="display: flex; align-items: center;"> <img src="https://github.com/qihao067/CrossFlow/blob/main/imgs/0.png" alt="Figure 1" width="200"/> <img src="https://github.com/qihao067/CrossFlow/blob/main/imgs/1.png" alt="Figure 2" width="200"/> <img src="https://github.com/qihao067/CrossFlow/blob/main/imgs/2.png" alt="Figure 3" width="200"/> <img src="https://github.com/qihao067/CrossFlow/blob/main/imgs/3.png" alt="Figure 4" width="200"/> </p>

We also support single arithmetic operation. You can perform addition by providing only prompt_ori and prompt_a, or subtraction by providing only prompt_ori and prompt_s.

Training CrossFlow for T2I

Prepare training data

To train the CrossFlow model, you need a dataset consisting of image-text pairs. We provide a demo dataset (download here) containing 100 images sourced from JourneyDB. The dataset includes an image folder and a .jsonl file that specifies the image paths and their corresponding captions.

To accelerate the training process, you can cache the image latents (from a VAE) and text embeddings (from a language model such as CLIP or T5-XXL) beforehand. We offer preprocessing scripts to simplify this step.

Specifically, you can use the [scripts/extract_train_feature.py](https://github.com

CrossFlow

Install / Use

README

CrossFlow

TODO

Setup

Environment

Model Preparation

Pretrained Models

Sampling

T2I generation

Linear Interpolation in Latent Space

Arithmetic Operations in Latent Space

Training CrossFlow for T2I

Prepare training data