SkillAgentSearch skills...

Paris

paris - world's first decentralized trained open-weight diffusion model

Install / Use

/learn @bageldotcom/Paris
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

<img src="images/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 20px;"/> <h1 style="font-size: 28px; margin-bottom: 20px;">Paris: A Decentralized Trained Open-Weight Diffusion Model</h1> <a href="https://huggingface.co/bageldotcom/paris" target="_blank"> <img src="https://img.shields.io/badge/🤗_DOWNLOAD_MODEL_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Model Weights" height="40"> </a> <a href="https://github.com/bageldotcom/paris" target="_blank"> <img src="https://img.shields.io/badge/⭐_STAR_ON_GITHUB-100000?style=for-the-badge&logo=github&logoColor=white" alt="Star on GitHub" height="40"> </a> <a href="https://arxiv.org/abs/2510.03434" target="_blank"> <img src="https://img.shields.io/badge/📄_READ_PAPER-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white" alt="Read Paper on arXiv" height="40"> </a> <div style="margin-top: 20px;"></div>

The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. Read our paper on arXiv to learn more.

Key Characteristics

  • 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
  • No gradient synchronization, parameter sharing, or activation exchange among nodes during training
  • Lightweight transformer router (~129M parameters) for dynamic expert selection
  • 11M LAION-Aesthetic images across 120 A40 GPU-days
  • 14× less training data than prior decentralized baselines
  • 16× less compute than prior decentralized baselines
  • Competitive generation quality (FID 12.45 on DiTExpert XL/2)
  • Open weights for research and commercial use under MIT license

Examples

Paris Generation Examples

Text-conditioned image generation samples using Paris across diverse prompts and visual styles


Architecture Details

| Component | Specification | |-----------|--------------| | Model Scale | DiT-XL/2 | | Parameters per Expert | 605M | | Total Expert Parameters | 4.84B (8 experts) | | Router Parameters | ~129M | | Hidden Dimensions | 1152 | | Transformer Layers | 28 | | Attention Heads | 16 | | Patch Size | 2×2 (latent space) | | Latent Resolution | 32×32×4 | | Image Resolution | 256×256 | | Text Conditioning | CLIP ViT-L/14 | | VAE | sd-vae-ft-mse (8× downsampling) |


Training Approach

Paris implements fully decentralized training where:

  • Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
  • No gradient synchronization, parameter sharing, or activation exchange between experts during training
  • Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
  • Router trained post-hoc on full dataset for expert selection during inference
  • Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)

Training Architecture

Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.

This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.

Comparison with Traditional Parallelization

| Strategy | Synchronization | Straggler Impact | Topology Requirements | |--------------|---------------------|---------------------|---------------------------| | Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster | | Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline | | Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline | | Paris | No synchronization | No blocking | Arbitrary |


Routing Strategies

  • top-1 (default): Single best expert per step. Fastest inference, competitive quality.
  • top-2: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
  • full-ensemble: All 8 experts weighted by router. Highest compute (8× cost).

Paris Inference Pipeline

Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).


Performance Metrics

Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)

| Inference Strategy | FID-50K ↓ | |------------------------|---------------| | Monolithic (single model) | 29.64 | | Paris Top-1 | 30.60 | | Paris Top-2 | 22.60 | | Paris Full Ensemble | 47.89 |

Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.


Training Details

Hyperparameters (DiT-XL/2)

| Parameter | Value | |---------------|-----------| | Dataset | LAION-Aesthetic (11M images) | | Clustering | DINOv2 semantic features | | Batch Size | 16 per expert (effective 32 with 2-step accumulation) | | Learning Rate | 2e-5 (AdamW, no scheduling) | | Training Steps | ~120k total across experts (asynchronous) | | EMA Decay | 0.9999 | | Mixed Precision | FP16 with automatic loss scaling | | Conditioning | AdaLN-Single (23% parameter reduction) |

Router Training

| Parameter | Value | |---------------|-----------| | Architecture | DiT-B (smaller than experts) | | Batch Size | 64 with 4-step accumulation (effective 256) | | Learning Rate | 5e-5 with cosine annealing (25 epochs) | | Loss | Cross-entropy on cluster assignments | | Training | Post-hoc on full dataset |


Citation

If you use Paris in your research, and want to cite our arXiv paper:

@misc{jiang2025paris,
  title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
  author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
  year={2025},
  eprint={2510.03434},
  archivePrefix={arXiv},
  primaryClass={cs.GR},
  url={https://arxiv.org/abs/2510.03434}
}

License

MIT License – Open for research and commercial use.

Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a>

Related Skills

View on GitHub
GitHub Stars54
CategoryEducation
Updated22d ago
Forks4

Security Score

100/100

Audited on Mar 14, 2026

No findings