Paris
paris - world's first decentralized trained open-weight diffusion model
Install / Use
/learn @bageldotcom/ParisREADME
The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. Read our paper on arXiv to learn more.
Key Characteristics
- 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
- No gradient synchronization, parameter sharing, or activation exchange among nodes during training
- Lightweight transformer router (~129M parameters) for dynamic expert selection
- 11M LAION-Aesthetic images across 120 A40 GPU-days
- 14× less training data than prior decentralized baselines
- 16× less compute than prior decentralized baselines
- Competitive generation quality (FID 12.45 on DiTExpert XL/2)
- Open weights for research and commercial use under MIT license
Examples

Text-conditioned image generation samples using Paris across diverse prompts and visual styles
Architecture Details
| Component | Specification | |-----------|--------------| | Model Scale | DiT-XL/2 | | Parameters per Expert | 605M | | Total Expert Parameters | 4.84B (8 experts) | | Router Parameters | ~129M | | Hidden Dimensions | 1152 | | Transformer Layers | 28 | | Attention Heads | 16 | | Patch Size | 2×2 (latent space) | | Latent Resolution | 32×32×4 | | Image Resolution | 256×256 | | Text Conditioning | CLIP ViT-L/14 | | VAE | sd-vae-ft-mse (8× downsampling) |
Training Approach
Paris implements fully decentralized training where:
- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
- No gradient synchronization, parameter sharing, or activation exchange between experts during training
- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
- Router trained post-hoc on full dataset for expert selection during inference
- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)

Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.
This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.
Comparison with Traditional Parallelization
| Strategy | Synchronization | Straggler Impact | Topology Requirements | |--------------|---------------------|---------------------|---------------------------| | Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster | | Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline | | Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline | | Paris | No synchronization | No blocking | Arbitrary |
Routing Strategies
top-1(default): Single best expert per step. Fastest inference, competitive quality.top-2: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.full-ensemble: All 8 experts weighted by router. Highest compute (8× cost).

Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).
Performance Metrics
Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)
| Inference Strategy | FID-50K ↓ | |------------------------|---------------| | Monolithic (single model) | 29.64 | | Paris Top-1 | 30.60 | | Paris Top-2 | 22.60 | | Paris Full Ensemble | 47.89 |
Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.
Training Details
Hyperparameters (DiT-XL/2)
| Parameter | Value | |---------------|-----------| | Dataset | LAION-Aesthetic (11M images) | | Clustering | DINOv2 semantic features | | Batch Size | 16 per expert (effective 32 with 2-step accumulation) | | Learning Rate | 2e-5 (AdamW, no scheduling) | | Training Steps | ~120k total across experts (asynchronous) | | EMA Decay | 0.9999 | | Mixed Precision | FP16 with automatic loss scaling | | Conditioning | AdaLN-Single (23% parameter reduction) |
Router Training
| Parameter | Value | |---------------|-----------| | Architecture | DiT-B (smaller than experts) | | Batch Size | 64 with 4-step accumulation (effective 256) | | Learning Rate | 5e-5 with cosine annealing (25 epochs) | | Loss | Cross-entropy on cluster assignments | | Training | Post-hoc on full dataset |
Citation
If you use Paris in your research, and want to cite our arXiv paper:
@misc{jiang2025paris,
title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
year={2025},
eprint={2510.03434},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2510.03434}
}
License
MIT License – Open for research and commercial use.
Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a>
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
18.7kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
Security Score
Audited on Mar 14, 2026
