Paris

paris - world's first decentralized trained open-weight diffusion model

Generate Convert Improve

Install / Use

/learn @bageldotcom/Paris

About this skill

Quality Score

0/100

README

<img src="images/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 20px;"/> <h1 style="font-size: 28px; margin-bottom: 20px;">Paris: A Decentralized Trained Open-Weight Diffusion Model</h1> <a href="https://huggingface.co/bageldotcom/paris" target="_blank"> <img src="https://img.shields.io/badge/🤗_DOWNLOAD_MODEL_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Model Weights" height="40"> </a> <a href="https://github.com/bageldotcom/paris" target="_blank"> <img src="https://img.shields.io/badge/⭐_STAR_ON_GITHUB-100000?style=for-the-badge&logo=github&logoColor=white" alt="Star on GitHub" height="40"> </a> <a href="https://arxiv.org/abs/2510.03434" target="_blank"> <img src="https://img.shields.io/badge/📄_READ_PAPER-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white" alt="Read Paper on arXiv" height="40"> </a> <div style="margin-top: 20px;"></div>

The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. Read our paper on arXiv to learn more.

Key Characteristics

8 independently trained expert diffusion models (605M parameters each, 4.84B total)
No gradient synchronization, parameter sharing, or activation exchange among nodes during training
Lightweight transformer router (~129M parameters) for dynamic expert selection
11M LAION-Aesthetic images across 120 A40 GPU-days
14× less training data than prior decentralized baselines
16× less compute than prior decentralized baselines
Competitive generation quality (FID 12.45 on DiTExpert XL/2)
Open weights for research and commercial use under MIT license

Examples

Paris Generation Examples

Text-conditioned image generation samples using Paris across diverse prompts and visual styles

Architecture Details

| Component | Specification | |-----------|--------------| | Model Scale | DiT-XL/2 | | Parameters per Expert | 605M | | Total Expert Parameters | 4.84B (8 experts) | | Router Parameters | ~129M | | Hidden Dimensions | 1152 | | Transformer Layers | 28 | | Attention Heads | 16 | | Patch Size | 2×2 (latent space) | | Latent Resolution | 32×32×4 | | Image Resolution | 256×256 | | Text Conditioning | CLIP ViT-L/14 | | VAE | sd-vae-ft-mse (8× downsampling) |

Training Approach

Paris implements fully decentralized training where:

Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
No gradient synchronization, parameter sharing, or activation exchange between experts during training
Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
Router trained post-hoc on full dataset for expert selection during inference
Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)

Training Architecture

Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.

This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.

Comparison with Traditional Parallelization

| Strategy | Synchronization | Straggler Impact | Topology Requirements | |--------------|---------------------|---------------------|---------------------------| | Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster | | Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline | | Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline | | Paris | No synchronization | No blocking | Arbitrary |

Routing Strategies

top-1 (default): Single best expert per step. Fastest inference, competitive quality.
top-2: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
full-ensemble: All 8 experts weighted by router. Highest compute (8× cost).

Paris Inference Pipeline

Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).

Performance Metrics

Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)

| Inference Strategy | FID-50K ↓ | |------------------------|---------------| | Monolithic (single model) | 29.64 | | Paris Top-1 | 30.60 | | Paris Top-2 | 22.60 | | Paris Full Ensemble | 47.89 |

Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.

Training Details

Hyperparameters (DiT-XL/2)

| Parameter | Value | |---------------|-----------| | Dataset | LAION-Aesthetic (11M images) | | Clustering | DINOv2 semantic features | | Batch Size | 16 per expert (effective 32 with 2-step accumulation) | | Learning Rate | 2e-5 (AdamW, no scheduling) | | Training Steps | ~120k total across experts (asynchronous) | | EMA Decay | 0.9999 | | Mixed Precision | FP16 with automatic loss scaling | | Conditioning | AdaLN-Single (23% parameter reduction) |

Router Training

| Parameter | Value | |---------------|-----------| | Architecture | DiT-B (smaller than experts) | | Batch Size | 64 with 4-step accumulation (effective 256) | | Learning Rate | 5e-5 with cosine annealing (25 epochs) | | Loss | Cross-entropy on cluster assignments | | Training | Post-hoc on full dataset |

Citation

If you use Paris in your research, and want to cite our arXiv paper:

@misc{jiang2025paris,
  title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
  author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
  year={2025},
  eprint={2510.03434},
  archivePrefix={arXiv},
  primaryClass={cs.GR},
  url={https://arxiv.org/abs/2510.03434}
}

License

MIT License – Open for research and commercial use.

Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a>

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

399

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

18.7k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

bageldotcom

View profile

View on GitHub

GitHub Stars54

CategoryEducation

Updated22d ago

Forks4

bageldotcom/paris

Security Score

100/100

Audited on Mar 14, 2026

No findings