M4
[TBD] "m4: A Learned Flow-level Network Simulator" by Chenning Li, Anton A. Zabreyko, Om Chabra, Arash Nasr-Esfahany, Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas Anderson.
Install / Use
/learn @netiken/M4README
m4: A Learned Flow-level Network Simulator
This repository provides scripts and instructions to replicate the experiments from our paper, m4: A Learned Flow-level Network Simulator. It includes all necessary tools to reproduce the experimental results documented in Sections 5.2 to 5.6 of the paper.
Contents
- Repository Structure
- Quick Reproduction
- Setup and Installation
- Running Experiments from Scratch
- Training Your Own Model
- Citation
- Acknowledgments
- Contact
Repository Structure
├── checkpoints/ # Pre-trained model checkpoints
├── config/ # Configuration files for training and testing m4
├── figs/ # Generated figures and plots from experiments
├── High-Precision-Congestion-Control/ # HPCC repository for data generation
├── inference/ # C++ inference engine for m4
├── parsimon-eval/ # Scripts to reproduce m4 experiments and comparisons
├── results/ # Experimental results and outputs
├── results_train/ # Training results and outputs
├── testbed/ # Testbed integration with ns-3, FlowSim, and m4 backends
│ ├── backends/ # Backend implementations
│ │ ├── m4/ # M4 ML-based simulator
│ │ ├── flowsim/ # FlowSim flow-level simulator
│ │ └── UNISON/ # NS3 packet-level simulator (UNISON)
│ ├── eval_test/ # Test scenarios and results
│ │ ├── testbed/ # Real hardware ground truth (24 scenarios)
│ │ ├── m4/ # M4 simulation results
│ │ ├── flowsim/ # FlowSim simulation results
│ │ └── ns3/ # NS3 simulation results
│ ├── eval_train/ # Training data generation
│ ├── results/ # Generated plots and accuracy summaries
│ ├── results_train/ # Training results and outputs
│ ├── run.py # Main runner script for simulations
│ ├── analyze.py # Results analysis and visualization
│ └── build.sh # Build script for all backends
├── SimAI/ # SimAI integration with UNISON, flowSim, and m4 backends
│ ├── astra-sim-alibabacloud/ # Core simulation framework
│ │ ├── astra-sim/ # AstraSim system layer
│ │ │ ├── network_frontend/ # Network backend implementations
│ │ │ │ ├── ns3/ # UNISON (ns-3) packet-level simulator
│ │ │ │ ├── flowsim/ # flowSim analytical simulator
│ │ │ │ └── m4/ # m4 ML-based simulator
│ │ │ └── system/ # System components (routing, collective ops)
│ │ ├── extern/ # ns-3 source code
│ │ └── build.sh # Build script for all backends
│ ├── example/ # Example workloads and topologies
│ │ ├── gray_failures/ # 105 pre-generated gray failure topology files
│ │ │ └── gray_topo_N{2-16}_R{4-10}.txt # Topology files for N degraded GPUs, R reduction factor
│ │ ├── microAllReduce.txt # AllReduce collective workload
│ │ └── SimAI.conf # ns-3 configuration
│ ├── scripts/ # Build and run scripts
│ ├── results_gray_failures/ # Pre-computed gray failure results (315 simulations)
│ │ └── n_{N}_r_{R}_{backend}/ # Individual scenario results (ns3/flowsim/m4)
│ ├── gray_failure_run_sweep.py # Gray failure sweep runner
│ ├── gray_failure_plot_results.py # Generate evaluation plots (6 figures)
│ └── gray_failure_topo_viz.py # Topology visualization tool
├── util/ # Utility functions for m4, including data loaders and ML model implementations
├── main_train.py # Main script for training and testing m4
└── plot_results.ipynb # Jupyter notebook for visualizing results
Quick Reproduction
To quickly reproduce the results in the paper, follow these steps:
1. Clone the repository and initialize submodules:
git clone https://github.com/netiken/m4.git
cd m4
git submodule update --init --recursive
2. Set up Python environment:
-
Install uv (a fast Python package manager): Follow the installation guide at https://docs.astral.sh/uv/getting-started/installation/
-
Set up Python environment:
uv sync source .venv/bin/activate # Activate the virtual environment!
3. Reproduce paper results:
- Section 5.2 (Testbed Integration): Run
cd testbed && python analyze.pyto generate testbed comparison plots from pre-computed results - Section 5.3 (SimAI Integration): Check pre-computed results in
SimAI/results_gray_failures/and runpython SimAI/gray_failure_plot_results.pyto generate paper figures - Sections 5.4-5.6 (m4 Evaluation): Run the notebook
plot_results.ipynbto generate paper figures
Setup and Installation
-
Always activate the python environment before running any commands:
uv sync source .venv/bin/activate # Activate the virtual environment! -
Install Rust and Cargo:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh rustup install nightly rustup default nightly -
Install gcc-9:
sudo apt-get install gcc-9 g++-9 -
Set up ns-3 (for training dataset with packet traces) and UNISON (for fast simulation) for data generation:
cd High-Precision-Congestion-Control/UNISON-for-ns-3 ./configure.sh ./ns3 run 'scratch/third mix/config_test.txt' cd ../ns-3.39 ./configure.sh ./ns3 run 'scratch/third mix/config_test.txt'
Running Experiments from Scratch
This section shows how to reproduce the experimental results from the paper using pre-trained models. The pre-trained checkpoints are available in the checkpoints/ directory.
Section 5.2: Testbed Integration
The testbed/ directory contains an integrated evaluation framework comparing three network simulation backends (m4, FlowSim, NS3) against real hardware measurements from a 12-node testbed running HERD, a key-value store application.
Build Backends
Build all three backends (requires GCC-9 and CUDA for M4):
cd testbed
# Build all backends
./build.sh all
# Or build individual backends
./build.sh m4 # M4 ML-based simulator (requires CUDA)
./build.sh flowsim # FlowSim flow-level simulator
./build.sh ns3 # NS3 packet-level simulator (UNISON)
Run Simulations
Run simulations using the pre-existing testbed ground truth data:
# Run all backends (recommended)
python run.py all
# Or run individual backends
python run.py m4 # M4 ML-based simulator
python run.py flowsim # FlowSim flow-level simulator
python run.py ns3 # NS3 packet-level simulator
# Use --process-only to skip simulation and only process existing results
python run.py all --process-only
Test Scenarios: 24 scenarios covering RDMA sizes (250KB-1000KB) × window sizes (1, 2, 4)
Results are saved in:
eval_test/testbed/— Real hardware ground truth (24 scenarios)eval_test/m4/— M4 simulation outputseval_test/flowsim/— FlowSim simulation outputseval_test/ns3/— NS3 simulation outputs
Analyze Results
Generate evaluation plots and accuracy summaries:
python analyze.py
This produces:
results/m4-testbed-perflow.png— Per-flow FCT error CDFresults/m4-testbed-overall-window2.png— Application completion time comparisonresults/accuracy_summary.txt— Summary statistics
Evaluation Metrics:
- Per-flow FCT error: Absolute relative error for individual UD and RDMA flows
- Application completion time error: End-to-end execution time accuracy
Section 5.3: SimAI Integration Experiments
The SimAI/ directory contains an integrated evaluation framework with three network simulation backends: UNISON (ns-3) , flowSim , and m4 .
Build Backends
Build all three backends (requires GCC-9):
cd SimAI
./scripts/build.sh -c ns3 # Build UNISON (ns-3) backend
./scripts/build.sh -c flowsim # Build flowSim backend
./scripts/build.sh -c m4 # Build m4 backend (requires CUDA)
Gray Failure Evaluation
We evaluate all three backends under gray failure conditions—scenarios where network components experience partial performance degradation rather than complete failures. This mimics real-world datacenter issues like cable aging, thermal throttling, or partial switch failures.
Gray Failure Topologies:
The repository includes 105 pre-generated topologies in example/gray_failures/ covering a comprehensive parameter sweep:
- N ∈ {2, 3, ..., 16}: Number of degraded GPUs (6%-50% of 32-GPU cluster)
- R ∈ {4, 5, ..., 10}: Bandwidth reduction factor (degraded links operate at 1/R capacity, i.e., 75%-90% bandwidth loss)
Run Gray Failure Sweep:
Note: Pre-computed results for all 315 simulations (3 backends × 105 scenarios) are available in results_gray_failures/. Running the sweep script will overwrite the pre-computed results.
# Run all scenarios for a specific backend
python gray_failure_run_sweep.py ns3 # UNISON (packet-level ground truth)
python gray_failure_run_sweep.py flowsim # flowSim (anal
Related Skills
node-connect
351.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
