M4

[TBD] "m4: A Learned Flow-level Network Simulator" by Chenning Li, Anton A. Zabreyko, Om Chabra, Arash Nasr-Esfahany, Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas Anderson.

Generate Convert Improve

Install / Use

/learn @netiken/M4

About this skill

Quality Score

0/100

README

m4: A Learned Flow-level Network Simulator

This repository provides scripts and instructions to replicate the experiments from our paper, m4: A Learned Flow-level Network Simulator. It includes all necessary tools to reproduce the experimental results documented in Sections 5.2 to 5.6 of the paper.

Repository Structure
Quick Reproduction
Setup and Installation
Running Experiments from Scratch
Training Your Own Model
Citation
Acknowledgments
Contact

Repository Structure

├── checkpoints/                    # Pre-trained model checkpoints
├── config/                         # Configuration files for training and testing m4
├── figs/                          # Generated figures and plots from experiments
├── High-Precision-Congestion-Control/ # HPCC repository for data generation
├── inference/                     # C++ inference engine for m4
├── parsimon-eval/                 # Scripts to reproduce m4 experiments and comparisons
├── results/                       # Experimental results and outputs
├── results_train/                 # Training results and outputs
├── testbed/                       # Testbed integration with ns-3, FlowSim, and m4 backends
│   ├── backends/                  # Backend implementations
│   │   ├── m4/                    # M4 ML-based simulator
│   │   ├── flowsim/               # FlowSim flow-level simulator
│   │   └── UNISON/                # NS3 packet-level simulator (UNISON)
│   ├── eval_test/                 # Test scenarios and results
│   │   ├── testbed/               # Real hardware ground truth (24 scenarios)
│   │   ├── m4/                    # M4 simulation results
│   │   ├── flowsim/               # FlowSim simulation results
│   │   └── ns3/                   # NS3 simulation results
│   ├── eval_train/                # Training data generation
│   ├── results/                   # Generated plots and accuracy summaries
│   ├── results_train/             # Training results and outputs
│   ├── run.py                     # Main runner script for simulations
│   ├── analyze.py                 # Results analysis and visualization
│   └── build.sh                   # Build script for all backends
├── SimAI/                         # SimAI integration with UNISON, flowSim, and m4 backends
│   ├── astra-sim-alibabacloud/    # Core simulation framework
│   │   ├── astra-sim/             # AstraSim system layer
│   │   │   ├── network_frontend/  # Network backend implementations
│   │   │   │   ├── ns3/           # UNISON (ns-3) packet-level simulator
│   │   │   │   ├── flowsim/       # flowSim analytical simulator
│   │   │   │   └── m4/            # m4 ML-based simulator
│   │   │   └── system/            # System components (routing, collective ops)
│   │   ├── extern/                # ns-3 source code
│   │   └── build.sh               # Build script for all backends
│   ├── example/                   # Example workloads and topologies
│   │   ├── gray_failures/         # 105 pre-generated gray failure topology files
│   │   │   └── gray_topo_N{2-16}_R{4-10}.txt  # Topology files for N degraded GPUs, R reduction factor
│   │   ├── microAllReduce.txt     # AllReduce collective workload
│   │   └── SimAI.conf             # ns-3 configuration
│   ├── scripts/                   # Build and run scripts
│   ├── results_gray_failures/     # Pre-computed gray failure results (315 simulations)
│   │   └── n_{N}_r_{R}_{backend}/ # Individual scenario results (ns3/flowsim/m4)
│   ├── gray_failure_run_sweep.py  # Gray failure sweep runner
│   ├── gray_failure_plot_results.py # Generate evaluation plots (6 figures)
│   └── gray_failure_topo_viz.py   # Topology visualization tool
├── util/                          # Utility functions for m4, including data loaders and ML model implementations
├── main_train.py                  # Main script for training and testing m4
└── plot_results.ipynb            # Jupyter notebook for visualizing results

Quick Reproduction

To quickly reproduce the results in the paper, follow these steps:

1. Clone the repository and initialize submodules:

git clone https://github.com/netiken/m4.git
cd m4
git submodule update --init --recursive

2. Set up Python environment:

Install uv (a fast Python package manager): Follow the installation guide at https://docs.astral.sh/uv/getting-started/installation/

Set up Python environment:

uv sync
source .venv/bin/activate  # Activate the virtual environment!

3. Reproduce paper results:

Section 5.2 (Testbed Integration): Run cd testbed && python analyze.py to generate testbed comparison plots from pre-computed results
Section 5.3 (SimAI Integration): Check pre-computed results in SimAI/results_gray_failures/ and run python SimAI/gray_failure_plot_results.py to generate paper figures
Sections 5.4-5.6 (m4 Evaluation): Run the notebook plot_results.ipynb to generate paper figures

Setup and Installation

Always activate the python environment before running any commands:

uv sync
source .venv/bin/activate  # Activate the virtual environment!

Install Rust and Cargo:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install nightly
rustup default nightly

Install gcc-9:
```
sudo apt-get install gcc-9 g++-9
```

Set up ns-3 (for training dataset with packet traces) and UNISON (for fast simulation) for data generation:

cd High-Precision-Congestion-Control/UNISON-for-ns-3
./configure.sh
./ns3 run 'scratch/third mix/config_test.txt'
cd ../ns-3.39
./configure.sh
./ns3 run 'scratch/third mix/config_test.txt'

Running Experiments from Scratch

This section shows how to reproduce the experimental results from the paper using pre-trained models. The pre-trained checkpoints are available in the checkpoints/ directory.

Section 5.2: Testbed Integration

The testbed/ directory contains an integrated evaluation framework comparing three network simulation backends (m4, FlowSim, NS3) against real hardware measurements from a 12-node testbed running HERD, a key-value store application.

Build Backends

Build all three backends (requires GCC-9 and CUDA for M4):

cd testbed

# Build all backends
./build.sh all

# Or build individual backends
./build.sh m4       # M4 ML-based simulator (requires CUDA)
./build.sh flowsim  # FlowSim flow-level simulator
./build.sh ns3      # NS3 packet-level simulator (UNISON)

Run Simulations

Run simulations using the pre-existing testbed ground truth data:

# Run all backends (recommended)
python run.py all

# Or run individual backends
python run.py m4       # M4 ML-based simulator
python run.py flowsim  # FlowSim flow-level simulator
python run.py ns3      # NS3 packet-level simulator

# Use --process-only to skip simulation and only process existing results
python run.py all --process-only

Test Scenarios: 24 scenarios covering RDMA sizes (250KB-1000KB) × window sizes (1, 2, 4)

Results are saved in:

eval_test/testbed/ — Real hardware ground truth (24 scenarios)
eval_test/m4/ — M4 simulation outputs
eval_test/flowsim/ — FlowSim simulation outputs
eval_test/ns3/ — NS3 simulation outputs

Analyze Results

Generate evaluation plots and accuracy summaries:

python analyze.py

This produces:

results/m4-testbed-perflow.png — Per-flow FCT error CDF
results/m4-testbed-overall-window2.png — Application completion time comparison
results/accuracy_summary.txt — Summary statistics

Evaluation Metrics:

Per-flow FCT error: Absolute relative error for individual UD and RDMA flows
Application completion time error: End-to-end execution time accuracy

Section 5.3: SimAI Integration Experiments

The SimAI/ directory contains an integrated evaluation framework with three network simulation backends: UNISON (ns-3) , flowSim , and m4 .

Build Backends

Build all three backends (requires GCC-9):

cd SimAI
./scripts/build.sh -c ns3      # Build UNISON (ns-3) backend
./scripts/build.sh -c flowsim  # Build flowSim backend
./scripts/build.sh -c m4       # Build m4 backend (requires CUDA)

Gray Failure Evaluation

We evaluate all three backends under gray failure conditions—scenarios where network components experience partial performance degradation rather than complete failures. This mimics real-world datacenter issues like cable aging, thermal throttling, or partial switch failures.

Gray Failure Topologies:

The repository includes 105 pre-generated topologies in example/gray_failures/ covering a comprehensive parameter sweep:

N ∈ {2, 3, ..., 16}: Number of degraded GPUs (6%-50% of 32-GPU cluster)
R ∈ {4, 5, ..., 10}: Bandwidth reduction factor (degraded links operate at 1/R capacity, i.e., 75%-90% bandwidth loss)

Run Gray Failure Sweep:

Note: Pre-computed results for all 315 simulations (3 backends × 105 scenarios) are available in results_gray_failures/. Running the sweep script will overwrite the pre-computed results.

# Run all scenarios for a specific backend
python gray_failure_run_sweep.py ns3      # UNISON (packet-level ground truth)
python gray_failure_run_sweep.py flowsim  # flowSim (anal

Related Skills

node-connect

351.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

netiken

View profile

View on GitHub

GitHub Stars18

CategoryDevelopment

Updated21d ago

Forks5

netiken/m4

Languages

C++

Security Score

80/100

Audited on Mar 18, 2026

No findings

M4

Install / Use

README

m4: A Learned Flow-level Network Simulator

Contents

Repository Structure

Quick Reproduction

Setup and Installation

Running Experiments from Scratch

Section 5.2: Testbed Integration

Build Backends

Run Simulations

Analyze Results

Section 5.3: SimAI Integration Experiments

Build Backends

Gray Failure Evaluation

Related Skills