SkillAgentSearch skills...

M4

[TBD] "m4: A Learned Flow-level Network Simulator" by Chenning Li, Anton A. Zabreyko, Om Chabra, Arash Nasr-Esfahany, Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas Anderson.

Install / Use

/learn @netiken/M4
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

m4: A Learned Flow-level Network Simulator

This repository provides scripts and instructions to replicate the experiments from our paper, m4: A Learned Flow-level Network Simulator. It includes all necessary tools to reproduce the experimental results documented in Sections 5.2 to 5.6 of the paper.

Contents


Repository Structure

├── checkpoints/                    # Pre-trained model checkpoints
├── config/                         # Configuration files for training and testing m4
├── figs/                          # Generated figures and plots from experiments
├── High-Precision-Congestion-Control/ # HPCC repository for data generation
├── inference/                     # C++ inference engine for m4
├── parsimon-eval/                 # Scripts to reproduce m4 experiments and comparisons
├── results/                       # Experimental results and outputs
├── results_train/                 # Training results and outputs
├── testbed/                       # Testbed integration with ns-3, FlowSim, and m4 backends
│   ├── backends/                  # Backend implementations
│   │   ├── m4/                    # M4 ML-based simulator
│   │   ├── flowsim/               # FlowSim flow-level simulator
│   │   └── UNISON/                # NS3 packet-level simulator (UNISON)
│   ├── eval_test/                 # Test scenarios and results
│   │   ├── testbed/               # Real hardware ground truth (24 scenarios)
│   │   ├── m4/                    # M4 simulation results
│   │   ├── flowsim/               # FlowSim simulation results
│   │   └── ns3/                   # NS3 simulation results
│   ├── eval_train/                # Training data generation
│   ├── results/                   # Generated plots and accuracy summaries
│   ├── results_train/             # Training results and outputs
│   ├── run.py                     # Main runner script for simulations
│   ├── analyze.py                 # Results analysis and visualization
│   └── build.sh                   # Build script for all backends
├── SimAI/                         # SimAI integration with UNISON, flowSim, and m4 backends
│   ├── astra-sim-alibabacloud/    # Core simulation framework
│   │   ├── astra-sim/             # AstraSim system layer
│   │   │   ├── network_frontend/  # Network backend implementations
│   │   │   │   ├── ns3/           # UNISON (ns-3) packet-level simulator
│   │   │   │   ├── flowsim/       # flowSim analytical simulator
│   │   │   │   └── m4/            # m4 ML-based simulator
│   │   │   └── system/            # System components (routing, collective ops)
│   │   ├── extern/                # ns-3 source code
│   │   └── build.sh               # Build script for all backends
│   ├── example/                   # Example workloads and topologies
│   │   ├── gray_failures/         # 105 pre-generated gray failure topology files
│   │   │   └── gray_topo_N{2-16}_R{4-10}.txt  # Topology files for N degraded GPUs, R reduction factor
│   │   ├── microAllReduce.txt     # AllReduce collective workload
│   │   └── SimAI.conf             # ns-3 configuration
│   ├── scripts/                   # Build and run scripts
│   ├── results_gray_failures/     # Pre-computed gray failure results (315 simulations)
│   │   └── n_{N}_r_{R}_{backend}/ # Individual scenario results (ns3/flowsim/m4)
│   ├── gray_failure_run_sweep.py  # Gray failure sweep runner
│   ├── gray_failure_plot_results.py # Generate evaluation plots (6 figures)
│   └── gray_failure_topo_viz.py   # Topology visualization tool
├── util/                          # Utility functions for m4, including data loaders and ML model implementations
├── main_train.py                  # Main script for training and testing m4
└── plot_results.ipynb            # Jupyter notebook for visualizing results

Quick Reproduction

To quickly reproduce the results in the paper, follow these steps:

1. Clone the repository and initialize submodules:

git clone https://github.com/netiken/m4.git
cd m4
git submodule update --init --recursive

2. Set up Python environment:

3. Reproduce paper results:

  • Section 5.2 (Testbed Integration): Run cd testbed && python analyze.py to generate testbed comparison plots from pre-computed results
  • Section 5.3 (SimAI Integration): Check pre-computed results in SimAI/results_gray_failures/ and run python SimAI/gray_failure_plot_results.py to generate paper figures
  • Sections 5.4-5.6 (m4 Evaluation): Run the notebook plot_results.ipynb to generate paper figures

Setup and Installation

  1. Always activate the python environment before running any commands:

    uv sync
    source .venv/bin/activate  # Activate the virtual environment!
    
  2. Install Rust and Cargo:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    rustup install nightly
    rustup default nightly
    
  3. Install gcc-9:

    sudo apt-get install gcc-9 g++-9
    
  4. Set up ns-3 (for training dataset with packet traces) and UNISON (for fast simulation) for data generation:

    cd High-Precision-Congestion-Control/UNISON-for-ns-3
    ./configure.sh
    ./ns3 run 'scratch/third mix/config_test.txt'
    cd ../ns-3.39
    ./configure.sh
    ./ns3 run 'scratch/third mix/config_test.txt'
    

Running Experiments from Scratch

This section shows how to reproduce the experimental results from the paper using pre-trained models. The pre-trained checkpoints are available in the checkpoints/ directory.

Section 5.2: Testbed Integration

The testbed/ directory contains an integrated evaluation framework comparing three network simulation backends (m4, FlowSim, NS3) against real hardware measurements from a 12-node testbed running HERD, a key-value store application.

Build Backends

Build all three backends (requires GCC-9 and CUDA for M4):

cd testbed

# Build all backends
./build.sh all

# Or build individual backends
./build.sh m4       # M4 ML-based simulator (requires CUDA)
./build.sh flowsim  # FlowSim flow-level simulator
./build.sh ns3      # NS3 packet-level simulator (UNISON)

Run Simulations

Run simulations using the pre-existing testbed ground truth data:

# Run all backends (recommended)
python run.py all

# Or run individual backends
python run.py m4       # M4 ML-based simulator
python run.py flowsim  # FlowSim flow-level simulator
python run.py ns3      # NS3 packet-level simulator

# Use --process-only to skip simulation and only process existing results
python run.py all --process-only

Test Scenarios: 24 scenarios covering RDMA sizes (250KB-1000KB) × window sizes (1, 2, 4)

Results are saved in:

  • eval_test/testbed/ — Real hardware ground truth (24 scenarios)
  • eval_test/m4/ — M4 simulation outputs
  • eval_test/flowsim/ — FlowSim simulation outputs
  • eval_test/ns3/ — NS3 simulation outputs

Analyze Results

Generate evaluation plots and accuracy summaries:

python analyze.py

This produces:

  • results/m4-testbed-perflow.png — Per-flow FCT error CDF
  • results/m4-testbed-overall-window2.png — Application completion time comparison
  • results/accuracy_summary.txt — Summary statistics

Evaluation Metrics:

  • Per-flow FCT error: Absolute relative error for individual UD and RDMA flows
  • Application completion time error: End-to-end execution time accuracy

Section 5.3: SimAI Integration Experiments

The SimAI/ directory contains an integrated evaluation framework with three network simulation backends: UNISON (ns-3) , flowSim , and m4 .

Build Backends

Build all three backends (requires GCC-9):

cd SimAI
./scripts/build.sh -c ns3      # Build UNISON (ns-3) backend
./scripts/build.sh -c flowsim  # Build flowSim backend
./scripts/build.sh -c m4       # Build m4 backend (requires CUDA)

Gray Failure Evaluation

We evaluate all three backends under gray failure conditions—scenarios where network components experience partial performance degradation rather than complete failures. This mimics real-world datacenter issues like cable aging, thermal throttling, or partial switch failures.

Gray Failure Topologies:

The repository includes 105 pre-generated topologies in example/gray_failures/ covering a comprehensive parameter sweep:

  • N ∈ {2, 3, ..., 16}: Number of degraded GPUs (6%-50% of 32-GPU cluster)
  • R ∈ {4, 5, ..., 10}: Bandwidth reduction factor (degraded links operate at 1/R capacity, i.e., 75%-90% bandwidth loss)

Run Gray Failure Sweep:

Note: Pre-computed results for all 315 simulations (3 backends × 105 scenarios) are available in results_gray_failures/. Running the sweep script will overwrite the pre-computed results.

# Run all scenarios for a specific backend
python gray_failure_run_sweep.py ns3      # UNISON (packet-level ground truth)
python gray_failure_run_sweep.py flowsim  # flowSim (anal

Related Skills

View on GitHub
GitHub Stars18
CategoryDevelopment
Updated21d ago
Forks5

Languages

C++

Security Score

80/100

Audited on Mar 18, 2026

No findings