NeuSim

An open-source simulator framework for neural processing units

Generate Convert Improve

Install / Use

/learn @platformxlab/NeuSim

About this skill

Quality Score

0/100

README

NeuSim: An Open-source Simulator Framework for NPUs

NeuSim is a simulator framework for modeling the performance and power behaviors of neural processing units (NPUs) when running machine learning workloads.

📌 Neural Processing Unit 101

NPU Architecture

As shown in the above figure, an NPU chip consists of systolic arrays (SAs) for matrix multiplications and SIMD vector units (VUs) for generic vector operations. Each chip has an off-chip high-bandwidth memory (HBM) to store the ML model weights and input/output data, and an on-chip SRAM to exploit data locality and hide HBM access latency. A direct memory access (DMA) engine performs asynchronous memory copy between the HBM and SRAM. Multiple NPU chips can be connected via high-speed inter-chip interconnect (ICI) links, which form an NPU pod. A pod is typically arranged as a 2D/3D torus, which is optimized for AllReduce bandwidth. The DMA engine performs remote DMA (RDMA) operations to access another chip’s HBM or SRAM.

🚀 Key Features of NeuSim

NeuSim Design

NeuSim features:

Detailed performance modeling: NeuSim models each component (e.g., systolic array, vector unit, on-chip SRAM, HBM, ICI) on an NPU chip and reports rich statistics for each tensor operator (e.g., execution time, FLOPS, memory traffic). It helps chip architects and system designers identify microarchitectural bottlenecks (e.g., SA-bound, VU-bound, HBM-bound).
Power, energy, and carbon modeling: NeuSim models the static/dynamic power and energy consumption of each component on an NPU chip. It also models the embodied and operational carbon emissions.
Flexibility: NeuSim can be invoked at different levels of granularity, including single operator simulation, end-to-end DNN model simulation, and batch simulations for design space explorations. This provides flexibility to users with different needs.
Support for popular DNN models: NeuSim takes the model graph definition as an input. It supports various popular DNN architectures, including LLMs (e.g., Llama, DeepSeek), recommendation models (e.g., DLRM), and stable diffusion models (e.g., DiT-XL, GLIGEN).
Multi-chip simulation: NeuSim supports simulating multi-chip systems with different parallelism strategies (e.g., tensor parallelism, pipeline parallelism, data parallelism, expert parallelism).
Scalability: A typical use case of NeuSim is the design space exploration: sweeping over millions of NPU hardware configurations (e.g., number of chips) and software parameters (e.g., batch size, parallelism config) to learn the "optimal" setting. NeuSim automatically parallelizes simulation jobs across multiple machines using Ray to speed up large-scale design space explorations.
Advanced features: NeuSim models advanced architectural features such as power gating and dynamic voltage and frequency scaling (DVFS) to help chip architects explore the trade-offs between performance, power, and energy efficiency.

👉 Installation

Install Miniconda (skip this if you already have conda installed).
NeuSim is installed as a Python package. Create a conda environment and install NeuSim with pip:
```
conda create --name neusim python=3.12.2
conda activate neusim
pip install -e .
```
If you want to run unit tests or contribute to the codebase, you may also install the optional development dependencies:
```
pip install -e ".[dev]"
```

🏄 Running NeuSim

NeuSim can be launched in different ways depending on the use cases, including single operator simulations, single model simulations, and batch simulations for design space explorations. The neusim/run_scripts/ directory contains several example scripts of NeuSim simulations.

Quick Start

To get started immediately, we provide an automated example script (neusim/run_scripts/example_npusim.sh) that demonstrates the full NeuSim pipeline. It sweeps through various hardware and model configurations to determine the most cost-efficient NPU design that meets specific performance targets.

Start ray server:
```
ray start --head --port=6379
```
Run the example script:
```
cd neusim/run_scripts
./example_npusim.sh
```
You may view the progress of the test runs in the Ray dashboard (at http://127.0.0.1:8265/ by default, it may require port forwarding if you are ssh'ing onto a remote machine).

After the script finishes with no errors, under the "Jobs" tab in the Ray dashboard, all jobs should have the "Status" column set to "SUCCEEDED". An output directory results should be created and contain the following folders:
- raw/: contains the performance simulation results. This is the output of the script run_sim.py.
- raw_None/: contains the power simulation results. This is the output of the script energy_operator_analysis_main.py.
- carbon_NoPG/dvfs_None/CI0.0624/UTIL0.6/: contains the results of the carbon emission analysis without power gating and DVFS, with carbon intensity 0.0624 kgCO2e/kWh and NPU chip duty cycle 60%. This is the output of the script carbon_analysis_main.py.
- slo/: contains the SLO analysis results. This is the output of the script slo_analysis_main.py.

The example_npusim.sh script invokes the core components of NeuSim to simulate different DNN models running on various NPU hardware configurations, and analyze the output statistics to find the most cost-efficient NPU configuration that meets the target performance SLOs:

First, it invokes run_sim.py for performance simulations. This script is the main entry point for running a batch of performance simulations. It sweeps over all possible numbers of chips, batch sizes, NPU versions, and parallelism configurations for the given DNN models. It outputs the per-operator performance statistics for each configuration to CSV files. It also dumps the end-to-end statistics and the simulation configuration to a JSON file. The Operator class contains the descriptions for all the statistics in the CSV files. This script will launch multiple Ray tasks to parallelize the simulation jobs.
Next, it invokes energy_operator_analysis_main.py to run power simulations. This script reads the performance statistics generated by run_sim.py and computes the power and energy consumption for each operator based on the NPU hardware configuration, power gating, and DVFS settings. (Note: we can integrate the power simulation into run_sim.py, but we separate them here for modularity and flexibility.)
After that, it invokes carbon_analysis_main.py to run carbon footprint analysis and further aggregate the simulation statistics. This script reads the power and energy statistics generated by energy_operator_analysis_main.py and computes the carbon emissions based on the datacenter carbon intensity and NPU chip duty cycle.
Finally, it invokes slo_analysis_main.py. This script analyzes the output of previous steps to find the optimal NPU configurations that meet the target SLOs (e.g., request latency for inference workloads).

A more comprehensive experiment script, run_power_gating.sh, demonstrates how to run simulations with different power gating strategies. It has the same structure as example_npusim.sh, but includes more models, NPU versions, and various power gating configurations.

Customizing Simulation Parameters

Output Directory

Most scripts under neusim/run_scripts should have the --output_dir argument.

Performance Simulation Parameters

The user can specify the NPU hardware configuration and the model architecture of the simulation by creating new configuration files under configs/. We provide a set of pre-defined configurations in the configs directory:

configs/chips/: contains the NPU chip parameters, such as the number of SAs, VUs, core frequency, HBM bandwidth, on-chip SRAM size, etc.
configs/models/: contains the model architecture parameters as well as the parallelism configurations. We currently support LLMs (Llama and DeepSeek), DLRM, DiT-XL, and GLIGEN. See Defining New DNN Model Architectures for more details on how to add support for new models.
configs/systems/: contains the system-level parameters, including the datacenter power usage efficiency (PUE) and carbon intensity used for carbon emission analysis.

The script neusim/run_scripts/run_sim.py automatically supports new configuration files added to these directories, as long as the file names follow the existing naming conventions:

--models: specify the model names. For example, if the user adds a new model configuration file configs/models/llama4-17b.json, the user can specify --models="llama4-17b" to run simulations for this model.
--versions: specify the NPU chip versions. For example, if the user adds a new chip configuration file configs/chips/tpuv7.json, the user can specify --versions="7" to run simulations for this NPU version.

Power Simulation Parameters

The power gating parameters are defined in neusim/configs/power_gating/PowerGatingConfig.py. The user can modify the get_power_gating_config() function to add new power gating configurations, including power gating wake-up cycles and power gating policies for each component.

The scripts neusim/run_scripts/energy_operator_analysis_main.py and neusim/run_scripts/carbon_analysis_main.py can

Related Skills

node-connect

353.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。