YiRage
YiRage (Yield Revolutionary AGile Engine) - Multi-Backend LLM Inference Optimization. Extends Mirage with comprehensive support for CUDA, MPS, CPU, Triton, NKI, cuDNN, and MKL backends.
Install / Use
/learn @chenxingqiang/YiRageREADME
YiRage - Yield Revolutionary AGile Engine
<div align="center">Multi-Backend LLM Inference Optimization
Based on Mirage by CMU
</div>🎯 About YiRage
YiRage (Yield Revolutionary AGile Engine) extends Mirage with comprehensive multi-backend support, enabling LLM inference optimization across diverse hardware platforms.
YiRage = Mirage + Multi-Backend Architecture
- Original Mirage (CMU): Superoptimizer framework for tensor programs
- YiRage Extensions (Chen Xingqiang, 2025): Multi-backend support with hardware-aware optimizations
🏗️ Architecture
Five-Layer Complete Backend Architecture
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ YiRage Complete Backend Architecture │
├─────────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ Layer 1: Python API │ │
│ │ yirage.new_kernel_graph() → UnifiedCompiler → CoreBridge → superoptimize() │ │
│ │ HardwareRegistry.instance() → ChipArchitecture → detect_current_chip() │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────▼───────────────────────────────────────────┐ │
│ │ Layer 2: Backend Manager (C++) │ │
│ │ BackendRegistry (thread-safe) ← BackendFactory ← StrategyFactory │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────▼───────────────────────────────────────────┐ │
│ │ Layer 3: Search & Strategy │ │
│ │ Hardware-aware Search │ Fingerprint Verification │ Performance Profiling │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────▼───────────────────────────────────────────┐ │
│ │ Layer 4: Threadblock Operations │ │
│ │ MatMul │ Attention │ RMSNorm │ SwiGLU │ Softmax │ Reduce │ Elementwise │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────▼───────────────────────────────────────────┐ │
│ │ Layer 5: Persistent Kernel Runtime │ │
│ │ Memory Management │ Kernel Launch │ Synchronization │ JIT Compilation │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────▼───────────────────────────────────────────┐ │
│ │ Hardware Layer │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌───────┐ ┌──────┐ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐ │ │
│ │ │CUDA │ │ROCm │ │ MPS │ │Ascend │ │ MACA │ │ TPU │ │ XPU │ │FPGA │ │ CPU │ │ │
│ │ │NVIDA│ │ AMD │ │Apple│ │Huawei │ │MetaX │ │Google│ │Intel│ │Xilinx││x86/ARM│ │ │
│ │ └─────┘ └─────┘ └─────┘ └───────┘ └──────┘ └──────┘ └─────┘ └─────┘ └──────┘ │ │
│ │ ┌───────┐ ┌─────┐ ┌──────┐ │ │
│ │ │Triton │ │ NKI │ │ MLIR │ ← Compiler Backends │ │
│ │ │OpenAI │ │ AWS │ │ LLVM │ │ │
│ │ └───────┘ └─────┘ └──────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────────┘
Complete Backend Support Matrix (12 Backends × 5 Layers)
| Backend | Hardware | Backend API | Strategy | Kernel | Threadblock | PK Runtime | |---------|----------|-------------|----------|--------|-------------|------------| | CUDA | NVIDIA GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | ROCm | AMD GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | CPU | x86/ARM | ✅ | ✅ | ✅ | ✅ | ✅ | | MPS | Apple Silicon | ✅ | ✅ | ✅ | ✅ | ✅ | | Ascend | Huawei NPU | ✅ | ✅ | ✅ | ✅ | ✅ | | MACA | MetaX GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | TPU | Google Cloud | ✅ | ✅ | ✅ | ✅ | ✅ | | XPU | Intel GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | FPGA | Intel/Xilinx | ✅ | ✅ | ✅ | ✅ | ✅ | | Triton | Compiler | ✅ | ✅ | ✅ | ✅ | ✅ | | NKI | AWS Neuron | ✅ | ✅ | ✅ | ✅ | ✅ | | MLIR | Multi-target | ✅ | ✅ | ✅ | ✅ | ✅ |
Five-Layer Design
Layer 1: Python API
- Backend query and selection (
get_available_backends()) - Hardware Device Registry (
HardwareRegistry— register/query chip architectures at runtime) - Unified compiler interface (
UnifiedCompiler) - Core bridge to C++ (
CoreBridge) - Hardware-specific optimizers
Layer 2: Backend Manager (C++)
- BackendRegistry (singleton, thread-safe)
- Factory patterns for backends and strategies
- Automatic initialization on import
Layer 3: Search & Strategy
- Hardware-aware kernel search
- Fingerprint-based verification
- Performance profiling and modeling
Layer 4: Threadblock Operations
- Optimized LLM operators (MatMul, Attention, RMSNorm, SwiGLU)
- Hardware-specific implementations
- Code generation for Triton/NKI/MLIR
Layer 5: Persistent Kernel Runtime
- Device memory management
- Kernel launch and synchronization
- JIT compilation support
✨ Key Features
🚀 12 Complete Backend Implementations (All 5 Layers)
| Backend | Hardware | Key Features | Architecture | |---------|----------|--------------|--------------| | CUDA | NVIDIA GPU | Tensor Core, 32-thread Warp, cuBLAS | SM, Shared Memory | | ROCm | AMD GPU | Matrix Core, 64-thread Wavefront, rocBLAS | GCN/CDNA, LDS | | CPU | x86/ARM | AVX512/NEON SIMD, Cache Blocking, OpenMP | Multi-core, L1/L2/L3 | | MPS | Apple Silicon | Metal, Threadgroup, Unified Memory | M1/M2/M3/M4 | | Ascend | Huawei NPU | Cube Unit 16×16, AI Core, L1 Buffer | Ascend 910/310 | | MACA | MetaX GPU | 64-thread Warp, CUDA-compat, Tensor Core | C500 Series | | TPU | Google Cloud | MXU 128×128, BF16 Native, PJRT | TPU v2/v3/v4/v5 | | XPU | Intel GPU | XMX 8×8, SYCL/oneAPI, SLM | Arc/Max/Gaudi | | FPGA | Intel/Xilinx | DSP Blocks, Pipeline, BRAM/HBM | OpenCL Kernel | | Triton | Compiler | Auto-tuning, Tile Fusion, MMA | PTX/HSACO | | NKI | AWS Neuron | Tensor Engine 128×128, SBUF 24MB | Trainium/Inferentia | | MLIR | Multi-target | JIT, Linalg, Pass Pipeline | LLVM/NVVM/SPIRV |
🔧 Hardware Architecture Differences
┌────────────────────────────────────────────────────────────────────────────────────┐
│ Hardware Architecture Comparison │
├────────────┬─────────────────┬─────────────────┬───────────────────────────────────┤
│ Backend │ Thread Model │ Matrix Unit │ Memory Hierarchy │
├────────────┼─────────────────┼─────────────────┼───────────────────────────────────┤
│ CUDA │ 32-thread Warp │ Tensor Core │ Registers → Shared → L2 → HBM │
│ ROCm │ 64-thread Wave │ Matrix Core │ VGPR → LDS → L2 → HBM │
│ MPS │ SIMD Group │ Apple GPU │ Threadgroup → Device → Unified │
│ Ascend │ AI Core │ Cube 16×16 │ L0 → L1 → L2 → HBM │
│ MACA │ 64-thread Warp │ Tensor Core │ Shared → L2 → HBM │
│ TPU │ MXU Systolic │ MXU 128×128 │ VMEM → HBM │
│ XPU │ Xe Subgroup │ XMX 8×8 │ SLM → L3 → HBM │
│ FPGA │ Pipeline │ DSP Block │ BRAM/URAM → DDR/HBM │
└────────────┴─────────────────┴─────────────────┴───────────────────────────────────┘
🎯 Hardware-Aware Kernel Optimizers
- 60+ Optimization Methods across all 12 backends
- Automatic Configuration based on hardware capabilities
- Performance Modeling for each backend
- Code Generation for Triton/NKI/MLIR
Example: CUDA Optimizer
from yirage.kernel.cuda import CUDAOptimizer, CUDAKernelConfig
config = CUDAKernelConfig()
CUDAOptimizer.optimize_grid_block_dims(1024, 1024, 1024,
compute_capability=80,
config=config)
# Auto-configured: Tensor Core, Warps, Shared Memory, Occupancy
Example: MPS Optimizer (Apple Silicon)
from yirage.kernel.mps import MPSOptimizer, MPSKernelConfig
config = MPSKernelConfig()
MPSOptimizer.optimize_for_apple_silicon(1024, 1024, 1024, config)
# Auto-detects: M1/M2/M3, GPU cor
