YiRage - Yield Revolutionary AGile Engine

Multi-Backend LLM Inference Optimization

Based on Mirage by CMU

</div>

🎯 About YiRage

YiRage (Yield Revolutionary AGile Engine) extends Mirage with comprehensive multi-backend support, enabling LLM inference optimization across diverse hardware platforms.

YiRage = Mirage + Multi-Backend Architecture

Original Mirage (CMU): Superoptimizer framework for tensor programs
YiRage Extensions (Chen Xingqiang, 2025): Multi-backend support with hardware-aware optimizations

🏗️ Architecture

Five-Layer Complete Backend Architecture

┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                              YiRage Complete Backend Architecture                       │
├─────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐    │
│  │                         Layer 1: Python API                                     │    │
│  │  yirage.new_kernel_graph() → UnifiedCompiler → CoreBridge → superoptimize()     │    │
│  │  HardwareRegistry.instance() → ChipArchitecture → detect_current_chip()         │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 2: Backend Manager (C++)                          │    │
│  │  BackendRegistry (thread-safe) ← BackendFactory ← StrategyFactory               │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 3: Search & Strategy                              │    │
│  │  Hardware-aware Search │ Fingerprint Verification │ Performance Profiling       │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 4: Threadblock Operations                         │    │
│  │  MatMul │ Attention │ RMSNorm │ SwiGLU │ Softmax │ Reduce │ Elementwise         │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 5: Persistent Kernel Runtime                      │    │
│  │  Memory Management │ Kernel Launch │ Synchronization │ JIT Compilation          │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                              Hardware Layer                                     │    │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌───────┐ ┌──────┐ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐   │    │
│  │  │CUDA │ │ROCm │ │ MPS │ │Ascend │ │ MACA │ │ TPU  │ │ XPU │ │FPGA │ │ CPU  │   │    │
│  │  │NVIDA│ │ AMD │ │Apple│ │Huawei │ │MetaX │ │Google│ │Intel│ │Xilinx││x86/ARM│  │    │
│  │  └─────┘ └─────┘ └─────┘ └───────┘ └──────┘ └──────┘ └─────┘ └─────┘ └──────┘   │    │
│  │  ┌───────┐ ┌─────┐ ┌──────┐                                                     │    │
│  │  │Triton │ │ NKI │ │ MLIR │  ← Compiler Backends                                │    │
│  │  │OpenAI │ │ AWS │ │ LLVM │                                                     │    │
│  │  └───────┘ └─────┘ └──────┘                                                     │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                                                                         │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Complete Backend Support Matrix (12 Backends × 5 Layers)

| Backend | Hardware | Backend API | Strategy | Kernel | Threadblock | PK Runtime | |---------|----------|-------------|----------|--------|-------------|------------| | CUDA | NVIDIA GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | ROCm | AMD GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | CPU | x86/ARM | ✅ | ✅ | ✅ | ✅ | ✅ | | MPS | Apple Silicon | ✅ | ✅ | ✅ | ✅ | ✅ | | Ascend | Huawei NPU | ✅ | ✅ | ✅ | ✅ | ✅ | | MACA | MetaX GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | TPU | Google Cloud | ✅ | ✅ | ✅ | ✅ | ✅ | | XPU | Intel GPU | ✅ | ✅ | ✅ | ✅ | ✅ | | FPGA | Intel/Xilinx | ✅ | ✅ | ✅ | ✅ | ✅ | | Triton | Compiler | ✅ | ✅ | ✅ | ✅ | ✅ | | NKI | AWS Neuron | ✅ | ✅ | ✅ | ✅ | ✅ | | MLIR | Multi-target | ✅ | ✅ | ✅ | ✅ | ✅ |

Five-Layer Design

Layer 1: Python API

Backend query and selection (get_available_backends())
Hardware Device Registry (HardwareRegistry — register/query chip architectures at runtime)
Unified compiler interface (UnifiedCompiler)
Core bridge to C++ (CoreBridge)
Hardware-specific optimizers

Layer 2: Backend Manager (C++)

BackendRegistry (singleton, thread-safe)
Factory patterns for backends and strategies
Automatic initialization on import

Layer 3: Search & Strategy

Hardware-aware kernel search
Fingerprint-based verification
Performance profiling and modeling

Layer 4: Threadblock Operations

Optimized LLM operators (MatMul, Attention, RMSNorm, SwiGLU)
Hardware-specific implementations
Code generation for Triton/NKI/MLIR

Layer 5: Persistent Kernel Runtime

Device memory management
Kernel launch and synchronization
JIT compilation support

✨ Key Features

🚀 12 Complete Backend Implementations (All 5 Layers)

| Backend | Hardware | Key Features | Architecture | |---------|----------|--------------|--------------| | CUDA | NVIDIA GPU | Tensor Core, 32-thread Warp, cuBLAS | SM, Shared Memory | | ROCm | AMD GPU | Matrix Core, 64-thread Wavefront, rocBLAS | GCN/CDNA, LDS | | CPU | x86/ARM | AVX512/NEON SIMD, Cache Blocking, OpenMP | Multi-core, L1/L2/L3 | | MPS | Apple Silicon | Metal, Threadgroup, Unified Memory | M1/M2/M3/M4 | | Ascend | Huawei NPU | Cube Unit 16×16, AI Core, L1 Buffer | Ascend 910/310 | | MACA | MetaX GPU | 64-thread Warp, CUDA-compat, Tensor Core | C500 Series | | TPU | Google Cloud | MXU 128×128, BF16 Native, PJRT | TPU v2/v3/v4/v5 | | XPU | Intel GPU | XMX 8×8, SYCL/oneAPI, SLM | Arc/Max/Gaudi | | FPGA | Intel/Xilinx | DSP Blocks, Pipeline, BRAM/HBM | OpenCL Kernel | | Triton | Compiler | Auto-tuning, Tile Fusion, MMA | PTX/HSACO | | NKI | AWS Neuron | Tensor Engine 128×128, SBUF 24MB | Trainium/Inferentia | | MLIR | Multi-target | JIT, Linalg, Pass Pipeline | LLVM/NVVM/SPIRV |

🔧 Hardware Architecture Differences

┌────────────────────────────────────────────────────────────────────────────────────┐
│                        Hardware Architecture Comparison                            │
├────────────┬─────────────────┬─────────────────┬───────────────────────────────────┤
│ Backend    │ Thread Model    │ Matrix Unit     │ Memory Hierarchy                  │
├────────────┼─────────────────┼─────────────────┼───────────────────────────────────┤
│ CUDA       │ 32-thread Warp  │ Tensor Core     │ Registers → Shared → L2 → HBM     │
│ ROCm       │ 64-thread Wave  │ Matrix Core     │ VGPR → LDS → L2 → HBM             │
│ MPS        │ SIMD Group      │ Apple GPU       │ Threadgroup → Device → Unified    │
│ Ascend     │ AI Core         │ Cube 16×16      │ L0 → L1 → L2 → HBM                │
│ MACA       │ 64-thread Warp  │ Tensor Core     │ Shared → L2 → HBM                 │
│ TPU        │ MXU Systolic    │ MXU 128×128     │ VMEM → HBM                        │
│ XPU        │ Xe Subgroup     │ XMX 8×8         │ SLM → L3 → HBM                    │
│ FPGA       │ Pipeline        │ DSP Block       │ BRAM/URAM → DDR/HBM               │
└────────────┴─────────────────┴─────────────────┴───────────────────────────────────┘

🎯 Hardware-Aware Kernel Optimizers

60+ Optimization Methods across all 12 backends
Automatic Configuration based on hardware capabilities
Performance Modeling for each backend
Code Generation for Triton/NKI/MLIR

Example: CUDA Optimizer

from yirage.kernel.cuda import CUDAOptimizer, CUDAKernelConfig

config = CUDAKernelConfig()
CUDAOptimizer.optimize_grid_block_dims(1024, 1024, 1024, 
                                        compute_capability=80, 
                                        config=config)
# Auto-configured: Tensor Core, Warps, Shared Memory, Occupancy

Example: MPS Optimizer (Apple Silicon)

from yirage.kernel.mps import MPSOptimizer, MPSKernelConfig

config = MPSKernelConfig()
MPSOptimizer.optimize_for_apple_silicon(1024, 1024, 1024, config)
# Auto-detects: M1/M2/M3, GPU cor

YiRage

Install / Use

README