Iris
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
Install / Use
/learn @ROCm/IrisREADME
Iris: First-Class Multi-GPU Programming Experience in Triton
<p align="center"> <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a> <a href="https://github.com/ROCm/iris/blob/main/.github/workflows/lint.yml"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff"></a> <a href="https://github.com/ROCm/iris/actions/workflows/iris-tests.yml"><img src="https://github.com/ROCm/iris/actions/workflows/iris-tests.yml/badge.svg" alt="Iris Tests"></a> <a href="https://doi.org/10.5281/zenodo.17382307"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.17382307.svg" alt="DOI"></a> <a href="https://doi.org/10.48550/arXiv.2511.12500"><img src="https://img.shields.io/badge/cs.DC%2C%20cs.LG-arXiv%3A2511.12500-B31B1B.svg" alt="DOI"></a> </p>Iris is a Triton-based framework for Remote Memory Access (RMA) operations developed by AMD's Research and Advanced Development team. Iris provides SHMEM-like APIs within Triton for Multi-GPU programming. Iris' goal is to make Multi-GPU programming a first-class citizen in Triton while retaining Triton's programmability and performance.
Latest with Iris 🔥
- [16/11/2025] Paper released: Iris: First-Class Multi-GPU Programming Experience in Triton
- [04/11/2025] Paper released: Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs
- [02/10/2025] Iris + Gluon Released
- [18/09/2025] FlashDecode with Iris
- [16/09/2025] Iris was presented in Chinese for participants of the AMD Distributed Inference Kernel Contest
- [12/09/2025] Presented Iris at GPUMode [talk] | [slides]
- [27/08/2025] AMD's GPU Mode Competition Announced
- [14/08/2025] Iris All-Scatter Taxonomy Released [documentation] | [video]
- [25/06/2025] Iris Released
Key Features
- SHMEM-like RMA: Iris provides SHMEM-like RMA support in Triton.
- Simple and Intuitive API: Iris provides simple and intuitive RMA APIs. Writing multi-GPU programs is as easy as writing single-GPU programs.
- Triton-based: Iris is built on top of Triton and inherits Triton's performance and capabilities.
- Triton Gluon-based backend (Experimental): Includes an optional backend built on Triton’s Gluon language, a lower-level GPU programming model that exposes explicit control over layouts, memory, and data movement—ideal for users seeking maximal performance and hardware-level optimization.
Documentation
- Setup Alternatives
- Examples
- Programming Model
- Taxonomy of Multi-GPU Programming Patterns
- Fine-grained GEMM & Communication Overlap
- API Reference
API Example
Here's a simple example showing how to perform remote memory operations between GPUs using Iris:
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import triton
import triton.language as tl
import iris
# Device-side APIs
@triton.jit
def kernel(buffer, buffer_size: tl.constexpr, block_size: tl.constexpr, heap_bases_ptr):
# Compute start index of this block
pid = tl.program_id(0)
block_start = pid * block_size
offsets = block_start + tl.arange(0, block_size)
# Guard for out-of-bounds accesses
mask = offsets < buffer_size
# Store 1 in the target buffer at each offset
source_rank = 0
target_rank = 1
iris.store(buffer + offsets, 1,
source_rank, target_rank,
heap_bases_ptr, mask=mask)
def _worker(rank, world_size):
# Torch distributed initialization
device_id = rank % torch.cuda.device_count()
dist.init_process_group(
backend="nccl",
rank=rank,
world_size=world_size,
init_method="tcp://127.0.0.1:29500",
device_id=torch.device(f"cuda:{device_id}")
)
# Iris initialization
heap_size = 2**30 # 1GiB symmetric heap for inter-GPU communication
iris_ctx = iris.iris(heap_size)
cur_rank = iris_ctx.get_rank()
# Iris tensor allocation
buffer_size = 4096 # 4K elements buffer
buffer = iris_ctx.zeros(buffer_size, device="cuda", dtype=torch.float32)
# Launch the kernel on rank 0
block_size = 1024
grid = lambda meta: (triton.cdiv(buffer_size, meta["block_size"]),)
source_rank = 0
if cur_rank == source_rank:
kernel[grid](
buffer,
buffer_size,
block_size,
iris_ctx.get_heap_bases(),
)
# Synchronize all ranks
iris_ctx.barrier()
dist.destroy_process_group()
if __name__ == "__main__":
world_size = 2 # Using two ranks
mp.spawn(_worker, args=(world_size,), nprocs=world_size, join=True)
Gluon-style API (Experimental)
Iris also provides an experimental cleaner API using Triton's Gluon with @gluon.jit decorator:
[!NOTE] Requirements for Gluon backend: ROCm 7.0+ and Triton commit aafec417bded34db6308f5b3d6023daefae43905 or later are required to use the experimental Gluon APIs.
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from triton.experimental import gluon
from triton.experimental.gluon import language as gl
import iris.experimental.iris_gluon as iris_gl
# Device-side APIs - context encapsulates heap_bases
@gluon.jit
def kernel(IrisDeviceCtx: gl.constexpr, context_tensor,
buffer, buffer_size: gl.constexpr, block_size: gl.constexpr):
# Initialize device context from tensor
ctx = IrisDeviceCtx.initialize(context_tensor)
pid = gl.program_id(0)
block_start = pid * block_size
layout: gl.constexpr = gl.BlockedLayout([1], [64], [1], [0])
offsets = block_start + gl.arange(0, block_size, layout=layout)
mask = offsets < buffer_size
# Store 1 in the target buffer - no need to pass heap_bases separately!
target_rank = 1
ctx.store(buffer + offsets, 1, target_rank, mask=mask)
def _worker(rank, world_size):
# Torch distributed initialization
device_id = rank % torch.cuda.device_count()
dist.init_process_group(
backend="nccl",
rank=rank,
world_size=world_size,
init_method="tcp://127.0.0.1:29500",
device_id=torch.device(f"cuda:{device_id}")
)
# Iris initialization
heap_size = 2**30 # 1GiB symmetric heap
iris_ctx = iris_gl.iris(heap_size)
context_tensor = iris_ctx.get_device_context() # Get encoded context
cur_rank = iris_ctx.get_rank()
# Iris tensor allocation
buffer_size = 4096 # 4K elements buffer
buffer = iris_ctx.zeros(buffer_size, device="cuda", dtype=torch.float32)
# Launch the kernel on rank 0
block_size = 1024
grid = (buffer_size + block_size - 1) // block_size
source_rank = 0
if cur_rank == source_rank:
kernel[(grid,)](iris_gl.IrisDeviceCtx, context_tensor,
buffer, buffer_size, block_size, num_warps=1)
# Synchronize all ranks
iris_ctx.barrier()
dist.destroy_process_group()
if __name__ == "__main__":
world_size = 2 # Using two ranks
mp.spawn(_worker, args=(world_size,), nprocs=world_size, join=True)
Quick Start Guide
Quick Installation
[!NOTE] Requirements: Python 3.10+, PyTorch 2.0+ (ROCm version), ROCm 6.3.1+ HIP runtime, Triton, and setuptools>=61
For a quick installation directly from the repository:
pip install git+https://github.com/ROCm/iris.git
Docker Compose (Recommended for Development)
The recommended way to get started is using Docker Compose, which provides a development environment with the Iris directory mounted inside the container. This allows you to make changes to the code outside the container and see them reflected inside.
# Start the development container
docker compose up --build -d
# or depending on your docker version
docker-compose up --build -d
# Attach to the running container
docker attach iris-dev
# Install Iris in development mode
cd iris && pip install -e .
For baremetal install, Docker or Apptainer setup, see Installation.
Next Steps
Check out our examples directory for ready-to-run scripts and usage patterns, including peer-to-peer communication and GEMM benchmarks.
Supported GPUs
Iris currently supports:
- MI300X, MI350X & MI355X
[!NOTE] Iris may work on other AMD GPUs with ROCm compatibility.
Roadmap
We plan to extend Iris with the following features:
- Extended GPU Support: Testing and optimization for other AMD GPUs.
- RDMA Support: Multi-node support using Remote Direct Memory Access (RDMA) for distributed computing across multiple machines.
- End-to-End Integration: Comprehensive examples covering various
Related Skills
product-manager-skills
38PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
