CUDApple
Exploration work on executing CUDA kernels on Apple Silicon (Metal-compatible code).
Install / Use
/learn @thomaschlt/CUDAppleREADME
CUDApple: A CUDA to Metal Translator
Welcome to CUDApple, a compiler project designed to automatically translate CUDA kernels into Metal shaders, enabling high-performance computation on Apple Silicon devices. This project demonstrates the feasibility of running complex CUDA-based machine learning workloads on Mac hardware through source-to-source compilation.

Core Achievements
This project has successfully implemented the translation of a wide range of computational kernels, providing the fundamental building blocks for modern machine learning models. The examples in src/examples showcase a complete, end-to-end training step for a neural network.
What's Working
- Neural Network Layers: Both forward and backward passes for essential layers have been translated and verified:
- Linear (Fully Connected):
linear.cu,linear_backward.cu,linear_backward_weights.cu,linear_backward_bias.cu - 2D Convolution:
conv2d.cu,conv2d_backward_weights.cu,conv2d_backward_bias.cu - 2D Max Pooling:
maxpool2d.cu
- Linear (Fully Connected):
- Activation Functions:
- ReLU:
simple_relu.cu - Softmax:
softmax.cu,softmax_backward.cu - Tanh (and others):
activation_functions.cu
- ReLU:
- Loss Functions & Optimizers:
- Cross-Entropy Loss:
cross_entropy_loss.cu - Stochastic Gradient Descent (SGD):
sgd_optimizer.cu
- Cross-Entropy Loss:
- End-to-End Training:
- A complete, fused kernel for a training step of a Multi-Layer Perceptron (MLP) on the MNIST dataset is working (
training_loop.cu). This includes the forward pass, backpropagation, and weight updates in a single, efficient kernel.
- A complete, fused kernel for a training step of a Multi-Layer Perceptron (MLP) on the MNIST dataset is working (
Key Learnings & Challenges
Developing a source-to-source translator for two distinct GPU programming models has offered several insights:
- Parallelism Model Mapping: A core challenge is mapping CUDA's execution model (grids, blocks, threads) to Metal's model (grids, threadgroups, threads). The project successfully maps concepts like
blockIdxandthreadIdxto their Metal equivalents, which is fundamental for correctness. - Fused vs. Individual Kernels: The examples demonstrate both small, individual kernels (e.g.,
vector_add.cu) and large, "fused" kernels (training_loop.cu). Translating fused kernels, which minimize memory transfers by keeping intermediate data in registers, presents a unique challenge and a significant opportunity for performance optimization. - Atomic Operations: Correctly translating CUDA's
atomicAddto its Metal counterpart is critical for algorithms like gradient accumulation, where multiple threads update the same memory location. This was a key focus to ensure numerical correctness. - Compiler Architecture: Building this translator required a robust pipeline, including a parser to generate an Abstract Syntax Tree (AST) from CUDA source and a generator to convert the AST into Metal Shading Language (MSL).
How It Works
The translation process follows a standard compiler pipeline:
- Parsing: CUDA source code (
.cu) is parsed into an Abstract Syntax Tree (AST) that represents its structure and semantics. The AST definition can be found inparser/unified_ast.rs. - Code Generation: The AST is traversed, and corresponding Metal Shading Language (MSL) code is generated. This involves mapping data types, functions, and kernel launch syntax. The generator logic is located in
metal/mod.rsandmetal/host.rs.
Getting Started
To try out the project, clone the repository and follow these steps:
- Compile the Project: Use
cargo buildto compile the Rust code. - Run a Kernel: Use
cd srcandcargo run -- -i examples/vector_add.cu -d output/ --run -vto translate and execute a CUDA kernel.
Thank you for checking out CUDApple! Don't hesitate to reach out on X if you have any questions or suggestions :)
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
