Kompute
General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.
Install / Use
/learn @KomputeProject/KomputeREADME
💬 Join the Discord & Community Calls 🔋 Documentation 💻 Blog Post ⌨ Examples 💾
<hr>Kompute is backed by the Linux Foundation as a <a href="https://lfaidata.foundation/blog/2021/08/26/kompute-joins-lf-ai-data-as-new-sandbox-project/">hosted project</a> by the LF AI & Data Foundation.
<table> <tr> <td> <a href="https://www.linuxfoundation.org/projects/"> <img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Linux_Foundation_logo.png"> </a> </td> <td> <a href="https://lfaidata.foundation/projects/"> <img src="https://raw.githubusercontent.com/lfai/artwork/main/lfaidata-assets/lfaidata/horizontal/color/lfaidata-horizontal-color.png"> </a> </td> </tr> </table>Principles & Features
- Flexible Python module with C++ SDK for optimizations
- Asynchronous & parallel processing support through GPU family queues
- Mobile enabled with examples via Android NDK across several architectures
- BYOV: Bring-your-own-Vulkan design to play nice with existing Vulkan applications
- Explicit relationships for GPU and host memory ownership and memory management
- Robust codebase with 90% unit test code coverage
- Advanced use-cases on machine learning 🤖, mobile development 📱 and game development 🎮.
- Active community with monthly calls, discord chat and more

Projects using Kompute ❤️ 🤖
- GPT4ALL
- An ecosystem of open-source on-edge large language models that run locally on your CPU and nearly any GPU.
- llama.cpp
- Port of Facebook's LLaMA model in C/C++ (now decomissioned).
- tpoisonooo/how-to-optimize-gemm
- row-major matmul optimization.
- vkJAX
- JAX interpreter for Vulkan.
Getting Started
Below you can find a GPU multiplication example using the C++ and Python Kompute interfaces.
You can join the Discord for questions / discussion, open a github issue, or read the documentation.
Your First Kompute (C++)
The C++ interface provides low level access to the native components of Kompute, enabling for advanced optimizations as well as extension of components.
void kompute(const std::string& shader) {
// 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
kp::Manager mgr;
// 2. Create and initialise Kompute Tensors through manager
// Default tensor constructor simplifies creation of float values
auto tensorInA = mgr.tensor({ 2., 2., 2. });
auto tensorInB = mgr.tensor({ 1., 2., 3. });
// Explicit type constructor supports uint32, int32, double, float and bool
auto tensorOutA = mgr.tensorT<uint32_t>({ 0, 0, 0 });
auto tensorOutB = mgr.tensorT<uint32_t>({ 0, 0, 0 });
std::vector<std::shared_ptr<kp::Memory>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};
// 3. Create algorithm based on shader (supports buffers & push/spec constants)
kp::Workgroup workgroup({3, 1, 1});
std::vector<float> specConsts({ 2 });
std::vector<float> pushConstsA({ 2.0 });
std::vector<float> pushConstsB({ 3.0 });
auto algorithm = mgr.algorithm(params,
// See documentation shader section for compileSource
compileSource(shader),
workgroup,
specConsts,
pushConstsA);
// 4. Run operation synchronously using sequence
mgr.sequence()
->record<kp::OpSyncDevice>(params)
->record<kp::OpAlgoDispatch>(algorithm) // Binds default push consts
->eval() // Evaluates the two recorded operations
->record<kp::OpAlgoDispatch>(algorithm, pushConstsB) // Overrides push consts
->eval(); // Evaluates only last recorded operation
// 5. Sync results from the GPU asynchronously
auto sq = mgr.sequence();
sq->evalAsync<kp::OpSyncLocal>(params);
// ... Do other work asynchronously whilst GPU finishes
sq->evalAwait();
// Prints the first output which is: { 4, 8, 12 }
for (const float& elem : tensorOutA->vector()) std::cout << elem << " ";
// Prints the second output which is: { 10, 10, 10 }
for (const float& elem : tensorOutB->vector()) std::cout << elem << " ";
} // Manages / releases all CPU and GPU memory resources
int main() {
// Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
// files). This shader shows some of the main components including constants, buffers, etc
std::string shader = (R"(
#version 450
layout (local_size_x = 1) in;
// The input tensors bind index is relative to index in parameter passed
layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };
// Kompute supports push constants updated on dispatch
layout(push_constant) uniform PushConstants {
float val;
} push_const;
// Kompute also supports spec constants on initalization
layout(constant_id = 0) const float const_one = 0;
void main() {
uint index = gl_GlobalInvocationID.x;
out_a[index] += uint( in_a[index] * in_b[index] );
out_b[index] += uint( const_one * push_const.val );
}
)");
// Run the function declared above with our raw string shader
kompute(shader);
}
Your First Kompute (Python)
The Python package provides a high level interactive interface that enables for experimentation whilst ensuring high performance and fast development workflows.
from .utils import compile_source # using util function from python/test/utils
def kompute(shader):
# 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
mgr = kp.Manager()
# 2. Create and initialise Kompute Tensors through manager
# Default tensor constructor simplifies creation of float values
tensor_in_a = mgr.tensor([2, 2, 2])
tensor_in_b = mgr.tensor([1, 2, 3])
# Explicit type constructor supports uint32, int32, double, float and bool
tensor_out_a = mgr.tensor_t(np.array([0, 0, 0], dtype=np.uint32))
tensor_out_b = mgr.tensor_t(np.array([0, 0, 0], dtype=np.uint32))
assert(t_data.data_type() == kp.DataTypes.uint)
params = [tensor_in_a, tensor_in_b, tensor_out_a, tensor_out_b]
# 3. Create algorithm based on shader (supports buffers & push/spec constants)
workgroup = (3, 1, 1)
spec_consts = [2]
push_consts_a = [2]
push_consts_b = [3]
# See documentation shader section for compile_source
spirv = compile_source(shader)
algo = mgr.algorithm(params, spirv, workgroup, spec_consts, push_consts_a)
# 4. Run operation synchronously using sequence
(mgr.sequence()
.record(kp.OpTensorSyncDevice(params))
.record(kp.OpAlgoDispatch(algo)) # Binds default push consts provided
.eval() # evaluates the two recorded ops
.record(kp.OpAlgoDispatch(algo, push_consts_b)) # Overrides push consts
.eval()) # evaluates only the last recorded op
# 5. Sync results from the GPU asynchronously
sq = mgr.sequence()
sq.eval_async(kp.OpTensorSyncLocal(params
