Nntile
A neural network training framework within a task-based parallel programming paradigm
Install / Use
/learn @nntile/NntileREADME
NNTile
General purpose
NNTile is a framework for training large neural networks. It relies on a task-based parallel programming paradigm, which distributes computations across all avialable hardware resources dynamically and transmits data asynchronously. For this purpose NNTile utilizes StarPU library.
Preliminary experimental results
Experiments with custom 4-layer and 8-layer GPT models of sizes up to 50B showed both good performance and a possibility to train 4 times larger models compared to PyTorch FSDP on the same hardware (a single server with 8 x Nvidia A100 80GB SXM).

The same figures in better quality:
Authors
NNTile is developed by specialists from
- Skolkovo Institute of Science and Technology (Skoltech)
- Artifical Intelligence Research Institute (AIRI)
Main contributors are:
- Aleksandr Mikhalev
- Aleksandr Katrutsa
- Konstantin Sozykin
- Gleb Karpov
- Daniel Bershatsky
Acknowledgement
Authors of the NNTile would like to thank Ivan Oseledets for bringing idea of this project to life.
The work was generously supported by the Center in the field of Artificial Intelligence in the direction of optimizing management decisions to reduce the carbon footprint on the basis of the Skolkovo Institute of Science and Technology under Contract No. 70-2021-00145/10841 dated 02.11.2021 (items 2.3.1, 2.3.3, 3.3.2 and 3.3.4) and Contract No. 10825/3978620 dated 26.08.2021.
This work was supported by FASIE (fasie.ru).
Assembly
NNTile comes with a Dockerfile to construct a Docker image with NNTile
and all prerequisites. The image uses Miniforge (Conda) for dependencies and
provides two build stages:
| Stage | Description |
|-------|-------------|
| sandbox | Prerequisites only (FXT, StarPU, Python, PyTorch, etc.). Use for NNTile development when building from local sources. |
| nntile | Full image with NNTile compiled from the repository. Default target. |
Building the image
Build the full image (sandbox + NNTile compiled):
docker build . -t nntile:latest
Build only the sandbox (prerequisites, no NNTile):
docker build . -t nntile_sandbox:latest --target sandbox
Build arguments (all optional):
| Argument | Default | Description |
|----------|---------|-------------|
| CUDA_VERSION | 12.9.1 | CUDA version for the base image |
| BASE_OS | ubuntu22.04 | Base OS for the CUDA image |
| BASE_IMAGE | nvidia/cuda:${CUDA_VERSION}-base-${BASE_OS} | Override the full base image |
| MAKE_JOBS | 4 | Parallelism for compiling FXT, StarPU, and NNTile |
| CUDA_ARCHS | 70;75;80;86;89;90;100;120 | Target CUDA architectures (semicolon-separated) |
| PYTHON_VERSION | 3.12 | Python version in the conda environment |
| PYTORCH_VERSION | 2.9.1 | PyTorch version |
Example with custom options:
docker build . -t nntile:latest \
--build-arg MAKE_JOBS=8 \
--build-arg CUDA_ARCHS="70;75;80;86;89;90" \
--build-arg CUDA_VERSION=12.6.1
Running the container
Start an interactive shell (default):
docker run -it --gpus all nntile:latest
The container uses the nntile conda environment by default. Working directory
is /workspace/nntile and PYTHONPATH is set for the Python wrappers.
Minimal requirements
NNTile supports CUDA devices only of compute capability 8.0 or higher
Jupyter notebook examples
Several examples (GPT2, LLaMa) can be found in notebooks directory. With a built
Docker image, launch Jupyter Lab with:
docker run -it --gpus all -p 8888:8888 nntile:latest jupyter lab --notebook-dir=/workspace/nntile --ip='*' --port=8888 --no-browser --allow-root
For the classic Jupyter Notebook interface:
docker run -it --gpus all -p 8888:8888 nntile:latest jupyter notebook --notebook-dir=/workspace/nntile --ip='*' --port=8888 --no-browser --allow-root
TensorBoard is available on port 6006. Expose it with -p 6006:6006 when running the container.
Minimal working GPT example
To make NNTile train your custom GPT model there is a minimal working example gpt2_custom_training.py. It works either with a WikiText-103 datasets or with a dataset stored in a train.bin format that contains a stream of uint16 values just like NanoGPT does it with a help of its special script prepare.py for the OpenWebText.
To try the example, build the Docker image (see Assembly above) and run a container. Once inside the container, run:
CUDA_VISIBLE_DEVICES=0 STARPU_NCPU=2 python /workspace/nntile/wrappers/python/examples/gpt2_custom_training.py --config-path=/workspace/nntile/wrappers/python/examples/gpt2_default_config.json --tokenizer=gpt2 --tokenizer-path=data --batch=1024 --minibatch=4 --minibatch-tile=4 --seq-tile=1024 --embd-tile=768 --inner-tile=3072 --head-tile=12 --restrict=cuda --flashattention --nforward=10 --nforward-warmup=10 --nbackward=10 --nbackward-warmup=10 --dataset=WikiText-103 --dataset-path=data --dataset-select=40000 --optimizer=fusedadamw --optimizer-eps=1e-8 --weight-decay=0.1 --loss-reduction=mean --lr=3e-4 --start-lr=0 --full-lr-iter=10 --nepochs=1 --nepochs-warmup=1
- Environment variable
CUDA_VISIBLE_DEVICESlimits visibility of GPUs to StarPU. If this variable is not set, StarPU will use all the GPUs. - Environment variable
STARPU_NCPU=2limits how many CPU cores will be used. If the variable is unset, all the CPU cores will be occupied. /workspace/nntile/wrappers/python/examples/gpt2_custom_training.pyis the location of the example script.--config-pathparameter points to a json GPT2 configuration file. Example uses default one, located at/workspace/nntile/wrappers/python/examples/gpt2_default_config.json.--tokenizer=gpt2selectsgpt2tokenizer from HuggingFacetransformersPython library.--tokenizer-path=datapath to download chosen tokenizer.--batch=1024how many sequences form the batch. Optimizer step happens after gradients for entire batch are collected.--minibatch=4defines how many sequences are processed by StarPU at once on entire node. This variable defines maximum memory, allocated by StarPU buffers.batchmust be divisible byminibatch.--minibatch-tile=4defines how many sequences are processed by StarPU at once at a single computing unit (CPU core or entire GPU).minibatchmust be divisible byminibatch_tile.--seq-tile=1024defines how many tokens out of a sequence are processed by StarPU at once at a single computing unit (CPU core or entire GPU). Sequence length, defined in the GPT2 config json file, must be divisible by this value.--embd-tile=768defines size of embedding processed by StarPU at once at a single computing unit (CPU core or entire GPU). Size of embedding, defined in the GPT2 config json file, does not restrict this value.--inner-tile=3072defines size of embedding inside MLP of the GPT model processed by StarPU at once at a single computing unit (CPU core or entire GPU). Size of inner embedding of GPT2, which is 4 times larger than ordinary embedding size, does not restrict this value.--head-tile=12defines number of attention heads processed by StarPU at once at a single computing unit (CPU core or entire GPU). Number of heads, defined in the GPT2 config json file, must be divisible by this value.--restrict=cudalimits execution of low-level kernels to CUDA if applicable. Only CUDA devices will be used for computations, while certain auxiliary low-level kernels, which are not implemented in CUDA, will be executed on CPU.--flashattentionis a flag to enable Flash Attention logic. Flash Attention itself is not yet implemented in the NNTile. Turning ON this logic helps to reduce peak memory usage, which may improve performance by a lot.--nforward=10sets number of forward operations to estimate performance to 10.--nforward-warmup=10sets number of warmup forward operations before estimating performance to 10.--nbackward=10sets number of backward operations to estimate performance to 10.--nbackward-warmup=10sets number of warmup backward operations before estimating performance to 10.--dataset=WikiText-103sets WikiText-103 as a training set. It will be automatically downloaded.--dataset-path=datapath to download WikiText-103. If it is already there, it will not be downloaded again.--dataset-select=40000limits entire WikiText-103 to only first 40000 texts. It is enough to have 2 input batches of 1024 sequences of 1024 tokens.--optimizer=fusedadamwselects optimizer.--optimizer-eps=1e-8defines optimizer regularization parameter.--weight-decay=0.1defines weight decay for the chosen AdamW optimizer.--loss-reduction=meansets up printed loss value as an average value across all tokens in an input batch.--lr=3e-4is the learning rate.--start-lr=0is the learning rate for the first input batch.--full-lr-iter=10defines which AdamW iteration will use the full learning rate, presented by--lrflag. IF this value is simply 1, then all the AdamW optimizer iterations will use full learning rate.--nepochs=1is the number of epochs
Related Skills
claude-opus-4-5-migration
107.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
model-usage
346.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
TrendRadar
50.7k⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构,赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ,数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。
mcp-for-beginners
15.8kThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.
