Nntile

A neural network training framework within a task-based parallel programming paradigm

Generate Convert Improve

Install / Use

/learn @nntile/Nntile

About this skill

Quality Score

0/100

README

NNTile

General purpose

NNTile is a framework for training large neural networks. It relies on a task-based parallel programming paradigm, which distributes computations across all avialable hardware resources dynamically and transmits data asynchronously. For this purpose NNTile utilizes StarPU library.

Preliminary experimental results

Experiments with custom 4-layer and 8-layer GPT models of sizes up to 50B showed both good performance and a possibility to train 4 times larger models compared to PyTorch FSDP on the same hardware (a single server with 8 x Nvidia A100 80GB SXM).

Custom 4-layer model on 4 GPUs

The same figures in better quality:

Authors

NNTile is developed by specialists from

Skolkovo Institute of Science and Technology (Skoltech)
Artifical Intelligence Research Institute (AIRI)

Main contributors are:

Aleksandr Mikhalev
Aleksandr Katrutsa
Konstantin Sozykin
Gleb Karpov
Daniel Bershatsky

Acknowledgement

Authors of the NNTile would like to thank Ivan Oseledets for bringing idea of this project to life.

The work was generously supported by the Center in the field of Artificial Intelligence in the direction of optimizing management decisions to reduce the carbon footprint on the basis of the Skolkovo Institute of Science and Technology under Contract No. 70-2021-00145/10841 dated 02.11.2021 (items 2.3.1, 2.3.3, 3.3.2 and 3.3.4) and Contract No. 10825/3978620 dated 26.08.2021.

This work was supported by FASIE (fasie.ru).

Assembly

NNTile comes with a Dockerfile to construct a Docker image with NNTile and all prerequisites. The image uses Miniforge (Conda) for dependencies and provides two build stages:

| Stage | Description | |-------|-------------| | sandbox | Prerequisites only (FXT, StarPU, Python, PyTorch, etc.). Use for NNTile development when building from local sources. | | nntile | Full image with NNTile compiled from the repository. Default target. |

Building the image

Build the full image (sandbox + NNTile compiled):

docker build . -t nntile:latest

Build only the sandbox (prerequisites, no NNTile):

docker build . -t nntile_sandbox:latest --target sandbox

Build arguments (all optional):

| Argument | Default | Description | |----------|---------|-------------| | CUDA_VERSION | 12.9.1 | CUDA version for the base image | | BASE_OS | ubuntu22.04 | Base OS for the CUDA image | | BASE_IMAGE | nvidia/cuda:${CUDA_VERSION}-base-${BASE_OS} | Override the full base image | | MAKE_JOBS | 4 | Parallelism for compiling FXT, StarPU, and NNTile | | CUDA_ARCHS | 70;75;80;86;89;90;100;120 | Target CUDA architectures (semicolon-separated) | | PYTHON_VERSION | 3.12 | Python version in the conda environment | | PYTORCH_VERSION | 2.9.1 | PyTorch version |

Example with custom options:

docker build . -t nntile:latest \
    --build-arg MAKE_JOBS=8 \
    --build-arg CUDA_ARCHS="70;75;80;86;89;90" \
    --build-arg CUDA_VERSION=12.6.1

Running the container

Start an interactive shell (default):

docker run -it --gpus all nntile:latest

The container uses the nntile conda environment by default. Working directory is /workspace/nntile and PYTHONPATH is set for the Python wrappers.

Minimal requirements

NNTile supports CUDA devices only of compute capability 8.0 or higher

Jupyter notebook examples

Several examples (GPT2, LLaMa) can be found in notebooks directory. With a built Docker image, launch Jupyter Lab with:

docker run -it --gpus all -p 8888:8888 nntile:latest jupyter lab --notebook-dir=/workspace/nntile --ip='*' --port=8888 --no-browser --allow-root

For the classic Jupyter Notebook interface:

docker run -it --gpus all -p 8888:8888 nntile:latest jupyter notebook --notebook-dir=/workspace/nntile --ip='*' --port=8888 --no-browser --allow-root

TensorBoard is available on port 6006. Expose it with -p 6006:6006 when running the container.

Minimal working GPT example

To make NNTile train your custom GPT model there is a minimal working example gpt2_custom_training.py. It works either with a WikiText-103 datasets or with a dataset stored in a train.bin format that contains a stream of uint16 values just like NanoGPT does it with a help of its special script prepare.py for the OpenWebText.

To try the example, build the Docker image (see Assembly above) and run a container. Once inside the container, run:

CUDA_VISIBLE_DEVICES=0 STARPU_NCPU=2 python /workspace/nntile/wrappers/python/examples/gpt2_custom_training.py --config-path=/workspace/nntile/wrappers/python/examples/gpt2_default_config.json --tokenizer=gpt2 --tokenizer-path=data --batch=1024 --minibatch=4 --minibatch-tile=4 --seq-tile=1024 --embd-tile=768 --inner-tile=3072 --head-tile=12 --restrict=cuda --flashattention --nforward=10 --nforward-warmup=10 --nbackward=10 --nbackward-warmup=10 --dataset=WikiText-103 --dataset-path=data --dataset-select=40000 --optimizer=fusedadamw --optimizer-eps=1e-8 --weight-decay=0.1 --loss-reduction=mean --lr=3e-4 --start-lr=0 --full-lr-iter=10 --nepochs=1 --nepochs-warmup=1

Environment variable CUDA_VISIBLE_DEVICES limits visibility of GPUs to StarPU. If this variable is not set, StarPU will use all the GPUs.
Environment variable STARPU_NCPU=2 limits how many CPU cores will be used. If the variable is unset, all the CPU cores will be occupied.
/workspace/nntile/wrappers/python/examples/gpt2_custom_training.py is the location of the example script.
--config-path parameter points to a json GPT2 configuration file. Example uses default one, located at /workspace/nntile/wrappers/python/examples/gpt2_default_config.json.
--tokenizer=gpt2 selects gpt2 tokenizer from HuggingFace transformers Python library.
--tokenizer-path=data path to download chosen tokenizer.
--batch=1024 how many sequences form the batch. Optimizer step happens after gradients for entire batch are collected.
--minibatch=4 defines how many sequences are processed by StarPU at once on entire node. This variable defines maximum memory, allocated by StarPU buffers. batch must be divisible by minibatch.
--minibatch-tile=4 defines how many sequences are processed by StarPU at once at a single computing unit (CPU core or entire GPU). minibatch must be divisible by minibatch_tile.
--seq-tile=1024 defines how many tokens out of a sequence are processed by StarPU at once at a single computing unit (CPU core or entire GPU). Sequence length, defined in the GPT2 config json file, must be divisible by this value.
--embd-tile=768 defines size of embedding processed by StarPU at once at a single computing unit (CPU core or entire GPU). Size of embedding, defined in the GPT2 config json file, does not restrict this value.
--inner-tile=3072 defines size of embedding inside MLP of the GPT model processed by StarPU at once at a single computing unit (CPU core or entire GPU). Size of inner embedding of GPT2, which is 4 times larger than ordinary embedding size, does not restrict this value.
--head-tile=12 defines number of attention heads processed by StarPU at once at a single computing unit (CPU core or entire GPU). Number of heads, defined in the GPT2 config json file, must be divisible by this value.
--restrict=cuda limits execution of low-level kernels to CUDA if applicable. Only CUDA devices will be used for computations, while certain auxiliary low-level kernels, which are not implemented in CUDA, will be executed on CPU.
--flashattention is a flag to enable Flash Attention logic. Flash Attention itself is not yet implemented in the NNTile. Turning ON this logic helps to reduce peak memory usage, which may improve performance by a lot.
--nforward=10 sets number of forward operations to estimate performance to 10.
--nforward-warmup=10 sets number of warmup forward operations before estimating performance to 10.
--nbackward=10 sets number of backward operations to estimate performance to 10.
--nbackward-warmup=10 sets number of warmup backward operations before estimating performance to 10.
--dataset=WikiText-103 sets WikiText-103 as a training set. It will be automatically downloaded.
--dataset-path=data path to download WikiText-103. If it is already there, it will not be downloaded again.
--dataset-select=40000 limits entire WikiText-103 to only first 40000 texts. It is enough to have 2 input batches of 1024 sequences of 1024 tokens.
--optimizer=fusedadamw selects optimizer.
--optimizer-eps=1e-8 defines optimizer regularization parameter.
--weight-decay=0.1 defines weight decay for the chosen AdamW optimizer.
--loss-reduction=mean sets up printed loss value as an average value across all tokens in an input batch.
--lr=3e-4 is the learning rate.
--start-lr=0 is the learning rate for the first input batch.
--full-lr-iter=10 defines which AdamW iteration will use the full learning rate, presented by --lr flag. IF this value is simply 1, then all the AdamW optimizer iterations will use full learning rate.
--nepochs=1 is the number of epochs

Related Skills

claude-opus-4-5-migration

107.2k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

346.4k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

TrendRadar

50.7k

⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载，你的 AI 舆情监控助手与热点筛选工具！聚合多平台热点 + RSS 订阅，支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机，也支持接入 MCP 架构，赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ，数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。

mcp-for-beginners

15.8k

This open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.

nntile

View profile

View on GitHub

GitHub Stars55

CategoryEducation

Updated21d ago

Forks10

nntile/nntile

Languages

C++

Security Score

100/100

Audited on Mar 12, 2026

No findings