SkillAgentSearch skills...

Nntile

A neural network training framework within a task-based parallel programming paradigm

Install / Use

/learn @nntile/Nntile

README

NNTile

Linting and testing. Nightly building.

General purpose

NNTile is a framework for training large neural networks. It relies on a task-based parallel programming paradigm, which distributes computations across all avialable hardware resources dynamically and transmits data asynchronously. For this purpose NNTile utilizes StarPU library.

Preliminary experimental results

Experiments with custom 4-layer and 8-layer GPT models of sizes up to 50B showed both good performance and a possibility to train 4 times larger models compared to PyTorch FSDP on the same hardware (a single server with 8 x Nvidia A100 80GB SXM).

Custom 4-layer model on 4 GPUs Custom 4-layer model on 4 GPUs

The same figures in better quality:

Authors

NNTile is developed by specialists from

  • Skolkovo Institute of Science and Technology (Skoltech)
  • Artifical Intelligence Research Institute (AIRI)

Main contributors are:

  • Aleksandr Mikhalev
  • Aleksandr Katrutsa
  • Konstantin Sozykin
  • Gleb Karpov
  • Daniel Bershatsky

Acknowledgement

Authors of the NNTile would like to thank Ivan Oseledets for bringing idea of this project to life.

The work was generously supported by the Center in the field of Artificial Intelligence in the direction of optimizing management decisions to reduce the carbon footprint on the basis of the Skolkovo Institute of Science and Technology under Contract No. 70-2021-00145/10841 dated 02.11.2021 (items 2.3.1, 2.3.3, 3.3.2 and 3.3.4) and Contract No. 10825/3978620 dated 26.08.2021.

This work was supported by FASIE (fasie.ru).

Assembly

NNTile comes with a Dockerfile to construct a Docker image with NNTile and all prerequisites. The image uses Miniforge (Conda) for dependencies and provides two build stages:

| Stage | Description | |-------|-------------| | sandbox | Prerequisites only (FXT, StarPU, Python, PyTorch, etc.). Use for NNTile development when building from local sources. | | nntile | Full image with NNTile compiled from the repository. Default target. |

Building the image

Build the full image (sandbox + NNTile compiled):

docker build . -t nntile:latest

Build only the sandbox (prerequisites, no NNTile):

docker build . -t nntile_sandbox:latest --target sandbox

Build arguments (all optional):

| Argument | Default | Description | |----------|---------|-------------| | CUDA_VERSION | 12.9.1 | CUDA version for the base image | | BASE_OS | ubuntu22.04 | Base OS for the CUDA image | | BASE_IMAGE | nvidia/cuda:${CUDA_VERSION}-base-${BASE_OS} | Override the full base image | | MAKE_JOBS | 4 | Parallelism for compiling FXT, StarPU, and NNTile | | CUDA_ARCHS | 70;75;80;86;89;90;100;120 | Target CUDA architectures (semicolon-separated) | | PYTHON_VERSION | 3.12 | Python version in the conda environment | | PYTORCH_VERSION | 2.9.1 | PyTorch version |

Example with custom options:

docker build . -t nntile:latest \
    --build-arg MAKE_JOBS=8 \
    --build-arg CUDA_ARCHS="70;75;80;86;89;90" \
    --build-arg CUDA_VERSION=12.6.1

Running the container

Start an interactive shell (default):

docker run -it --gpus all nntile:latest

The container uses the nntile conda environment by default. Working directory is /workspace/nntile and PYTHONPATH is set for the Python wrappers.

Minimal requirements

NNTile supports CUDA devices only of compute capability 8.0 or higher

Jupyter notebook examples

Several examples (GPT2, LLaMa) can be found in notebooks directory. With a built Docker image, launch Jupyter Lab with:

docker run -it --gpus all -p 8888:8888 nntile:latest jupyter lab --notebook-dir=/workspace/nntile --ip='*' --port=8888 --no-browser --allow-root

For the classic Jupyter Notebook interface:

docker run -it --gpus all -p 8888:8888 nntile:latest jupyter notebook --notebook-dir=/workspace/nntile --ip='*' --port=8888 --no-browser --allow-root

TensorBoard is available on port 6006. Expose it with -p 6006:6006 when running the container.

Minimal working GPT example

To make NNTile train your custom GPT model there is a minimal working example gpt2_custom_training.py. It works either with a WikiText-103 datasets or with a dataset stored in a train.bin format that contains a stream of uint16 values just like NanoGPT does it with a help of its special script prepare.py for the OpenWebText.

To try the example, build the Docker image (see Assembly above) and run a container. Once inside the container, run:

CUDA_VISIBLE_DEVICES=0 STARPU_NCPU=2 python /workspace/nntile/wrappers/python/examples/gpt2_custom_training.py --config-path=/workspace/nntile/wrappers/python/examples/gpt2_default_config.json --tokenizer=gpt2 --tokenizer-path=data --batch=1024 --minibatch=4 --minibatch-tile=4 --seq-tile=1024 --embd-tile=768 --inner-tile=3072 --head-tile=12 --restrict=cuda --flashattention --nforward=10 --nforward-warmup=10 --nbackward=10 --nbackward-warmup=10 --dataset=WikiText-103 --dataset-path=data --dataset-select=40000 --optimizer=fusedadamw --optimizer-eps=1e-8 --weight-decay=0.1 --loss-reduction=mean --lr=3e-4 --start-lr=0 --full-lr-iter=10 --nepochs=1 --nepochs-warmup=1
  • Environment variable CUDA_VISIBLE_DEVICES limits visibility of GPUs to StarPU. If this variable is not set, StarPU will use all the GPUs.
  • Environment variable STARPU_NCPU=2 limits how many CPU cores will be used. If the variable is unset, all the CPU cores will be occupied.
  • /workspace/nntile/wrappers/python/examples/gpt2_custom_training.py is the location of the example script.
  • --config-path parameter points to a json GPT2 configuration file. Example uses default one, located at /workspace/nntile/wrappers/python/examples/gpt2_default_config.json.
  • --tokenizer=gpt2 selects gpt2 tokenizer from HuggingFace transformers Python library.
  • --tokenizer-path=data path to download chosen tokenizer.
  • --batch=1024 how many sequences form the batch. Optimizer step happens after gradients for entire batch are collected.
  • --minibatch=4 defines how many sequences are processed by StarPU at once on entire node. This variable defines maximum memory, allocated by StarPU buffers. batch must be divisible by minibatch.
  • --minibatch-tile=4 defines how many sequences are processed by StarPU at once at a single computing unit (CPU core or entire GPU). minibatch must be divisible by minibatch_tile.
  • --seq-tile=1024 defines how many tokens out of a sequence are processed by StarPU at once at a single computing unit (CPU core or entire GPU). Sequence length, defined in the GPT2 config json file, must be divisible by this value.
  • --embd-tile=768 defines size of embedding processed by StarPU at once at a single computing unit (CPU core or entire GPU). Size of embedding, defined in the GPT2 config json file, does not restrict this value.
  • --inner-tile=3072 defines size of embedding inside MLP of the GPT model processed by StarPU at once at a single computing unit (CPU core or entire GPU). Size of inner embedding of GPT2, which is 4 times larger than ordinary embedding size, does not restrict this value.
  • --head-tile=12 defines number of attention heads processed by StarPU at once at a single computing unit (CPU core or entire GPU). Number of heads, defined in the GPT2 config json file, must be divisible by this value.
  • --restrict=cuda limits execution of low-level kernels to CUDA if applicable. Only CUDA devices will be used for computations, while certain auxiliary low-level kernels, which are not implemented in CUDA, will be executed on CPU.
  • --flashattention is a flag to enable Flash Attention logic. Flash Attention itself is not yet implemented in the NNTile. Turning ON this logic helps to reduce peak memory usage, which may improve performance by a lot.
  • --nforward=10 sets number of forward operations to estimate performance to 10.
  • --nforward-warmup=10 sets number of warmup forward operations before estimating performance to 10.
  • --nbackward=10 sets number of backward operations to estimate performance to 10.
  • --nbackward-warmup=10 sets number of warmup backward operations before estimating performance to 10.
  • --dataset=WikiText-103 sets WikiText-103 as a training set. It will be automatically downloaded.
  • --dataset-path=data path to download WikiText-103. If it is already there, it will not be downloaded again.
  • --dataset-select=40000 limits entire WikiText-103 to only first 40000 texts. It is enough to have 2 input batches of 1024 sequences of 1024 tokens.
  • --optimizer=fusedadamw selects optimizer.
  • --optimizer-eps=1e-8 defines optimizer regularization parameter.
  • --weight-decay=0.1 defines weight decay for the chosen AdamW optimizer.
  • --loss-reduction=mean sets up printed loss value as an average value across all tokens in an input batch.
  • --lr=3e-4 is the learning rate.
  • --start-lr=0 is the learning rate for the first input batch.
  • --full-lr-iter=10 defines which AdamW iteration will use the full learning rate, presented by --lr flag. IF this value is simply 1, then all the AdamW optimizer iterations will use full learning rate.
  • --nepochs=1 is the number of epochs

Related Skills

View on GitHub
GitHub Stars55
CategoryEducation
Updated21d ago
Forks10

Languages

C++

Security Score

100/100

Audited on Mar 12, 2026

No findings