SkillAgentSearch skills...

Nvshare

Practical GPU Sharing Without Memory Size Constraints

Install / Use

/learn @grgalex/Nvshare
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

nvshare: Practical GPU Sharing Without Memory Size Constraints

nvshare is a GPU sharing mechanism that allows multiple processes (or containers running on Kubernetes) to securely run on the same physical GPU concurrently, each having the whole GPU memory available.

You can watch a quick explanation plus a demonstration at https://www.youtube.com/watch?v=9n-5sc5AICY.

To achieve this, it transparently enables GPU page faults using the system RAM as swap space. To avoid thrashing, it uses nvshare-scheduler, which manages the GPU and gives exclusive GPU access to a single process for a given time quantum (TQ), which has a default duration of 30 seconds.

This functionality solely depends on the Unified Memory API provided by the NVIDIA kernel driver. It is highly unlikely that an update to NVIDIA's kernel drivers would interfere with the viability of this project as it would require disabling Unified Memory.

The de-facto way (Nvidia's device plugin) of handling GPUs on Kubernetes is to assign them to containers in a 1-1 manner. This is especially inefficient for applications that only use a GPU in bursts throughout their execution, such as long-running interactive development jobs like Jupyter notebooks.

I've written a Medium article on the challenges of GPU sharing on Kubernetes, it's worth a read.

Indicative Use Cases

  • Run 2+ processes/containers with infrequent GPU bursts on the same GPU (e.g., interactive apps, ML inference)
  • Run 2+ non-interactive workloads (e.g., ML training) on the same GPU to minimize their total completion time and reduce queueing

Table of Contents

<a name="features"/>

Features

  • Single GPU sharing among multiple processes/containers
  • Memory and fault isolation is guaranteed because co-located processes use different CUDA contexts, unlike other approaches such as NVIDIA MPS.
  • Completely transparent to applications, no code changes needed
  • Each process/container has whole GPU memory available
    • Uses Unified Memory to swap GPU memory to system RAM
    • Scheduler optionally serializes overlapping GPU work to avoid thrashing (assigns exclusive access to one app for TQ seconds at a time)
    • Apps release GPU if done with work before TQ elapses
  • Device plugin for Kubernetes
<a name="key_idea"/>

Key Idea

  1. With cudaMalloc(), the sum of memory allocations from CUDA apps must be smaller than physical GPU memory size (Σ(mem_allocs) <= GPU_mem_size).
  2. Hooking and replacing all cudaMalloc() calls in an application with cudaMallocManaged(), i.e., transparently forcing the use of CUDA's Unified Memory API does not affect correctness and only leads to a ~1% slowdown.
  3. If we apply (2), constraint (1) no longer holds for an application written using cudaMalloc().
  4. When we oversubscribe GPU memory (Σ(mem_allocs) > GPU_mem_size), we must take care to avoid thrashing when the working sets of co-located apps (i.e., the data they are actively using) don't fit in GPU mem (Σ(wss) > GPU_mem_size). We use nvshare-scheduler to serialize work on the GPU to avoid thrashing. If we don't serialize the work, the frequent (every few ms) context switches of NVIDIA's black-box scheduler between the co-located apps will cause thrashing.
  5. If we know that Σ(wss) <= GPU_mem_size, we can disable nvshare-scheduler's anti-thrashing mode.
<a name="supported_gpus"/>

Supported GPUs

nvshare relies on Unified Memory's dynamic page fault handling mechanism introduced in the Pascal microarchitecture.

It supports any Pascal (2016) or newer Nvidia GPU.

It has only been tested on Linux systems.

<a name="overview"/>

Overview

<a name="components"/>

nvshare components

  • nvshare-scheduler, which is responsible for managing a single Nvidia GPU. It schedules the GPU "lock" among co-located clients that want to submit work on the GPU. It assigns exclusive access to the GPU to clients in an FCFS manner, for TQ seconds at a time.
  • libnvshare.so, which we inject into CUDA applications through LD_PRELOAD and which:
    • Interposes (hooks) the application's calls to the CUDA API, converting normal memory allocation calls to their Unified Memory counterparts
    • Implements the client side of nvshare, which communicates with the nvshare-scheduler instance to gain exclusive access to the GPU each time the application wants to do computations on the GPU.
  • nvsharectl, which is a command-line tool used to configure the status of an nvshare-scheduler instance.
<a name="details_scheduler"/>

Some Details on nvshare-scheduler

IMPORTANT: nvshare currently supports only one GPU per node, as nvshare-scheduler is hardcoded to use the Nvidia GPU with ID 0.

nvshare-scheduler's job is to prevent thrashing. It assigns exclusive usage of the whole GPU and its physical memory to a single application at a time, handling requests from applications in an FCFS manner. Each app uses the GPU for at most TQ seconds. If the app is idle, it releases the GPU early. When it wants to compute something on the GPU at a later point, it again requests GPU access from the scheduler. When the scheduler gives it access to the GPU, the app gradually fetches its data to the GPU via page faults.

If the combined GPU memory usage of the co-located applications fits in the available GPU memory, they can seamlessly run in parallel.

However, when the combined memory usage exceeds the total GPU memory, nvshare-scheduler must serialize GPU work from different processes in order to avoid thrashing.

The anti-thrashing mode of nvshare-scheduler is enabled by default. You can configure this using nvsharectl. We currently have no way of automatically detecting thrashing, therefore we must toggle the scheduler on/off manually.

<a name="single_oversub"/>

Memory Oversubscription For a Single Process

nvshare allows each co-located process to use the whole physical GPU memory. By default, it doesn't allow a single process to allocate more memory than the GPU can hold, as this can lead to internal thrashing for the process, regardless of the existence of other processes on the same GPU.

If you get a CUDA_ERROR_OUT_OF_MEMORY it means that your application tried to allocate more memory than the total capacity of the GPU.

You can set the NVSHARE_ENABLE_SINGLE_OVERSUB=1 environment variable to enable a single process to use more memory than is physically available on the GPU. This can lead to degraded performance.

<a name="scheduler_tq"/>

The Scheduler's Time Quantum (TQ)

The TQ has effect only when the scheduler's anti-thrashing mode is enabled.

A larger time quantum sacrifices interactivity (responsiveness) in favor of throughput (utilization).

The scheduler's TQ dictates the amount of time the scheduler assigns the GPU to a client for. A larger time quantum sacrifices interactivity (latency) in favor of throughput (utilization) and vice-versa.

You shouldn't set the time quantum to a very small value (< 10), as the time spent fetching the pages of the app that just acquired the GPU lock takes a few seconds, so it won't have enough time to do actual computations.

To minimize the overall completion time of a set of sequential (batch) jobs, you can set the TQ to very large value.

Without nvshare, you would run out of memory and have to run one job after another.

With nvshare:

  • Only the GPU portions of the jobs will run serialized on the GPU, the CPU parts will run in parallel
  • Each application will hold the GPU only while it runs code on it (due to the early release mechanism)
<a name="further_reading"/>

Further Reading

Our ICSE 2024 paper is available at https://grgalex.gr/assets/pdf/nvshare_icse24.pdf.

nvshare is based on my diploma thesis titled "Dynamic memory management for the efficient utilization of graphics processing units in interactive machine learning development", published in July 2021 and available at http://dx.doi.org/10.26240/heal.ntua.21988.

Thesis:

The title and first part are in Greek, but the second part is the full thesis in English. You can also find it at grgalex-thesis.pdf in the root of this repo.

Presentation:

View the presentation on nvshare:

<a name="deploy_local"/>

Deploy on a Local System

<a name="installation_local"/>

Installation (Local)

For compatibility reasons, it is better if you build nvshare from source for your system before installing.

  1. (Optional) Download the latest release tarball from the Releases tab or through the command-line:

    wget https://gith
    
View on GitHub
GitHub Stars307
CategoryDevelopment
Updated1d ago
Forks32

Languages

C

Security Score

100/100

Audited on Mar 27, 2026

No findings