NVIDIA HierarchicalKV(Beta)

About HierarchicalKV

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements.

The key capability of HierarchicalKV is to store key-value (feature-embedding) on high-bandwidth memory (HBM) of GPUs and in host memory.

You can also use the library for generic key-value storage.

Benefits

When building large recommender systems, machine learning (ML) engineers face the following challenges:

GPUs are needed, but HBM on a single GPU is too small for the large DLRMs that scale to several terabytes.
Improving communication performance is getting more difficult in larger and larger CPU clusters.
It is difficult to efficiently control consumption growth of limited HBM with customized strategies.
Most generic key-value libraries provide low HBM and host memory utilization.

HierarchicalKV alleviates these challenges and helps the machine learning engineers in RecSys with the following benefits:

Supports training large RecSys models on HBM and host memory at the same time.
Provides better performance by full bypassing CPUs and reducing the communication workload.
Implements table-size restraint strategies that are based on LRU or customized strategies. The strategies are implemented by CUDA kernels.
Operates at a high working-status load factor that is close to 1.0.

Key ideas

Buckets are locally ordered
Store keys and values separately
Store all the keys in HBM
Build-in and customizable eviction strategy

HierarchicalKV makes NVIDIA GPUs more suitable for training large and super-large models of search, recommendations, and advertising. The library simplifies the common challenges to building, evaluating, and serving sophisticated recommenders models.

API Documentation

The main classes and structs are below, but reading the comments in the source code is recommended:

For regular API doc, please refer to API Docs

API Maturity Matrix

industry-validated means the API has been well-tested and verified in at least one real-world scenario.

| Name | Description | Function | |:---------------------|:-------------------------------------------------------------------------------------------------------------------------|:-------------------| | insert_or_assign | Insert or assign for the specified keys. Overwrite one key with minimum score when bucket is full. | industry-validated | | insert_and_evict | Insert new keys, and evict keys with minimum score when bucket is full. | industry-validated | | find_or_insert | Search for the specified keys, and insert them when missed. | well-tested | | assign | Update for each key and bypass when missed. | well-tested | | accum_or_assign | Search and update for each key. If found, add value as a delta to the original value. If missed, update it directly. | well-tested | | find_or_insert* | Search for the specified keys and return the pointers of values. Insert them firstly when missing. | well-tested | | find | Search for the specified keys. | industry-validated | | find* | Search and return the pointers of values, thread-unsafe but with high performance. | well-tested | | export_batch | Exports a certain number of the key-value-score tuples. | industry-validated | | export_batch_if | Exports a certain number of the key-value-score tuples which match specific conditions. | industry-validated | | warmup | Move the hot key-values from HMEM to HBM | June 15, 2023 |

Evict Strategy

The score is introduced to define the importance of each key, the larger, the more important, the less likely they will be evicted. Eviction only happens when a bucket is full. The score_type must be uint64_t. For more detail, please refer to class EvictStrategy.

| Name | Definition of Score | |:---------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Lru | Device clock in a nanosecond, which could differ slightly from host clock. | | Lfu | Frequency increment provided by caller via the input parameter of scores of insert-like APIs as the increment of frequency. | | EpochLru | The high 32bits is the global epoch provided via the input parameter of global_epoch, the low 32bits is equal to (device_clock >> 20) & 0xffffffff with granularity close to 1 ms. | | EpochLfu | The high 32bits is the global epoch provided via the input parameter of global_epoch, the low 32bits is the frequency, the frequency will keep constant after reaching the max value of 0xffffffff. | | Customized | Fully provided by the caller via the input parameter of scores of insert-like APIs. |

Note:
- The insert-like APIs mean the APIs of insert_or_assign, insert_and_evict, find_or_insert, accum_or_assign, and find_or_insert.
- The global_epoch should be maintained by the caller and input as the input parameter of insert-like APIs.

Configuration Options

It's recommended to keep the default configuration for the options ending with *.

| Name | Type | Default | Description | |:---------------------------|:-------|:--------|:------------------------------------------------------| | init_capacity | size_t | 0 | The initial capacity of the hash table. | | max_capacity | size_t | 0 | The maximum capacity of the hash table. | | max_hbm_for_vectors | size_t | 0 | The maximum HBM for vectors, in bytes. | | dim | size_t | 64 | The dimension of the value vectors. | | max_bucket_size* | size_t | 128 | The length of each bucket. | | max_load_factor* | float | 0.5f | The max load factor before rehashing. | | block_size* | int | 128 | The default block size for CUDA kernels. | | io_block_size* | int | 1024 | The block size for IO CUDA kernels. | | device_id* | int | -1 | The ID of device. Managed internally when set to -1 | | io_by_cpu* | bool | false | The flag indicating if the CPU handles IO. | | reserved_key_start_bit | int | 0 | The start bit offset of reserved key in the 64 bit |

Fore more details refer to struct HashTableOptions.

Reserved Keys

By default, the keys of 0xFFFFFFFFFFFFFFFD, 0xFFFFFFFFFFFFFFFE, and 0xFFFFFFFFFFFFFFFF are reserved for internal using. change options.reserved_key_start_bit if you want to use the above keys. reserved_key_start_bit has a valid range from 0 to 62. The default value is 0, which is the above default reserved keys. When reserved_key_start_bit is set to any value other than 0, the least significant bit (bit 0) is always 0 for any reserved key.
Setting reserved_key_start_bit = 1:
- This setting reserves the two least significant bits 1 and 2 for the reserved keys.
- In binary, the last four bits range from 1000 to 1110. Here, the least significant bit (bit 0) is always 0, and bits from 3 to 63 are set to 1.
- The new reserved keys in hexadecimal repr

HierarchicalKV

Install / Use

README