"Yinyang" K-means and K-nn using NVIDIA CUDA

K-means implementation is based on "Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup". While it introduces some overhead and many conditional clauses which are bad for CUDA, it still shows 1.6-2x speedup against the Lloyd algorithm. K-nearest neighbors employ the same triangle inequality idea and require precalculated centroids and cluster assignments, similar to the flattened ball tree.

| Benchmarks | sklearn KMeans | KMeansRex | KMeansRex OpenMP | Serban | kmcuda | kmcuda 2 GPUs | |---------------------------|----------------|-----------|------------------|--------|--------|---------------| | speed | 1x | 4.5x | 8.2x | 15.5x | 17.8x | 29.8x | | memory | 1x | 2x | 2x | 0.6x | 0.6x | 0.6x |

Technically, this project is a shared library which exports two functions defined in kmcuda.h: kmeans_cuda and knn_cuda. It has built-in Python3 and R native extension support, so you can from libKMCUDA import kmeans_cuda or dyn.load("libKMCUDA.so").

<a href="img/kmeans_image.ipynb">How was this created?</a>

K-means
K-nn
Notes
Building
- macOS
Testing
Benchmarks
- 100,000x256@1024
  - Configuration
  - Contestants
  - Data
  - Notes
- 8,000,000x256@1024
  - Data
  - Notes
Python examples
Python API
R examples
- K-means
- K-nn
R API
C examples
C API
License

K-means

The major difference between this project and others is that kmcuda is optimized for low memory consumption and the large number of clusters. E.g., kmcuda can sort 4M samples in 480 dimensions into 40000 clusters (if you have several days and 12 GB of GPU memory); 300K samples are grouped into 5000 clusters in 4½ minutes on NVIDIA Titan X (15 iterations); 3M samples and 1000 clusters take 20 minutes (33 iterations). Yinyang can be turned off to save GPU memory but the slower Lloyd will be used then. Four centroid initialization schemes are supported: random, k-means++, AFKMC2 and import. Two distance metrics are supported: L2 (the usual one) and angular (arccos of the scalar product). L1 is in development. 16-bit float support delivers 2x memory compression. If you've got several GPUs, they can be utilized together and it gives the corresponding linear speedup either for Lloyd or Yinyang.

The code has been thoroughly tested to yield bit-to-bit identical results from Yinyang and Lloyd. "Fast and Provably Good Seedings for k-Means" was adapted from the reference code.

Read the articles: 1, 2.

K-nn

Centroid distance matrix Cij is calculated together with clusters' radiuses Ri (the maximum distance from the centroid to the corresponding cluster's members). Given sample S in cluster A, we avoid calculating the distances from S to another cluster B's members if CAB - SA - RB is greater than the current maximum K-nn distance. This resembles the ball tree algorithm.

The implemented algorithm is tolerant to NANs. There are two variants depending on whether k is small enough to fit the sample's neighbors into CUDA shared memory. Internally, the neighbors list is a binary heap - that reduces the complexity multiplier from O(k) to O(log k).

The implementation yields identical results to sklearn.neighbors.NearestNeighbors except cases in which adjacent distances are equal and the order is undefined. That is, the returned indices are sorted in the increasing order of the corresponding distances.

Notes

Lloyd is tolerant to samples with NaN features while Yinyang is not. It may happen that some of the resulting clusters contain zero elements. In such cases, their features are set to NaN.

Angular (cosine) distance metric effectively results in Spherical K-Means behavior. The samples must be normalized to L2 norm equal to 1 before clustering, it is not done automatically. The actual formula is:

$D(A, B)=\arccos\left(\frac{A\cdot B}{|A||B|}\right)$

If you get OOM with the default parameters, set yinyang_t to 0 which forces Lloyd. verbosity 2 will print the memory allocation statistics (all GPU allocation happens at startup).

Data type is either 32- or 16-bit float. Number of samples is limited by 2^32, clusters by 2^32 and features by 2^16 (2^17 for fp16). Besides, the product of clusters number and features number may not exceed 2^32.

In the case of 16-bit floats, the reduced precision often leads to a slightly increased number of iterations, Yinyang is especially sensitive to that. In some cases, there may be overflows and the clustering may fail completely.

Building

git clone https://github.com/src-d/kmcuda
cd src
cmake -DCMAKE_BUILD_TYPE=Release . && make

It requires cudart 8.0 / Pascal and OpenMP 4.0 capable compiler. The build has been tested primarily on Linux but it works on macOS too with some blows and whistles (see "macOS" subsection). If you do not want to build the Python native module, add -D DISABLE_PYTHON=y. If you do not want to build the R native module, add -D DISABLE_R=y. If CUDA is not automatically found, add -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0 (change the path to the actual one). By default, CUDA kernels are compiled for the architecture 60 (Pascal). It is possible to override it via -D CUDA_ARCH=52, but fp16 support will be disabled then.

Python users:

CUDA_ARCH=61 pip install libKMCUDA
# replace 61 with your device version

Or install it from source:

CUDA_ARCH=61 pip install git+https://github.com/src-d/kmcuda.git#subdirectory=src
# replace 61 with your device version

Binary Python packages are quite hard to provide because they depend on CUDA and device architecture versions. PRs welcome!

macOS

macOS build is tricky, but possible. The instructions below correspond to the state from 1 year ago and may be different now. Please help with updates!

Install Homebrew and the Command Line Developer Tools which are compatible with your CUDA installation. E.g., CUDA 8.0 does not support the latest 8.x and works with 7.3.1 and below. Install clang with OpenMP support and Python with numpy:

brew install llvm --with-clang
brew install python3
pip3 install numpy

Execute this magic command which builds kmcuda afterwards:

CC=/usr/local/opt/llvm/bin/clang CXX=/usr/local/opt/llvm/bin/clang++ LDFLAGS=-L/usr/local/opt/llvm/lib/ cmake -DCMAKE_BUILD_TYPE=Release .

And make the last important step - rename *.dylib to *.so so that Python is able to import the native extension:

mv libKMCUDA.{dylib,so}

Testing

test.py contains the unit tests based on unittest. They require either cuda4py or pycuda and scikit-learn. test.R contains R integration tests and shall be run with Rscript.

Benchmarks

100000x256@1024

| | sklearn KMeans | KMeansRex | KMeansRex OpenMP | Serban | kmcuda | kmcuda 2 GPUs | |------------|----------------|-----------|------------------|--------|--------|---------------| | time, s | 164 | 36 | 20 | 10.6 | 9.2 | 5.5 | | memory, GB | 1 | 2 | 2 | 0.6 | 0.6 | 0.6 |

Configuration

16-core (32 threads) Intel Xeon E5-2620 v4 @ 2.10GHz
256 GB RAM Samsung M393A2K40BB1
Nvidia Titan X 2016

Contestants

sklearn.cluster.KMeans@0.18.1; KMeans(n_clusters=1024, init="random", max_iter=15, random_state=0, n_jobs=1, n_init=1).
KMeansRex@288c40a with -march-native and Eigen 3.3; KMeansRex.RunKMeans(data, 1024, Niter=15, initname=b"random").
KMeansRex with additional -fopenmp.
Serban KMeans@83e76bf built for arch 6.1; ./cuda_main -b -i serban.bin -n 1024 -t 0.0028 -o
kmcuda v6.1 built for arch 6.1; libKMCUDA.kmeans_cuda(dataset, 1024, tolerance=0.002, seed=777, init="random", verbosity=2, yinyang_t=0, device=0)
kmcuda running on 2 GPUs.

Data

100000 random samples uniformly distributed between 0 and 1 in 256 dimensions.

Notes

100000 is the maximum size Serban KMeans can handle.

Kmcuda

Install / Use

README

"Yinyang" K-means and K-nn using NVIDIA CUDA

Table of contents

K-means

K-nn

Notes

Building

macOS

Testing

Benchmarks

100000x256@1024

Configuration

Contestants

Data

Notes