Cocos (Core Computational System) - Scientific GPU Computing in Python

Overview

Cocos is a package for numeric and scientific computing on GPUs for Python with a NumPy-like API. It supports both CUDA and OpenCL on Windows, Mac OS, and Linux. Internally, it relies on the ArrayFire C/C++ library. Cocos offers a multi-GPU map-reduce framework. In addition to its numeric functionality, it allows parallel computation of SymPy expressions on the GPU.

Highlights

Fast vectorized computation on GPUs with a NumPy-like API.
Multi GPU support via map-reduce.
High-performance random number generators for beta, chi-square, exponential, gamma, logistic, lognormal, normal, uniform, and Wald distributions. Antithetic random numbers for uniform and normal distributions.
Provides a GPU equivalent to SymPy's lambdify, which enables numeric evaluation of symbolic SymPy (multi-dimensional array) expressions on the GPU for vectors of input parameters in parallel.
Adaptation of SciPy's gaussian_kde to the GPU

Installation
Getting Started
Multi-GPU Computing
Memory Limitations on the GPU Device
Examples
5.1. Estimating Pi via Monte Carlo
5.2. Option Pricing in a Stochastic Volatility Model via Monte Carlo
5.3. Numeric evaluation of SymPy array expressions on the GPU
5.4. Kernel Density Estimation
Benchmark
Functionality
Limitations and Differences with NumPy
A Note on Hardware Configurations for Multi-GPU Computing
License

Installation

0. Prerequisites

NVidia CUDA must be installed on the system.

1. Download and Install Arrayfire

Windows

ArrayFire 3.8.0

Linux via Installer

ArrayFire 3.8.0

Ubuntu Linux 20.04 and Derivatives via APT

Follow the instructions here: https://github.com/arrayfire/arrayfire/wiki/Install-ArrayFire-From-Linux-Package-Managers, concretely
- sudo apt-key adv --fetch-key https://repo.arrayfire.com/GPG-PUB-KEY-ARRAYFIRE-2020.PUB
- Register the ArrayFire repo as a software source for apt-get via echo "deb [arch=amd64] https://repo.arrayfire.com/ubuntu focal main" | sudo tee /etc/apt/sources.list.d/arrayfire.list
  (if your distribution is based on a different version of Ubuntu, you must replace focal with the code name obtained via lsb_release -c)
- Update software sources and install ArrayFire via sudo apt-get update && sudo apt-get install arrayfire

Ubuntu Linux 20.04 and Derivatives via Docker

Docker must be installed on your system. See here for instructions: https://docs.docker.com/engine/install/ubuntu/.

Set up the nvidia-docker plugin as follows:
- Add nvidia's cryptographic key to apt-get via curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
- Add nvidia's repo to the software sources apt-get is able to access curl -s -L https://nvidia.github.io/nvidia-container-runtime/ubuntu20.04/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
- Update the software sources and install the nvidia container runtime via sudo apt-get update && sudo apt-get install nvidia-container-runtime
- Stop the Docker daemon via sudo systemctl stop docker
- Restart the Docker daemon via sudo systemctl start docker
Build the Docker image from the Dockerfile via sudo docker build --tag cocos . (this only needs to be done the first time)
Run a Docker container based on the image created in the previous step via sudo docker run -it --gpus all cocos
To test the installation:
- Navigate to the Monte-Carlo example via cd examples/monte_carlo_pi
- Run Monte-Carlo example via python3 -m monte_carlo_pi

MacOS up until High Sierra (version 10.13.6)

ArrayFire 3.5.1

2. Make sure that your System is able to locate ArrayFire's libraries

ArrayFire's functionality is contained in dynamic libries, dynamic link libraries (.dll) on Windows and shared objects (.so) on Unix.

This step is to ensure that these library files can be located on your system. On Windows, this can be done by adding %AF_PATH%\lib to the path environment variable. On Linux and Mac, one can either install (or copy) the ArrayFire libraries and their dependencies to /usr/local/lib or modify the environment variable LD_LIBRARY_PATH (Linux) or DYLD_LIBRARY_PATH (MacOS) to include the ArrayFire library directory.

3. Install Cocos via PIP:

<pre> pip install cocos </pre>

<pre> pip3 install cocos </pre>

if not using Anaconda.

To get the latest version, clone the repository from github, open a terminal/command prompt, navigate to the root folder and install via

<pre> pip install . </pre>

<pre> pip3 install . </pre>

if not using Anaconda.

Getting Started

Platform Information:

Print available devices

<pre> import cocos.device as cd cd.info() </pre>

Select a device

<pre> cd.ComputeDeviceManager.set_compute_device(0) </pre>

First Steps:

<pre> # begin by importing the numerics package import cocos.numerics as cn # create two arrays from lists a = cn.array([[1.0, 2.0], [3.0, 4.0]]) b = cn.array([[5.0], [6.0]]) # print their contents print(a) print(b) # matrix product of b and a c = a @ b print(c) # create an array of normally distributed random numbers d = cn.random.randn(2, 2) print(d) </pre>

Multi-GPU Computing:

Cocos provides map-reduce as well as the related map-combine as multi-GPU programming models. The computations are separated into 'batches' and then distributed across GPU devices in a pool. Cocos implements multi-GPU support via process-based parallelism.

To run the function my_gpu_function over separate batches of input data on multiple GPUs in parallel, first create a ComputeDevicePool:

<pre> compute_device_pool = cocos.multi_processing.device_pool.ComputeDevicePool() </pre>

To construct the batches, separate the arguments of the function into

a list of args lists and (one list per batch)
a list of kwargs dictionaries (one dictionary per batch)

<pre> args_list = [args_list_1, arg_list_2, ..., arg_list_n] kwargs_list = [kwargs_dict_1, kwargs_dict_2, ..., kwargs_dict_n] </pre>

Run the function in separate batches via map_reduce

<pre> result = \ compute_device_pool.map_reduce(f=my_gpu_function, reduction=my_reduction_function, initial_value=..., host_to_device_transfer_function=..., device_to_host_transfer_function=..., args_list=args_list kwargs_list=kwargs_list) </pre>

The reduction function iteratively aggregates two results from the list of results generated by my_gpu_function from left to right, beginning at initial_value (i.e. reducing initial_value and the result of my_gpu_function corresponding to the first batch). The list of results is in the same order to the list of args and kwargs.

If the function requires input arrays on the GPU, it must be provided to map_reduce as a NumPy array. The data is then sent to the process managing the GPU assigned to this batch, where it is moved to the GPU device by a host_to_device_transfer_function. This function needs to be implemented by the user.

Likewise, results that involve GPU arrays are transferred to the host via a user-supplied device_to_host_transfer_function and are then sent back to the main process before reduction takes place.

map_combine is a variation of map_reduce, in which a combination function aggregates the list of results in a single step.

Please refer to the documentation of cocos.multi_processing.device_pool.ComputeDevicePool.map_reduce as well as cocos.multi_processing.device_pool.ComputeDevicePool.map_combine for further details. See 'examples/heston_pricing_multi_gpu_example.py' for a fully worked example.

Memory Limitations on the GPU Device

It is common for modern standard desktop computers to support up to support up to 128GB of RAM. Video cards by contrast only feature a small fraction of VRAM. The consequence is that algorithms that work well on a CPU can experience into memory limitations when run on a GPU device.

In some cases this problem can be resolved by running the computation in batches or chunks and transferring results from the GPU to the host after each batch has been processed.

Using map_reduce_single_device and map_combine_single_device found in cocos.multi_processing.single_device_batch_processing, computations on a single GPU can be split into chunks and run sequentially. The interface is modeled after the multi GPU functionality described in the previous section.

Calls to map_reduce_single_device and map_combine_single_device can be nested in a multi GPU computation, which is how multi GPU evaluation of kernel density estimates is realized in Cocos (see cocos.scientific.kde.gaussian_kde.evaluate_in_batches_on_multiple_gpus).

Cocos

Install / Use

README