Ucudnn

Accelerating DNN Convolutional Layers with Micro-batches

Generate Convert Improve

Install / Use

/learn @spcl/Ucudnn

About this skill

Quality Score

0/100

README

μ-cuDNN

μ-cuDNN is a transparent wrapper for the NVIDIA cuDNN library that splits a minibatch into micro-batches to speed up computation. μ-cuDNN is intended to be combined with deep learning frameworks written in C++, such as Caffe and TensorFlow.

Reference

This repository contains the code used in

Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, μ-cuDNN: Accelerating Deep Neural Networks with Micro-Batching, arXiv e-prints, 2018. [URL]

Please cite as:

@article{ucudnn,
  author    = {Yosuke Oyama and Tal Ben-Nun and Torsten Hoefler and Satoshi Matsuoka},
  title     = {{{\(\mu\)}-cuDNN}: Accelerating Deep Learning Frameworks with Micro-Batching},
  journal   = {CoRR},
  volume    = {abs/1804.04806},
  year      = {2018},
  url       = {http://arxiv.org/abs/1804.04806},
  archivePrefix = {arXiv},
  eprint    = {1804.04806},
}

Requirements

GCC >= 4.8.5 (should support -std=c++11)
CMake >= 3.9.2
CUDA >= 8
cuDNN >= 6
GLPK >= 4.63 (optional)
SQLite >= 3.21 (optional)

Performance

DeepBench

This figure shows the relative speedups of DeepBench's 3x3 and 5x5 convolution layers on the NVIDIA Tesla P100-SXM2 GPU. We use a mini-batch of 256, and workspace limits of 128, 256, and 512 MiB. μ-cuDNN achieves up to speedups of 2.31x for 3x3 layers and 3.85x for 5x5 layers.

CIFAR-10 Training

This figure shows learning curves of a CIFAR-10 CNN defined in Caffe with three different micro-batch policies. We use a mini-batch of 1024, and workspace limit of 64 MiB. The CNN achieves ~80% test accuracy that is similar to the official result.

Installation

Compile μ-cuDNN with CMake:

mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX:PATH="/path/to/ucudnn"
make
make install

Add path to μ-cuDNN:

export CPLUS_INCLUDE_PATH=/path/to/ucudnn/include:$CPLUS_INCLUDE_PATH
export LD_LIBRARY_PATH=/path/to/ucudnn/lib:$LD_LIBRARY_PATH

Modify your deep learning framework:
- Add #include <ucudnn/ucudnn.h> to *.cpp,*.cu,*.h files that contain cudnnHandle_t.
- Replace cudnnHandle_t with UcudnnHandle_t.
Compile the framework.
- You need to link libucudnn.so to the framework explictly by adding the -lucudnn flag.
- In some frameworks you also need to specify the -std=c++11 flag.
- For example, the following CMake flags are needed to compile μ-cuDNN-enabled Caffe:
  - -DCMAKE_SHARED_LINKER_FLAGS="-lucudnn"
  - -DCMAKE_EXE_LINKER_FLAGS="-lucudnn"
  - -DCMAKE_CXX_FLAGS="-std=c++11"

Here you can find μ-cuDNN enabled forks of Caffe and TensorFlow, instead of running steps 3. and 4..

CMake options

| Option | Default | Description | |--------|---------|-------------| | UCUDNN_USE_GLPK | OFF | Use GNU Linear Programming Kit (GLPK) to run ILP-based optimization. | | UCUDNN_USE_SQLITE | OFF | Use SQLite to cache benchmark result in file systems. | | UCUDNN_DEBUG_GLPK_PRINT | OFF | Output ILP information after solving (as glpk_{sol,mip,prob}_(UNIX time)) to the current directory. The output formats are based on the glp_print_sol, glp_print_mip, glp_write_prob functions of GLPK respectively. | | UCUDNN_DEBUG_OUTPUT | ON | Output optimization results to stderr. | | UCUDNN_DEBUG_OUTPUT_ENV | OFF | Output used environment variables to stderr. | | UCUDNN_DEBUG_EQUIVALENCE_TEST | OFF | Compute normalized L2-distance between output tensors of cuDNN and μ-cuDNN to check whether these convolutions are equivalent. | | UCUDNN_TEST | ON | Build tests. |

Notes on UCUDNN_DEBUG_EQUIVALENCE_TEST:
- The normalized distance may be larger than zero due to numerical error and nondeterministic behavior of some algorithms.
- In practice the normalized distance should be less than 1e-6.
- Since this option computes the distance every time, it considerably slows down the computation. Please turn off if you want to perform practical training/inference.

Runtime options (environment variables)

| Variable | Acceptable values | Default | Description | |----------|-------------------|---------|-------------| | UCUDNN_BATCH_SIZE_POLICY | one of undivided,powerofTwo,all | powerOfTwo | μ-batch size policy. This can be set by UcudnnHandle_t::setOptimizerBatchSizePolicy as well. | | UCUDNN_BENCHMARK_DEVICES | comma-separated integers (e.g. 0,1,2,3) or all | The current device of the process | GPU device ID(s) that are used for benchmarking in parallel. Note that optimization will be incorrect if different kinds of GPUs are used at the same time. | | UCUDNN_DEFAULT_WORKSPACE_LIMIT | integer | 67108864 (i.e. 64 MiB) | The default workspace limit in bytes for convolutional layers. This is used only if the framework tries to get the workspace size before the workspace limit is provided. | | UCUDNN_TOTAL_WORKSPACE_SIZE | integer | N/A | If this is set, μ-cuDNN will create a static workspace size with the specified size in bytes for ILP-based optimization. Workspace limit passed via conventional cuDNN functions will be ignored. | | UCUDNN_DATABASE | path | N/A | If this is set, μ-cuDNN will use a SQLite 3 database at the specified path to cache the benchmark results. |

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。