SkillAgentSearch skills...

Anakin

High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Install / Use

/learn @PaddlePaddle/Anakin
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Anakin2.0

Build Status License Coverage Status

Welcome to the Anakin GitHub.

Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineers and is a large-scale application of industrial products.

Please refer to our release announcement to track the latest feature of Anakin.

Features

  • Flexibility

    Anakin is a cross-platform, high-performance inference engine, supports a wide range of neural network architectures and different hardware platforms. It is easy to run Anakin on GPU / x86 / ARM platform.

    Anakin has integrated with NVIDIA TensorRT and open source this part of integrated API to provide services, developers can call the API directly or modify it as needed, which will be more flexible for development requirements.

  • High performance

    In order to give full play to the performance of hardware, we optimized the forward prediction at different levels.

    • Automatic graph fusion. The goal of all performance optimizations under a given algorithm is to make the ALU as busy as possible. Operator fusion can effectively reduce memory access and keep the ALU busy.

    • Memory reuse. Forward prediction is a one-way calculation. We reuse the memory between the input and output of different operators, thus reducing the overall memory overhead.

    • Assembly level optimization. Saber is a underlying DNN library for Anakin, which is deeply optimized at assembly level.

NV GPU Benchmark

Machine And Enviornment

CPU: Intel(R) Xeon(R) CPU 5117 @ 2.0GHz
GPU: Tesla P4
cuda: CUDA8
cuDNN: v7

  • Time:warmup 10,running 1000 times to get average time
  • Latency (ms) and Memory(MB) of different batch

The counterpart of Anakin is the acknowledged high performance inference engine NVIDIA TensorRT 5 , The models which TensorRT 5 doesn't support we use the custom plugins to support.

<span id = '1'> VGG16 </span>

| Batch_Size | RT latency FP32(ms) | Anakin2 Latency FP32 (ms) |RT Memory (MB) | Anakin2 Memory (MB) | |------------|---------------------|---------------------------|---------------|---------------------| | 1 | 8.52532 | 8.2387 |1090.89 | 702 | | 2 | 14.1209 | 13.8772 |1056.02 | 768.76 | | 4 | 24.4529 | 24.3391 |1002.17 | 840.54 | | 8 | 46.7956 | 46.3309 |1098.98 | 935.61 |

<span id = '2'> Resnet50 </span>

| Batch_Size | RT latency FP32(ms) | Anakin2 Latency FP32 (ms) | RT Latency INT8 (ms) | Anakin2 Latency INT8 (ms) | RT Memory FP32(MB) | Anakin2 Memory FP32(MB) | |------------|---------------------|---------------------------|----------------------|---------------------------|--------------------|-------------------------| | 1 | 4.6447 | 3.0863 | 1.78892 | 1.61537 | 1134.88 | 311.25 | | 2 | 6.69187 | 5.13995 | 2.71136 | 2.70022 | 1108.86 | 382 | | 4 | 11.1943 | 9.20513 | 4.16771 | 4.77145 | 885.96 | 406.86 | | 8 | 19.8769 | 17.1976 | 6.2798 | 8.68197 | 813.84 | 532.61 |

<span id = '3'> Resnet101 </span>

| Batch_Size | RT latency (ms) | Anakin2 Latency (ms) | RT Latency INT8 (ms) | Anakin2 Latency INT8 (ms) | RT Memory (MB) | Anakin2 Memory (MB) | |------------|-----------------|----------------------|----------------------|---------------------------|----------------|---------------------| | 1 | 9.98695 | 5.44947 | 2.81031 | 2.74399 | 1159.16 | 500.5 | | 2 | 17.3489 | 8.85699 | 4.8641 | 4.69473 | 1158.73 | 492 | | 4 | 20.6198 | 16.8214 | 7.11608 | 8.45324 | 1021.68 | 541.08 | | 8 | 31.9653 | 33.5015 | 11.2403 | 15.4336 | 914.49 | 611.54 |

X86 CPU Benchmark

Machine And Enviornment

CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz with HT, for FP32 test
CPU: Intel(R) Xeon(R) Gold 6271 CPU @ 2.60GHz with HT, for INT8 test
System: CentOS 6.3 with GCC 4.8.2, for benchmark between Anakin and Intel Caffe

  • All test enable 8 thread parallel
  • Time:warmup 10,running 200 times to get average time

The counterpart of Anakin is Intel Cafe(1.1.6) with mklml.

| Net_Name | Batch_Size | Anakin2 Latency(2650v4) fp32 (ms) | caffe Latency(2650v4) fp32 (ms) | Anakin2 Latency int8(6271) (ms) | |-------------|----|-------------------------------------|-----------------------------------|---------------------------------| | resnet50 | 1 | 20.6201 | 24.1369 | 3.20866 | | resnet50 | 2 | 39.2286 | 43.1096 | 5.44311 | | resnet50 | 4 | 77.1392 | 81.8814 | 9.93424 | | resnet50 | 8 | 152.941 | 158.321 | 19.5618 | | vgg16 | 1 | 55.6132 | 70.532 | 15.3181 | | vgg16 | 2 | 96.5034 | 131.451 | 22.5082 | | vgg16 | 4 | 180.479 | 247.926 | 37.2974 | | vgg16 | 8 | 346.619 | 485.44 | 67.6682 | | mobilenetv1 | 1 | 3.98104 | 5.42775 | 0.926546 | | mobilenetv1 | 2 | 7.27079 | 9.16058 | 1.35007 | | mobilenetv1 | 4 | 14.4029 | 16.2505 | 2.37271 | | mobilenetv1 | 8 | 29.1651 | 29.8381 | 3.75992 | | vgg16_ssd | 1 | 125.948 | 143.412 | | | vgg16_ssd | 2 | 247.242 | 266.22 | | | vgg16_ssd | 4 | 488.377 | 510.978 | | | vgg16_ssd | 8 | 972.762 | 995.407 | | | mobilenetv2 | 1 | 3.78504 | 23.0066 | | | mobilenetv2 | 2 | 7.24622 | 65.9301 | | | mobilenetv2 | 4 | 13.7638 | 85.3893 | | | mobilenetv2 | 8 | 28.4093 | 131.669 |

ARM CPU Benchmark

Machine And Enviornment

CPU: Kirin 980
CPU: Snapdragon 652
CPU: Snapdragon 855
CPU: RK3399

  • Compile circumstance: Android ndk cross compile,gcc 4.9,enable neon
  • Time:warmup 10,running 10 times to get average time
  • Note: 1、shufflenetv2 int8 model add swish operator

The counterpart of Anakin is ncnn(20190320). This benchmark we test ARMv7 ARMv8 splitly

ARMv8 TEST

  • ABI: arm64-v8a
  • Latency (ms) of one batch

| Kirin 980 | Anakin fp32 | | | Anakin int8 | | | NCNN fp32 | | | NCNN int8 | | | |---------------|-------------|----------|----------|-------------|----------|----------|-----------|----------|----------|-----------|----------|----------| | | 1 thread | 2 thread | 4 thread | 1 thread | 2 thread | 4 thread | 1 thread | 2 thread | 4 thread | 1 thread | 2 thread | 4 thread | | mobilenet_v1 | 34.172 | 19.369 | 12.723 | 37.588 | 20.692 | 13.280 | 45.420 | 24.220 | 16.730 | 50.560 | 27.820 | 20.010 | | mobilenet_v2 | 30.489 | 17.784 | 12.327 | 29.581 | 17.208 | 15.307 | 30.390 | 17.310 | 12.900 | | | | | mobilenet_ssd | 71.609 | 37.477 | 28.952 | | | | 88.220 | 70.070 | 66.430 | 103.700 | 85.160 | 85.320 | | resnet50 | 255.748 | 137.842 | 104.628 | | | | 1299.480 | 695.830 | 498.010 | 243.360 | 131.100 | 89.800 | | shufflenetv1 | 11.544 | 8.931 | 7.027 | | | | 12.810 | 9.390 | 8.030 | | | | | shufflenetv2

View on GitHub
GitHub Stars537
CategoryDevelopment
Updated1d ago
Forks135

Languages

C++

Security Score

100/100

Audited on Apr 2, 2026

No findings