SkillAgentSearch skills...

Tinyengine

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory

Install / Use

/learn @mit-han-lab/Tinyengine

README

TinyEngine

This is the official implementation of TinyEngine, a memory-efficient and high-performance neural network library for Microcontrollers. TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. TinyEngine and TinyNAS are co-designed to fit the tight memory budgets.

The MCUNet and TinyNAS repo is here.

TinyML Project Website | MCUNetV1 | MCUNetV2 | MCUNetV3

Demo (Inference)

demo

Demo (Training)

demo_v3

News

If you are interested in getting updates, please sign up here to get notified!

Overview

Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications, but the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.

MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. It consists of TinyNAS and TinyEngine. They are co-designed to fit the tight memory budgets. With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.

overview

Specifically, TinyEngine is a memory-efficient inference library. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing memory usage and accelerating the inference. It outperforms existing inference libraries such as TF-Lite Micro from Google, CMSIS-NN from Arm, and X-CUBE-AI from STMicroelectronics.

TinyEngine adopts the following optimization techniques to accelerate inference speed and minimize memory footprint.

  • In-place depth-wise convolution: A unique data placement technique for depth-wise convolution that overwrites input data by intermediate/output data to reduce peak SRAM memory.
  • Patch-based inference: A generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory.
  • Operator fusion: A method that improves performance by merging one operator into a different operator so that they are executed together without requiring a roundtrip to memory.
  • SIMD (Single instruction, multiple data) programming: A computing method that performs the same operation on multiple data points simultaneously.
  • HWC to CHW weight format transformation: A weight format transformation technique that increases cache hit ratio for in-place depth-wise convolution.
  • Image to Column (Im2col) convolution: An implementation technique of computing convolution operation using general matrix multiplication (GEMM) operations.
  • Loop reordering: A loop transformation technique that attempts to optimize a program's execution speed by reordering/interchanging the sequence of loops.
  • Loop unrolling: A loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.
  • Loop tiling: A loop transformation technique that attempts to reduce memory access latency by partitioning a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused.

inplace_depthwise

By adopting the abovementioned optimization techniques, TinyEngine can not only enhance inference speed but also reduce peak memory, as shown in the figures below.

MAC/s improvement breakdown: mac_result

Peak memory reduction: peakmem_result

To sum up, our TinyEngine inference engine could be a useful infrastructure for MCU-based AI applications. It significantly improves the inference speed and reduces the memory usage compared to existing libraries like TF-Lite Micro, CMSIS-NN, X-CUBE-AI, etc. It improves the inference speed by 1.1-18.6x, and reduces the peak memory by 1.3-3.6x.

measured_result

Save Memory with Patch-based Inference: We can dramastically reduce the inference peak memory by using patch-based inference for the memory-intensive stage of CNNs. measured_result

For MobileNetV2, using patch-based inference allows us to reduce the peak memory by 8x. measured_result

With patch-based infernece, tinyengine achieves higher accuracy at the same memory budgets. measured_result

Code Structure

code_generator contains a python library that is used to compile neural networks into low-level source code (C/C++).

TinyEngine contains a C/C++ library that implements operators and performs inference on Microcontrollers.

examples contains the examples of transforming TFLite models into our TinyEngine models.

tutorial contains the demo tutorial (of inference and training) of deploying a visual wake words (VWW) model onto microcontrollers.

assets contains misc assets.

Requirement

  • Python 3.6+
  • STM32CubeIDE 1.5+

Setup for Users

First, clone this repository:

git clone --re
View on GitHub
GitHub Stars932
CategoryEducation
Updated4d ago
Forks155

Languages

C

Security Score

100/100

Audited on Mar 23, 2026

No findings