# tinyengine

**Repository Path**: cuishaowei/tinyengine

## Basic Information

- **Project Name**: tinyengine
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: dev/support_more_fp_ops
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-03
- **Last Updated**: 2025-07-03

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# TinyEngine

This is the official implementation of TinyEngine, a memory-efficient and high-performance neural network library for Microcontrollers.
TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. TinyEngine and TinyNAS are co-designed to fit the tight memory budgets.

**The MCUNet and TinyNAS repo is [here](https://github.com/mit-han-lab/mcunet).**

### [MCUNetV1](https://mcunet.mit.edu/#mcunetv1) | [MCUNetV2](https://mcunet.mit.edu/#mcunetv2) | [MCUNetV3](https://mcunet.mit.edu/#mcunetv3)

### [Demo (Inference)](https://www.youtube.com/watch?v=YvioBgtec4U)

![demo](assets/figures/mcunet_demo.gif)

### [Demo (Training)](https://www.youtube.com/watch?v=XaDCO8YtmBw)

![demo_v3](assets/figures/mcunetV3_demo_2images.gif)

## News

**If you are interested in getting updates, please sign up [here](https://forms.gle/UW1uUmnfk1k6UJPPA) to get notified!**

- **(2023/02)** We release the source code of the [person detection demo](examples/openmv_person_detection), [face mask detection demo](examples/openmv_face_mask_detection), and [on-device training demo](examples/openmv_training_sparse) on OpenMV Cam H7.
- **(2022/12)** We update the [measured results](README.md#measured-results) on STM32H743 with the new versions of the inference libraries.
- **(2022/12)** We release the source code for patch-based inference and update the [tutorial of our inference demo](tutorial/inference/README.md) to provide option that generates patch-based inference code for the visual wake words (VWW) demo.
- **(2022/11)** We release the source code of Tiny Training Engine, and include the [tutorial of our training demo](tutorial/training) for training a visual wake words (VWW) model on microcontrollers.
- **(2022/11)** We release the source code of the algorithm and compilation parts of MCUNetV3 in [this repo](https://github.com/mit-han-lab/tiny-training). Please take a look!
- **(2022/10)** Our new work [On-Device Training Under 256KB Memory](https://arxiv.org/abs/2206.15472) is highlighted on the [MIT homepage](http://web.mit.edu/spotlight/learning-edge/)!
- **(2022/09)** Our new work [On-Device Training Under 256KB Memory](https://arxiv.org/abs/2206.15472) is accepted to NeurIPS 2022! It enables tiny on-device training for IoT devices \[[demo](https://www.youtube.com/watch?v=XaDCO8YtmBw)\].
- **(2022/08)** Our **New Course on TinyML and Efficient Deep Learning** will be released soon in September 2022: [efficientml.ai](https://efficientml.ai/).
- **(2022/08)** We include the [tutorial of our inference demo](tutorial/inference) for deploying a visual wake words (VWW) model onto microcontrollers.
- **(2022/08)** We opensource the TinyEngine repo.
- **(2022/07)** We include the person detection model used in the video demo above in the [MCUNet repo](https://github.com/mit-han-lab/mcunet).
- **(2022/06)** We refactor the [MCUNet repo](https://github.com/mit-han-lab/mcunet) as a standalone repo (previous repo: https://github.com/mit-han-lab/tinyml)
- **(2021/10)** **MCUNetV2** is accepted to NeurIPS 2021: https://arxiv.org/abs/2110.15352 !
- **(2020/10)** **MCUNet** is accepted to NeurIPS 2020 as **spotlight**: https://arxiv.org/abs/2007.10319 !
- Our projects are covered by: [MIT Spotlight News (v3)](http://web.mit.edu/spotlight/learning-edge/), [MIT News (v2)](https://news.mit.edu/2021/tiny-machine-learning-design-alleviates-bottleneck-memory-usage-iot-devices-1208), [MIT News (v1)](https://news.mit.edu/2020/iot-deep-learning-1113), [WIRED](https://www.wired.com/story/ai-algorithms-slimming-fit-fridge/), [Morning Brew](https://www.morningbrew.com/emerging-tech/stories/2020/12/07/researchers-figured-fit-ai-ever-onto-internet-things-microchips), [Stacey on IoT](https://staceyoniot.com/researchers-take-a-3-pronged-approach-to-edge-ai/), [Analytics Insight](https://www.analyticsinsight.net/amalgamating-ml-and-iot-in-smart-home-devices/), [Techable](https://techable.jp/archives/142462), etc.

## Overview

Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications, but the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.

MCUNet is a **system-algorithm co-design** framework for tiny deep learning on microcontrollers. It consists of **TinyNAS** and **TinyEngine**. They are co-designed to fit the tight memory budgets. With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.

![overview](assets/figures/overview.png)

Specifically, TinyEngine is a memory-efficient inference library. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing memory usage and accelerating the inference. It outperforms existing inference libraries such as [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers) from Google, [CMSIS-NN](https://arxiv.org/abs/1801.06601) from Arm, and [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html) from STMicroelectronics.

TinyEngine adopts the following optimization techniques to accelerate inference speed and minimize memory footprint.

- [**In-place depth-wise convolution**](https://mcunet.mit.edu/#mcunetv1): A unique data placement technique for depth-wise convolution that overwrites input data by intermediate/output data to reduce peak SRAM memory.
- [**Patch-based inference**](https://mcunet.mit.edu/#mcunetv2): A generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory.
- [**Operator fusion**](https://docs.microsoft.com/en-us/windows/ai/directml/dml-fused-activations): A method that improves performance by merging one operator into a different operator so that they are executed together without requiring a roundtrip to memory.
- [**SIMD (Single instruction, multiple data) programming**](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data): A computing method that performs the same operation on multiple data points simultaneously.
- [**HWC to CHW weight format transformation**](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html): A weight format transformation technique that increases cache hit ratio for in-place depth-wise convolution.
- [**Image to Column (Im2col) convolution**](https://iq.opengenus.org/im2col/): An implementation technique of computing convolution operation using general matrix multiplication (GEMM) operations.
- [**Loop reordering**](https://xilinx.github.io/Vitis_Accel_Examples/2019.2/html/loop_reorder.html): A loop transformation technique that attempts to optimize a program's execution speed by reordering/interchanging the sequence of loops.
- [**Loop unrolling**](https://en.wikipedia.org/wiki/Loop_unrolling): A loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.
- [**Loop tiling**](https://en.wikipedia.org/wiki/Loop_nest_optimization): A loop transformation technique that attempts to reduce memory access latency by partitioning a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused.

![inplace_depthwise](assets/figures/inplace_depthwise.png)

By adopting the abovementioned optimization techniques, TinyEngine can not only enhance inference speed but also reduce peak memory, as shown in the figures below.

**MAC/s improvement breakdown:**
![mac_result](assets/figures/mac_result.png)

**Peak memory reduction:**
![peakmem_result](assets/figures/peakmem_result.png)

To sum up, our **TinyEngine** inference engine could be a useful infrastructure for MCU-based AI applications. It significantly **improves the inference speed and reduces the memory usage** compared to existing libraries like [TF-Lite Micro](https://www.tensorflow.org/lite/microcontrollers), [CMSIS-NN](https://arxiv.org/abs/1801.06601), [X-CUBE-AI](https://www.st.com/en/embedded-software/x-cube-ai.html), etc. It improves the inference speed by **1.1-18.6x**, and reduces the peak memory by **1.3-3.6x**.

![measured_result](assets/figures/measured_result.png)

**Save Memory with Patch-based Inference:**
We can dramastically reduce the inference peak memory by using patch-based inference for the memory-intensive stage of CNNs.
![measured_result](assets/figures/layer_vs_patch.gif)

For MobileNetV2, using patch-based inference allows us to reduce the peak memory by 8x.
![measured_result](assets/figures/mbv2_mem_compare.gif)

With patch-based infernece, tinyengine achieves higher accuracy at the same memory budgets.
![measured_result](assets/figures/imagenet_result.png)

## Code Structure

`code_generator` contains a python library that is used to compile neural networks into low-level source code (C/C++).

`TinyEngine` contains a C/C++ library that implements operators and performs inference on Microcontrollers.

`examples` contains the examples of transforming TFLite models into our TinyEngine models.

`tutorial` contains the demo tutorial (of inference and training) of deploying a visual wake words (VWW) model onto microcontrollers.

`assets` contains misc assets.

## Requirement

- Python 3.6+
- STM32CubeIDE 1.5+

## Setup for Users

First, clone this repository:

```bash
git clone --recursive https://github.com/mit-han-lab/tinyengine.git
```

(Optional) Using a virtual environment with `conda` is recommended.

```bash
conda create -n tinyengine python=3.6 pip
conda activate tinyengine
```

Install dependencies:

```bash
pip install -r requirements.txt
```

## Setup for Developers

Install pre-commit hooks to automatically format changes in your code.

```
pre-commit install
```

## Deployment Example

Please see [tutorial](tutorial) to learn how to deploy a visual wake words (VWW) model onto microcontrollers by using TinyEngine. We include both [the inference demo](tutorial/inference) and [the training demo](tutorial/training) in the tutorial, please take a look!

## Measured Results

- All the tflite models are from [Model Zoo in MCUNet repo](https://github.com/mit-han-lab/mcunet#model-zoo). Please see MCUNet repo to know how to build the pre-trained int8 quantized models in TF-Lite format.
- All the **latency**, **peak memory (SRAM)** and **Flash memory usage** results are profiled on STM32H743 with the limitations of 512 KB peak memory and 2 MB storage.
- Note that we measure the newer versions of libraries in this repo, so that the results in this repo might be different from the ones in the MCUNet papers.
- For each inference library, we use the git commit ID to indicate the version.
- All the tflite models are compiled by `-Ofast` optimization level in STM32CubeIDE.
- OOM denotes Out Of Memory.
- Measurement for X-Cube-AI v7.3.0 was conducted with the default compilation setting of balanced mode.

The **latency** results:

| net_id                       | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>[@ 011bf32](https://github.com/ARM-software/CMSIS-NN/tree/011bf3228a64cd70ba6bfac91ac6840a88b829ee) | X-CUBE-AI<br>v7.3.0 | TinyEngine<br>[@ 0363956](https://github.com/mit-han-lab/tinyengine/tree/03639563ebf6538fff557515e31667fca6448cd3) |
| ---------------------------- | ----------------------- | ------------------ | --------- | ---------- |
| *# mcunet models (VWW)*      |                         |                    |           |            |
| mcunet-vww0                  |          587ms          |        53ms        |   32ms    |   27ms    |
| mcunet-vww1                  |          1120ms         |        97ms        |    57ms   |    51ms    |
| mcunet-vww2                  |          5310ms         |        478ms       |   269ms   |   234ms    |
| *# mcunet models (ImageNet)* |                         |                    |           |            |
| mcunet-in0                   |           586ms         |         51ms       |   35ms    |   25ms    |
| mcunet-in1                   |           1227ms        |         103ms      |    63ms   |    56ms    |
| mcunet-in2                   |           6463ms        |         642ms      |   351ms   |   280ms    |
| mcunet-in3                   |           7821ms        |         770ms      |    414ms  |   336ms    |
| mcunet-in4                   |           OOM           |         OOM        |    516ms  |   463ms    |
| *# baseline models*          |                         |                    |           |            |
| mbv2-w0.35                   |           OOM           |         OOM        |   118ms   |   124ms    |
| proxyless-w0.3              |           3801ms        |         380ms      |   205ms   |   176ms    |

The **peak memory (SRAM)** results:

| net_id                       | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>[@ 011bf32](https://github.com/ARM-software/CMSIS-NN/tree/011bf3228a64cd70ba6bfac91ac6840a88b829ee) | X-CUBE-AI<br>v7.3.0 | TinyEngine<br>[@ 0363956](https://github.com/mit-han-lab/tinyengine/tree/03639563ebf6538fff557515e31667fca6448cd3) |
| ---------------------------- | ----------------------- | ------------------ | --------- | ---------- |
| *# mcunet models (VWW)*      |                         |                    |           |            |
| mcunet-vww0                  |          163kB          |        163kB       |   88kB    |   59kB     |
| mcunet-vww1                  |          220kB          |        220kB       |   113kB   |    92kB    |
| mcunet-vww2                  |          385kB          |        390kB       |   201kB   |   174kB    |
| *# mcunet models (ImageNet)* |                         |                    |           |            |
| mcunet-in0                   |           161kB         |        161kB       |   69kB    |   49kB     |
| mcunet-in1                   |           219kB         |        219kB       |   106kB   |   96kB     |
| mcunet-in2                   |           460kB         |        469kB       |   238kB   |   215kB    |
| mcunet-in3                   |           493kB         |        493kB       |    243kB  |   260kB    |
| mcunet-in4                   |           OOM           |         OOM        |    342kB  |   416kB    |
| *# baseline models*          |                         |                    |           |            |
| mbv2-w0.35                   |           OOM           |         OOM        |   296kB   |   295kB    |
| proxyless-w0.3              |           453kB         |        453kB       |   221kB   |   259kB    |

The **Flash memory usage** results:

| net_id                       | TF-Lite Micro<br>[@ 713b6ed](https://github.com/tensorflow/tflite-micro/tree/713b6ed6bd81d8d6906d885e14f444aaf9c154f6) | CMSIS-NN<br>[@ 011bf32](https://github.com/ARM-software/CMSIS-NN/tree/011bf3228a64cd70ba6bfac91ac6840a88b829ee) | X-CUBE-AI<br>v7.3.0 | TinyEngine<br>[@ 0363956](https://github.com/mit-han-lab/tinyengine/tree/03639563ebf6538fff557515e31667fca6448cd3) |
| ---------------------------- | ----------------------- | ------------------ | --------- | ---------- |
| *# mcunet models (VWW)*      |                         |                    |           |            |
| mcunet-vww0                  |          627kB          |        646kB       |   463kB   |   453kB    |
| mcunet-vww1                  |          718kB          |        736kB       |   534kB   |   521kB    |
| mcunet-vww2                  |         1016kB          |        1034kB      |   774kB   |   741kB    |
| *# mcunet models (ImageNet)* |                         |                    |           |            |
| mcunet-in0                   |          1072kB         |        1090kB      |   856kB   |   842kB    |
| mcunet-in1                   |           937kB         |        956kB       |   737kB   |   727kB    |
| mcunet-in2                   |          1084kB         |        1102kB      |   849kB   |   830kB    |
| mcunet-in3                   |          1091kB         |        1106kB      |    867kB  |   835kB    |
| mcunet-in4                   |           OOM           |         OOM        |    1843kB |  1825kB    |
| *# baseline models*          |                         |                    |           |            |
| mbv2-w0.35                   |           OOM           |         OOM        |    857kB  |   839kB    |
| proxyless-w0.3              |          1065kB         |        1075kB      |   865kB   |   842kB    |

## Citation

If you find the project helpful, please consider citing our paper:

```
@article{
  lin2020mcunet,
  title={Mcunet: Tiny deep learning on iot devices},
  author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

@inproceedings{
  lin2021mcunetv2,
  title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
  author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year={2021}
}

@article{
  lin2022ondevice,
  title = {On-Device Training Under 256KB Memory},
  author = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}
```

## Related Projects

[MCUNet: Tiny Deep Learning on IoT Devices](https://mcunet.mit.edu/#mcunetv1) (NeurIPS'20)

[MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning](https://mcunet.mit.edu/#mcunetv2) (NeurIPS'21)

[MCUNetV3: On-Device Training Under 256KB Memory](https://mcunet.mit.edu/#mcunetv3) (NeurIPS'22)