SimpleTPU

A FPGA Based CNN accelerator, following Google's TPU V1.

Generate Convert Improve

Install / Use

/learn @cea-wind/SimpleTPU

About this skill

Quality Score

0/100

README

SimpleTPU

A Tensor Processing Unit is designed to accelerate the matrix multiplication, especially for Multilayer perceptron and Convolution Nerual Network.
This implmentaion is mainly following the Google TPU Version 1, which architecture is introduced in https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf.

It may cost a lot of time to implementation TPU using Hardware Description Language (such as VHDL or Verilog HDL), even if I had tried to simplify it. So I try to use the Xilinx HLS ToolKit to complete it.

The plan is divided into three phases.

Phase 1: Completing the main computing module,including
- Lab1:Systolic Array
- Lab2:Relu, Normalization & Pooling
Phase 2: Finish the full design of simpleTPU.
Phase 3: Testing the simpleTPU through some real network, such as MLP and CNN.

Key Features

The key features of Simple TPU including

Int8 mulitply & Int32 accumulators
VLIW based instruction parallel
Vector Architecture based data parallel

Here are some operate which Simple TPU can support.

Operate | Support -|- Conv3d | in_channels: Resource Constrained out_channels: Resource Constrained kerner_size: Support stride: support padding: Support dilation:Support groups: Architecture Constrained bias :Support ConvTranspose3d | The same as above Maxpool2d | kernel_size: Support stride: Support padding: Support
Avgpool2d | The same as above Relu | Only support Relu as nonlinear function BatchNorm2d | BatchNorm2d is merge with Conv or Pool when inference Linear | Resource Constrained UpscalingNearest2D | Support (calling Avgpool2d multiple times.) UpscalingBilinear2D | Support (calling Avgpool2d multiple times.)

Performance

The size of mac array in SimpleTPU is 32*32, the clock frequency is 500MHz (timing closure when using Xilinx ultrascale+ FPGA, Speed -2).
$$32\times 32 \times 500 \times 2 = 1 Tops(int8)$$

Installation

env :

Vivado HLS 2018.2

run :

step1: vivado_hls -f run_hls.tcl
step2: lanch vivado HLS and open the project
step3: Run C synthesis, C/RTL cosimulation e.t.c

Synthesis Result:
result
Simulation Result:
result

Examlpes

1. MLP

The network structure of mlp is defined as follow.

class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.hidden = nn.Linear(784,64)
        self.fc = nn.Linear(64,10)

    def forward(self, x):
        x = x.view(-1,784)
        x = self.hidden(x)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)

Work efficiency of SimpleTPU is about 84%.

|LOC| Layers | Nonlinear function | Weights | Batch Size | % of Deployed| |---|---|---|----|----|----| |10 | 2 FC | Relu | 5M | 512 | 16%|

Classfication Result in MNIST.

result

2. CNN

Because there is no compiler to generate instruction, this plan was suspended. If you want to kown how to calculate convolution using SimpleTPU, lab1 provides a simple example.

Relative Link

https://www.cnblogs.com/sea-wind/p/10993958.html

Related Skills

node-connect

348.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

348.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

348.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。