Team

Principal Investigator: Prof. Peipei Zhou, https://peipeizhou-eecs.github.io/

Ph.D. Students: Jinming Zhuang (Student Lead), Zhuoping Yang, Shixin Ji

Faculty Collaborators: Professors Jingtong Hu (Pitt), Alex Jones (Syracuse), Deming Chen (UIUC), Yiyu Shi (Notre Dame), Yanzhi Wang (Northeastern) and Jason Cong (UCLA)

Student Collaborators: Jason Lau (UCLA) and Hanchen Ye (UIUC)

AMD Collaborators: Stephen Neuendorffer, Jack Lo, and Kristof Denolf

🚀 Thank You for Using CHARM! ! !

Your support and growing engagement inspire us to continually improve and enhance the project.

Total Views since 02/13/2025: 11340
Total Downloads since 02/13/2025: 1665 <img src="./plot/CHARM_traffic_plot.png" width="600" />

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture (FPGA'23)

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives (DAC'23).

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration (FPGA'24)

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture (ACM TRETS'24)

ESWEEK 2023 and DAC 2023 Video Demos: https://drive.google.com/file/d/1wWn7R_l-Sfbg818Hmvg2448l9WvGy-YK/view?usp=sharing

ACM/IEEE Reference Format

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, Peipei Zhou. 2023. CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 12 pages.

ACM PDF: https://doi.org/10.1145/3543622.3573210 Author Version PDF: https://peipeizhou-eecs.github.io/publication/fpga23/

Jinming Zhuang, Zhuoping Yang, Peipei Zhou. 2023. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. In Proceedings of the 60th ACM/IEEE Design Automation Conference, San Francisco, California, USA, (DAC ’23), July 9–13, 2023, San Francisco, CA, USA. https://doi.org/10.1109/DAC56929.2023.10247981

IEEE PDF: https://ieeexplore.ieee.org/document/10247981 Author Version PDF: https://arxiv.org/pdf/2305.18698.pdf

Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, Peipei Zhou. 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration (FPGA'24)

ACM PDF: https://doi.org/10.1145/3626202.3637569

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Shixin Ji, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Yiyu Shi, Deming Chen, Jason Cong, Peipei Zhou. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture (ACM TRETS'24)

ACM PDF: https://dl.acm.org/doi/10.1145/3686163

Dong, Peiyan, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang et al. "EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, no. 11 (2024): 3949-3960.

IEEE PDF: https://doi.org/10.1109/TCAD.2024.3443692 Author Version PDF: https://peipeizhou-eecs.github.io/publication/2024_esweek_eqvit/2024_esweek_eqvit.pdf

New Release ! ! ! 2023.05.29

PyACAP 1.0: Python Based Automatic Code Generation for Versal ACAP:

What's new ?: In this release we create an entire python interface for matrix multiply under floating-point 32 data type for Versal ACAP VCK190 and VCK5000 Platforms.
Overall Compilation Flow: <img src="https://github.com/arc-research-lab/CHARM/assets/77606152/c001c316-e310-49ec-9495-7bb01a718656" width="800" height="300">
Python Interface Introduction: Quick Start: Running project_setup.py

python project_setup.py

from charm import* 

#Define the left-hand-side(A) and right-hide-side(B) operands
A=np.random.rand(4096, 4096).astype(np.float32)
B=np.random.rand(4096, 4096).astype(np.float32)

#Create the object of the class charm
automm=charm(prj_dir)

#Launch charm dse to find optimized hardware configuration
Versal_config=automm.cdse(A,B)

#Launch charm automatic code generator to emit the code for AIE, PL and Host CPU
device='vck190' # Supported devices are vck190 and vck5000
automm.cacg(Versal_config,device)

#Run Vitis Compilation Flow
automm.build()

Overview

In this repo, we use general-purpose Matrix-Matrix Multiplication (GEMM) applications as an example and provide a detailed description of how to build a system-level design on AMD Versal VCK190 Platform. By going through this repo, users can get knowledge on:

How to design a highly efficient single AIE kernel by leveraging the 7-way very long instruction words (VLIW)?
How to sustain 400 AIEs with the limited I/O interfaces between AIE and PL by using a broadcast-packet mechanism?
How to transfer data from PL/AIE to AIE/PL by using a bubble-free pipeline strategy?

We provide an automatic code generation and compilation flow that users can build the system on Versal step by step by changing the configuration files.

Dependencies

To play with the Charming Accelerators, the following software and hardware dependencies are required:

Linux System with "tar" installed
AMD/Xilinx Vitis 2021.1 (Version 2021.1 guarantees the designs in the example folder to be compiled correctly)
AMD/Xilinx XRT Library
AMD/Xilinx Versal VCK190 (Vitis 2021.1)
AMD/Xilinx Versal VCK5000 (xilinx_vck5000_gen3x16_xdma_1_202120_1, Vitis 2021.2)
AMD/Xilinx Versal VCK5000 (xilinx_vck5000_gen4x8_qdma_2_202220_1, Vitis 2022.2-2023.1)

Environment Setup

1. To quickly boost and run experiments on the board instead of building the platform and Linux from scratch, users can download the platform package (VCK190 Base 2021.1) and petalinux common image(Versal common image) from the following link:

https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/embedded-platforms/2021-1.html

2. Install the platform and Petalinux

unzip xilinx_vck190_base_202110_1.zip

tar -xf xilinx-versal-common-v2021.1.tar.gz
cd xilinx-versal-common-v2021.1
sh sdk.sh

3. VCK190 Base 2021.1: It contains the pre-built Versal extensible embedded platform. During compilation users need to specify the platofrm path in the following format.

PLATFORM=${PATH}/xilinx_vck190_base_202110_1/xilinx_vck190_base_202110_1.xpfm

4. Versal common image: It includes the petalinux system boot files and the cross compilation environment needed for ARM CPU. During compilation, users need to point the path to SYSROOT and EDGE_COMMON_SW.

SYSROOT = ${PATH}/sysroots/cortexa72-cortexa53-xilinx-linux
EDGE_COMMON_SW=${PATH}/xilinx-versal-common-v2021.1

5. Vitis and Cross-compilation Environment Setup

source /opt/tools/xilinx/Vitis/2021.1/settings64.sh
source /opt/xilinx/xrt/setup.sh
unset LD_LIBRARY_PATH (If needed)
source ${PATH}/environment-setup-cortexa72-cortexa53-xilinx-linux

6. Project Setup and Compilation

Users can generate the customized project by setting up the configuration file and directly running the following command:

./project_setup.sh ./config_files/input.cfg ${Project_DIR}
cd ${Project_DIR}
make all PLATFORM=${PATH} EDGE_COMMON_SW_PATH=${PATH} SYSROOT_PATH={PATH}

7. On Board Execution for MM with Arbitrary Sizes

After copy the sd card image to micro sd card and boot up the system run the following commands to get the execution results. {M}, {K}, {N} refers to the size of MM. In order to reduce the effect of overhead of calling API when runnning the kernel, users can specify the number of {iteration} of running the MM then it provides the average throughput. To verify the correctness of the MM kernel, {verify} should be assigned to 1, otherwise 0. One example of running MM with 1024*1024*1024 for 100 iterations without verify the result can be: ./hostexe mm_hw.xclbin 1024 1024 1024 100 0

cd /mnt/sd-mmcblk0p1
./hostexe mm_hw.xclbin {M} {K} {N} {iteration} {verify}

Targeting VCK5000

By default, CHARM targets the xilinx_vck5000_gen3x16_xdma_1_202120_1 platform for VCK5000. To target the xilinx_vck5000_gen4x8_qdma_2_202220_1 platform, we require Vitis 2022.2-2023.1 and the PLATFORM_NAME variable to be defined for the build process, i.e.:

make all PLATFORM_NAME=xilinx_vck5000_gen4x8_qdma_2_202220_1

Note: with higher total off-chip bandwidth and AIE clock frequency, the VCK5000 can show better QoR than the VCK190, e.g. for single square MM kernels:

| Size | Paper (GFlop/s) | Observed (GFlop/s) | |:----:|:---------------:|:------------------:| | 1024 | 1103.46 | 1598.24 | | 4096 | 2718.42 | 4081.14 | | 6144 | 3277.99 | 4518.02 |

Step-by-Step Tutorial

In this part, we first introduce the overall MM tiling strategy including four levels of tilings. Then in the later parts, we illustrate the methodology of how we handle each of these level of tilings.

Overall MM Tiling Strategy:

Given a large Matrix Multiplication(MM) with size (M*K) * (K*N) refer as M*K*N, the listing below shows four level of tilings to handle this MM (from innermost to outermost):

Line 16-20: MM calculated on a single AIE core.
Line 12-14: The spatial distribution unrolled across different AIE cores in AIE Array.
Line

CHARM

Install / Use

README