Chameleon
Chameleon: A Multiplier-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
Install / Use
/learn @cogsys-tudelft/ChameleonREADME
Chameleon: A Multiplier-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
Chameleon is the first chip carrying out end-to-end FSL and CL with temporal data (no off-chip embedder, all weights on-chip), with three key innovations all validated in silicon:
-
Chameleon builds on a reformulation of Prototypical Networks to integrate FSL/CL directly as part of the inference process, with only 0.5% logic area and negligible latency/power overheads. Simple yet powerful: we show the first 250-class demonstration of CL with Omniglot, also exceeding the accuracy of previous FSL-only demonstrations.
-
Chameleon uses temporal convolutional networks (TCNs) as an efficient embedder for temporal data. This choice simultaneously enables (i) accuracy/power performance on-par or exceeding state-of-the-art inference-only accelerators for keyword spotting (93.3% at 3.1μW real-time power on Google Speech Commands), and (ii) efficiently dealing with >10^4 context lengths, demonstrating competitive performance also when directly processing 16kHz raw audio (demonstrated for the first time end-to-end on-chip, to the best of our knowledge).
-
Small or large on-chip TCN? Learning- or inference-focused execution? Emphasis on throughput or low static power? Chameleon can efficiently support all of these application-dependent choices, thanks to a reconfigurable PE array.
In case you decide to use the Chameleon source code for academic or commercial use, we would appreciate it if you let us know; feedback is welcome.
Note: the documentation of Chameleon is currently work in progress! Feel free to open an issue if there is something that is unclear.
Directory structure
src/: RTL (Verilog) source codescripts/: Scripts for Verilog formatting and configs for various waveviewerschameleon/core/: Core Python code to load and run trained nets and to communicate with the ASIC.fpga_bridge/: Python code that runs on the FPGA to interface with the ASIC.sim/: Simulation code for the ASIC.
nets/: Pre-trained networks in quantized state dict format used in the experiments.
Installation
Verilog
Run: bender update to install the necessary Verilog dependencies. Always run make to generate all required Verilog code before synthesizing or simulating the hardware.
Python
To use only the core Python code from Chameleon, simply run pip install . from the project root. If you also want to use the FPGA bridge code, run pip install . [fpga_bridge]. Finally, if you want to run the simulation code, run pip install . [sim].
Testing
Chameleon is tested using the cocotb framework. To run all full system tests, go into the test/ directory and run:
python test_chameleon.py
To run the tests for individual components, run:
python test_pe.py
python test_pe_array.py
python test_argmax_tree.py
From PyTorch training to silicon deployment (or simulation)
1. Train a TCN
Install the required package for TCN networks from here.
from tcn_lib import TCN
# Model for sequential MNIST task (1 input, 10 outputs, 8 layers of 32 channels, kernel size 7)
model = TCN(1, 10, [32] * 8, 7)
# Perform training in any way you want
dataset = ...
for epoch in range(10):
for batch in dataset:
...
loss = ...
loss.backward()
...
optimizer.step()
...
All experiments described in our paper were conducted using our fully open-source AutoLightning framework, which builds on PyTorch Lightning while adding support for detailed hyperparameter tracking, sweeping, and specialized training recipes, such as prototypical learning and quantization-aware training. However, you can of course use any framework you like to train your TCN model.
2. Quantize the network and export it
Install the required package for configurable quantization-aware-training from here.
from brevitas_utils import create_qat_ready_model, get_quant_state_dict, save_quant_state_dict, QuantConfig
WEIGHT_BIT_WIDTH = 4
ACT_BIT_WIDTH = 4
BIAS_BIT_WIDTH = 15
ACCUMULATION_BIT_WIDTH = 18
# Define quantization configurations (see for more details: https://xilinx.github.io/brevitas/tutorials/tvmcon2021.html#Inheriting-from-a-quantizer)
weight_quant_cfg = QuantConfig(
base_classes=["Int8WeightPerTensorPowerOfTwo"],
kwargs={"bit_width": WEIGHT_BIT_WIDTH, "narrow_range": False}
)
act_quant_cfg = QuantConfig(
base_classes=["ShiftedParamFromPercentileUintQuant"],
kwargs={"bit_width": ACT_BIT_WIDTH, "collect_stats_steps": 1500}
)
bias_quant_cfg = QuantConfig(
base_classes= ["Int16Bias"],
kwargs={"bit_width": BIAS_BIT_WIDTH}
)
# Make sure to define output quantization in case of prototypical training
output_quant_cfg = None
load_float_weights_into_model = True # Reuse weights from the floating point model
calibration_setup = None # Do not calibrate (via: https://xilinx.github.io/brevitas/tutorials/tvmcon2021.html#Calibration-based-post-training-quantization) the model. Optionally configure for better model performance and/or faster convergence
skip_modules = [] # Quantize all modules
# Create a QAT-ready model
qat_ready_model = create_qat_ready_model(model,
weight_quant_cfg,
act_quant_cfg,
bias_quant_cfg,
load_float_weights_into_model=load_float_weights_into_model,
calibration_setup=calibration_setup,
skip_modules=skip_modules)
# Retrain for a few more epochs with a smaller learning rate to recover accuracy
# Get quantized weights and biases
quant_state_dict = get_quant_state_dict(qat_ready_model, (1, 1, 100))
# Export model
save_quant_state_dict(quant_state_dict, "quant_model.pkl")
3. Simulate the network
3.1 Simulate network with NumPy
We use Numpy primitives to simulate the network instead of PyTorch, as the Python code of this project has to run on the ARM core of the FPGA as well and PyTorch is likely a bit too heavy for that.
from chameleon.core.net_transfer_utils import get_quant_state_dict_and_layers
path = "quant_model.pkl" # Path to the exported quantized model
in_quant, quant_layers = get_quant_state_dict_and_layers(path)
dataset = ... # Load your dataset here
accuracy, preds_and_targets = infer(path, dataset)
print(f"Accuracy: {accuracy}")
Note that when simulating the network like this, no limits are enforced on the number of layers and the width of the layers. Only the accumulation width is enforced.
3.2 Simulate network on-chip
# TO DO!
4. Deploy the network
FPGA acting as host
Setup at Delft University of Technology
To configure the power supply and the current measurement unit correctly, run the following commands after booting the FPGA:
sudo ip addr add xxx.yyy.z.p/qq dev eth1
ifconfig eth1 up
Then navigate to the location of this repository on the FPGA and run:
python -m asyncio
To start Python session to interact with Chameleon.
Configuration
All memories are accessible for both read and write operations via the system’s 32-bit SPI bus (1 R/W-bit, 4-bit memory index, 16-bit start address, 11-bit transaction count).
This same SPI bus is also used to program the SoC’s configuration registers, such as the number of layers and kernel size per layer. These configuration values serve, among other things, as inputs to the network address generator, which generates the necessary control signals and memory addresses during processing.
A full list of configuration registers is provided below. If you want to change the configuration of the chip, edit the config_memory.json file in the source code directory and rerun make to regenerate the associated Verilog code.
| Name | Width | Vector Size | Async Reset | Description |
|--------------------------------------------|------------------------------|------------------|------------------|---------------------------------------------------------------------------------|
| continuous_processing | 1 | – | ✅ | Enables processing (classification/regression) processing every few inputs for continuous streaming inputs. |
| classification | 1 | – | ✅ | Enables classification mode. |
| power_down_memories_while_running | 1 | – | ✅ | If you have a very high dimensional input or are transferring data into Chameleon very slowly, you can enable this flag. This will power down all memories except for the input memory while receiving input data. As soon as the necessary input data for a new set of computations is received, all memories will be powered up and the computation will be started. |
| enable_clock_divider | 1 | –
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
19.1kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
