TinyGPUs: the DMC-1

Making graphics hardware like its 1990, explained. Renders Doom, Comanche and Quake levels!

This is work in progress. Please stay tuned if you'd like to known when additional explanations come in. Comments welcome!

Quick links:

TinyGPUs: the DMC-1

The tinyGPUs project started with the following question: "What would have resembled graphics hardware dedicated to our beloved retro-games from the early 90's, such as Doom 1993 and Comanche 1992?". This led me to creating the DMC-1 GPU, the first (and currently only!) tiny GPU in this repository.

DMC stands for Doom Meets Comanche... also, it sounds cool (any resemblance to a car name is of course pure coincidence).

However, the true objective is to explore and explain basic graphics hardware design. Don't expect to learn anything about modern GPUs, but rather expect to learn about fundamental graphics algorithms, their elegant simplicity, and how to turn these algorithms into hardware on FPGAs.

The DMC-1 is powering my Doom-chip "onice" demo, about which I gave a talk at rC3 nowhere in December 2021. You can watch it here and browse the slides here.

There is a plan to do another tiny GPU, hence the s in tinyGPUs, exploring different design tradeoffs. But that will come later.

The tinyGPUs are written in Silice, with bits and pieces of Verilog.

Running the demos

For building the DMC-1 demos Silice has to be installed and in the path, please refer to the Silice repository.

Note: The build process automatically downloads files, including data files from external sources. See the download scripts here and here.

There are several demos: terrain, tetrahedron, doomchip-onice, interleaved, triangles and q5k (quake viewer!). All can be simulated, and currently run on the mch2022 badge and the icebreaker with a SPI screen plugged in the PMOD 1A connector (details below).

The demos are running both on the icebreaker board and the MCH2022 badge.

In simulation

All demos run in simulation (verilator). Note that it takes a little bit of time before the rendering starts, as the full boot process (including loading code from SPIflash) is being simulating. During boot the screen remains black (on real hardware this delay is imperceptible).

For the rotating tetrahedron demo:

cd demos
make simulation DEMO=tetrahedron

For the terrain demo:

cd demos
make simulation DEMO=terrain

For the doomchip-onice demo:

cd demos
make simulation DEMO=doomchip-onice

For the quake-up5k (q5k) demo:

cd demos
make simulation DEMO=q5k

On the MCH2022 badge

The badge scripts require python dependencies: pip install pyserial pyusb

Plugin the board and type:

cd demos
make BOARD=mch2022 DEMO=q5k program_all MCH2022_PORT=/dev/ttyACM1

The program_all target takes time as it uploads the texture pack. Once done, use program_code to only upload the compiled code and program_design for the design only (as long as there is power to the badge).

When switching between the q5k (Quake) and other demos, use make clean before building to ensure the correct palette is used next.

On the icebreaker

A 240x320 SPIscreen with a ST7789 driver has to be hooked to the PMOD 1A, following this pinout:

| PMOD1A | SPIscreen | |--------------|------------| | pin 1 | rst | | pin 2 | dc | | pin 3 | clk | | pin 4 | din | | pin 5 (GND) | screen GND | | pin 6 (VCC) | screen VCC | | pin 11 (GND) | cs | | pin 12 (VCC) | bl |

For the rotating tetrahedron demo:

cd demos
make BOARD=icebreaker DEMO=tetrahedron program_all

For the terrain demo:

cd demos
make BOARD=icebreaker DEMO=terrain program_all

program_all takes a long time as it transfers the texture data onto the board. After doing it once, to test other demos replace program_all by program_code.

When switching between the q5k (Quake) and other demos, use make clean before building to ensure the correct palette is used next.

The DMC-1 design

Context

I started the DMC-1 after my initial doomchip experiments. The original doomchip, which is available in the Silice repository was pushing the idea of running the Doom render loop without any CPU. This means that the entire rendering algorithm was turned into specialized logic -- a Doom dedicated chip that could not do anything else but render E1M1 ... !

I went on to design several versions, including one using SDRAM to store level and textures. This was a fun experiment, but of course the resulting design is very large because every part of the algorithm becomes a piece of a complex dedicated circuit. That might be desirable under specific circumstances, but is otherwise highly unusual. You see, normally one seeks to produce compact hardware designs, minimizing the resource usage, and in particular the number of logic gates (or their LUTs equivalent in FPGAs).

When targeting boards with (relatively) large FPGAs like the de10-nano or ULX3S 85F I was using, this is not a top concern, because the design still fits easily. But there's a lot to be said about trying to be parsimonious and make the best of the resources you have. I come from an age where computers where not powerful beyond imagination as they are today, and where optimizing was not only done for fun, it was essential. And when I say optimizing, I mean it in all possible ways: better algorithms (lower complexity), clever tricks, and informed low level CPU optimizations, usually directly in assembly code. Of course we could not go past the hardware. But now we can! Because thanks to FPGAs and the emergence of open source toolchains we are empowered with designing and quickly testing our own hardware!

This got me wondering. The doomchip is one extreme point of putting everything in hardware and redoing the entire render loop from scratch. The other extreme point -- also very interesting -- is to take the Doom source code and run it on a custom SOC ('system on a chip', the CPU, RAM and other pieces of hardware around). There, the skills are all on designing a very efficient hardware SOC with a good CPU and well thought out memory layout and cache. Some source ports require in depth, careful engine optimizations to fit the target hardware.

There are so many excellent ports, I am just linking to a few for context here.

So the questions I asked myself were, Could we design a GPU for Doom and other games of this era? What would its architecture be like? Could it fit on a small FPGA?

The DMC-1 is my take on this. Thinking beyond Doom, I thought I should also support Comanche 1992 style terrains. There are several reasons for that:

Comanche was a sight to behold back in the days!
Adding a terrain to Doom sounds like a huge thing.
If I was to create a GPU for Doom, it better should come with a killer feature.

And while I was at it, I later added support for Quake level rendering!

In terms of resources, I decided to primarily target the Lattice ice40 UP5K. First, this is the platform used by the incredible source port on custom SOC by Sylvain Munaut. Targeting anything bigger would have seemed too easy. Second, the UP5K is fairly slow (validating timing at 25 MHz is good, anything above is very good), and has 'only' 5K LUTs (that's not so small though, 1K LUTs can run a full 32 bits RISCV dual-core processor!). So this makes for a good challenge. On the upside, the UP5K has 8 DSPs (great for fast multipliers!) and 128KB of SPRAM, a fast one-cycle read/write memory. So this gives hope something can actually be achieved. Plus, a SPIflash memory is typically hooked alongside FPGAs for its configuration. A SPIflash is to be considered read only for our purpose (because writing is very, very slow), but even though reading takes multiple cycles to initialize a random access, performance is far from terrible. And that's great, because SPIflash memories are typically a few MB and we need to put our large textures somewhere!

At this point, you might want to watch [

Tinygpus

Install / Use

README