================================================ Overview

This repository is a RISC-V SoftCPU Contest entry. It implement 3 SoC :

Igloo2Perf : Performant Microsemi IGLOO®2 implementation
Up5kPerf : Performance Lattice iCE40 UltraPlus™ implementation
Up5kArea : Small Lattice iCE40 UltraPlus™ implementation

For each of those SoC a port was made for a physical board :

Igloo2PerfCreative : https://www.microsemi.com/existing-parts/parts/143948
Up5kPerfEvn : https://www.latticesemi.com/Products/DevelopmentBoardsAndKits/iCE40UltraPlusBreakoutBoard
Up5kAreaEvn : https://www.latticesemi.com/Products/DevelopmentBoardsAndKits/iCE40UltraPlusBreakoutBoard

There are some general informations :

Hardware description made with SpinalHDL/Scala
CPU used in SoCs is VexRiscv
Netlist exported in Verilog
Simulations made with Verilator
Pass all RV32I compliance tests
Zephyr OS ready

There is some informations about VexRiscv (CPU used) :

Hosted on https://github.com/SpinalHDL/VexRiscv
Implement RV32I[M][C]
Optimized for FPGA
Pipelined CPU with a parametrable number of stages : (Fetch -> 0 .. x stages) Decode Execute [Memory] [Writeback]
Deeply parametrable via a system of plugins and a dataflow hardware description layer implemented on the top of SpinalHDL/Scala

================================================ Requirements

This repository was developed on a Linux VM. Only the Libero synthesis and the serial ports interaction where done on Windows.

This repository has some git submodules (Zephyr, compliance, VexRiscv), you need to clone it recursively:

.. code-block:: sh

git clone https://github.com/SpinalHDL/VexRiscvSoftcoreContest2018.git --recursive

Requirements :

Verilator
Icestorm
Libero 11.8 SP2 (11.8.2.4)
Icecube 2 2017.08.27940
Zephyr SDK
Java JDK 8, to regenerate the netlist
SBT, to regenerate the netlist
Python 2.7, to generate the igloo2 SPI flash programmation files

================================================ Repository structure

makefile : Contain most of the commands you will need
hardware
- scala : Contain the SpinalHDL/Scala hardware description of the SoCs
- netlist : Contain the verilog files generated by SpinalHDL
- synthesis : Contain the synthesis projects (Icestorm, Icecube2, Libero)
software
- bootloader
- dhrystone
test : Verilator testbenches
ext : External dependancies (Zephyr, compliance)
doc
scripts
project : Scala SBT project folder

================================================ Up5kPerf / Igloo2Perf

Those two SoC (Up5kPerf and Igloo2Perf) try to get the maximal dhrystone score. Both are very similar, as they only differ in their memory architecture.

There is some characteristics of the VexRiscv configuration used :

RV32IM
6 stages : 2xFetch, Decode, Execute, Memory, Writeback
Bypassed register file
Branch condition/target processing in the execute stage, jump in the Memory stage
1 way branch target predictor
1 cycle barrel shifter (result in Memory stage)
1 cycles multiplication using FPGA DSP blocks (result in the writeback stage)
34 cycles iterative division, with a lookup table to single cycle division that have small arguments
- The lookup table save about 33 cycles per dhrystone iteration.
- The lookup table can be disable by setting dhrystoneOpt of the hardware generation to false.
- The lookup table purpose is only to boost the dhrystone result
- The lookup table is a 16x16 table of 4 bits
- The lookup table optimisation can be argue as a fair/unfair thing :)
Uncached fetch/load/store buses
load command emitted in the Memory stage
Load result in the Writeback stage
No emulation
The CPU configuration isn't set to get the maximal DMIPS/Mhz but the maximal Dhrystones/s

There is some comments about the design :

In both SoC, the CPU boot on the SPI flash
on-chip-ram organisation :
- For the Up5k, there is two 64 KB ram based on SPRAM blocks. One for the instruction bus, one for the data bus
- For the Igloo2, there is one 32 KB true dual port ram with one port for the instruction bus and one port for the data bus.
No cache were used for the following reasons :
- There was enough on chip ram to host the instruction and the data
- The contest requirements was initially asking to support fence-i instruction, which aren't supported by the VexRiscv caches (line management is done by another way)
- Even if using an instruction cache and a data cache allow to have a better decoupling between the CPU and the memory system, it wasn't providing frequancy gain in the implemented SoC.
This SPI flash contain the following partitions :
- [0x00000 => FPGA bitstream for the Up5k]
- 0x20000 => CPU bootloader which copy the 0x30000 partition into the instruction ram
- 0x30000 => Application that the cpu should run
The reasons why the VexRiscv is configured with 2 fetch stages instead of 1 are :
- It relax the branch prediction path
- It relax the instruction bus to ram path
- The performance/mhz degradation is mostly absorbed by the branch predictor
The load command are emitted in the Memory stage instead of the Execute stage to relax the address calculation timings
The data ram was mapped on purpose at the address 0x00000 for the following reasons :
- The dhrystone benchmark use many global variables, and by mapping the ram this way, they can be accessed at any time via a x0 relative load/store
- The RISC-V compiler provided by the zephyr compiler don't use the 'gp' register to access global variables
The spi flash is programmed by the following way :
- Up5k -> by using the FTDI and iceprog
- Igloo2 -> by using the FTDI to Up5k serial link

There is a block diagram explaining the SoCs memory system :

.. |up5kPerfDiagram| image:: doc/assets/up5kPerfDiagram.png :width: 400

.. |igloo2PerfDiagram| image:: doc/assets/igloo2PerfDiagram.png :width: 400

+--------------------+-----------------------+ | Up5kPerf + Igloo2Perf + +====================+=======================+ | |up5kPerfDiagram| + |igloo2PerfDiagram| + +--------------------+-----------------------+

There is a block diagram of the CPU made by the VexRiscv configuration used in both Up5kPerf and Igloo2Perf:

.. image:: doc/assets/xPerfCpuDiagram.png :width: 800

Note that the purpose of the double PC register between the pc-gen and the Fetch1 stage is to produce a consistent iBus_cmd.

Transaction on the iBus_cmd will always stay unchanged until their completion
One of this PC register is used to make the iBus_cmd address consistant
One of this PC register is used to store the value of a branch request while the Fetch1 stage is blocked.

Claimed spec :

+--------------+--------------------+------------+ | | Up5kPerf | Igloo2Perf | +==============+====================+============+ | Dhrystones/s | 65532 | 276695 | +--------------+--------------------+------------+ | DMIPS/Mhz | 1.38 | 1.38 | +--------------+--------------------+------------+ | Frequancy | 27 Mhz | 114 Mhz | +--------------+--------------------+------------+

Note that without the lookup table divider optimisation, the performance for both SoC is reduced to 1.27 DMIPS/Mhz

Notes about the synthesis/place/route of the Igloo2PerfCreative :

The maximal frequency from one synthesis to another one with a different seed can easily vary between 107 Mhz to 121 Mhz
The critical combinatorial paths are dominated by routing delays (85% for the routing delay vs 15% for the cells delay)
The synthesis was done without retiming, as it wasn't providing a visible frequency gain.

Notes about the synthesis/place/route of the Up5kPerfEvn :

Stressing the synthesis tool with crazy timing requirements realy help to get better final timings.

================================================ Up5kArea

This SoC try to use the least LC possible.

There is some characteristics of the VexRiscv configuration used :

RV32I
2 stages : (Fetch_Decode), Execute
Hazard resolution choices :
- Single instruction scheduling (smallest)
- interlocked
- bypassed (faster)
No branch prediction
Iterative shifter, up to 31 cycles
Uncached fetch/load/store buses
No emulation

There is some comments about the design :

It does not try to get the absolute minimal LC usage as it still keep an traditional pipelined approach.
This design mainly tried to expand the usage scope of VexRiscv by reducing it's LC usage.
It provide the occupancy of a regular 2 stages pipelined RISC-V, which could serve as a baseline from which, to reduce the area, "major" architecture changes are required.
VexRiscv was designed as a 5 stages CPU, but by using its dataflow hardware description paradigm, it was quite easy to retarget it into a 2 stages CPU
The CPU boot on the SPI flash
The instruction bus and data bus have share the same memory (64 KB SPRAM)
This SPRAM memory is only used for the software application.
This SPI flash contain the following partitions :
- 0x00000 => FPGA bitstream
- 0x20000 => CPU bootloader which copy the 0x30000 partition into the SPRAM
- 0x30000 => Application that the cpu should run
The spi flash is programmed by using the FTDI and iceprog

There is a block diagram explaining the memory system :

.. image:: doc/assets/up5kAreaDiagram.png :width: 400

There is a block diagram of the CPU made by the VexRiscv configuration used (No args):

.. image:: doc/assets/up5kAreaCpuDiagram.png :width: 400

Claimed spec of the Up5kArea :

+------------------+-----------------------------+------------------------+-------------------------

VexRiscvSoftcoreContest2018

Install / Use

README

================================================ Overview

================================================ Requirements

================================================ Repository structure

================================================ Up5kPerf / Igloo2Perf

================================================ Up5kArea