VexRiscvSoftcoreContest2018
No description available
Install / Use
/learn @SpinalHDL/VexRiscvSoftcoreContest2018README
================================================ Overview
This repository is a RISC-V SoftCPU Contest entry. It implement 3 SoC :
- Igloo2Perf : Performant Microsemi IGLOO®2 implementation
- Up5kPerf : Performance Lattice iCE40 UltraPlus™ implementation
- Up5kArea : Small Lattice iCE40 UltraPlus™ implementation
For each of those SoC a port was made for a physical board :
- Igloo2PerfCreative : https://www.microsemi.com/existing-parts/parts/143948
- Up5kPerfEvn : https://www.latticesemi.com/Products/DevelopmentBoardsAndKits/iCE40UltraPlusBreakoutBoard
- Up5kAreaEvn : https://www.latticesemi.com/Products/DevelopmentBoardsAndKits/iCE40UltraPlusBreakoutBoard
There are some general informations :
- Hardware description made with SpinalHDL/Scala
- CPU used in SoCs is VexRiscv
- Netlist exported in Verilog
- Simulations made with Verilator
- Pass all RV32I compliance tests
- Zephyr OS ready
There is some informations about VexRiscv (CPU used) :
- Hosted on https://github.com/SpinalHDL/VexRiscv
- Implement RV32I[M][C]
- Optimized for FPGA
- Pipelined CPU with a parametrable number of stages : (Fetch -> 0 .. x stages) Decode Execute [Memory] [Writeback]
- Deeply parametrable via a system of plugins and a dataflow hardware description layer implemented on the top of SpinalHDL/Scala
================================================ Requirements
This repository was developed on a Linux VM. Only the Libero synthesis and the serial ports interaction where done on Windows.
This repository has some git submodules (Zephyr, compliance, VexRiscv), you need to clone it recursively:
.. code-block:: sh
git clone https://github.com/SpinalHDL/VexRiscvSoftcoreContest2018.git --recursive
Requirements :
- Verilator
- Icestorm
- Libero 11.8 SP2 (11.8.2.4)
- Icecube 2 2017.08.27940
- Zephyr SDK
- Java JDK 8, to regenerate the netlist
- SBT, to regenerate the netlist
- Python 2.7, to generate the igloo2 SPI flash programmation files
================================================ Repository structure
-
makefile : Contain most of the commands you will need
-
hardware
- scala : Contain the SpinalHDL/Scala hardware description of the SoCs
- netlist : Contain the verilog files generated by SpinalHDL
- synthesis : Contain the synthesis projects (Icestorm, Icecube2, Libero)
-
software
- bootloader
- dhrystone
-
test : Verilator testbenches
-
ext : External dependancies (Zephyr, compliance)
-
doc
-
scripts
-
project : Scala SBT project folder
================================================ Up5kPerf / Igloo2Perf
Those two SoC (Up5kPerf and Igloo2Perf) try to get the maximal dhrystone score. Both are very similar, as they only differ in their memory architecture.
There is some characteristics of the VexRiscv configuration used :
-
RV32IM
-
6 stages : 2xFetch, Decode, Execute, Memory, Writeback
-
Bypassed register file
-
Branch condition/target processing in the execute stage, jump in the Memory stage
-
1 way branch target predictor
-
1 cycle barrel shifter (result in Memory stage)
-
1 cycles multiplication using FPGA DSP blocks (result in the writeback stage)
-
34 cycles iterative division, with a lookup table to single cycle division that have small arguments
- The lookup table save about 33 cycles per dhrystone iteration.
- The lookup table can be disable by setting dhrystoneOpt of the hardware generation to false.
- The lookup table purpose is only to boost the dhrystone result
- The lookup table is a 16x16 table of 4 bits
- The lookup table optimisation can be argue as a fair/unfair thing :)
-
Uncached fetch/load/store buses
-
load command emitted in the Memory stage
-
Load result in the Writeback stage
-
No emulation
-
The CPU configuration isn't set to get the maximal DMIPS/Mhz but the maximal Dhrystones/s
There is some comments about the design :
-
In both SoC, the CPU boot on the SPI flash
-
on-chip-ram organisation :
- For the Up5k, there is two 64 KB ram based on SPRAM blocks. One for the instruction bus, one for the data bus
- For the Igloo2, there is one 32 KB true dual port ram with one port for the instruction bus and one port for the data bus.
-
No cache were used for the following reasons :
- There was enough on chip ram to host the instruction and the data
- The contest requirements was initially asking to support fence-i instruction, which aren't supported by the VexRiscv caches (line management is done by another way)
- Even if using an instruction cache and a data cache allow to have a better decoupling between the CPU and the memory system, it wasn't providing frequancy gain in the implemented SoC.
-
This SPI flash contain the following partitions :
- [0x00000 => FPGA bitstream for the Up5k]
- 0x20000 => CPU bootloader which copy the 0x30000 partition into the instruction ram
- 0x30000 => Application that the cpu should run
-
The reasons why the VexRiscv is configured with 2 fetch stages instead of 1 are :
- It relax the branch prediction path
- It relax the instruction bus to ram path
- The performance/mhz degradation is mostly absorbed by the branch predictor
-
The load command are emitted in the Memory stage instead of the Execute stage to relax the address calculation timings
-
The data ram was mapped on purpose at the address 0x00000 for the following reasons :
- The dhrystone benchmark use many global variables, and by mapping the ram this way, they can be accessed at any time via a x0 relative load/store
- The RISC-V compiler provided by the zephyr compiler don't use the 'gp' register to access global variables
-
The spi flash is programmed by the following way :
- Up5k -> by using the FTDI and iceprog
- Igloo2 -> by using the FTDI to Up5k serial link
There is a block diagram explaining the SoCs memory system :
.. |up5kPerfDiagram| image:: doc/assets/up5kPerfDiagram.png :width: 400
.. |igloo2PerfDiagram| image:: doc/assets/igloo2PerfDiagram.png :width: 400
+--------------------+-----------------------+ | Up5kPerf + Igloo2Perf + +====================+=======================+ | |up5kPerfDiagram| + |igloo2PerfDiagram| + +--------------------+-----------------------+
There is a block diagram of the CPU made by the VexRiscv configuration used in both Up5kPerf and Igloo2Perf:
.. image:: doc/assets/xPerfCpuDiagram.png :width: 800
Note that the purpose of the double PC register between the pc-gen and the Fetch1 stage is to produce a consistent iBus_cmd.
- Transaction on the iBus_cmd will always stay unchanged until their completion
- One of this PC register is used to make the iBus_cmd address consistant
- One of this PC register is used to store the value of a branch request while the Fetch1 stage is blocked.
Claimed spec :
+--------------+--------------------+------------+ | | Up5kPerf | Igloo2Perf | +==============+====================+============+ | Dhrystones/s | 65532 | 276695 | +--------------+--------------------+------------+ | DMIPS/Mhz | 1.38 | 1.38 | +--------------+--------------------+------------+ | Frequancy | 27 Mhz | 114 Mhz | +--------------+--------------------+------------+
Note that without the lookup table divider optimisation, the performance for both SoC is reduced to 1.27 DMIPS/Mhz
Notes about the synthesis/place/route of the Igloo2PerfCreative :
- The maximal frequency from one synthesis to another one with a different seed can easily vary between 107 Mhz to 121 Mhz
- The critical combinatorial paths are dominated by routing delays (85% for the routing delay vs 15% for the cells delay)
- The synthesis was done without retiming, as it wasn't providing a visible frequency gain.
Notes about the synthesis/place/route of the Up5kPerfEvn :
- Stressing the synthesis tool with crazy timing requirements realy help to get better final timings.
================================================ Up5kArea
This SoC try to use the least LC possible.
There is some characteristics of the VexRiscv configuration used :
-
RV32I
-
2 stages : (Fetch_Decode), Execute
-
Hazard resolution choices :
- Single instruction scheduling (smallest)
- interlocked
- bypassed (faster)
-
No branch prediction
-
Iterative shifter, up to 31 cycles
-
Uncached fetch/load/store buses
-
No emulation
There is some comments about the design :
-
It does not try to get the absolute minimal LC usage as it still keep an traditional pipelined approach.
-
This design mainly tried to expand the usage scope of VexRiscv by reducing it's LC usage.
-
It provide the occupancy of a regular 2 stages pipelined RISC-V, which could serve as a baseline from which, to reduce the area, "major" architecture changes are required.
-
VexRiscv was designed as a 5 stages CPU, but by using its dataflow hardware description paradigm, it was quite easy to retarget it into a 2 stages CPU
-
The CPU boot on the SPI flash
-
The instruction bus and data bus have share the same memory (64 KB SPRAM)
-
This SPRAM memory is only used for the software application.
-
This SPI flash contain the following partitions :
- 0x00000 => FPGA bitstream
- 0x20000 => CPU bootloader which copy the 0x30000 partition into the SPRAM
- 0x30000 => Application that the cpu should run
-
The spi flash is programmed by using the FTDI and iceprog
There is a block diagram explaining the memory system :
.. image:: doc/assets/up5kAreaDiagram.png :width: 400
There is a block diagram of the CPU made by the VexRiscv configuration used (No args):
.. image:: doc/assets/up5kAreaCpuDiagram.png :width: 400
Claimed spec of the Up5kArea :
+------------------+-----------------------------+------------------------+-------------------------
