Vectorforth
SIMD vectorized Forth compiler with CPU based shader application
Install / Use
/learn @janm31415/VectorforthREADME
vectorforth
SIMD vectorized Forth compiler with CPU based shader application
Vectorforth is a forth compiler that creates executable code, but every instruction is vectorized. For instance, if you compile 1 2 +, then the result will obviously be 3, but actually you will get 16 times the result 3 (using AVX-512), as each instruction is vectorized. The main toy application that I am interested in here is to see how fast your CPU can compute pixel shaders a la ShaderToy. So we make use as much as possible of SIMD (i.e. vectorized) instructions and multithreading. Of course the CPU will not beat the GPU but it was fun to test how large the gap would be.
As the compiler generates machine instructions, this will only work for an Intel architecture, and not for ARM architectures.
Building
Vectorforth has been tested on Windows 10 using Visual Studio 2017/2019, on Ubuntu 18.04.4 with gcc 7.5.0, and on MacOS 10.15.6 with XCode 11.7. You best use CMake to generate a solution file or makefile or XCode project.
Vectorforth uses two CMake variables: VECTORFORTH_AVX512 and VECTORFORTH_SINCOS_METHOD.
- If your computer can handle it, switch on VECTORFORTH_AVX512, as it will perform faster than AVX2. If your computer does not support AVX-512, and you switched VECTORFORTH_AVX512 on, the application will crash. Simply rebuild then with VECTORFORTH_AVX512 switched off.
- Computation of sine and cosine can be (as good as) exact or approximate. This behaviour is controlled with variable VECTORFORTH_SINCOS_METHOD. If you want to see the shader examples,
approximateis a good choice as it is faster, and the loss of accuracy is not visible in the shader images. If you want the highest accuracy for sine and cosine, then choosevectorclass.
Also of importance is your choice related to multithreading. This is controlled with the CMake variable JTK_THREADING. There are several choices here, but the best (fastest) choice is to use Intel's TBB library. This requires some work though, as you will need to install TBB yourself first.
On Windows you can download TBB's binaries from its website, and install them, preferably, in
folder C:\Program Files\TBB. Another folder is also possible, but then you'll need to
change the CMake variables TBB_INCLUDE_DIR and TBB_LIBRARIES and make them point to the correct location.
On Ubuntu you can simply run
sudo apt install libtbb-dev
to install TBB. On MacOS you can run
brew install tbb
If this gives an error in the sense of Cannot write to /usr/local/Cellar then you can solve this probably by updating your write privileges in this folder with the command sudo chmod a+w /usr/local/Cellar, and then try brew again.
If you don't want to use/install TBB, you can set JTK_THREADING to std, which will use std::thread instead.
Other dependencies, which are delivered with the code, are
- SDL2 (https://www.libsdl.org/)
- Agner Fog's vectorclass (https://github.com/vectorclass/version2)
- ocornut's dear imgui (https://github.com/ocornut/imgui)
- nlohmann's json (https://github.com/nlohmann/json)
The shader
I've mainly used vectorforth as a test to see what the performance of a CPU is when you let it do embarrassingly parallel GPU tasks, such as writing a shader (watch here). The most basic shader application is sf which has only a dependency on TBB. The other application is shaderforth which, apart from TBB, also depends on SDL2 and imgui. Both sf and shaderforth have the same functionality, but sf is a command line tool, and shaderforth has a gui.
To run a shader you have to provide the shader code in vectorforth. There are some example vectorforth shaders in subfolder shaderforth/examples. One of the examples that is good to start with (iq_tutorial.4th) is based on a tutorial by Inigo Quilez ( see https://www.youtube.com/watch?v=0ifChJ0nJfM). The vectorforth code looks like:
: qx u 0.33 - ;
: qy v 0.7 - ;
: dist-to-center qx dup * qy dup * + sqrt ;
: r 0.2 0.1 qy qx atan2 10 * qx 20 * + 1 + cos * +;
: factor r 0.01 r + dist-to-center smoothstep ;
: r2 0.01 120 qy * cos 0.002 * + -40 v * exp +;
: factor2 1
1 r2 0.01 r2 + qx 0.25 2 qy * sin * - abs smoothstep -
1 0 0.1 qy smoothstep -
* - ;
factor factor2 *
dup dup
0.4 0.8 v sqrt mix * swap
0.1 0.3 v sqrt mix *
The shader generates the following image.

I've also reworked the tutorial by Inigo Quilez on the Happy Jumper shader (https://www.youtube.com/watch?v=Cfe5UQ-1L9Q) to vectorforth. Currently I've processed the first three steps in the tutorial, see scripts sphere.4th, guy.4th, and guy2.4th with corresponding images below. I'll probably not work through steps 4 and 5, as step 3 currently has 10 FPS on my pc, so steps 4/5 will be too slow on the CPU.

You can also watch the video on YouTube.
The other examples in subfolder shaderforth/examples are mainly taken from the website https://forthsalon.appspot.com/. The author and link to the original shader are always mentioned in the 4th script.
Memory
The memory used by vectorforth is subdivided into 4 batches of memory that each have their own name:
- the stack
- the return stack
- the binding space
- the data space or heap
The return stack corresponds to the c stack, which, in assembler, is controlled by the rsp register. The other memories are dynamically aligned on 32 bytes and they are sequential, like this:
[ ... <== stack | variable space | data space ==> ... ]
The stack is the workspace for vectorforth. Most operators pop or push values on this stack. The top of the stack resides at the end of the memory batch, as in the scheme above. Adding items on the stack moves the stack pointer to the left, i.e., the addresses have lower values in bytes.
The data space is similar to the heap in c. With "allot" data space memory can be allocated and values can be stored here for later reference.
The binding space is used for binding words to data space locations. This memory is controlled by the compiler. With the "create <name>" keyword, the "variable <name>" keyword, or the "value <name>" keyword, the compiler will bind a given name to a data space address. Internally a dictionary is kept that binds <name> to a location in the binding space. This binding space location will point to the address in the data space memory where the memory was allocated. This means that the number of named variables/words is restricted by the size of binding space. For instance, if the binding space has size 2048 bytes, then 64 variables can be named ( 64 equals 2048 / 32 ).
Glossary
Core vectorforth
: ; Define a new word with the syntax : <word> <definition ...> ;. Defined words are always inlined.
@ ( #a -- v ) Read the value v at 64-bit memory address #a and put it on the stack.
! ( v #a -- ) Store value v in 64-bit memory address #a.
+! ( v #a -- ) Add value v to the value in 64-bit memory address #a and store the result in #a.
-! ( v #a -- ) Subtract value v from the value in 64-bit memory address #a and store the result in #a.
*! ( v #a -- ) Multiply value v to the value in 64-bit memory address #a and store the result in #a.
/! ( v #a -- ) Divide the value in 64-bit memory address #a by v and store the result in #a.
( ( -- ) A multiline comment until the corresponding ).
\ ( -- ) A comment until the end of the line.
+ ( a b -- a+b ) Pop the top two values from the stack, and push their sum to the stack.
- ( a b -- a-b ) Pop the top two values from the stack, and push their difference to the stack.
* ( a b -- a*b ) Pop the top two values from the stack, and push their multiplication to the stack.
/ ( a b -- a/b ) Pop the top two values from the stack, and push their quotient to the stack.
= ( a b -- v ) 0xffffffff if the top elements are equal, else 0x00000000.
<> ( a b -- v ) 0xffffffff if the top elements are not equal, else 0x00000000.
< ( a b -- v ) 0xffffffff if a < b, else 0x00000000.
> ( a b -- v ) 0xffffffff if a > b, else 0x00000000.
<= ( a b -- v ) 0xffffffff if a <= b, else 0x00000000.
>= ( a b -- v ) 0xffffffff if a >= b, else 0x00000000.
, ( v -- ) Moves the top stack element to the address pointed to by the data space pointer, and updates the data space pointer so that it points to the next available location.
abs ( a -- v ) Pop the top element of the stack, and put its absolute value on the stack.
add2 ( #a #b #r -- ) Adds two vec2 objects whose addresses are given by #a and #b. The result is saved at memory location #r.
add3 ( #a #b #r -- ) Adds two vec3 objects whose addresses are given by #a and #b. The result is saved at memory location #r.
allot ( #u -- #a ) Allocates #u bytes of memory on the data space (heap). #u should always be a multiple of 32 for correct alignment with the simd addresses. The address of the memory allocated is pushed on the stack.
and ( a b -- v ) bitwise and operator on a and b.
atan2 ( a b -- v ) Pop the top two elements of the stack and push the arctangent of a/b on the stack.
begin <test> while <loop> repeat While loop. The loop is repeated until all vectorized values in the test return false. So it is possible that a vectorized value in the loop already is returning true in the test, but is still going through the loop, as other vectorized values are still returning false.
ceil ( v -- v ) Round the top element of the stack upward.
cells ( #n -- 32*#
