tensorForth - lives in GPU, does linear algebra and machine learning

Forth VM that supports tensor calculus and Convolution Neural Network with dynamic parallelism in CUDA

Status

|version|feature|stage|description|conceptual comparable| |---|---|---|---|---| |1.0|float|production|extended eForth with F32 float|Python| |2.0|matrix|production|+ vector and matrix objects|NumPy| |2.2|lapack|production|+ linear algebra methods|SciPy| |3.0|CNN|beta|+ Machine Learning with autograd|Torch| |3.2|GAN|alpha|+ Generative Adversarial Net|PyTorch.GAN| |4.0|Transformer|developing|add Transformer ops|PyTorch.Transformer| |4.2|Retentive|analyzing|add RetNet ops|PyTorch.RetNet|

Why?

Compiled programs run fast on Linux. On the other hand, command-line interface and shell scripting tie them together in operation. With interactive development, small tools are built along the way, productivity usually grows with time, especially in the hands of researchers.

Niklaus Wirth: Algorithms + Data Structures = Programs

Too much on Algorithms - most modern languages, i.e. OOP, abstraction, template, ...
Too focused on Data Structures - APL, SQL, ...

Numpy kind of solves both. So, for AI projects today, we use Python mostly. However, when GPU got involved, to enable processing on CUDA device, say with Numba or the likes, mostly there will be a behind the scene 'just-in-time' transcoding to C/C++ followed by compilation then load and run. In a sense, your Python code behaves like a Makefile which requires compilers/linker available on the host box. The common code-compile-run-debug cycle is especially counter-productive with ML's extra-long run stage.

Forth language encourages incremental build-test cycle. Having a 'shell', resides in GPU, that can interactively and incrementally develop/run each AI layer/node as a small 'subroutine' without dropping back to host system might better assist building a rapid and accurate system. Some might argue that this kind of CUDA kernel will kill GPU with branch divergence. Yes, indeed and there's always space for improvment. However, the performance of the 'shell scripts' themselves is not really the point of discussion. So, here we are!

tensor + Forth = tensorForth!

What?

More details to come but here are some samples of tensorForth in action

Benchmarks (on MNIST)

|Different Neural Network Models|Different Gradient Descent Methods| |---|---| |<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_model_cmp.png" width="600px" height="400px">|<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_gradient_cmp.png" width="600px" height="400px">|

|2D Convolution vs Linear+BatchNorm|Effectiveness of Different Activations| |---|---| |<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_cnv_vs_bn.png" width="600px" height="400px">|<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_act_cmp.png" width="600px" height="400px">|

|Generative Adversarial Network (MNIST)|Generator & Discriminator Losses| |---|---| |<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_l7_progress2.png" width="880px" height="400px">|<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_l7_loss.png" width="300px" height="300px"><br/>|

How?

GPU, behaves like a co-processor or a DSP chip. It has no OS, no string support, and runs its own memory. Most of the available libraries are built for host instead of device i.e. to initiate calls from CPU into GPU but not the other way around. So, to be interactive, a memory manager, IO, and syncing with CPU are things needed to be had. It's pretty much like creating a Forth from scratch for a new processor as in the old days.

Since GPUs have good compiler support nowadays and I've ported the latest eForth to lambda-based in C++, pretty much all words can be transferred straight forward. However, having FP32 or float32 as my basic data unit, so later I can morph them to FP16, or even fixed-point, there are some small stuffs such as addressing and logic ops that require some attention.

The codebase will be in C for my own understanding of the multi-trip data flows. In the future, the class/methods implementation can come back to Forth in the form of loadable blocks so maintainability and extensibility can be utilized as other self-hosting systems. It would be amusing to find someone brave enough to work the NVVM IR or even PTX assembly into a Forth that resides on GPU micro-cores in the fashion of GreenArray, or to forge an FPGA doing similar kind of things.

In the end, languages don't really matter. It's the problem they solve. Having an interactive Forth in GPU does not mean a lot by itself. However by adding vector, matrix, linear algebra support with a breath of APL's massively parallel from GPUs. Neural Network tensor ops with backprop following the path from Numpy to PyTorch, plus the cleanness of Forth, it can be useful one day, hopefully!

Example - Small Matrix ops

<pre> > ten4 # enter tensorForth tensorForth 2.0 \ GPU 0 initialized at 1800MHz, dict[1024], vmss[64*1], pmem=48K, tensor=1024M 2 3 matrix{ 1 2 3 4 5 6 } \ create a 2x3 matrix <0 T2[2,3]> ok \ 2-D tensor shown on top of stack (TOS) dup \ create a view of the matrix <0 T2[2,3] t[2,3]> ok \ view shown in lower case . \ print the matrix (destructive as in Forth) matrix[2,3] = { { +1.0000 +2.0000 +3.0000 } { +4.0000 +5.0000 +6.0000 } } <0 T2[2,3]> ok \ original matrix still on TOS 3 2 matrix ones \ create a 3x2 matrix, fill it with ones <0 T2[2,3] T2[3,2]> ok \ now we have two matrices on stack dup . \ see whether it is filled with one indeed matrix[3,2] = { { +1.0000 +1.0000 } { +1.0000 +1.0000 } { +1.0000 +1.0000 } } @ \ multiply them 2x3 @ 3x2 <0 T2[2,3] T2[3,2] T2[2,2]> ok \ 2x2 resultant matrix shown on TOS . \ print the new matrix matrix[2,2] = { { +6.0000 +6.0000 } { +15.0000 +15.0000 } } <0 T2[2,3] T2[3,2]> ok bye \ exit tensorForth <0> ok tensorForth 2.0 done. </pre>

Example - Larger Matrix ops - benchmark 1024x2048 x 2048x512 matrices - 1000 loops

<pre> 1024 2048 matrix rand \ create a 1024x2048 matrix with uniform random values <0 T2[1024,2048]> ok 2048 512 matrix ones \ create another 2048x512 matrix filled with 1s <0 T2[1024,2048] T2[2048,512]> ok @ \ multiply them and resultant matrix on TOS <0 T2[1024,2048] T2[2048,512] T2[1024,512]> ok 2048 /= . \ scale down and print the resultant 1024x512 matrix matrix[1024,512] = { \ in PyTorch style (edgeitem=3) { +0.4873 +0.4873 +0.4873 ... +0.4873 +0.4873 +0.4873 } { +0.4274 +0.4274 +0.4274 ... +0.4274 +0.4274 +0.4274 } { +0.5043 +0.5043 +0.5043 ... +0.5043 +0.5043 +0.5043 } ... { +0.5041 +0.5041 +0.5041 ... +0.5041 +0.5041 +0.5041 } { +0.5007 +0.5007 +0.5007 ... +0.5007 +0.5007 +0.5007 } { +0.5269 +0.5269 +0.5269 ... +0.5269 +0.5269 +0.5269 } } <0 T2[1024,2048] T2[2048,512]> ok : mx \ now let's define a word 'mx' for benchmark loop clock >r \ keep the init time (in msec) on return stack for @ drop next \ loops of matrix multiplication clock r> - ; \ time it (clock1 - clock0) <0 T2[1024,2048] T2[2048,512]> ok 999 mx \ now try 1000 loops <0 T2[1024,2048] T2[2048,512] 3.938+04> ok \ that is 39.38 sec (i.e. ~40ms / loop) </pre>

Example - CNN Training on MNIST dataset

<pre> 10 28 28 1 nn.model \ create a network model (input dimensions) 0.5 10 conv2d 2 maxpool relu \ add a convolution block 0.5 20 conv2d 0.5 dropout 2 maxpool relu \ add another convolution block flatten 49 linear \ add reduction layer to 49-feature, and 0.5 dropout 10 linear softmax \ final 10-feature fully connected output constant md0 \ we can keep the model as a constant md0 batchsize dataset mnist_train \ create a MNIST dataset with model batch size constant ds0 \ keep the dataset as a constant \ the entire CNN training framework here : epoch ( N D -- N' ) \ one epoch thru entire training dataset for \ loop thru dataset per mini-batch forward \ neural network forward pass backprop \ neural network back propagation 0.01 nn.sgd \ training with Stochastic Gradient Descent next ;

TensorForth

Install / Use

README