TensorForth
Forth does tensors, in CUDA.
Install / Use
/learn @chochain/TensorForthREADME
tensorForth - lives in GPU, does linear algebra and machine learning
- Forth VM that supports tensor calculus and Convolution Neural Network with dynamic parallelism in CUDA
Status
|version|feature|stage|description|conceptual comparable| |---|---|---|---|---| |1.0|float|production|extended eForth with F32 float|Python| |2.0|matrix|production|+ vector and matrix objects|NumPy| |2.2|lapack|production|+ linear algebra methods|SciPy| |3.0|CNN|beta|+ Machine Learning with autograd|Torch| |3.2|GAN|alpha|+ Generative Adversarial Net|PyTorch.GAN| |4.0|Transformer|developing|add Transformer ops|PyTorch.Transformer| |4.2|Retentive|analyzing|add RetNet ops|PyTorch.RetNet|
Why?
Compiled programs run fast on Linux. On the other hand, command-line interface and shell scripting tie them together in operation. With interactive development, small tools are built along the way, productivity usually grows with time, especially in the hands of researchers.
Niklaus Wirth: Algorithms + Data Structures = Programs
- Too much on Algorithms - most modern languages, i.e. OOP, abstraction, template, ...
- Too focused on Data Structures - APL, SQL, ...
Numpy kind of solves both. So, for AI projects today, we use Python mostly. However, when GPU got involved, to enable processing on CUDA device, say with Numba or the likes, mostly there will be a behind the scene 'just-in-time' transcoding to C/C++ followed by compilation then load and run. In a sense, your Python code behaves like a Makefile which requires compilers/linker available on the host box. The common code-compile-run-debug cycle is especially counter-productive with ML's extra-long run stage.
Forth language encourages incremental build-test cycle. Having a 'shell', resides in GPU, that can interactively and incrementally develop/run each AI layer/node as a small 'subroutine' without dropping back to host system might better assist building a rapid and accurate system. Some might argue that this kind of CUDA kernel will kill GPU with branch divergence. Yes, indeed and there's always space for improvment. However, the performance of the 'shell scripts' themselves is not really the point of discussion. So, here we are!
tensor + Forth = tensorForth!
What?
More details to come but here are some samples of tensorForth in action
-
Benchmarks (on MNIST)
|Different Neural Network Models|Different Gradient Descent Methods| |---|---| |<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_model_cmp.png" width="600px" height="400px">|<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_gradient_cmp.png" width="600px" height="400px">|
|2D Convolution vs Linear+BatchNorm|Effectiveness of Different Activations| |---|---| |<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_cnv_vs_bn.png" width="600px" height="400px">|<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_act_cmp.png" width="600px" height="400px">|
|Generative Adversarial Network (MNIST)|Generator & Discriminator Losses| |---|---| |<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_l7_progress2.png" width="880px" height="400px">|<img src="https://raw.githubusercontent.com/chochain/tensorForth/master/docs/img/ten4_l7_loss.png" width="300px" height="300px"><br/>|
How?
GPU, behaves like a co-processor or a DSP chip. It has no OS, no string support, and runs its own memory. Most of the available libraries are built for host instead of device i.e. to initiate calls from CPU into GPU but not the other way around. So, to be interactive, a memory manager, IO, and syncing with CPU are things needed to be had. It's pretty much like creating a Forth from scratch for a new processor as in the old days.
Since GPUs have good compiler support nowadays and I've ported the latest eForth to lambda-based in C++, pretty much all words can be transferred straight forward. However, having FP32 or float32 as my basic data unit, so later I can morph them to FP16, or even fixed-point, there are some small stuffs such as addressing and logic ops that require some attention.
The codebase will be in C for my own understanding of the multi-trip data flows. In the future, the class/methods implementation can come back to Forth in the form of loadable blocks so maintainability and extensibility can be utilized as other self-hosting systems. It would be amusing to find someone brave enough to work the NVVM IR or even PTX assembly into a Forth that resides on GPU micro-cores in the fashion of GreenArray, or to forge an FPGA doing similar kind of things.
In the end, languages don't really matter. It's the problem they solve. Having an interactive Forth in GPU does not mean a lot by itself. However by adding vector, matrix, linear algebra support with a breath of APL's massively parallel from GPUs. Neural Network tensor ops with backprop following the path from Numpy to PyTorch, plus the cleanness of Forth, it can be useful one day, hopefully!
