Squint
Squint: A peephole optimizer for stack VM compilers
Install / Use
/learn @HPCguy/SquintREADME
Squint: A peephole optimizer for stack VM compilers
Introduction (Updated June 2025)
Short summary: This repo contains a highly functional, but not quite complete, C compiler (mc.c) that supports JIT execution, ELF executable generation, dynamic linking, and peephole optimization. mc.c is a follow-on to the AMaCC compiler. See the AMaCC documentation referenced below for more information.
This compiler was developed on a Cortex-A72 Raspberry Pi computer running 32 bit Buster Linux, but will also run in an aarch64 OS environment with an arm-linux-gnueabihf-gcc (cross) compiler installed. Finally, see the "Discussion" tab above for instructions on using this compiler with an ARM Chromebook.
A debugger has recently been added as outlined below
[!IMPORTANT]
The mc compiler (extended version of AMaCC compiler) is relatively bug free. That said, the optimizer that lives inside the compiler (squint) can be buggy, so I recommend running non-optimized, or using a commercial compiler, to compare.
Quick-Start
// bench.c
#include <stdio.h>
void bench(int *s1, int *s2)
{
int sum1 = 0, sum2 = 0;
for (int i=1000000000; i>0; --i) {
sum1 += i % 3;
sum2 += i / 333333333;
}
*s1 = sum1;
*s2 = sum2;
}
int main()
{
int sum1, sum2;
bench(&sum1, &sum2);
printf("%d %d\n", sum1, sum2);
return 0;
}
% gcc -o mc mc.c -ldl # (Optional) Creates non-optimized compiler
% make mc-so # Creates optimizing compiler
%
% ./mc-so -o bench bench.c # Optimized compilation
% time ./bench
1000000000 1000000005
real 0m2.672s # <------- My compiler almost twice as fast as Gcc
user 0m2.672s
sys 0m0.001s
%
% gcc -mtune=cortex-a72 -O3 bench.c # Processor used for benchmarking
% time ./a.out
1000000000 1000000005
real 0m4.673s # <------- Optimized Gcc !!!
user 0m4.673s
sys 0m0.001s
%
Further Introduction
This compiler supports the following features beyond AMaCC:
-
Float data type (AMaCC is an integer-only based compiler).
-
Array declarations and initializers. (e.g. float foo[4][4] = { { 0.0, ... }, { 0.0 ... }, ... };
-
The Squint peephole optimizer that roughly halves the number of executable instructions in compiled code. The tests/sieve.c benchmark provides an optimization example, and runs roughly 6x faster after peephole optimization.
-
Greatly improved type checking, error checking, and IR code generation (try -si option).
The MC C compiler found in this repository is a subset of the full MC HPC compiler which is being developed offline. That said, results from the full MC HPC compiler are shown below.
Source code size (public code version, this repository):
- mc C compiler -- 4150 SLOC
- squint optimizer -- 5300 SLOC
The original AMaCC compiler is based on the phenomenal work of the team at https://github.com/jserv/amacc , and I strongly suggest you visit that site to see the history of AMaCC compiler development as stored in that repository. It shows how small changes can be added to a base complier, step-by-step, to create a truly marvelous educational compiler. The README there expands upon this README and is where you will learn the most about the AMaCC compiler that was used as a starting point for this work.
Compilation time and features
The following time command uses the mc compiler to JIT compile an optimized verison of the mc compiler, using an external optimizer linked as a shared object library, and then runs that optimized JIT complier to then JIT compile an optimized version of Sieve of Eratosthenes (again using a shared object optimizer), and then the sieve benchmark is JIT executed. The time shown is the time it takes to do everything in this paragraph, sieving 8388608 values using three different algorithms applied to bit arrays.
$ make mc-so # use gcc to build an enhanced version of the mc compiler
CC+LD mc-so
$ time ./mc-so -DSQUINT_SO=1 mc.c tests/sieve.c
real 0m0.370s
user 0m0.369s
sys 0m0.001s
Performance vs Gcc
gcc compiler options "-mfloat-abi=hard -mtune=cortex-a72 -lm" included for all compiles.
Time to run executable
| Benchmark | Mc runtime | Mc+Squint | Gcc | Gcc -O1 | Gcc -03 | Desc | | --- | --- | --- | --- | --- | --- | --- | | sieve (int) | 1.697s | 0.266s | 0.655s | 0.266s | 0.272s | Eratosthenes | | i2a2i (int) | 21.386s | 1.631s | 8.842s | 2.230s | 2.654s | int -> string -> int | | sort_p (float/int) | 61.628s | 3.357s | 31.479s | 5.071s | 3.012s | sort positive float array | | shock_struct_p (float) | 52.541s | 4.075s | 13.331s | 5.197s | 4.700s | shock tube | | fannkuch_p 11 (int) | 89.227s | 7.379s | 19.249s | 6.551s | 8.612s | well known benchmark |
Notes: Tests in the repo were "scaled up" to run longer. Third best time of 20 runs to eliminate outliers.
The 4.075s time listed in the table for the shock benchmark is not a typo/transposition.
Time to compile the mc compiler
| Benchmark | Mc compile time | Mc+Squint time (optimized compiler) | Gcc -O3 time | | --- | --- | --- | --- | | mc.c | 0.034s | 0.239s | 7.057s |
[!TIP]
Mc+squint time found by "time ( ELF/mc-opt -Op -o mc mc.c && scripts/peep mc )"
Size of .txt section, executed subset
| Benchmark | Mc .text size | Mc+Squint .text (optimized) | Gcc -O3 .text | Notes | | --- | --- | --- | --- | --- | | bezier.c | 3376 | 1016 | 768 | recursive | | duff.c | 2972 | 504 | 412 | unusual | | maze.c | 6568 | 2444 | 1752 | misc | | shock.c | 7824 | 2096 | 2200 | floating point | | mc.c | 203888 | 88176 | 62892 | full compiler |
Assembly language quality
Below is a comparison of assembly language quality of three compilers on the Raspberry Pi 4B when compiling tests/shock.c.
-
GCC, specifically: "gcc -mfloat-abi=hard -mtune=cortex-a72 -O3 tests/shock.c -lm"
-
The Squint compiler uses the compiler in this repository with the -Op option, followed by the Squint optimizer.
-
The MC compiler is a non-public HPC version of Squint that I am working on offline.
For floating point, my HPC compiler is currently always faster than gcc with the above compiler options, by a minimum of 3%.
That said, make no mistake, my current optimizations are all crap** and yet I am still beating gcc. I am only one person with no resources, so I pick the path I see as most interesting and plod along at a snail's pace when I am not watching TV, playing video games, or reading/commenting on news articles.
If I had the resources to hire a small team and treat this as a real full time project, I could do much better! If you might be interested, let's talk.
** No common subexpression elimination, register renaming to reduce stalls, code motion to reduce stalls, or register coloring to reduce register pressure, etc. None of these things are hard to do, but all consume time to implement.
Below is a C function followed by the assembly language listing for the loop body, for all three compilers. The MC compiler creates only 3.05 assembly language instructions to represent each (complex) line of high level C code, on average:
void ComputeFaceInfo(int numFace, float *mass, float *momentum, float *energy,
float *f0, float *f1, float *f2)
{
int i;
int contributor;
float ev;
float cLocal;
for (i = 0; i < numFace; ++i)
{
/* each face has an upwind and downwind element. */
int upWind = i; /* upwind element */
int downWind = i + 1; /* downwind element */
/* calculate face centered quantities */
float massf = 0.5 * (mass[upWind] + mass[downWind]);
float momentumf = 0.5 * (momentum[upWind] + momentum[downWind]);
float energyf = 0.5 * (energy[upWind] + energy[downWind]);
float pressuref = (gammaa - 1.0) *
(energyf - 0.5*momentumf*momentumf/massf);
float c = sqrtf(gammaa*pressuref/massf);
float v = momentumf/massf;
/* Now that we have the wave speeds, we might want to */
/* look for the max wave speed here, and update dt */
/* appropriately right before leaving this function. */
/* ... */
/* OK, calculate face quantities */
contributor = ((v >= 0.0) ? upWind : downWind);
massf = mass[contributor];
momentumf = momentum[contributor];
energyf = energy[contributor];
pressuref = energyf - 0.5*momentumf*momentumf/massf;
ev = v*(gammaa - 1.0);
f0[i] = ev*massf;
f1[i] = ev*momentumf;
f2[i] = ev*(energyf - pressuref);
contributor = ((v + c >= 0.0) ? upWind : downWind);
massf = mass[contributor];
momentumf = momentum[contributor];
energyf = energy[contributor];
pressuref = (gammaa - 1.0)*(energyf - 0.5*momentumf*momentumf/massf);
ev = 0.5*(v + c);
cLocal = sqrtf(gammaa*pressuref/massf);
f0[i] += ev*massf;
f1[i] += ev*(momentumf + massf*cLocal);
f2[i] += ev*(energyf + pressuref + momentumf*cLocal);
contributor = ((v - c >= 0.0) ? upWind : downWind);
massf = mass[contributor];
momentumf = momentum[contributor];
energyf = energy[contributor];
pressuref = (gammaa - 1.0)*(energyf - 0.5*momentumf*momentumf/massf);
ev = 0.5*(v - c);
cLocal = sqrtf(gammaa*pressuref/massf);
f0[i] += ev*massf;
f1[i] += ev*(momentumf - massf*cLocal);
f2[i] += ev*(energyf + pressuref - momentumf*cLocal);
}
}
| gcc | Squint (this repo) | MC (private repo HPC compiler) | | --- | --- | --- | | 164++ instructions | 119 instructions | 113 instructions | | ??? inststructions/iter | 119 instructions/iter | 113 instructions/iter | | 10d00: b 10e88 | 640: add r0, r5, r3, lsl #2 | 5f0: mov r0, #12 | | 10d04: mov r0, r8 | 644: vldr s1, [r0] | 5f4: mla r0, r3, r0, r4 | | 10d08:
Related Skills
node-connect
345.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
106.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
