Ypsilon
R7RS/R6RS Scheme Implementation
Install / Use
/learn @fujita-y/YpsilonREADME
Ypsilon: R7RS/R6RS Scheme Implementation
-
Conforms to R7RS/R6RS
-
Mostly concurrent garbage collector optimized for multi-core processor
-
Incremental native code generation with background compilation threads
-
On-the-fly FFI with native stub code generation
-
Full features of R6RS and R6RS standard libraries including:
- arbitrary precision integer arithmetic
- rational number
- exact and inexact complex number
- top-level program
- proper tail recursion
- call/cc and dynamic wind
- unicode
- bytevectors
- records
- exceptions and conditions
- i/o
- syntax-case
- hashtables
- enumerations
-
Full features of R7RS-small and following R7RS-large libraries:
- (scheme comparator)
- (scheme hash-table)
- (scheme list)
- (scheme sort)
-
See LICENSE for terms and conditions of use.
Performance
- Ypsilon 2.0.8 outperform Guile 3.0.8 in r7rs-benchmarks for 31 of 55 programs on WSL2, Ubuntu 20.04, Ryzen 3700X 3.6GHz. https://fujita-y.github.io/benchmarks/ubuntu-llvm12-ryzen3700x-ypsilon-208-guile-308.pdf
Documents
Requirements
- LLVM 10, 11, 12, 13, 14, 15 or 19
Installation
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
cmake --install .
- On MacOS, you may want using Homebrew to install LLVM 15.
brew update
brew install llvm
echo 'export PATH="/opt/homebrew/opt/llvm/bin:$PATH"' >> ~/.zshrc
Optional
- You may want using rlwrap to make the REPL easier to use.
# Ubuntu 21.04
apt install rlwrap
# Homebrew on MacOS
brew install rlwrap
Run
- To run R7RS sample script, try following from project root:
ypsilon --r7rs test/r7rs-sample.scm
- To run FFI demo program, try follwing from project root:
ypsilon --r6rs --top-level-program demo/glut-demo.scm # (OpenGL 1.x, GLUT)
ypsilon --r6rs --top-level-program demo/glfw-demo.scm # (OpenGL 1.x, GLFW)
ypsilon --r6rs --top-level-program demo/glcorearb-demo.scm # (OpenGL Core Profile, GLFW)
ypsilon --r6rs --top-level-program demo/freetype-demo.scm # (OpenGL Core Profile, GLFW, freetype)
ypsilon --r6rs --top-level-program demo/widget-demo.scm # (OpenGL Core Profile, GLFW, freetype)
Docker
- Container image is available on Docker Hub for amd64 and arm64.
$ docker run --rm -it fujitay/ypsilon:latest ypsilon --r7rs
ypsilon-2.0.8 (r7rs)
> (magnitude 3/22+2/11i)
5/22
> (exit)
Packages
Notes
- REPL start with
(import (scheme base) (scheme process-context))or(import (rnrs base) (rnrs programs))depending on the command line options specified. - Without
--top-level-programoption, the contents of the specified script file will be interpreted as if they had been entered into the REPL.
Rebuild heap files
- Open 'src/core.h' and set 'UNBOUND_GLOC_RETURN_UNSPEC' to '1'
make- Edit .scm files in 'heap/boot' directory
cd heap; make; cd ..; make; cd heap; make; cd ..- Open 'src/core.h' and set 'UNBOUND_GLOC_RETURN_UNSPEC' to '0'
make
Versions
- Ypsilon version 2 have major design changes. Former version is available at version-0.9.
- This project is migrated from code.google.com/p/ypsilon
Behind the scenes: Concurrent Garbage Collector
Ypsilon implements a Concurrent Mark-Sweep (CMS) garbage collector designed for low-latency, real-time performance. It minimizes "stop-the-world" (STW) pauses by performing most of the marking and sweeping work concurrently with the mutator (the Scheme program).
Heap Architecture
The heap is organized into a hierarchy of structures to support efficient allocation and concurrent GC:
1. Concurrent Pool (concurrent_pool_t)
The top-level memory manager that allocates large chunks of memory from the OS. It manages the overall heap layout and provides memory for slabs and large objects.
2. Tag Directory
The first few slabs of the pool are reserved for a "tag directory." Each byte in this directory stores the attributes (tags) of a corresponding slab in the pool.
- PTAG_FREE: Slab is available for allocation.
- PTAG_USED: Slab is in use.
- PTAG_SLAB: Slab is managed by a
concurrent_slab_t. - PTAG_GC: Slab contains collectible objects managed by the GC.
- PTAG_EXTENT: Used for multi-slab objects (large objects).
3. Concurrent Slab (concurrent_slab_t)
Small objects are allocated from "slabs" (typically 4KB or 8KB). Each slab contains multiple objects of a fixed size.
- Slab Traits: Metadata located at the end of each slab containing free lists and cache references.
- Mark Bitmap: A bitmap stored within the slab (just before the traits) that tracks the mark state of each object. One bit per object:
0for unmarked (white),1for marked (gray/black).
3. Object Layout
Ypsilon uses tagged pointers to distinguish between immediate values (like fixnums) and heap-allocated cells. The GC only tracks "collectible" pointers.
GC Cycle and Phases
A typical GC cycle consists of the following phases:
1. Initiation
The GC is triggered when its allocation trip-byte threshold is reached. A dedicated collector thread manages the cycle.
2. Root Snapshotting (STW)
To start marking safely, the GC must capture the "roots" (global variables, thread stacks, etc.).
- It uses a "stop-the-world" pause to snapshot global roots.
- It resumes the mutator and performs concurrent marking.
- Later, it pauses again to snapshot stack-based local roots.
- This staggered approach minimizes the duration of each individual pause.
3. Concurrent Marking
The collector thread performs a Depth-First Search (DFS) of the object graph using a mark stack.
- Tri-color Marking:
- White: Unmarked objects (bit
0). - Gray: Marked (bit
1) but children haven't been traced. These are currently in themark_stackorshade_queue. - Black: Marked (bit
1) and children have been traced.
- White: Unmarked objects (bit
- Write Barrier (Dijkstra-style): When the mutator stores a pointer to an object
Binto objectA, a write barrier is triggered if the GC is in the marking phase. IfBis not yet marked, it is "shaded" (added to ashade_queue) to ensure it isn't missed even if it was moved during concurrent marking.
4. Final Mark (STW)
A final STW phase ensures that all objects reachable via roots and the shade_queue are fully traced. Ypsilon includes "Ensured Real-time" logic: if the final mark takes too long, it can resume the mutator and return to concurrent marking to avoid long pauses.
5. Concurrent Sweeping
Once marking is complete, the GC begins reclaiming memory.
- Sweep Wavefront: The collector iterates through slabs, resetting free lists for unmarked objects. It maintains a "wavefront" pointer to track progress through the heap.
- Alloc Barrier: If the mutator allocates a new object ahead of the sweep wavefront, the object is immediately marked. This prevents it from being incorrectly reclaimed by the ongoing sweep.
- Read Barrier: Used for global sets (like the symbol table). If the mutator retrieves an object from such a set during the sweep phase, it is marked to ensure visibility.
Special Features
Weak References
Objects of type TC_WEAKMAPPING are handled specially. The collector only marks the value if the key is already marked. If the key is not marked by the end of the marking phase, the mapping is broken.
Finalization
The GC supports object finalization. Objects requiring finalization are traced and, if found unreachable, custom finalizers are called.
Compaction
While the primary GC is mark-sweep, Ypsilon also includes a compactor (object_heap_compact.cpp) that can relocate objects to reduce fragmentation. This is used as needed during heap expansion or as a fallback when fragmentation becomes high.
Concurrency Control
The GC and mutator interact via highly optimized primitives:
- Shade Queue: A concurrent queue used by the write barrier to communicate new roots found by the mutator to the collector.
- Slab Locks: Each
concurrent_slab_thas its own mutex to synchronize allocation and sweeping. - Condition Variables: Used for coordination between the collector thread and mutator threads during STW phases.
Behind the scenes: Native Code Generation (JIT)
Ypsilon includes a Just-In-Time (JIT) compiler named Digamma that translates Scheme bytecode into high-performance native machine code using the LLVM backend.
Overview
The JIT compiler works by translating the Virtual Machine's instruction sequence (bytecode) of a closure into LLVM Intermediate Representation (IR). This IR is then optimized and compiled into native code by LLVM's ORCJIT execution engine.
Compilation Strategy
1. Concurrent Compilation
Native code generation is decoupled from the main VM execution thread.
- Codegen Thread: A dedicated thread (
codegen_threadindigamma.cpp) waits for compilation requests. - Queueing: When the VM encounters a closure that is a candidate for compilation (e.g., during on-demand execution or AOT pre-scanning), it pushes the closure onto a
codegen_queue. - Background Work: The codegen thread picks up closures from the queue, generates LLVM IR, applies optimizations, and registers the resulting native thunk with the closure.
2. Translation (VM to LLVM IR)
The digamma_t::transform method is the heart of the compiler. It performs a single-pass translation of VM instructions:
- Instruction Mapping: Each VM
