SkillAgentSearch skills...

SimpleHaskellStackVM

Simple demonstration stack based virtual machine written in Haskell

Install / Use

/learn @andrevdm/SimpleHaskellStackVM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Creating a Stack Machine with Haskell

About the Project

This is a small demonstration project showing how a simple byte code interpreting stack machine (virtual machine) can be built with Haskell. It is not a production VM nor of any particular practical use but is rather a simple demonstration of how a stack machine can be built.

I built this for mainly as a project for learning Haskell, i.e. something a little bigger to work on. So NB this is probably not idiomatic Haskell, and may have some newbie mistakes. Hopefully it is interesting enough despite this...

Stack Machines

A stack machine is a real or emulated computer that uses a pushdown stack rather than individual machine registers to evaluate each sub-expression in the program. A stack computer is programmed with a reverse Polish notation instruction set. - wikipedia [1]

Stack machines are simpler to implement than register machines but are still practical even for production VMs.

Writing the VM in Haskell

Virtual machines are typically written in a low level language like C for maximum efficiency. They also are typically written using mutable data structures. Here I'm using Haskell and pure functional data structures. I have absolutely no performance goals, so there are no constraints I need to worry about.

A few design decisions

  1. I'm using a Data.Sequence rather than a list. While I'm not concerned about performance using a linked list to do random access is still a bad idea

  2. Try to avoid bottom (⊥), so use "safe" versions whenever ⊥ could otherwise be returned (toEnum, index, head etc)

  3. I'm using Stephen Diehl's Protolude [2] and removing the default prelude

Using the immutable data structures like Data.Sequence had worked out nicely and means that as the VM runs you can keep a full history of the VM's state. So you have everything (apart from time) you need to build a nice visualiser / 'time travelling' debugger.

The VM

Opcodes & byte code

The opcodes are the operations that the virtual CPU can perform. These are simple operations like push, pop, add and jump. The opcodes are encoded as bytes together with the opcode parameters (e.g. the value to push) this forms the byte code.

Here is the current set of opcodes understood by the CPU

data Operation = Nop       -- No operation
               | Halt      -- Stop CPU execution
               | Push      -- Push a value onto the stack
               | Pop       -- Pop the most recent value from the stack
               | PopPrev   -- Pop n values before the most recent value from the stack
               | Add       -- Add the top two items on the stack
               | Inc       -- Increment the top item on the stack
               | Dup       -- Duplicate the most recent item on the stack
               | Jmp       -- Jump unconditionally to a location
               | Bne       -- Pop the top two items, compare and branch if the values are not equal
               | Beq       -- Pop the top two items, compare and branch if the values are equal
               | Bgt       -- Pop the top two items, compare and branch if value 1 is greater than value 2
               | Bgte      -- Pop the top two items, compare and branch if value 1 is greater or equal to value 2
               | Blt       -- Pop the top two items, compare and branch if value 1 is less than value 2
               | Blte      -- Pop the top two items, compare and branch if value 1 is less than or equal to value 2
               | Call      -- Call a function
               | Ret       -- Return from a function
               | LdArg     -- Push local value n onto the stack
               deriving (Show, Eq, Ord, Enum, Bounded)

The virtual CPU

The data structure representing the CPU is defined below

data Cpu = Cpu { ip :: Int               -- Instruction pointer
               , fp :: Int               -- Frame pointer
               , cpuStack :: S.Seq Int   -- The stack
               , cpuGlobals :: S.Seq Int -- Gloal variables
               , ranOp :: Int            -- The last opcode that was executed
               , state :: Text           -- Debugging message
               , debug :: Bool           -- Enable/disable debugging
               , panic :: Bool           -- Is the CPU in a faulted state
               }
         deriving (Show, Eq)

-- | Default empty/initial CPU state
emptyCpu :: Cpu
emptyCpu = Cpu { ip = -1
               , fp = -1
               , cpuStack = S.empty
               , cpuGlobals = S.empty
               , state = ""
               , debug = True
               , ranOp = 0
               , panic = False 
               }

This is fairly minimal but it’s more than enough for a simple VM like this.

Some things to note

  • The stack (cpuStack) is part of the CPU. This makes sense for a stack machine since a stack is core to everything it does. It also means that as the CPU runs you get a full history of each stack state along with the CPU flags at the time each opcode was run
  • There is no need for a stack pointer since the stack is a 'dynamically' growing Data.Sequence. I.e. sp always points to the head of cpuStack.
  • The instruction pointer (ip) points to the next instruction to run.
  • In this implementation the byte stream is fixed (no self-modifying code), so there is no need to copy it on each CPU operation
  • The frame pointer (fp) is discussed below in the section about function calls (Call & Ret)

The byte code assembler

Rather than writing the byte code in hex a very simple assembler is used. Later on additional assemblers and compliers can be layer on top of this low level assembler which does little more than convert opcode mnemonics to byte code.

An operation can take parameters from the byte code stream. For instance a push instruction takes a parameter that is the value to push onto the stack. The type that represents this is the inventively named OpAndParam

-- | A single CPU operator and its parameters
type OpAndParam = (Operation, [Int])

Having a type that defines the number of parameters an op takes and how many values it pops off the stack allows the assembler to perform some basic checks. The interpreter can also use this definition when running the code.

-- | Configuration for an operation
-- | opParamCount = number of paramaters taken from the code stream
-- | opPopsCount = number of values this op pops from the stack
-- | opSimple = determines if the op needs full access to cpu state to change things like the fp and ip
-- |            note that 'complex' instructions do not need to honour opParamCount and opPopsCount
-- |            e.g. a 'ret' instruction pops a variable number of parameters
data Instruction = Instruction { opCode :: Operation
                               , opPopsCount :: Int
                               , opParamCount :: Int
                               , opSimple :: Bool
                               }
                 deriving (Show, Eq)

The instructions can then be setup and a map created from opcode to Instruction

-- | Config for the op codes
instructions :: [Instruction]
instructions = [ Instruction { opCode = Nop, opParamCount = 0, opPopsCount = 0, opSimple = True }
                -- ...
               , Instruction { opCode = Call, opParamCount = 1, opPopsCount = 0, opSimple = False }
               ]

-- | Instructions indexed by opcode
instrByOp :: Map.Map Operation Instruction
instrByOp = Map.fromList $ map (\i -> (opCode i, i)) instructions

The assembler then does nothing more than converting the opcode enum to a byte (an Int in the code but it would be serialised as a byte) checking the number of parameters for each opcode. It is small enough to be pasted in full here

-- | A single assembler error
data AssemblerError = AssemblerError Integer Operation Text deriving (Show, Eq)

-- | Compiles the list to byte code
-- | Returns as many errors as possible rather than just first error
assembleByteCode :: [(Operation, [Int])] -> Either [AssemblerError] [Int]	
assembleByteCode code =
  let res = foldl assemble [] code in
  case lefts res of
    [] -> Right $ concat $ rights res
    errors -> Left errors
  
  where
    assemble :: [Either AssemblerError [Int]] -> OpAndParam -> [Either AssemblerError [Int]]
    assemble res (op, prms) =
      res ++ case Map.lookup op instrByOp of
               Nothing -> [Left $ AssemblerError (toInteger $ length res) op "unknown op code"]
               Just i ->
                 if opParamCount i == length prms
                 then [Right $ fromEnum (opCode i) : prms]
                 else [Left $ AssemblerError (toInteger $ length res) op "incorrect number of parameters"]

The interpreter

With all of that in place the interpreter can finally be written.

-- | Interpreter for the byte code
-- | Given a byte code stream will 'run' the code
-- | If debug is enabled then the full history (all states) will be returned. 
interpretByteCode :: S.Seq Int -> [Cpu]

The interpreter takes the output of the assembler or bytes loaded from a file and runs it producing CPU state along the way. In debug mode the interpreter stores all the states as it interprets, in "non-debug" mode only the last state is kept. Note that currently the interpreter is always in debug mode

Before looking at the implementation of the interpreter its worth going over a few examples of how it should operate

Push & Pop
                                                  +--------+
                          +--------+              |  123   |
   +--------+             |  123   |              |  456   |
   +--------+    push 123 +--------+    push 456  +--------+

In the first example the value 123 is pushed

Related Skills

View on GitHub
GitHub Stars42
CategoryDevelopment
Updated1y ago
Forks4

Languages

Haskell

Security Score

75/100

Audited on Aug 27, 2024

No findings