Introduction
Inspired by virtual machines (VMs) like WebAssembly and the Ethereum Virtual Machine, I set out to challenge myself by crafting a simplified version. In this post, I share the process of implementing a VM in Go that operates on compiled bytecode and handles basic arithmetic operations.
Architectural Design
Stack Machine or Register Machine?
When exploring the core architecture of VMs, two primary models stand out: stack machine and register machine.
A stack machine, such as WebAssembly, operates on a last-in, first-out (LIFO) basis. Instructions in this model primarily manipulate data at the top of the stack. This model is often used in implementation of VM due to its simplicity and ease of implementation.
A register machine akin to modern CPU designs, which use a set of registers (backed by hardware) that instructions can directly manipulate. This model can potentially enhance execution speed by reducing the overhead of repetitive data movement and often allows for more sophisticated optimizations during code compilation.
I’ve chosen to go with a stack machine architecture. This decision was driven by the model’s inherent simplicity and the straightforward nature of its execution flow, which significantly eases both the development and debugging processes.
Word Size
A word represents a fixed-sized data unit that the processor’s instruction set or hardware handles collectively.
Choosing a larger word size, such as 64-bit, allows the VM to process more data per instruction. This capability enables more efficient management of complex data types and operations with fewer instructions. Although these benefits are clear, it’s crucial to note that the actual performance improvements can vary, depending on the underlying system’s architecture and the VM’s ability to utilize it effectively.
I’ve chosen a 64-bit architecture for the VM, which means that each instruction can process data up to 8 bytes in size.
Memory
A byte-addressable memory model has been chosen for the VM. This model allows each byte of memory to be individually addressed, enhancing the system’s flexibility when managing different data size.
Implementation
Opcode
Opcodes (operation codes) are fundamental elements that define the set of operations the VM can perform. Each opcode corresponds to a specific operation, ranging from simple data manipulation to complex arithmetic functions.
Importantly, opcodes operate on raw bytes, without any awareness of higher-level data types such as strings or signed integers. This approach highlights the VM’s role as a flexible, low-level execution environment, handling data purely at the byte level.
Below are the opcodes supported by the VM.
const (
// POP is used to discard the top value from the stack and decrement the stack pointer.
POP Opcode = iota
// PUSH1 is used to place a 1 byte data onto the stack.
PUSH1
// PUSH8 is used to place a 8 byte (a word) data onto the stack.
PUSH8
// LOAD is used to load 8 byte (a word) data from memory into the stack at a specified offset using the top value of the stack
LOAD8
// ADD is used to add two operands.
ADD
// SUB subtracts the top value of the stack from the second top value of the stack.
SUB
// MUL multiplies the top two values on the stack.
MUL
// DIV divides the top value of the stack by the second top value of the stack, errors on division by zero.
DIV
// STORE1 stores a 1-byte value into memory at a specified offset, using the top value of the stack as the offset and the second top value as the data to store
STORE1
// STORE8 stores an 8-byte value (a word) into memory at a specified offset, using the top value of the stack as the offset and the second top value as the data to store
STORE8
// RETURN is used to exit the execution and return a block of memory.
// It pops the size and offset from the stack and returns the specified block from memory.
RETURN
)
This definition limits each opcode to one byte, allowing the VM to support up to 256 distinct opcodes.
Opcode Functions
Each opcode has a corresponding function of type opFunc
, which operates on the VM’s state to perform its designated task. Below are examples for PUSH8
and ADD
:
// standardize type for opcode function in vm
type (
opFunc func(v *vm) ([]byte, error)
)
func opPush8(v *vm) ([]byte, error) {
value := binary.BigEndian.Uint64(v.bytecode[v.pc : v.pc+8])
v.stack[v.sp] = value
v.pc += 8
v.sp++
if err := v.checkStackOverflow(); err != nil {
return nil, err
}
return nil, nil
}
func opAdd(v *vm) ([]byte, error) {
if err := v.checkStackUnderflow(uint64(1)); err != nil {
return nil, err
}
a := v.stack[v.sp-1]
b := v.stack[v.sp-2]
v.sp -= 2
v.stack[v.sp] = a + b
v.sp++
if err := v.checkStackOverflow(); err != nil {
return nil, err
}
return nil, nil
}
Virtual Machine
Next, we will define our VM as a Go’s struct
type.
type JumpTable [1 << 16]opFunc
type vm struct {
stack []uint64 // stack
pc uint64 // program counter
sp uint64 // stack pointer
bytecode []byte // compiled bytecode
memory []byte // memory
jumpTable JumpTable
}
stack []uint64
- Represents the stack used by the VM
- Each element in the stack is 64 bits in size (
uint64
)
pc uint64
- Stands for the program counter
- Indicates the current position of the VM within the bytecode
sp uint64
- Represents the stack pointer
- Points to the top of the stack, indicating the next available location for pushing data
- Changes dynamically as the VM executes instructions that manipulate the stack
bytecode []byte
- Holds the compiled bytecode of the program
- Bytecode consists of low-level instructions that the VM executes
memory []byte
- Represents the memory space available to the VM
- Used for storing data required during program execution, such as variables and constants
jumpTable JumpTable
- A mapping table for each opcode to its corresponding opcode function (
opFunc
) - During execution, the VM uses this table to efficiently look up the operation function associated with each opcode in the bytecode
- A mapping table for each opcode to its corresponding opcode function (
Stack Operation
The crucial component for stack operations is the stack itself and its pointer, commonly referred to as the Stack Pointer (SP). The stack pointer is an internal pointer that keeps track of the top of the stack, which is the location for the next item to be placed on.
Each time an item is pushed onto the stack, the stack pointer increments; conversely, it decrements every time an item is popped off.
For Example: Imagine a scenario in a VM where the stack starts empty and the following operations (opcodes) occur:
- PUSH8 0x3 - places the value 3 at the current stack pointer location and increments the pointer.
- PUSH8 0x8 - places the value 8 at the new stack pointer location, incrementing the pointer again.
- ADD - pops the top two values, adds them, and pushes the result back onto the stack, adjusting the stack pointer appropriately.
Here’s how the stack and the stack pointer change with each operation:
- Initial State: Stack is empty, SP = 0
- After PUSH8 0x3: Stack = [3], SP = 1
- After PUSH8 0x8: Stack = [3, 8], SP = 2
- After ADD: Stack = [11], SP = 1
Compiling Instruction into Byte Code
Instructions for the VM will be specified in the following string format
// This is comment line
PUSH8 0x48656C6C6F20576F
PUSH1 0x00
STORE8
PUSH8 0x726C642100000000
PUSH1 0x08
STORE8
PUSH1 0x0C
PUSH1 0x00
RETURN
Each line represents a single instruction using the defined opcode
. Only PUSH1
and PUSH8
require an operand for the data to be pushed into the stack in hexadecimal representation.
Any line that starts with //
or is empty will be ignored by the compiler.
Following the the compiled bytecode in hexadecimal representation.
02 48 65 6C 6C 6F 20 57 6F 01 00 09 02 72 6C 64 21 00 00 00 00 01 08 09 01 0C 01 00 0A
Which can be mapped into the instruction as followed
// This is comment line
PUSH8 0x48656C6C6F20576F - 02 48 65 6C 6C 6F 20 57 6F
PUSH1 0x00 - 01 00
STORE8 - 09
PUSH8 0x726C642100000000 - 02 72 6C 64 21 00 00 00 00
PUSH1 0x08 - 01 08
STORE8 - 09
PUSH1 0x0C - 01 0C
PUSH1 0x00 - 01 00
RETURN - 00 0A
Execution of Byte Code
Program Counter (PC) is used to track the execution of bytecode instructions. The PC points to the address of the current instruction that the VM is executing.
As each instruction is processed, the PC is incremented to point to the next instruction, ensuring that operations are carried out in the correct sequence.
02 48 65 6C 6C 6F 20 57 6F 01 00 09 02 72 6C 64 21 00 00 00 00 01 08 09 01 0C 01 00 0A
Here’s how the Program Counter (PC) operates through the sequence of bytecode from earlier example:
- PC = 0:
- Reads opcode
02
(PUSH8), indicating the operation to push 8 bytes into the stack. - Increment PC by 1 to start reading data for this opcode.
- Data read:
48 65 6C 6C 6F 20 57 6F
from PC=1 to PC=8. - Update PC by 8 after reading 8 bytes of data, now PC = 9.
- Reads opcode
- PC = 9:
- Reads opcode
01
(PUSH1), indicating the operation to push 1 byte into the stack. - Increment PC by 1 to read the data for this opcode.
- Data read:
00
at PC=10. - Update PC by 1 after reading the byte, now PC = 11.
- Reads opcode
- PC = 11:
- Reads opcode
09
(STORE8), indicating the operation to pop items from the stack and store them. - Increment PC by 1 to move to the next part of the instruction or the next opcode.
- Reads opcode
- Continues through the sequence: As each opcode is processed, the PC is incremented by 1 to read the opcode, and then further incremented as per the number of bytes the opcode processes. This ensures each instruction is executed in the correct order.
Memory Operation
In the VM, the memory []byte
field is a byte array that represents the VM’s memory space, designed to store data throughout the lifecycle of the VM’s operation.
Using the following example, let’s visualize the changes to the VM’s memory:
PUSH8 0x48656C6C6F20576F // Push "Hello Wo" onto the stack
PUSH1 0x08 // Push the memory offset 8 onto the stack
STORE8 // Store 8 bytes at memory offset 8
PUSH8 0x48656C6C6F20576F
- This instruction pushes the 8-byte value corresponding to the ASCII string “Hello Wo” onto the stack. The hexadecimal
0x48656C6C6F20576F
directly translates to the string “Hello Wo”. - Effect on Stack:
- Stack before operation:
[]
- Stack after operation:
[0x48656C6C6F20576F]
- The stack now contains one item, the 8-byte string “Hello Wo”.
- Stack before operation:
- This instruction pushes the 8-byte value corresponding to the ASCII string “Hello Wo” onto the stack. The hexadecimal
PUSH1 0x08
- This instruction pushes a 1-byte value
0x08
onto the stack. In this context,0x08
represents the memory offset where the previously pushed data (“Hello Wo”) will be stored. - Effect on Stack:
- Stack before operation:
[0x48656C6C6F20576F]
- Stack after operation:
[0x48656C6C6F20576F, 0x08]
- The stack now has two items: the “Hello Wo” data and the memory offset
0x08
.
- Stack before operation:
- This instruction pushes a 1-byte value
STORE8
- This instruction takes the two items from the top of the stack: the 8-byte data (“Hello Wo”) and the 1-byte memory offset (
0x08
). It stores the 8-byte data at the specified memory offset in the VM’s memory. - Effect on Memory and Stack:
- Memory before operation:
[00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...]
- Memory after operation:
[00 00 00 00 00 00 00 00 48 65 6C 6C 6F 20 57 6F ...]
- The bytes
48 65 6C 6C 6F 20 57 6F
(representing “Hello Wo”) are stored starting at memory offset0x08
. - Stack after operation:
[]
- The stack is cleared of both items used by the
STORE8
operation.
- Memory before operation:
- This instruction takes the two items from the top of the stack: the 8-byte data (“Hello Wo”) and the 1-byte memory offset (
Program Output
The opcode RETURN
provides a mechanism for returning data from the VM’s memory to the calling code . It reads the specified memory offset and size from the stack and returns the data as a byte array.
Let’s consider the sequence in the provided example, culminating in the use of the RETURN
opcode:
PUSH8 0x48656C6C6F20576F
PUSH1 0x00
STORE8
PUSH8 0x726C642100000000
PUSH1 0x08
STORE8
// Memory at this point
// [48 65 6C 6C 6F 20 57 6F 72 6C 64 21 00 00 00 00 00 ...]
PUSH1 0x0C // Pushes the size of the data to return
PUSH1 0x00 // Pushes the starting memory offset for the return
RETURN
PUSH1 0x0C
- This pushes
0x0C
onto the stack, which represents the size of the data (12 bytes) to be returned. This size indicates how many bytes following the specified offset should be returned.
- This pushes
PUSH1 0x00
- This pushes
0x00
onto the stack, indicating the memory offset from which the return should begin.
- This pushes
RETURN
- This opcode functions by retrieving the offset (
0x00
) and size (0x0C
) from the stack. It then accesses the VM’s memory starting from this offset and reads the specified number of bytes to form the returned byte array. - The returned data would consist of the first 12 bytes of memory, corresponding to the sequence
[48 65 6C 6C 6F 20 57 6F 72 6C 64 21]
, which is the ASCII representation of “Hello World!”.
- This opcode functions by retrieving the offset (
Running the Virtual Machine
The source code for the virtual machine (VM) is available at GitHub - PhakornKiong/go-vm.
To operate the VM and execute instructions, use the command go run main.go -i examples/addition
to run an example instruction set from the examples
folder.
For output customization, the -t
flag allows you to specify the output data type, determining how the results are formatted and displayed.