Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

WASM-PVM: WebAssembly to PolkaVM Recompiler

A Rust compiler that translates WebAssembly (WASM) bytecode into PolkaVM (PVM) bytecode for execution on the JAM (Join-Accumulate Machine) protocol. Write your JAM programs in AssemblyScript (TypeScript-like), hand-written WAT, or any language that compiles to WASM — and run them on PVM.

WASM  ──►  LLVM IR  ──►  PVM bytecode  ──►  JAM program (.jam)
      inkwell    mem2reg       Rust backend

Key Features

  • Multi-language input: AssemblyScript, hand-written WAT, or any WASM-targeting language
  • LLVM-powered: Uses inkwell (LLVM 18 bindings) for IR generation and optimization
  • No unsafe code: deny(unsafe_code) enforced at workspace level
  • Toggleable optimizations: Every non-trivial optimization can be individually disabled via CLI flags
  • Comprehensive test suite: 800+ tests across unit, integration, differential, and PVM-in-PVM layers

Supported WASM Features

CategoryOperations
Arithmetic (i32 & i64)add, sub, mul, div_u/s, rem_u/s, all comparisons, clz, ctz, popcnt, rotl, rotr, bitwise ops
Control flowblock, loop, if/else, br, br_if, br_table, return, unreachable, block results
Memoryload/store (all widths), memory.size, memory.grow, memory.fill, memory.copy, globals, data sections
Functionscall, call_indirect (with signature validation), recursion, stack overflow detection
Type conversionswrap, extend_s/u, sign extensions (i32/i64 extend8/16/32_s)
ImportsText-based import maps and WAT adapter files

Not supported: floating point (by design — PVM has no FP instructions).

Project Structure

crates/
  wasm-pvm/              # Core compiler library
    src/
      llvm_frontend/     # WASM → LLVM IR translation
      llvm_backend/      # LLVM IR → PVM bytecode lowering
      translate/         # Compilation orchestration & SPI assembly
      pvm/               # PVM instruction definitions & peephole optimizer
  wasm-pvm-cli/          # Command-line interface
tests/                   # Integration tests (TypeScript/Bun)
  fixtures/
    wat/                 # WAT test programs
    assembly/            # AssemblyScript examples
    imports/             # Import maps & adapter files
vendor/
  anan-as/               # PVM interpreter (submodule)

Resources

  • PVM Debugger — upload .jam files for disassembly, step-by-step execution, and register/gas inspection
  • PVM Decompiler — decompile PVM bytecode back to human-readable form
  • ananas (anan-as) — PVM interpreter written in AssemblyScript, compiled to PVM itself for PVM-in-PVM execution
  • as-lan — example AssemblyScript project compiled from WASM to PVM
  • JAM Gray Paper — the JAM protocol specification (PVM is defined in Appendix A)
  • AssemblyScript — TypeScript-like language that compiles to WASM

Getting Started

Prerequisites

  • Rust (stable, edition 2024)
  • LLVM 18 — the compiler uses inkwell (LLVM 18 bindings)
    • macOS: brew install llvm@18 then export LLVM_SYS_181_PREFIX=/opt/homebrew/opt/llvm@18
    • Ubuntu: apt install llvm-18-dev
  • Bun (for running integration tests and the JAM runner) — bun.sh

Build

git clone https://github.com/tomusdrw/wasm-pvm.git
cd wasm-pvm
cargo build --release

Hello World: Compile & Run

Create a simple WAT program that adds two numbers:

;; add.wat
(module
  (memory 1)
  (func (export "main") (param $args_ptr i32) (param $args_len i32) (result i64)
    ;; Read two i32 args, add them, write result to memory
    (i32.store (i32.const 0)
      (i32.add
        (i32.load (local.get $args_ptr))
        (i32.load (i32.add (local.get $args_ptr) (i32.const 4)))))
    (i64.const 17179869184)))  ;; packed ptr=0, len=4

Compile it to a JAM blob and run it:

# Compile WAT → JAM
cargo run -p wasm-pvm-cli -- compile add.wat -o add.jam

# Run with two u32 arguments: 5 and 7 (little-endian hex)
npx @fluffylabs/anan-as run add.jam 0500000007000000
# Output: 0c000000  (12 in little-endian)

Inspect the Output

Upload the resulting .jam file to the PVM Debugger for step-by-step execution, disassembly, register inspection, and gas metering visualization.

AssemblyScript Example

You can also write programs in AssemblyScript:

// fibonacci.ts
export function main(args_ptr: i32, args_len: i32): i64 {
  const buf = heap.alloc(256);
  let n = load<i32>(args_ptr);
  let a: i32 = 0;
  let b: i32 = 1;

  while (n > 0) {
    b = a + b;
    a = b - a;
    n = n - 1;
  }

  store<i32>(buf, a);
  return (buf as i64) | ((4 as i64) << 32);  // packed ptr + len
}

Compile via the AssemblyScript compiler to WASM, then use wasm-pvm-cli to produce a JAM blob. See the tests/fixtures/assembly/ directory for more examples.

Using as a Library

You can use wasm-pvm as a Rust dependency in two modes:

Full compiler (default)

Requires LLVM 18 installed on the system.

[dependencies]
wasm-pvm = "0.5.2"

This gives you access to the full compiler pipeline (compile(), compile_with_options()) plus all PVM types.

PVM types only

No LLVM dependency — compiles to any target including wasm32-unknown-unknown.

[dependencies]
wasm-pvm = { version = "0.5.2", default-features = false }

Available types: Instruction, Opcode, ProgramBlob, SpiProgram, abi::*, memory_layout::*, and Error. This is useful for PVM interpreters, debuggers, and bytecode analyzers that don’t need the WASM compiler.

Entry Function ABI

All entry functions must use the signature main(args_ptr: i32, args_len: i32) -> i64. The i64 return value packs a result pointer (lower 32 bits) and result length (upper 32 bits). The compiler unpacks this into PVM’s SPI convention (r7 = start address, r8 = end address).

For WAT programs, the common “return 4 bytes at address 0” constant is (i64.const 17179869184) (= 4 << 32).

For AssemblyScript, use: return (ptr as i64) | ((len as i64) << 32).

CLI Usage

# Compile WAT or WASM to JAM
wasm-pvm compile input.wat -o output.jam
wasm-pvm compile input.wasm -o output.jam

# With import resolution
wasm-pvm compile input.wasm -o output.jam \
  --imports imports.txt \
  --adapter adapter.wat

# Disable specific optimizations
wasm-pvm compile input.wasm -o output.jam --no-inline --no-peephole

# Disable all optimizations
wasm-pvm compile input.wasm -o output.jam \
  --no-llvm-passes --no-peephole --no-register-cache \
  --no-icmp-fusion --no-shrink-wrap --no-dead-store-elim \
  --no-const-prop --no-inline --no-cross-block-cache \
  --no-register-alloc --no-dead-function-elim \
  --no-fallthrough-jumps

Optimization Flags

All non-trivial optimizations are enabled by default. Each can be individually disabled:

FlagWhat it controls
--no-llvm-passesLLVM optimization passes (mem2reg, instcombine, etc.)
--no-peepholePost-codegen peephole optimizer
--no-register-cachePer-block store-load forwarding
--no-icmp-fusionFuse ICmp+Branch into single PVM branch
--no-shrink-wrapOnly save/restore used callee-saved regs
--no-dead-store-elimRemove SP-relative stores never loaded from
--no-const-propSkip redundant LoadImm when register already holds the constant
--no-inlineLLVM function inlining for small callees
--no-cross-block-cachePropagate register cache across single-predecessor block boundaries
--no-register-allocLinear-scan register allocation for loop values
--no-dead-function-elimRemove unreachable functions from output
--no-fallthrough-jumpsSkip redundant Jump when target is next block

See the Optimizations chapter for details on each.

Import Handling

WASM modules that import external functions need those imports resolved before compilation. Two mechanisms are available, and they can be combined.

Import Map (--imports)

A text file mapping import names to simple actions:

# my-imports.txt
abort = trap        # emit unreachable (panic)
console.log = nop   # do nothing, return zero

Adapter WAT (--adapter)

A WAT module whose exported functions replace matching WASM imports, enabling arbitrary logic for import resolution (pointer conversion, memory reads, host calls). Adapters are function-only overlays — tables, memories, globals, and data sections from the adapter are not merged:

(module
  (import "env" "host_call_5" (func $host_call_5 (param i64 i64 i64 i64 i64 i64) (result i64)))
  (import "env" "pvm_ptr" (func $pvm_ptr (param i64) (result i64)))

  (func (export "console.log") (param i32)
    (drop (call $host_call_5
      (i64.const 100)                                    ;; ecalli index
      (i64.const 3)                                      ;; log level
      (i64.const 0) (i64.const 0)                        ;; target ptr/len
      (call $pvm_ptr (i64.extend_i32_u (local.get 0)))   ;; message ptr
      (i64.extend_i32_u (i32.load offset=0
        (i32.sub (local.get 0) (i32.const 4)))))))       ;; message len
)

When both --imports and --adapter are provided, the adapter runs first, then the import map handles remaining unresolved imports. All imports must be resolved or compilation fails.

Host Call Imports

A family of typed host_call_N imports (N=0..6) map to PVM ecalli instructions, where N is the number of data registers (r7..r7+N-1) to set. See the ABI & Calling Conventions chapter for the full reference table and examples.

Variants with a b suffix (e.g. host_call_2b) also capture r8 to a stack slot, retrievable via host_call_r8() -> i64.

The pvm_ptr(wasm_addr) -> pvm_addr import converts a WASM-space address to a PVM-space address.

Compiler Pipeline

The compiler translates WebAssembly to PVM bytecode in five stages:

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ Adapter  │     │  WASM →  │     │   LLVM   │     │ LLVM IR  │     │   SPI    │
  │  Merge   │────►│  LLVM IR │────►│  Passes  │────►│  → PVM   │────►│ Assembly │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘     └──────────┘
   (optional)       inkwell          mem2reg,etc      Rust backend     JAM blob

Stage 1: Adapter Merge (Optional)

File: crates/wasm-pvm/src/translate/adapter_merge.rs

When a WAT adapter module is provided (--adapter), it is merged into the main WASM binary. Adapter exports replace matching WASM imports, enabling complex import resolution logic (pointer conversion, memory reads, host calls). Uses wasm-encoder to build the merged binary.

Stage 2: WASM → LLVM IR

File: crates/wasm-pvm/src/llvm_frontend/function_builder.rs (~1350 lines)

Each WASM function is translated to LLVM IR using inkwell (LLVM 18 bindings). PVM-specific intrinsics (@__pvm_load_i32, @__pvm_store_i32, etc.) are used for memory operations instead of direct pointer arithmetic, avoiding unsafe GEP/inttoptr patterns.

All values are treated as i64 (matching PVM’s 64-bit registers).

Stage 3: LLVM Optimization Passes

File: crates/wasm-pvm/src/llvm_frontend/function_builder.rs

Three optimization phases run sequentially:

  1. Pre-inline cleanup: mem2reg (SSA promotion), instcombine, simplifycfg
  2. Inlining (optional): cgscc(inline) — function inlining for small callees
  3. Post-inline cleanup: instcombine<max-iterations=2>, simplifycfg, gvn (redundancy elimination), simplifycfg, dce (dead code removal)

Stage 4: LLVM IR → PVM Bytecode

Files: crates/wasm-pvm/src/llvm_backend/ (7 modules)

A custom Rust backend reads LLVM IR and emits PVM instructions:

ModuleResponsibility
emitter.rsCore emitter, value slot management, register cache
alu.rsArithmetic, logic, comparisons, conversions, fused bitwise
memory.rsLoad/store, memory intrinsics, word-sized bulk ops
control_flow.rsBranches, phi nodes, switch, return
calls.rsDirect/indirect calls, import stubs
intrinsics.rsPVM + LLVM intrinsic lowering
regalloc.rsLinear-scan register allocator

Key optimizations at this stage:

  • Per-block register cache: eliminates redundant loads (~50% gas reduction)
  • Cross-block cache propagation: for single-predecessor blocks
  • ICmp+Branch fusion: combines compare and branch into one PVM instruction
  • Linear-scan register allocation: assigns loop values to callee-saved registers
  • Peephole optimizer: fuses immediate chains, eliminates dead stores

Stage 5: SPI Assembly

File: crates/wasm-pvm/src/translate/mod.rs

Packages everything into a JAM/SPI program blob:

  1. Build entry header (jump to main function, optional secondary entry)
  2. Build dispatch table (for call_indirect) → ro_data
  3. Build globals + WASM memory initial data → rw_data (with trailing zero trim)
  4. Encode PVM program blob (jump table + bytecode + instruction mask)
  5. Write SPI header (ro_data_len, rw_data_len, heap_pages, stack_size)

ABI & Calling Conventions

Register assignments, calling convention, stack frame layout, memory layout, and the SPI/JAM program format used by the WASM-to-PVM recompiler.

The canonical source for constants lives in crates/wasm-pvm/src/abi.rs and crates/wasm-pvm/src/memory_layout.rs.


Register Assignments

PVM provides 13 general-purpose 64-bit registers (r0–r12). The compiler assigns them as follows:

RegisterAliasPurposeSaved by
r0raReturn address (jump table index)Callee
r1spStack pointer (grows downward)Callee
r2t0Temp: load operand 1 / immediatesCaller
r3t1Temp: load operand 2Caller
r4t2Temp: ALU resultCaller
r5s0ScratchCaller
r6s1ScratchCaller
r7a0Return value / SPI args_ptrCaller
r8a1SPI args_len / second resultCaller
r9l0Local 0 / param 0Callee
r10l1Local 1 / param 1Callee
r11l2Local 2 / param 2Callee
r12l3Local 3 / param 3Callee

Callee-saved (r0, r1, r9–r12): the callee must preserve these across calls. Caller-saved (r2–r8): the caller must assume these are clobbered by any call.


Stack Frame Layout

Every function allocates a stack frame. The stack grows downward (SP decreases).

                Higher addresses
          ┌─────────────────────────┐
          │   caller's frame ...    │
old SP →  ├─────────────────────────┤
          │  Saved r0  (ra)    +0   │  8 bytes
          │  Saved r9  (l0)    +8   │  8 bytes
          │  Saved r10 (l1)   +16   │  8 bytes
          │  Saved r11 (l2)   +24   │  8 bytes
          │  Saved r12 (l3)   +32   │  8 bytes
          ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤  FRAME_HEADER_SIZE = 40
          │  SSA value slot 0  +40  │  8 bytes
          │  SSA value slot 1  +48  │  8 bytes
          │  ...                    │  8 bytes per SSA value
new SP →  ├─────────────────────────┤
          │  (operand spill area)   │  SP - 0x100 .. SP
          └─────────────────────────┘
                Lower addresses

Frame size = FRAME_HEADER_SIZE (40) + num_ssa_values * 8

The operand spill area at SP + OPERAND_SPILL_BASE (i.e. SP - 0x100) is used for temporary storage during phi-node copies and indirect calls. The frame grows upward from SP (toward higher addresses), while the spill area is below SP, so the two regions never overlap regardless of frame size. However, a callee’s frame allocation must not reach into the caller’s spill area — this is protected by the stack overflow check which ensures SP - frame_size >= stack_limit.

Stack-Slot Approach with Register Allocation

Every LLVM SSA value gets a dedicated 8-byte stack slot. The baseline instruction sequence is:

  1. Load operands from stack slots into temp registers (t0, t1)
  2. Execute ALU operation, result in t2
  3. Store t2 back to the result’s stack slot

A linear-scan register allocator (regalloc.rs) improves on this when a function contains loop back-edges; loop-free functions skip allocation entirely. Candidate intervals are built from use-def live-interval analysis and filtered by a minimum-use threshold (MIN_USES_FOR_ALLOCATION, currently 3), rather than requiring per-value “loop-spanning” as the eligibility rule. The allocator assigns eligible values to available callee-saved registers (r9-r12 when not used for this function’s incoming parameters). In non-leaf functions, r9+ needed for outgoing call arguments are reserved from allocation. Call-site clobber handling/reloads are performed by the emitter after calls, not by explicit call-site invalidation logic inside regalloc itself. Combined with the register cache, this eliminates most redundant memory traffic.

Per-Block Register Cache (Store-Load Forwarding)

PvmEmitter maintains a per-basic-block register cache (slot_cache: HashMap<i32, u8>, reg_to_slot: [Option<i32>; 13]) that tracks which stack slot values are currently live in registers. This eliminates redundant LoadIndU64 instructions:

  • Cache hit, same register: Skip entirely (0 instructions emitted)
  • Cache hit, different register: Emit AddImm64 dst, cached_reg, 0 (register copy)
  • Cache miss: Emit normal LoadIndU64, then record in cache

The cache is invalidated:

  • When a register is overwritten (auto-detected via Instruction::dest_reg())
  • At block boundaries (define_label() clears the entire cache)
  • After function calls (clear_reg_cache() after Fallthrough return points)
  • After ecalli host calls (clear_reg_cache() after Ecalli)

Impact: ~50% gas reduction, ~15-40% code size reduction across benchmarks.


Calling Convention

Parameter Passing

ParameterLocation
1st–4thr9–r12
5th+PARAM_OVERFLOW_BASE (0x32000 + (i-4)*8) in global memory

Return value: r7 (single i64).

Caller Sequence

1. Load arguments into r9–r12 (first 4)
2. Store overflow arguments to PARAM_OVERFLOW_BASE
3. LoadImm64  r0, <return_jump_table_index>
4. Jump       <callee_code_offset>
   ── callee executes ──
5. (fallthrough) Store r7 to result slot if function returns a value

Callee Prologue

1. Stack overflow check (skipped for entry function):
     LoadImm64  t1, stack_limit        ; unsigned comparison!
     AddImm64   t2, sp, -frame_size
     BranchGeU  t1, t2, continue
     Trap                              ; stack overflow → panic
2. Allocate frame:
     AddImm64   sp, sp, -frame_size
3. Save callee-saved registers:
     StoreIndU64  [sp+0],  r0
     StoreIndU64  [sp+8],  r9
     StoreIndU64  [sp+16], r10
     StoreIndU64  [sp+24], r11
     StoreIndU64  [sp+32], r12
4. Copy parameters to SSA value slots:
     - First 4 from r9–r12
     - 5th+ loaded from PARAM_OVERFLOW_BASE

Callee Epilogue (return)

1. Load return value into r7 (if returning a value)
2. Restore callee-saved registers:
     LoadIndU64  r9,  [sp+8]
     LoadIndU64  r10, [sp+16]
     LoadIndU64  r11, [sp+24]
     LoadIndU64  r12, [sp+32]
3. Restore return address:
     LoadIndU64  r0, [sp+0]
4. Deallocate frame:
     AddImm64   sp, sp, +frame_size
5. Return:
     JumpInd    r0, 0

Jump Table & Return Addresses

PVM’s JUMP_IND instruction uses a jump table — it is not a direct address jump:

JUMP_IND rA, offset
  target_address = jumpTable[(rA + offset) / 2 - 1]

Return addresses stored in r0 are therefore jump-table indices, not code offsets:

r0 = (jump_table_index + 1) * 2

The jump table is laid out as:

[ return_addr_0, return_addr_1, ...,   // for call return sites
  func_0_entry,  func_1_entry,  ... ]  // for indirect calls

Each entry is a 4-byte code offset (u32). Jump table entries for call_indirect encode function entry points used by the dispatch table.


Indirect Calls (call_indirect)

A dispatch table at RO_DATA_BASE (0x10000) maps WASM table indices to function entry points:

Dispatch table entry (8 bytes each):
  [0–3]  Jump address (u32, byte offset → jump table index)
  [4–7]  Type signature index (u32)

The indirect call sequence:

 1. Compute dispatch_addr = RO_DATA_BASE + (table_index << 3)
 2. Load type_idx from [dispatch_addr + 4]
 3. Compare type_idx with expected_type_idx
 4. Trap if mismatch (signature validation)
 5. Load jump_addr from [dispatch_addr + 0]
 6. LoadImmJumpInd  jump_addr, r0, <return_jump_table_index>, 0

Import Calls

host_call_N(ecalli_index, r7, ..., r7+N-1) -> i64ecalli

A family of typed host call imports where N (0–6) indicates the number of data arguments loaded into r7–r12. The first argument must be a compile-time constant (the ecalli index). All variants return r7 as an i64.

ImportParamsRegisters set
host_call_0(i64)none
host_call_1(i64 i64)r7
host_call_2(i64 i64 i64)r7-r8
host_call_3(i64 i64 i64 i64)r7-r9
host_call_4(i64 i64 i64 i64 i64)r7-r10
host_call_5(i64 i64 i64 i64 i64 i64)r7-r11
host_call_6(i64 i64 i64 i64 i64 i64 i64)r7-r12

Example — JIP-1 log call with 5 register args:

(import "env" "host_call_5" (func $host_call_5 (param i64 i64 i64 i64 i64 i64) (result i64)))
(import "env" "pvm_ptr" (func $pvm_ptr (param i64) (result i64)))

;; ecalli 100 = log; r7=level, r8=target_ptr, r9=target_len, r10=msg_ptr, r11=msg_len
(drop (call $host_call_5
  (i64.const 100)                                  ;; ecalli index
  (i64.const 3)                                    ;; r7: log level
  (call $pvm_ptr (i64.const 0))                    ;; r8: target PVM pointer
  (i64.const 8)                                    ;; r9: target length
  (call $pvm_ptr (i64.const 8))                    ;; r10: message PVM pointer
  (i64.const 15)))                                 ;; r11: message length

host_call_Nb — two-register output variants

Same as host_call_N but also captures r8 after the ecalli to a dedicated stack slot (R8_CAPTURE_SLOT_OFFSET relative to SP). Use the companion import host_call_r8() -> i64 (no arguments) to retrieve the captured value. The host_call_r8 call must be in the same function as the preceding host_call_Nb.

All *b variants (host_call_0b through host_call_6b) are supported.

Example:

(import "env" "host_call_2b" (func $host_call_2b (param i64 i64 i64) (result i64)))
(import "env" "host_call_r8" (func $host_call_r8 (result i64)))

;; Call ecalli 10, passing r7=100 and r8=200.
;; Store r7 return value, then retrieve r8.
(local $r7 i64)
(local $r8 i64)
(local.set $r7 (call $host_call_2b (i64.const 10) (i64.const 100) (i64.const 200)))
(local.set $r8 (call $host_call_r8))

pvm_ptr(wasm_addr) -> pvm_addr

Converts a WASM-space address to a PVM-space address by zero-extending to 64 bits and adding wasm_memory_base.

Other imports

The abort import emits Trap (unrecoverable error). All other unresolved imports cause a compilation error — they must be resolved via --imports or --adapter before compilation succeeds.


Memory Layout

PVM Address Space:
  0x00000 - 0x0FFFF   Reserved / guard (fault on access)
  0x10000 - 0x1FFFF   Read-only data (RO_DATA_BASE) — dispatch tables
  0x20000 - 0x2FFFF   Gap zone (unmapped, guard between RO and RW)
  0x30000 - 0x31FFF   Globals window (GLOBAL_MEMORY_BASE, 8KB cap; actual bytes used = globals_region_size(...))
  0x32000 - 0x320FF   Parameter overflow area (5th+ function arguments)
  0x32100+            Spilled locals (per-function metadata, typically unused)
  0x33000+             WASM linear memory (4KB-aligned, computed dynamically via `compute_wasm_memory_base`)
  ...                  (unmapped gap until stack)
  0xFEFE0000           STACK_SEGMENT_END (initial SP)
  0xFEFF0000           Arguments segment (input data, read-only)
  0xFFFF0000           EXIT_ADDRESS (jump here → HALT)

Key formulas (see memory_layout.rs):

  • Global address: 0x30000 + global_index * 4
  • Memory size global: 0x30000 + num_globals * 4
  • Spilled local: 0x32100 + func_idx * SPILLED_LOCALS_PER_FUNC + local_offset
  • WASM memory base: align_up(max(SPILLED_LOCALS_BASE + num_funcs * SPILLED_LOCALS_PER_FUNC, GLOBAL_MEMORY_BASE + globals_region_size(num_globals, num_passive_segments)), 4KB) — the heap starts immediately after the globals/passive-length region, aligned to PVM page size (4KB). This is typically 0x33000 for programs with few globals.
  • Stack limit: 0xFEFE0000 - stack_size

RW data layout

SPI rw_data is defined as a contiguous dump of every byte from GLOBAL_MEMORY_BASE up to the last initialized byte of the WASM heap; the loader memcpys this region at 0x30000, so there is no sparse encoding or per-segment offsets inside the blob. That is why the zero stretch between the globals window and the first non-zero heap byte is encoded verbatim instead of being skipped.

build_rw_data() trims trailing zero bytes before SPI encoding. Heap pages are zero-initialized, so omitted trailing zeros are semantically equivalent.


Entry Function (SPI Convention)

The entry function is special — it follows SPI conventions rather than the normal calling convention.

Initial register state (set by the PVM runtime):

RegisterValuePurpose
r00xFFFF0000EXIT address — jump here to HALT
r10xFEFE0000Stack pointer (STACK_SEGMENT_END)
r70xFEFF0000Arguments pointer (PVM address)
r8args.lengthArguments length in bytes
r2–r6, r9–r120Available

Entry prologue differences from a normal function:

  1. No stack overflow check (main function starts with full stack)
  2. Allocates frame and stores SSA slots
  3. No callee-saved register saves (no caller to return to)
  4. Adjusts args_ptr: r7 = r7 - wasm_memory_base (convert PVM address to WASM address)
  5. Stores r7 and r8 to parameter slots

Entry return — unified packed i64 convention:

The entry function must return a single i64 value encoding a pointer and length:

  • Lower 32 bits = WASM pointer to result data
  • Upper 32 bits = result length in bytes
  • PVM output: r7 = (ret & 0xFFFFFFFF) + wasm_memory_base, r8 = r7 + (ret >> 32)

All entry functions end by jumping to EXIT_ADDRESS (0xFFFF0000).

Start Function

If a WASM start function exists, the entry function calls it before processing arguments. r7/r8 are saved to the stack, the start function is called (no arguments), then r7/r8 are restored.


SPI/JAM Program Format

The compiled output is a JAM file in the SPI (Standard Program Interface) format:

Offset  Size    Field
──────  ──────  ─────────────────────
0       3       ro_data_len (u24 LE)
3       3       rw_data_len (u24 LE)
6       2       heap_pages  (u16 LE)
8       3       stack_size  (u24 LE)
11      N       ro_data     (dispatch table)
11+N    M       rw_data     (globals + WASM memory initial data)
11+N+M  4       code_len    (u32 LE)
15+N+M  K       code        (PVM program blob)

heap_pages is computed from the WASM module’s initial_pages (not max_pages). It represents the number of 4KB PVM pages pre-allocated as zero-initialized writable memory at program start. Additional memory beyond this is allocated on demand via sbrk/memory.grow. Programs declaring (memory 0) get a minimum of 16 WASM pages (1MB) to accommodate AssemblyScript runtime memory accesses.

PVM Code Blob

Inside the code section, the PVM blob format is:

- jump_table_len  (varint u32)
- item_len        (u8, always 4)
- code_len        (varint u32)
- jump_table      (4 bytes per entry, code offsets)
- instructions    (PVM bytecode)
- mask            (bit-packed instruction start markers)

Entry Header

The first 10 bytes of code are the entry header:

[0–4]   Jump  <main_function_offset>        (5 bytes)
[5–9]   Jump  <secondary_entry_offset>      (5 bytes, or Trap + padding)

The secondary entry is for future use (e.g. is_authorized). If unused, it emits Trap followed by 4 Fallthrough instructions as padding.


Phi Node Handling

Phi nodes (SSA merge points) use a two-pass approach to avoid clobbering:

  1. Load pass: Load all incoming phi values into temp registers (t0, t1, t2, s0, s1)
  2. Store pass: Store all temps to their destination phi result slots

This supports up to 5 simultaneous phi values. The two-pass design prevents cycles where storing one phi value would overwrite a source needed by another phi.


Design Trade-offs

DecisionRationale
Stack-slot for every SSA valueCorrectness-first baseline; linear-scan register allocator (for loop-containing functions) assigns high-use values to available callee-saved regs (r9-r12 when not used for this function’s incoming parameters), and per-block register cache eliminates most remaining redundant loads
Spill area below SPFrame grows up from SP, spill area grows down — no overlap
Global PARAM_OVERFLOW_BASEAvoids stack frame complexity for overflow params
Jump-table indices as return addressesRequired by PVM’s JUMP_IND semantics
Entry function has no stack checkStarts with full stack, nothing to overflow into
Unsigned stack limit comparisonLoadImm64 avoids sign-extension bugs with large addresses
unsafe forbiddenWorkspace-level deny(unsafe_code) lint

References

  • crates/wasm-pvm/src/abi.rs — Register and frame constants
  • crates/wasm-pvm/src/memory_layout.rs — Memory address constants
  • crates/wasm-pvm/src/llvm_backend/emitter.rs — PvmEmitter and value management
  • crates/wasm-pvm/src/llvm_backend/calls.rs — Calling convention implementation
  • crates/wasm-pvm/src/llvm_backend/control_flow.rs — Prologue/epilogue/return
  • crates/wasm-pvm/src/spi.rs — JAM/SPI format encoder
  • Technical Reference — Technical reference and debugging journal
  • Gray Paper — JAM/PVM specification

Translation Module

The translation module orchestrates the end-to-end WASM → LLVM IR → PVM lowering and assembles the final SPI/JAM output.

Source: crates/wasm-pvm/src/translate/

Files

FileRole
mod.rsPipeline dispatch, SPI assembly, entry header + data sections
wasm_module.rsWASM section parsing into WasmModule
memory_layout.rsMemory address constants and helper functions

Pipeline

  1. Parse module sections in wasm_module.rs (WasmModule::parse()).
  2. Translate WASM operators to LLVM IR in llvm_frontend/function_builder.rs.
  3. Run LLVM optimization pipeline (mem2reg, instcombine, simplifycfg, optional inlining, cleanup passes).
  4. Lower LLVM IR to PVM instructions in llvm_backend/mod.rs.
  5. Build SPI sections in mod.rs:
    • Entry header and dispatch tables
    • ro_data (jump table refs + passive data)
    • rw_data (globals + active data segments), with trailing zero trim
    • Encoded PVM blob + metadata

Key Behaviors

  • calculate_heap_pages() uses WASM initial_pages (not max), with a minimum of 16 WASM pages for (memory 0).
  • compute_wasm_memory_base() compares SPILLED_LOCALS_BASE + num_funcs * SPILLED_LOCALS_PER_FUNC with GLOBAL_MEMORY_BASE + globals_region_size(num_globals, num_passive_segments), then rounds the larger address up to the next 4KB (PVM page) boundary. This typically gives 0x33000.
  • build_rw_data() copies globals and active segments into a contiguous image, then trims trailing zero bytes before SPI encoding.
  • Call return addresses are pre-assigned as jump-table refs ((idx + 1) * 2) at emission time; fixup resolution accepts direct (LoadImmJump) and indirect (LoadImm / LoadImmJumpInd) return-address carriers.
  • Export parsing tracks exported_wasm_func_indices in WASM global index space for dead-function-elimination roots; entry resolution prefers canonical names (main, main2) over aliases (refine*, accumulate*) regardless of export order.
  • Entry exports (main/main2 and aliases) must target local (non-imported) functions; imported targets are rejected during parse with Error::Internal to avoid index-underflow panics.

Current Memory Layout

AddressPurpose
0x10000Read-only data
0x30000Globals window (8KB cap; actual bytes = globals_region_size(num_globals, num_passive_segments)). The heap starts at compute_wasm_memory_base(), which is the 4KB-aligned address after max(globals_end, spills_end).
0x32000Parameter overflow area
0x32100+Spilled-locals base (spills are stack-based; base kept for layout/alignment)
0x33000+WASM linear memory (4KB-aligned, computed dynamically)

Anti-Patterns

  1. Don’t change layout constants without validating pvm-in-pvm tests.
  2. Don’t bypass Result error handling with panics in library code.
  3. Don’t assume rw_data must include trailing zero bytes.

PVM Instruction Module

PolkaVM instruction definitions, opcodes, encoding/decoding, and the peephole optimizer.

Source: crates/wasm-pvm/src/pvm/

Files

FileLinesRole
instruction.rs~700Instruction enum, encoding/decoding logic
opcode.rs~130Opcode constants (~100 opcodes)
blob.rs143Program blob format with jump table
peephole.rs~400Post-codegen peephole optimizer (Fallthroughs, truncation NOPs, dead stores, immediate chain fusion, self-move elimination)

Key Patterns

Instruction Encoding

#![allow(unused)]
fn main() {
pub enum Instruction {
    Add32 { dst: u8, src1: u8, src2: u8 },
    LoadIndU32 { dst: u8, base: u8, offset: i32 },
    MoveReg { dst: u8, src: u8 },
    BranchLtUImm { reg: u8, value: i32, offset: i32 },
    BranchEq { reg1: u8, reg2: u8, offset: i32 },
    CmovIzImm { dst: u8, cond: u8, value: i32 },  // TwoRegOneImm encoding
    StoreImmU32 { address: i32, value: i32 },  // TwoImm encoding
    StoreImmIndU32 { base: u8, offset: i32, value: i32 },  // OneRegTwoImm encoding
    AndImm { dst: u8, src: u8, value: i32 },
    ShloLImm32 { dst: u8, src: u8, value: i32 },
    NegAddImm32 { dst: u8, src: u8, value: i32 },
    SetGtUImm { dst: u8, src: u8, value: i32 },
    // ... ~100 variants total
}
}

Encoding Helpers

  • encode_three_reg(opcode, dst, src1, src2) - ALU ops (3 regs)
  • encode_two_reg(opcode, dst, src) - Moves/conversions (2 regs)
  • encode_two_reg_one_imm(opcode, dst, src, value) - ALU immediate ops (2 regs + imm)
  • encode_two_imm(opcode, imm1, imm2) - TwoImm format (StoreImm*)
  • encode_one_reg_one_imm_one_off(opcode, reg, imm, offset) - Branch-immediate ops
  • encode_one_reg_two_imm(opcode, base, offset, value) - Store immediate indirect
  • encode_two_reg_one_off(opcode, reg1, reg2, offset) - Branch-register ops
  • encode_two_reg_two_imm(opcode, reg1, reg2, imm1, imm2) - Compound indirect jump (LoadImmJumpInd)
  • encode_imm(value) - Variable-length signed immediate (0-4 bytes)
  • encode_uimm(value) - Variable-length unsigned immediate (0-4 bytes)
  • encode_var_u32(value) - LEB128-style variable int

Decoding Helpers

  • Instruction::decode(bytes) dispatches by opcode and returns (instruction, consumed_bytes)
  • Opcode::from_u8 / Opcode::try_from provide explicit opcode-byte to enum conversion
  • decode_imm_signed / decode_imm_unsigned handle 0-4 byte immediate expansion
  • decode_offset_at reads fixed 4-byte branch/jump offsets
  • For formats where the trailing immediate has no explicit length (OneImm, OneRegOneImm, TwoRegOneImm, TwoImm, OneRegTwoImm, TwoRegTwoImm), decode consumes the remaining bytes as that immediate

Terminating Instructions

Instructions that end a basic block:

#![allow(unused)]
fn main() {
pub fn is_terminating(&self) -> bool {
    matches!(self,
        Trap | Fallthrough | Jump {..} | LoadImmJump {..} | JumpInd {..} | LoadImmJumpInd {..} |
        BranchNeImm {..} | BranchEqImm {..} | ...)
}
}

Destination Register Query

Used by the register cache in emitter.rs to auto-invalidate stale cache entries:

#![allow(unused)]
fn main() {
pub fn dest_reg(&self) -> Option<u8> {
    // Returns Some(reg) for instructions that write to a register
    // Returns None for stores, branches, traps, ecalli
}
}

Peephole Notes

  • Dead-code elimination runs only when a function has no labels (single-block code). Multi-block functions skip DCE to avoid incorrect liveness across control flow.
  • DCE must track side-effects for all store variants: StoreIndU8/U16/U32/U64, StoreImmIndU8/U16/U32/U64, StoreImmU8/U16/U32/U64, StoreU8/U16/U32/U64
  • DCE must track memory loads (can-trap, track dst) for all load variants: LoadIndU8/I8/U16/I16/U32/I32/U64, LoadU8/I8/U16/I16/U32/I32/U64
  • Address-folding for AddImm* chains is width-aware: AddImm32 relations only fold into later AddImm32, and AddImm64 relations only fold into later AddImm64 (no cross-width fusion).

Where to Look

TaskLocation
Add new PVM instructionopcode.rs (add enum variant) + instruction.rs (encoding + decoding)
Change instruction encodinginstruction.rs:impl Instruction
Check opcode existsopcode.rs (~100 opcodes defined)
Build program blobblob.rs:ProgramBlob::with_jump_table()
Variable int encodingblob.rs:encode_var_u32()

Branch Operand Convention (Important!)

Two-register branch instructions use reversed operand order: Branch_op { reg1: a, reg2: b } branches when reg2 op reg1 (i.e., b op a).

For example, BranchLtU { reg1: 3, reg2: 2 } branches when reg[2] < reg[3], NOT reg[3] < reg[2].

This matches the PVM spec where branch_lt_u(rA, rB) branches when ω_rB < ω_rA. In the binary encoding, reg1 = high nibble (rA), reg2 = low nibble (rB).

Immediate-form branches are straightforward: BranchLtUImm { reg, value } branches when reg < value.

Anti-Patterns

  1. Don’t change opcode numbers - Would break existing JAM files
  2. Preserve register field order - (dst, src1, src2) convention
  3. Keep encoding compact - Variable-length immediates save space

Testing

Unit tests in same files under #[cfg(test)]:

  • instruction.rs: Tests encoding and decode(encode) roundtrip coverage for all variants
  • blob.rs: Tests mask packing, varint encoding

Gray Paper Reference

See gp-0.7.2.md Appendix A for PVM spec:

  • Gas costs per instruction (ϱ∆)
  • Semantics for each opcode
  • This module implements the encoding, not semantics

Optimizations

All non-trivial optimizations can be individually toggled via OptimizationFlags (in translate/mod.rs, re-exported from lib.rs). Each defaults to enabled; CLI exposes --no-* flags.

LLVM Passes (--no-llvm-passes)

Three-phase optimization pipeline:

  1. mem2reg, instcombine, simplifycfg (pre-inline cleanup)
  2. cgscc(inline) (optional, see --no-inline)
  3. instcombine<max-iterations=2>, simplifycfg, gvn, simplifycfg, dce

Function Inlining (--no-inline)

LLVM CGSCC inline pass for small callees. After inlining, instcombine may introduce new LLVM intrinsics (llvm.abs, llvm.smax, etc.) that the backend must handle.

Peephole Optimizer (--no-peephole)

Post-codegen patterns in pvm/peephole.rs:

  • Fallthrough elimination: remove redundant Fallthrough before jump/branch
  • Truncation NOP removal: [32-bit-producer] → AddImm32(x,x,0) eliminated
  • Dead store elimination: SP-relative stores never loaded from are removed
  • Immediate chain fusion: LoadImm + AddImm → single LoadImm; chained AddImm → fused
  • Self-move elimination: MoveReg r, r removed
  • Address calculation folding: AddImm offsets folded into subsequent load/store offsets

Register Cache (--no-register-cache)

Per-basic-block store-load forwarding. Tracks which stack slots are live in registers:

  • Cache hit, same register: skip entirely (0 instructions)
  • Cache hit, different register: emit register copy (1 instruction)
  • Cache miss: normal load + record in cache

Impact: ~50% gas reduction, ~15-40% code size reduction.

Invalidated at block boundaries, after function calls, and after ecalli.

Cross-Block Cache (--no-cross-block-cache)

When a block has exactly one predecessor and no phi nodes, the predecessor’s cache snapshot is propagated instead of clearing. The snapshot is taken before the terminator instruction.

ICmp+Branch Fusion (--no-icmp-fusion)

Combines an LLVM icmp + br pair into a single PVM branch instruction (e.g., BranchLtU), saving one instruction per conditional branch.

Shrink Wrapping (--no-shrink-wrap)

For non-entry functions, only callee-saved registers (r9-r12) that are actually used are saved/restored in prologue/epilogue. Reduces frame header size from fixed 40 bytes to 8 + 8 * num_used_callee_regs.

Dead Store Elimination (--no-dead-store-elim)

Removes StoreIndU64 instructions to SP-relative offsets that are never loaded from. Runs as part of the peephole optimizer.

Constant Propagation (--no-const-prop)

Skips LoadImm/LoadImm64 when the target register already holds the required constant value.

Register Allocation (--no-register-alloc)

Linear-scan allocator assigns SSA values to physical registers, reducing LoadIndU64 memory traffic. Allocates in all functions (looped and straight-line, leaf and non-leaf). Eviction uses a spill-weight model (use_count × 10^loop_depth) to keep loop-hot values in registers. In non-leaf functions, the existing call lowering (spill_allocated_regs + clear_reg_cache + lazy reload) handles spill/reload around calls automatically, and per-call-site arity-aware invalidation only clobbers registers used by each specific call. See the Register Allocation chapter for details.

Aggressive Register Allocation (--no-aggressive-regalloc)

Lowers the minimum-use threshold for register allocation candidates from 2 to 1, capturing more values when a register is free. Enabled by default.

Scratch Register Allocation (--no-scratch-reg-alloc)

Adds r5/r6 (abi::SCRATCH1/SCRATCH2) to the allocatable set in all functions that don’t clobber them (no bulk memory ops, no funnel shifts). Per-function LLVM IR scan detects clobbering operations. In non-leaf functions, r5/r6 are spilled before calls via spill_allocated_regs and lazily reloaded on next access. Doubles allocation capacity in the common case (e.g., 2-param function: 2 → 4 allocatable regs).

Caller-Saved Register Allocation (--no-caller-saved-alloc)

Adds r7/r8 (RETURN_VALUE_REG/ARGS_LEN_REG) to the allocatable set in leaf functions. These registers are idle after the prologue and are never clobbered by calls in leaf functions. In non-leaf functions, r7/r8 are not allocated because every call clobbers r7 (return value) and r8 (scratch), making the constant invalidation/reload overhead a net negative. Combined with r5/r6, gives up to 4 extra registers (r5, r6, r7, r8) beyond callee-saved r9-r12 in leaf functions. The full register convention: r0=return address, r1=SP, r2-r4=temps, r5-r6=scratch, r7=return value/args ptr, r8=args len, r9-r12=callee-saved locals.

Dead Function Elimination (--no-dead-function-elim)

Removes functions not reachable from exports or the function table. Reduces code size for programs with unused library functions.

Fallthrough Jump Elimination (--no-fallthrough-jumps)

When a block ends with an unconditional jump to the next block in layout order, the Jump is skipped — execution falls through naturally.

Lazy Spill (--no-lazy-spill)

Eliminates write-through stack stores for register-allocated values. When a value is stored to a slot that has an allocated register, the value goes only into the register (marked “dirty”) and the StoreIndU64 to the stack is skipped. Values are flushed to the stack only when required:

  • When the register is about to be clobbered by another instruction (auto-spill in invalidate_reg)
  • Before function calls and ecalli (via spill_allocated_regs())
  • Before the function epilogue (return)
  • Before terminators at block boundaries
  • After prologue parameter stores

With register-aware phi resolution (Phase 5), phi copies between blocks use direct register-to-register moves when both the incoming value and the phi destination are in allocated registers, avoiding stack round-trips. The target block restores alloc_reg_slot for phi destinations after define_label, so subsequent reads use the register directly. For mixed cases (some values allocated, some not), a two-pass approach loads all incoming values into temp registers, then stores to destinations (registers or stack). This handles all dependency cases including cycles without needing a separate parallel move resolver.

Requires register_allocation to be effective.

Store-Side Coalescing (Phase 7)

When a value has an allocated register, result_reg() / result_reg_or() helpers in emitter.rs return that register directly. ALU, memory load, and intrinsic lowering paths use the result register as their output destination instead of TEMP_RESULT (r4), so store_to_slot no longer needs to emit a MoveReg to copy from TEMP_RESULT into the allocated register.

This is a codegen-only optimization (no new flag) — it is always active when register allocation is enabled.

Not coalesced (store-side correctness constraints):

  • lower_select: loading the default value into the allocated register corrupts register cache state needed by subsequent operand loads (load-side coalescing IS applied — see Phase 9 below)
  • emit_pvm_memory_grow: TEMP_RESULT is used across control flow (branch between grow success/failure)
  • lower_abs intrinsic: TEMP_RESULT is used across control flow (branch between positive/negative paths)

result_reg_or() variant: Some lowering paths (zext, sext, trunc) need TEMP1 as the fallback register instead of TEMP_RESULT to preserve register cache behavior in non-allocated paths. result_reg_or(fallback) returns the allocated register when available, or the specified fallback otherwise.

Impact (anan-as compiler): 54% reduction in store_moves (2720 to 1262), 4% reduction in total instructions (37225 to 35744), 2.9% reduction in JAM size (169,853 to 164,902 bytes).

Load-Side Coalescing (Phase 8)

When a value is live in its allocated register, operand_reg() returns that register directly instead of requiring load_operand() to copy it into a temp register (TEMP1/TEMP2). The instruction’s source operand fields use the allocated register, eliminating the MoveReg that load_operand() would otherwise emit.

Applied across all lowering modules:

  • alu.rs: Binary arithmetic (register-register and immediate-folding paths), comparisons, zext/sext/trunc
  • memory.rs: PVM load/store address and value operands, global store values
  • control_flow.rs: Branch conditions, fused ICmp+Branch operands, switch values
  • intrinsics.rs: min/max, bswap, ctlz/cttz/ctpop, rotation operands

Not coalesced (complexity/safety constraints):

  • Div/rem operations: intermediate trap code may clobber scratch registers
  • Non-rotation funnel shifts: use SCRATCH1/SCRATCH2 after spill_allocated_regs
  • lower_abs: control flow between positive/negative paths
  • Call argument setup: already loaded into specific registers
  • Phi resolution: already uses register-aware moves

Dst-conflict safety: When an operand’s allocated register matches the destination register (result_reg), the operand falls back to the temp register to avoid invalidation hazards from emit() → invalidate_reg(dst).

This is a codegen-only optimization — always active when register allocation is enabled.

Phase 9: Select Coalescing, Spill Weight Refinement & Call Return Hints

Three allocator improvements added in Phase 9:

Select Coalescing (load-side)

lower_select now uses operand_reg() for all Cmov operands (default value, condition, and source). Values already in their allocated registers are used directly as CmovNz/CmovIz/CmovNzImm/CmovIzImm operands without MoveReg copies. Store-side coalescing (using result_reg() for the Cmov dst) remains deferred due to the invalidate_reg cache corruption issue documented in Phase 7.

Spill Weight Refinement

Values whose live ranges span real call instructions receive a penalty to their spill weight. Each spanning call costs CALL_SPANNING_PENALTY (2.0) weight, representing the spill+reload pair required when a register is allocated across a call boundary. The formula:

effective_weight = base_weight - (num_spanning_calls × 2.0)

Values spanning many calls get lower weights, making them more likely to be evicted in favor of values with fewer spanning calls. This improves allocation decisions in call-heavy functions. Call positions are collected during linearization using the same is_real_call() check from emitter.rs; counting uses binary search for efficiency.

Call Return Value Coalescing (register hints)

When a value is defined by a real call instruction, the linear scan allocator prefers assigning r7 (RETURN_VALUE_REG). Since call return values are already in r7, this eliminates the MoveReg from r7 to the allocated register in store_to_slot. The hint is best-effort — if r7 is not free, a different register is used.

All three are codegen-only optimizations — always active when register allocation is enabled.

Phase 10: Loop Phi Early Interval Expiration

Eliminates phi MoveReg instructions in loop headers by modifying the linear scan to expire loop phi destination intervals at their actual last use (before loop extension) instead of the loop-extended end. This frees the phi’s register earlier, allowing the incoming back-edge value to naturally reuse it via the free register pool. When both values share the same register, the phi copy at the back-edge becomes a no-op (skipped entirely in emit_phi_copies_regaware).

Three coordinated changes:

  1. regalloc.rs: LiveInterval.expiration field — for loop phi destinations where pre_extension_end < end, expires early. Pressure guard disables when intervals > 2× registers.
  2. control_flow.rs: Phi copy no-op filter — when incoming_reg == phi_reg and is_alloc_reg_valid(src_reg, incoming_slot), skips data movement.
  3. emitter.rs: store_to_slot safety — spills dirty values before overwriting alloc_reg_slot with a different slot.

Impact: fib(20) -15.7% gas / -7.2% code, factorial -5.6% gas. No regressions.

This is a codegen-only optimization — always active when register allocation is enabled.

Phase 11: Cross-Block Alloc State Propagation

Improves register allocation state propagation at block boundaries, particularly at loop headers with back-edges. Previously, blocks with unprocessed predecessors (back-edges) cleared alloc_reg_slot entirely, forcing reloads from the stack at every loop iteration start. Phase 11 instead propagates the dominator predecessor’s alloc state, filtered by safety:

  • Non-leaf functions: Only callee-saved registers beyond max_call_args are propagated (these are never clobbered by calls). Caller-saved registers (r5-r8) are excluded because they may be invalidated after calls on other paths.
  • Leaf functions with lazy spill: All registers are propagated (no calls to clobber them).
  • Multi-predecessor blocks (leaf+lazy_spill): The existing intersection logic (keep only entries where all processed predecessors agree) is now also applied to leaf functions with lazy spill, not just non-leaf functions.

New emitter method set_alloc_reg_slot_filtered() selectively propagates alloc entries based on a register filter predicate, enabling the per-register-class filtering described above.

The predecessor map (pred_map) is now built for both non-leaf functions AND leaf functions with lazy spill (condition: has_regalloc && (!is_leaf || lazy_spill_enabled)).

Impact: fib(20) -5.1% gas, factorial(10) -7.1% gas, is_prime(25) -4.6% gas, PiP aslan-fib -0.52% gas.

This is a codegen-only optimization — always active when register allocation and lazy spill are enabled.

Phase 12: Callee-Saved Preference for Call-Spanning Intervals

In non-leaf functions, the linear scan allocator now applies register class preferences based on whether an interval spans call instructions:

  • Call-spanning intervals (live range contains at least one real call) prefer callee-saved registers (r9-r12 beyond max_call_args). These registers survive calls without invalidation, eliminating post-call reload traffic.
  • Non-call-spanning intervals prefer caller-saved registers (r5-r8), leaving callee-saved registers available for call-spanning values.
  • Leaf functions use the default pop() behavior — all registers are equal since there are no calls.

The preferred_reg hint (e.g., r7 for call return values) takes priority over the class preference.

Implementation: LiveInterval.spans_calls field set during interval construction based on count_spanning_calls() > 0. The linear_scan() function receives is_leaf and applies class-aware register selection.

Impact: Primarily benefits non-leaf functions with call-spanning values. anan-as PVM interpreter -0.2% code size (106,820→106,577 bytes).

This is a codegen-only optimization — always active when register allocation is enabled.

Adding a New Optimization

  1. Add a field to OptimizationFlags in translate/mod.rs
  2. Thread it through LoweringContextEmitterConfig
  3. Guard the optimization code with e.config.<flag>
  4. Add a --no-* CLI flag in wasm-pvm-cli/src/main.rs

Benchmarks

All optimizations enabled (default):

BenchmarkWASM sizeJAM sizeCode sizeGas Used
add(5,7)68 B165 B-28
fib(20)110 B247 B-511
factorial(10)102 B209 B-178
is_prime(25)162 B293 B-65
AS fib(10)235 B640 B-247
AS factorial(7)234 B625 B-209
AS gcd(2017,200)229 B649 B-176
AS decoder1.5 KB20.8 KB-637
AS array1.4 KB20.0 KB-557
regalloc two loops-595 B-16,776
aslan-fib accumulate-38.5 KB-11,089
anan-as PVM interpreter54.6 KB158.9 KB--

Register Allocation

The compiler uses a linear-scan register allocator to assign frequently-used SSA values to physical callee-saved registers (r9-r12), reducing memory traffic.

Overview

Every LLVM SSA value gets a dedicated 8-byte stack slot (the baseline). The register allocator improves on this by keeping hot values in registers across block boundaries and loop iterations.

Eligibility

  • Only functions with loop back-edges are considered (loop-free functions skip allocation)
  • Values must have ≥3 uses (MIN_USES_FOR_ALLOCATION)
  • Live intervals are computed from use-def analysis with loop extension

Available Registers

Callee-saved registers r9-r12, minus those used for incoming parameters:

  • A function with 2 parameters uses r9-r10 → r11-r12 are available for allocation
  • In non-leaf functions, registers needed for outgoing call arguments are also reserved

Allocation Strategy

  1. Build candidate intervals from use-def live-interval analysis
  2. Filter by minimum-use threshold
  3. Run linear scan: assign to available callee-saved registers, evict lower-priority intervals when needed
  4. Naturally expired intervals remain in the mapping (earlier uses still benefit)
  5. Evicted intervals are removed entirely (whole-interval mapping invalid after eviction)

Runtime Integration

  • load_operand checks regalloc before stack: uses MoveReg from allocated reg instead of LoadIndU64
  • store_to_slot uses write-through: copies to allocated reg AND stores to stack
  • Dead store elimination removes the stack store if never loaded
  • After calls in non-leaf functions, allocated register mappings are invalidated and lazily reloaded

Cross-Block Propagation

  • Leaf functions: alloc_reg_slot is preserved across all block boundaries (allocated registers are never clobbered by calls)
  • Non-leaf functions: Predecessor exit snapshots are intersected at multi-predecessor blocks — only entries where ALL predecessors agree are kept
  • Back-edges (unprocessed predecessors) are treated conservatively

Debugging

Enable allocator logs with RUST_LOG=wasm_pvm::regalloc=debug:

  • regalloc::run() prints candidate/assignment stats
  • lower_function() prints per-function usage counters (alloc_load_hits, alloc_store_hits, etc.)

Quick triage:

  • allocatable_regs=0 → no allocation will happen
  • Non-zero allocated_values with near-zero load/store hits → move/reload overhead dominates

For the full development journey, see Regalloc Cross-Block Journey.

Technical Reference

Accumulated technical knowledge from development — LLVM pass behavior, PVM instruction semantics, code generation patterns, and optimization details.


Entry Function ABI — Unified Packed i64 Convention

All entry functions (both WAT and AssemblyScript) must use main(args_ptr: i32, args_len: i32) -> i64. The i64 return value packs a WASM pointer and length: (ptr as u64) | ((len as u64) << 32). The PVM epilogue unpacks: r7 = (ret & 0xFFFFFFFF) + wasm_memory_base, r8 = r7 + (ret >> 32).

Common constant: ptr=0, len=4 → i64.const 17179869184 (= 4 << 32).

Previous conventions (globals-based, multi-value (result i32 i32), simple scalar) were removed. AssemblyScript uses a writeResult(val: i32): i64 helper that stores the value and returns packResult(ptr, len).


LLVM New Pass Manager (inkwell 0.8.0 / LLVM 18)

Pass Pipeline Syntax

  • Module::run_passes() accepts a pipeline string parsed as a module-level pipeline
  • Function passes (like mem2reg, instcombine) auto-wrap as module(function(...))
  • CGSCC passes (like inline) cannot be mixed with function passes in a single string
  • To run the inliner: use a separate run_passes("cgscc(inline)") call
  • Pass parameters use angle brackets: instcombine<max-iterations=2>

instcombine Convergence

  • instcombine defaults to max-iterations=1, which can cause LLVM ERROR: Instruction Combining did not reach a fixpoint on complex IR (e.g., after aggressive inlining)
  • Fix: use instcombine<max-iterations=2> to give it a second iteration
  • Running instcombine,simplifycfg before inlining also helps by simplifying the IR first

Inlining Creates New LLVM Intrinsics

  • After inlining, instcombine may transform patterns into LLVM intrinsics that weren’t present before:
    • if x < 0 then -x else x becomes llvm.abs.i64
    • Similar patterns may produce llvm.smax, llvm.smin, llvm.umax, llvm.umin
  • The PVM backend must handle these intrinsics (see llvm_backend/intrinsics.rs)

PassBuilderOptions

  • set_inliner_threshold() is on PassManagerBuilder, NOT on PassBuilderOptions
  • PassBuilderOptions has no direct way to set the inline threshold
  • The inline pass uses LLVM’s default threshold (225) when invoked via cgscc(inline)

PVM Branch Operand Convention

Two-register branch instructions use reversed operand order: Branch_op { reg1: a, reg2: b } branches when reg2 op reg1 (i.e., b op a). For example, BranchLtU { reg1: 3, reg2: 2 } branches when reg[2] < reg[3]. This matches the Gray Paper where branch_lt_u(rA, rB) branches when ω_rB < ω_rA. In the encoding, reg1 = high nibble (rA), reg2 = low nibble (rB). Immediate-form branches are straightforward: BranchLtUImm { reg, value } branches when reg < value.

PVM Memory Layout Optimization

  • Globals only occupy the bytes they actually need: the compiler now tracks globals_region_size = (num_globals + 1 + num_passive_segments) * 4 bytes and places the heap immediately after that region instead of reserving a full 64KB block. This keeps the RW data blob limited to real globals/passive-length fields plus active data segments.
  • Dynamic heap base calculation: compute_wasm_memory_base(num_funcs, num_globals, num_passive_segments) compares the spill area (SPILLED_LOCALS_BASE + num_funcs * SPILLED_LOCALS_PER_FUNC) with the globals region end (GLOBAL_MEMORY_BASE + globals_region_size(...)) before rounding up to the next 4KB (PVM page) boundary. This typically gives 0x33000 instead of the old 0x40000, saving ~52KB per program.
  • 4KB alignment is sufficient: The SPI spec only requires page-aligned (4KB) rw_data length. The 64KB WASM page size governs memory.grow granularity, not the base address. The anan-as interpreter uses alignToPageSize(rwLength) (4KB) not segment alignment for the heap zeros start. Evidence: vendor/anan-as/assembly/spi.ts line 41: heapZerosStart = heapStart + alignToPageSize(rwLength).
  • heap_pages headroom for rw_data trimming: SPI heap_pages means “zero pages after rw_data”, but build_rw_data() trims trailing zeros. With the tighter 4KB alignment, both rw_data and heap_pages shrink, reducing total writable memory. A 16-page (64KB) headroom is added to calculate_heap_pages() to compensate. This doesn’t affect JAM file size (heap_pages is a 2-byte header field), it only tells the runtime to allocate more zero pages. Without this headroom, PVM-in-PVM tests fail for programs at the memory edge (e.g. as-tests-structs inside the anan-as interpreter).

Code Generation

  • Leaf Functions: Functions that make no calls don’t need to save/restore the return address (ra/r0) because it’s invariant. This optimization saves 2 instructions per leaf function.
  • Address Calculation: Fusing AddImm into subsequent LoadInd/StoreInd offsets reduces instruction count.
  • Dead Code Elimination: Basic DCE for ALU operations removes unused computations (e.g. from macro expansions).

StoreImm (TwoImm Encoding)

  • Opcodes 30-33: StoreImmU8/U16/U32/U64
  • TwoImm encoding: [opcode, addr_len & 0x0F, address_bytes..., value_bytes...]
  • Both address and value are variable-length signed immediates (0-4 bytes each)
  • Semantics: mem[address] = value (no registers involved)
  • Used for: data.drop (store 0 to segment length addr), global.set with constants
  • Savings: 3 instructions (LoadImm + LoadImm + StoreInd) → 1 instruction

StoreImmInd (Store Immediate Indirect)

Encoding (OneRegTwoImm)

  • Format: [opcode, (offset_len << 4) | (base & 0x0F), offset_bytes..., value_bytes...]
  • Both offset and value use variable-length signed encoding (encode_imm)
  • Opcodes: StoreImmIndU8=70, StoreImmIndU16=71, StoreImmIndU32=72, StoreImmIndU64=73
  • Semantics: mem[reg[base] + sign_extend(offset)] = value (truncated/sign-extended per width)
  • For U64: value is sign-extended from i32 to i64

Optimization Triggers

  • emit_pvm_store: When WASM store value is a compile-time constant fitting i32
  • Saves 1 instruction (LoadImm) per constant store to WASM linear memory

ALU Immediate Opcode Folding

Immediate folding for binary operations

  • When one operand of a binary ALU op is a constant that fits in i32, use the *Imm variant (e.g., And + const → AndImm)
  • Saves 1 gas per folded instruction (no separate LoadImm/LoadImm64 needed) + code size reduction
  • Available for: Add, Mul, And, Or, Xor, ShloL, ShloR, SharR (both 32-bit and 64-bit)
  • Sub with const RHS → AddImm with negated value; Sub with const LHS → NegAddImm
  • ICmp UGT/SGT with const RHS → SetGtUImm/SetGtSImm (avoids swap trick)
  • LLVM often constant-folds before reaching the PVM backend, so benefits are most visible in complex programs

Instruction Decoder (Instruction::decode)

  • instruction.rs now has Instruction::decode(&[u8]) -> Result<(Instruction, usize)> so roundtrip tests and disassembly-style tooling can share one decode path.
  • Opcode::from_u8 / TryFrom<u8> are now the canonical byte→opcode conversion helpers for code and tests.
  • Fixed-width formats (Zero, ThreeReg, TwoReg, OneOff, TwoRegOneOff, OneRegOneExtImm, OneRegOneImmOneOff) return exact consumed length.
  • Formats with trailing variable-length immediates but no explicit terminal length marker (OneImm, OneRegOneImm, TwoRegOneImm, TwoImm, OneRegTwoImm, TwoRegTwoImm) are decoded by consuming the remaining bytes for that trailing immediate.
  • Unknown opcode passthrough is explicit: decode returns Instruction::Unknown { opcode, raw_bytes } with original bytes preserved.

Conditional Move (CmovIz/CmovNz)

Branchless select lowering

  • select i1 %cond, %true_val, %false_val now uses CmovNz instead of a branch
  • Old: load false_val, branch on cond==0, load true_val, define label (5-6 instructions)
  • New: load false_val, load true_val, load cond, CmovNz (4 instructions, branchless)
  • CmovIz/CmovNz are ThreeReg encoded: [opcode, (cond<<4)|src, dst]
  • Semantics: if reg[cond] == 0 (CmovIz) / != 0 (CmovNz) then reg[dst] = reg[src]
  • Note: CmovNz conditionally writes dst — the register cache must invalidate dst after CmovNz/CmovIz since the write is conditional

CmovIzImm / CmovNzImm (TwoRegOneImm Encoding)

  • Opcodes 147-148: Conditional move with immediate value
  • TwoRegOneImm encoding: [opcode, (cond << 4) | dst, imm_bytes...]
  • CmovIzImm: if reg[cond] == 0 then reg[dst] = sign_extend(imm)
  • CmovNzImm: if reg[cond] != 0 then reg[dst] = sign_extend(imm)
  • Now used: optimize select when one operand is a compile-time constant that fits in i32

LoadImmJumpInd (Opcode 180) — Implemented

  • TwoRegTwoImm encoding: fuses LoadImm + JumpInd into one instruction.
  • Semantics: reg[dst] = sign_extend(value); jump to reg[base] + sign_extend(offset).
  • call_indirect now emits LoadImmJumpInd { base: r8, dst: r0, value: preassigned_return_addr, offset: 0 }.
  • Dispatch table address math for indirect calls can use ShloLImm32(..., value=3) instead of three Add32 doublings (idx*8), reducing one hot-path sequence from 3 instructions to 1 with equivalent 32-bit wrap/sign-extension semantics.
  • Fixups remain stable by:
    • pre-assigning return jump-table slots at emission time, and
    • recording return_addr_instr == jump_ind_instr for this fused call instruction.
  • return_addr_jump_table_idx() accepts LoadImmJump, LoadImm, and LoadImmJumpInd, so mixed old/new patterns still resolve safely.
  • Important semantic pitfall: do not assume base == dst is safe for absolute jumps. Using LoadImmJumpInd for the main epilogue (EXIT_ADDRESS) caused global failures because jump target evaluation does not behave like a guaranteed “write dst first, then read base” in practice.

PVM Intrinsic Lowering

llvm.abs (absolute value)

  • Signature: llvm.abs.i32(x, is_int_min_poison) / llvm.abs.i64(x, is_int_min_poison)
  • Lowered as: if x >= 0 then x else 0 - x
  • For i32: must sign-extend first (zero-extension from load_operand makes negatives look positive in i64 comparisons)

LoadImmJump for Direct Calls

Combined Instruction Replaces LoadImm64 + Jump

  • Direct function calls previously used two instructions: LoadImm64 { reg: r0, value } (10 bytes) + Jump { offset } (5 bytes) = 15 bytes, 2 gas
  • LoadImmJump { reg: r0, value, offset } (opcode 80) combines both into a single instruction: 6-10 bytes, 1 gas
  • Uses encode_one_reg_one_imm_one_off encoding: opcode(1) + (imm_len|reg)(1) + imm(0-4) + offset(4)
  • For typical call return addresses (small positive integers like 2, 4, 6), the imm field is 1 byte, so total is 7 bytes
  • LoadImmJump does not read any source registers; treat it like LoadImm/LoadImm64 in Instruction::src_regs for DCE
  • PVM-in-PVM args are passed via a temp binary file; use a unique temp dir + random filename to avoid collisions under concurrent bun test workers. Debug knobs: PVM_IN_PVM_DEBUG=1 for extra logging, PVM_IN_PVM_KEEP_ARGS=1 to retain the temp args file on disk.
  • DCE src_regs: Imm ALU ops read only src; StoreImm* reads no regs; StoreImmInd* reads base only.

Pre-Assignment of Jump Table Addresses

  • Same challenge as LoadImm for return addresses: LoadImmJump has variable-size encoding, so the value must be known at emission time
  • Solution: Thread a next_call_return_idx counter through the compilation pipeline, pre-computing (index + 1) * 2 at emission time
  • During resolve_call_fixups, only the offset field is patched (always 4 bytes, size-stable)
  • The value field is verified via debug_assert! to match the actual jump table index

Bonus: Peephole Fallthrough Elimination

  • Since LoadImmJump is a terminating instruction, the peephole optimizer can remove a preceding Fallthrough
  • This saves an additional 1 byte per call site where a basic block boundary precedes the call
  • Total savings per call: -8 bytes (instruction) + -1 byte (Fallthrough removal) + -1 gas

Call Return Address Encoding

LoadImm vs LoadImm64 for Call Return Addresses

  • Call return addresses are jump table addresses: (jump_table_index + 1) * 2
  • These are always small positive integers (2, 4, 6, …) that fit in LoadImm (3-6 bytes)
  • Previously used LoadImm64 (10 bytes) with placeholder value 0, patched during fixup resolution
  • Problem with late patching: LoadImm has variable encoding size (2 bytes for value 0, 3 bytes for value 2), so changing the value after branch fixups are resolved corrupts relative offsets
  • Solution: Pre-assign jump table indices at emission time by threading a next_call_return_idx counter through the compilation pipeline. This way LoadImm values are known during emission, ensuring correct byte_offset tracking for branch fixup resolution
  • For direct calls, LoadImmJump combines return address load + jump into one instruction, using the same pre-assigned index
  • For indirect calls (call_indirect), LoadImmJumpInd is used to combine return-address setup and the indirect jump
  • Impact: Saves 7 bytes per indirect call site (LoadImm vs LoadImm64). Direct calls save even more via LoadImmJump fusion.

Why LoadImm64 was originally needed

  • LoadImm64 has fixed 10-byte encoding regardless of value, so placeholder patching was safe
  • LoadImm with value 0 encodes to 2 bytes, but after patching to value 2 becomes 3 bytes
  • This size change would break branch fixups already resolved with the old instruction sizes

PVM 32-bit Instruction Semantics

Sign Extension

  • All PVM 32-bit arithmetic/shift instructions produce u32SignExtend(result) — the lower 32 bits are computed, then sign-extended to fill the full 64-bit register
  • This means AddImm32(x, x, 0) after a 32-bit producer is a NOP (both sign-extend identically)
  • Confirmed in anan-as reference: add_32, sub_32, mul_32, div_u_32, rem_u_32, shlo_l_32, etc. all call u32SignExtend()

Peephole Truncation Pattern

  • The pattern [32-bit-producer] → [AddImm32(x, x, 0)] is eliminated by peephole when directly adjacent
  • In practice with LLVM passes enabled, instcombine already eliminates trunc(32-bit-op) at the LLVM IR level, so this peephole pattern fires rarely
  • The peephole is still valuable for --no-llvm-passes mode and as defense-in-depth
  • Known limitation: the pattern only matches directly adjacent instructions; a StoreIndU64 between producer and truncation breaks the match

Peephole AddImm Width Safety

  • optimize_address_calculation() must not fold address relations across AddImm32/AddImm64 width boundaries.
  • Track AddImm relation width alongside (base, offset) and only fold when widths match (32→32, 64→64), while still allowing width-agnostic MoveReg alias folding.

Cross-Block Register Cache

Approach

  • Pre-scan computes block_single_pred map by scanning terminator successors
  • For each block with exactly 1 predecessor and no phi nodes, restore the predecessor’s cache snapshot instead of clearing
  • Snapshot is taken before the terminator instruction to avoid capturing path-specific phi copies

Key Pitfall: Terminator Phi Copies

  • lower_switch emits phi copies for the default path inline (not in a trampoline)
  • These phi copies modify the register cache (storing values to phi slots)
  • If the exit cache includes these entries, they are WRONG for case targets (which don’t take the default path)
  • Fix: snapshot before the terminator and invalidate TEMP1/TEMP2 (registers the terminator clobbers for operand loads)
  • Same issue can occur with conditional branches when one path has phis and the other doesn’t (trampoline case)

Specialized PVM Instructions for Common Patterns

Absolute Address Load/Store (LoadU32/StoreU32)

  • LoadU32 { dst, address } replaces LoadImm { reg, value: addr } + LoadIndU32 { dst, base: reg, offset: 0 } for known-address loads (globals)
  • StoreU32 { src, address } similarly replaces the store pattern
  • OneRegOneImm encoding: [opcode, reg & 0x0F, encode_imm(address)...]
  • PVM-in-PVM layout sensitivity: Replacing multi-instruction sequences with single instructions changes bytecode layout (code size, jump offsets). Test each significant code generation change with the full PVM-in-PVM suite.
  • LoadU32 is used for lower_wasm_global_load. StoreU32 is used for lower_wasm_global_store. Both absolute-address variants are now emitted everywhere applicable.

LoadIndI32 (Sign-Extending Indirect Load)

  • Replaces LoadIndU32 { dst, base, offset } + AddImm32 { dst, src: dst, value: 0 } for signed i32 loads
  • Single instruction: LoadIndI32 { dst, base, offset } (sign-extends result to 64 bits)
  • Safe for PVM-in-PVM (small layout change)

Min/Max/MinU/MaxU (Single-Instruction Min/Max)

  • Replaces SetLt + branch + stores + jump pattern (~5-8 instructions) with Min/Max/MinU/MaxU (1 instruction)
  • For i32 signed variants, must keep AddImm32 { value: 0 } sign-extension before the instruction (PVM compares full 64-bit values)

ReverseBytes (Byte Swap)

  • llvm.bswap intrinsic lowered as ReverseBytes { dst, src } instead of byte-by-byte extraction
  • For sub-64-bit types: add ShloRImm64 to align bytes (48 for i16, 32 for i32)
  • Savings: i16: ~10→2 instructions, i32: ~20→2, i64: ~40→1

CmovIzImm/CmovNzImm (Conditional Move with Immediate)

  • For select with one constant operand: CmovNzImm { dst, cond, value } or CmovIzImm { dst, cond, value }
  • Load non-constant operand as default, then conditionally overwrite with immediate
  • Note: LLVM may invert conditions, so select(cond, true_const, false_runtime) may emit CmovIzImm instead of CmovNzImm

RotL/RotR (Rotate Instructions)

  • llvm.fshl(a, b, amt) / llvm.fshr(a, b, amt) when a == b (same SSA value) → rotation
  • Detected via val_key_basic(a) == val_key_basic(b) identity check
  • fshl with same operands → RotL32/RotL64, fshr → RotR32/RotR64
  • Falls back to existing shift+or sequence when operands differ

Linear-Scan Register Allocation

  • Allocates SSA values to physical registers using spill-weight eviction (use_count × 10^loop_depth).
  • Operates on LLVM IR before PVM lowering; produces ValKey → physical register mapping
  • load_operand checks regalloc before slot lookup: uses MoveReg from allocated reg instead of LoadIndU64 from stack
  • store_to_slot uses write-through: copies to allocated reg AND stores to stack; DSE removes the stack store if never loaded
  • r5/r6 allocatable in safe leaf functions (no bulk memory ops or funnel shifts); detected by scratch_regs_safe() LLVM IR scan
  • r7/r8 allocatable in all leaf functions; lowering paths that use them as scratch trigger invalidate_reg via emit()
  • Clobbered allocated scratch regs (when present) are handled with lazy invalidation/reload instead of eager spill+reload
  • Allocates in all functions (looped and straight-line), not just loop-heavy code
  • MIN_USES default=2 (aggressive=1); values with fewer uses are skipped
  • Loop extension: back-edges detected by successor having lower block index; live ranges extended to cover the back-edge source
  • Eviction uses spill weight (sum of 10^loop_depth per use) instead of furthest-end heuristic
  • linear_scan must track active assignments separately from final assignments:
    • naturally expired intervals should remain in the final val_to_reg/slot_to_reg maps (their earlier uses still benefit),
    • evicted intervals must be removed from final mapping (whole-interval mapping is no longer valid after eviction).
  • Unit tests cover both interval outcomes (non-overlapping reuse and eviction dropping).
  • Targeted benchmark fixture: tests/fixtures/wat/regalloc-two-loops.jam.wat (regalloc two loops(500) row).
  • Regalloc instrumentation:
    • regalloc::run() logs candidate/assignment stats at target wasm_pvm::regalloc (enable via RUST_LOG=wasm_pvm::regalloc=debug).
    • lower_function() logs per-function summary including allocation usage counters (alloc_load_hits, alloc_store_hits).
  • Instrumentation root cause and fix:
    • Root cause was allocatable_regs=0 in non-leaf functions because only leaf functions exposed r9-r12 to regalloc.
    • Fix: expose available r9-r12 registers in both leaf and non-leaf functions; reserve outgoing argument registers (r9..r9+max_call_args-1) from non-leaf allocation and invalidate local-register mappings after calls.
    • Example (regalloc-two-loops): allocatable_regs=2, allocated_values=4, alloc_load_hits=11, alloc_store_hits=8.
  • Non-leaf stabilization:
    • Reserve outgoing call-argument registers (r9.. by max call arity) from the non-leaf allocatable set.
    • Initially, alloc_reg_valid was reset at label boundaries (define_label / define_label_preserving_cache) because that validity state was not path-sensitive and CacheSnapshot did not yet snapshot alloc_reg_slot during cross-block cache propagation.
    • Without boundary reset, large workloads (notably anan-as-compiler.jam) can miscompile under pvm-in-pvm despite direct tests passing.
  • Follow-up stabilization:
    • Corrective follow-up: CacheSnapshot now includes allocated-register slot ownership (alloc_reg_slot), which replaced the earlier label-boundary alloc_reg_valid reset approach by restoring allocation state path-sensitively across propagated edges.
    • alloc_reg_valid was removed; slot identity (alloc_reg_slot == Some(slot)) is sufficient to decide whether a lazy reload is needed.
    • Non-leaf gate: skip when no allocatable registers remain (all r9-r12 used by params/call args). Previously skipped at <2 regs and <24 SSA values, but these conservative gates were removed in Phase 2 (#165).
  • Post-fix benchmark shape: consistent JAM size reductions from regalloc, but gas/time gains are workload-dependent and often near-noise on current microbenchmarks.
  • Leaf detection fix: PVM intrinsics (__pvm_load_i32, __pvm_store_i32, etc.) are LLVM Call instructions but are NOT real function calls — they’re lowered inline using temp registers only. The is_real_call() function in emitter.rs distinguishes real calls (wasm_func_*, __pvm_call_indirect) from intrinsics (__pvm_*, llvm.*). Before this fix, ALL functions with memory access were classified as non-leaf, causing unnecessary callee-save prologue/epilogue overhead.
  • Cross-block alloc_reg_slot propagation: In leaf functions (no real calls), alloc_reg_slot is preserved across all block boundaries because allocated registers are never clobbered. In non-leaf functions with multi-predecessor blocks, predecessor exit snapshots are intersected — only entries where ALL processed predecessors agree are kept. Back-edges (unprocessed predecessors) are treated conservatively.
  • Phi node allocation is a gas regression in PVM: Allocating phi nodes at loop headers adds +1 MoveReg per iteration per phi (write-through to allocated reg) with 0 gas savings (MoveReg replaces LoadIndU64, both cost 1 gas). Net: +1 gas per iteration per allocated phi. Only beneficial when loads are cheaper than stores, when allocated regs can be used directly by instructions (avoiding MoveReg to temps), or when code size matters more than gas.

Fused Inverted Bitwise (AndInv / OrInv / Xnor)

  • and(a, xor(b, -1))AndInv(a, b) (bit clear): saves 1 instruction (eliminates separate Xor for NOT)
  • or(a, xor(b, -1))OrInv(a, b) (or-not): same pattern
  • xor(a, xor(b, -1))Xnor(a, b) (equivalence): note that LLVM instcombine may reassociate xor(a, xor(b, -1)) to xor(xor(a,b), -1), which makes Xnor fire less often in practice
  • Detection is commutative: checks both LHS and RHS for the NOT pattern
  • All three use ThreeReg encoding: [opcode, (src2<<4)|src1, dst]

CmovIz Register Form for Inverted Select

  • select(!cond, true_val, false_val) now uses CmovIz instead of computing the inversion + CmovNz
  • Detected patterns: xor(cond, 1) (boolean flip) and icmp eq cond, 0 (i32.eqz)
  • Saves 2-3 instructions by avoiding the boolean inversion sequence
  • Note: LLVM instcombine often folds select(icmp eq x, 0, tv, fv)select(x, fv, tv), so the pattern fires mainly in edge cases or with specific IR shapes

Intentionally Not Emitted Opcodes

  • MulUpperSS/UU/SU (213-215): No WASM operator produces 128-bit multiply upper halves
  • Alt shift immediates (reversed): dst = imm OP src form — no WASM pattern generates this (LLVM canonicalizes register on LHS)
  • Absolute address non-32-bit sizes: All WASM globals use 4-byte (i32) slots; no need for U8/U16/U64 absolute address variants

RW Data Trimming

  • translate::build_rw_data() now trims trailing zero bytes before SPI encoding.
  • Semantics remain correct because heap pages are zero-initialized; omitted high-address zero tail bytes are equivalent.
  • This is a low-risk blob-size optimization and does not materially affect gas.

Fallthrough Jump Elimination

  • When LLVM block N ends with an unconditional branch to block N+1 (next in layout order), the Jump can be skipped — execution falls through naturally.
  • Controlled by fallthrough_jumps optimization flag (--no-fallthrough-jumps to disable).
  • Implementation: PvmEmitter.next_block_label tracks the label of the next block. emit_jump_to_label() skips the Jump when the target matches next_block_label.
  • Critical pitfall — phi node trampolines: When conditional branches target blocks with phi nodes, the codegen emits per-edge trampoline code (phi copies + Jump) between blocks. The emit_jump_to_label() in trampoline code must NOT be eliminated, because the jump is not the last instruction before the next block’s define_label. Fix: lower_br and lower_switch temporarily clear next_block_label during trampoline emission.
  • Entry header shrunk from 10 to 6 bytes when no secondary entry (removed 4 Fallthrough padding after Trap).
  • Main function emitted first (right after entry header) to minimize Jump distance.

Memory Layout Sensitivity (PVM-in-PVM)

  • Moving the globals/overflow/spill region around directly affects the base address that the interpreter loads as the WASM heap, so every change still requires a full pvm-in-pvm validation. Direct/unit runs may look fine, but the outer interpreter can panic if the linear memory isn’t page-aligned or overlaps reserved slots.
  • Critical: The parameter overflow area must be >= GLOBAL_MEMORY_BASE (0x30000) because the SPI rw_data zone starts at 0x30000. The gap zone (0x20000-0x2FFFF) between ro_data and rw_data is unmapped. Placing constants in the gap zone causes PVM panics.
  • The compact layout places the parameter overflow area dynamically right after globals (no fixed address), and SPILLED_LOCALS_BASE/SPILLED_LOCALS_PER_FUNC have been removed. This reduces the gap between globals and WASM linear memory, saving ~8KB RW data for typical programs (WASM memory base moves from ~0x33000 to ~0x31000 for a program with 5 globals).

Benchmark Comparison Parsing

  • tests/utils/benchmark.sh emits two different result tables:
    • Direct: Benchmark | WASM Size | JAM Size | Gas Used | Time
    • PVM-in-PVM: Benchmark | JAM Size | Outer Gas Used | Time
  • Branch comparison must parse JAM size and gas from the correct columns per table header (direct rows use columns 3/4; PiP rows use 2/3).
  • With set -u, EXIT trap handlers must not depend on function-local variables at exit time; expand local values when installing the trap.

Peephole Immediate Chain Fusion (2026-03)

  • LoadImm + AddImm fusion: LoadImm r1, A; AddImm r1, r1, BLoadImm r1, A+B
    • Saves 1 instruction when loading a value then adjusting it
    • Only applies when combined result fits in i32
  • Chained AddImm fusion: AddImm r1, r1, A; AddImm r1, r1, BAddImm r1, r1, A+B
    • Collapses sequences of incremental adjustments
    • Common in address calculations and loop induction variables
  • MoveReg self-elimination: MoveReg r1, r1 → removed entirely (no-op)
    • Can appear after register allocation or phi lowering
  • Implementation in peephole.rs::optimize_immediate_chains()

Comparison Code Size Optimizations (2026-03)

PVM-in-PVM Ecalli Forwarding (2026-03)

  • Dynamic ecalli index is not supported by PVM: The ecalli instruction takes a static u32 immediate. To forward inner program ecalli with dynamic indices, either use a per-ecalli dispatch table in the adapter or use a fixed “proxy” ecalli with a data buffer protocol.

  • Adapter import resolution against main exports: adapter_merge.rs resolves adapter imports matching main export names internally. Key use case: adapter importing host_read_memory / host_write_memory (exported by the compiler module) to access inner PVM memory during ecalli handling.

  • Scratch buffer protocol for trace replay: The replay adapter allocates a single WASM memory page (memory.grow(1)) on the first ecalli call and caches the address at a sentinel location (0xFFFF0) for reuse on subsequent calls. The outer handler writes the ecalli response ([8:new_r7][8:new_r8][4:num_memwrites][8:new_gas][memwrites...]) to the buffer at the PVM address obtained via pvm_ptr. The adapter reads the response, applies memwrites via host_write_memory, and returns the new register values.

  • Adapter globals not supported: adapter_merge only merges function-related sections (types, imports, functions, code) from the adapter. Globals, data sections, and memory declarations from the adapter are NOT included in the merged module. Workaround: use main module memory with fixed addresses or memory.grow.

  • host_call_N requires compile-time constant ecalli index: The first argument to host_call_N imports must be a compile-time constant because it becomes the immediate operand of the PVM ecalli instruction. Runtime ecalli indices (e.g., forwarded from inner programs) cause compilation failure.

  • NE comparison optimization was reverted for correctness in PVM-in-PVM: Xor + SetGtUImm(0) looked equivalent to Xor + LoadImm(0) + SetLtU, but it regressed as-decoder-subarray-test in layer5 (inner run returned empty Result: [0x]). Keep the conservative LoadImm(0) + SetLtU lowering for icmp ne.

  • i1→i64 sign-extension: LoadImm(0) + Sub64NegAddImm64(0)

    • Original: 2 instructions to compute 0 - val (negate boolean to 0/-1)
    • Optimized: 1 instruction using NegAddImm64 which computes val = imm - src
    • NegAddImm64(dst, src, 0) = dst = 0 - src = -src
    • Saves 1 instruction per boolean sign-extension

Register-Aware Phi Resolution (Phase 5, 2026-03)

  • Ordering dependencies between reg→reg and reg→stack phi copies: When phi copies include both register-to-register copies and copies involving stack, they must be treated as a single set of parallel moves. An initial implementation separated them into two independent phases, but this caused incorrect results when a reg→reg copy clobbered a source register that a reg→stack copy also needed. The fix: use a unified two-pass approach (load ALL incoming values into temp registers first, then store all to destinations).
  • Phi destinations must be restored after define_label: After define_label clears all alloc state at a block boundary, blocks with phi nodes must call restore_phi_alloc_reg_slots to re-establish alloc_reg_slot for phi destinations. Without this, load_operand falls back to stack loads, missing the values that the phi copy placed in registers.
  • Dirty phi values and block exit: After restore_phi_alloc_reg_slots marks phi destinations as dirty, the before-terminator spill_all_dirty_regs() writes them to the stack. This is essential: non-phi successor blocks (like loop exit blocks) clear alloc state and read from the stack. Without the spill, exit paths read stale stack values. This limits the code-size benefit of lazy spill — each iteration still writes phi values to the stack once via the before-terminator spill.
  • alloc_reg_slot shared between phi destination and incoming value: The same SSA value can be both a phi destination (in the header) and an incoming value (from the body). After mem2reg, phi incoming values from the loop body ARE the phi results from the current iteration. The regalloc may assign them the same physical register. When phi_reg == incoming_reg, the phi copy is a no-op (the value is already in the right register).

Load-Side Coalescing (Phase 8, 2026-03)

  • Eliminating MoveReg by reading directly from allocated registers: operand_reg() checks if a value is currently live in its allocated register and returns that register directly. Lowering code uses the allocated register as the instruction’s source operand instead of loading into TEMP1/TEMP2, eliminating the MoveReg that load_operand() would have emitted. This complements store-side coalescing — together they eliminate moves on both sides of instructions.
  • Dst-conflict safety: When an operand’s allocated register equals the instruction’s destination register (result_reg), the operand must fall back to a temp register. Otherwise, emit() → invalidate_reg(dst) auto-spills the old value and clears alloc tracking before the instruction reads the operand. While the PVM instruction itself would execute correctly (read-before-write at hardware level), the conservative approach avoids subtle alloc-state corruption in edge cases.
  • Div/rem excluded from coalescing: Signed division/remainder trap code (emit_wasm_signed_overflow_trap) uses SCRATCH1 (r5) as scratch for sign-extending 32-bit operands. If the LHS operand is in r5, the trap code clobbers it before the div instruction can read it. Rather than adding per-operation conflict checks, div/rem operations always load into TEMP1/TEMP2.
  • Immediate-folding paths coalesced: The commutative_imm_instruction helper was parameterized to accept a src register instead of hardcoding TEMP1. This allows immediate-folding paths (the most common for LLVM-optimized code) to use the allocated register directly. Shift/sub immediate paths were similarly updated.
  • Store instructions have no dst conflict: PVM store instructions (StoreIndU8, etc.) write to memory, not to a register, so they have no destination register. Both address and value operands can freely use allocated registers without conflict checks.
  • Impact: The fib(20) benchmark dropped from 613 to 511 gas (17%), regalloc two loops from 23,334 to 16,776 gas (28%), and the anan-as PVM interpreter JAM size from 164.9 KB to 158.9 KB (3.6%).

Rematerialization — Why It Doesn’t Work Here (Phase 8 investigation, 2026-03)

Rematerialization (reloading values with LoadImm instead of LoadIndU64 from the stack) was investigated and found to have zero practical impact in this architecture. Three approaches were tried and all failed for the same fundamental reason:

Approach 1: LLVM IR constant detection — Evaluate LLVM instructions with all-constant operands (e.g., add(3, 5)8). Why it fails: LLVM’s IRBuilder constant-folds at instruction creation time, before any passes run. LLVMBuildAdd(3, 5) produces the constant 8 directly — the add instruction is never created. This applies to ALL pure computations (binary ops, casts). Even with --no-llvm-passes, no instruction with all-constant operands survives construction.

Approach 2: PVM emitter constant tracking — Capture reg_to_const[src_reg] at store_to_slot time. Why it fails: reg_to_const is only set by LoadImm/LoadImm64 instructions. All compute instructions (Add, Sub, etc.) clear it via emit() → invalidate_reg(dst). So at store_to_slot time, reg_to_const[alloc_reg] is Some only when the last instruction was LoadImm — which means the original value IS an LLVM constant. But LLVM constants are caught by get_sign_extended_constant() at the top of load_operand(), before the regalloc path is entered. On reload, the same check fires again and emits LoadImm directly. The regalloc reload path is never reached.

Approach 3: Regalloc-level constant map — Track val_constants: HashMap<ValKey, u64> and check it during reload. Same root cause: no non-constant LLVM value produces a compile-time-known PVM result.

Root cause summary: Every value that enters the regalloc reload path is a non-constant instruction result (parameter, ALU result, memory load, phi, call return). Constants are intercepted by get_sign_extended_constant() before reaching the alloc code path. There is no gap between “LLVM knows it’s constant” and “the emitter needs to reload it.”

What WOULD make rematerialization useful: Extending PVM-level constant propagation beyond LoadImm/LoadImm64 — e.g., tracking that AddImm32 { dst, src, value: 0 } where reg_to_const[src] is known means reg_to_const[dst] is computable. This is a significant feature (PVM-level constant folding across all instruction types) with uncertain ROI.

Store-Side Coalescing (Phase 7, 2026-03)

  • Avoiding MoveReg by computing directly into allocated registers: result_reg() returns the allocated register for the current instruction’s result slot, allowing ALU/memory-load/intrinsic lowering to use it as the output destination. This eliminates the MoveReg that store_to_slot would otherwise emit to copy from TEMP_RESULT into the allocated register. On the anan-as compiler, this reduced store_moves by 54% (2720 to 1262) and total instructions by 4%.
  • lower_select store-side coalescing cannot be used: Loading the default value into the allocated register via load_operand(val, alloc_reg) triggers invalidate_reg(alloc_reg) in emit(), which corrupts register cache state for subsequent operand loads. However, load-side coalescing works (Phase 9): operand_reg() is used for all Cmov operands so values already in their allocated registers are used directly without MoveReg copies. This is safe because all select operands are simultaneously live (the allocator guarantees different registers) and the Cmov instruction’s dst register is only invalidated by emit(), not by load_operand() on the other operands.
  • result_reg_or() needed for zext/sext/trunc: These lowering paths use TEMP1 (not TEMP_RESULT) as the working register in the non-allocated case, because the source operand is already in TEMP1 and the in-place truncation/extension writes back to the same register. Using TEMP_RESULT would require an extra MoveReg. result_reg_or(TEMP1) returns the allocated register when available, or TEMP1 as fallback, preserving the existing efficient non-allocated codepath.
  • Control-flow-spanning TEMP_RESULT uses cannot be coalesced: emit_pvm_memory_grow and lower_abs both use TEMP_RESULT across branches (grow success/failure, positive/negative paths). Computing into the allocated register would corrupt it if the branch takes the alternative path. These remain uncoalesced.

Spill Weight Refinement and Call Return Hints (Phase 9, 2026-03)

  • Spill weight call penalty: Values whose live ranges span real call instructions receive a penalty of 2.0 per spanning call to their spill weight. This represents the cost of the spill+reload pair required when a register is allocated across a call boundary. Binary search on sorted call positions enables efficient counting. Trade-off: a tiny regression in very small functions with a single call (e.g., host-call-log: +3 gas) for consistent improvements in larger functions (e.g., AS fib: -2 gas, aslan-fib: -28 gas).
  • Call return value register hints: The linear scan allocator accepts preferred_reg hints on live intervals. Values defined by real call instructions get a hint for r7 (RETURN_VALUE_REG), since the return value is already in r7 after a call. If r7 is free, it’s used; otherwise, a different register is allocated. This eliminates the MoveReg from r7 to the allocated register in store_to_slot.
  • is_real_call() made pub(super): The function distinguishing real calls from PVM/LLVM intrinsics was made module-visible so regalloc.rs can use it for call position collection without code duplication.

Loop Phi Early Interval Expiration (Phase 10, 2026-03)

  • Post-allocation coalescing doesn’t work: Three approaches were tried and all failed due to the emitter’s per-register alloc_reg_slot tracking disagreeing with the allocator’s per-value liveness model. See git history for details.
  • Early interval expiration works: Modifying the linear scan to expire loop phi destination intervals at their actual last use (before loop extension) frees the register earlier. The incoming back-edge value naturally gets the freed register via the free pool. Since the linear scan’s slot_to_reg maps reflect both assignments from the start, the emitter handles transitions correctly.
  • Pressure guard: When intervals.len() > allocatable_regs.len() * 2, early expiration is disabled. Under high pressure, freed phi registers get taken by unrelated values, causing reload traffic that outweighs the MoveReg savings.
  • Phi copy no-op: When incoming_reg == phi_reg AND the register currently holds the incoming value (verified by is_alloc_reg_valid), the phi copy is skipped — just update alloc_reg_slot. The is_alloc_reg_valid check is critical: without it, a third value that overwrote the register between the incoming’s store and the phi copy would cause silent data corruption.
  • store_to_slot safety: When storing to a slot whose allocated register currently holds a DIFFERENT dirty slot, spill the dirty value first. Prevents data loss when multiple slots share a register via early expiration.
  • Impact: fib(20) -15.7% gas / -7.2% code, factorial -5.6% gas. No regressions.

Cross-Block Alloc State Propagation (Phase 11, 2026-03)

  • Back-edge dominator propagation instead of clearing: At loop headers with unprocessed predecessor back-edges, instead of clearing all alloc_reg_slot entries, the dominator predecessor’s alloc state is propagated through set_alloc_reg_slot_filtered(). This avoids unnecessary reloads at loop entry for values that remain valid across the back-edge.
  • Register class filtering for safety: Non-leaf functions only propagate callee-saved registers beyond max_call_args — these are the only registers guaranteed safe across all paths (never clobbered by calls). Caller-saved registers (r5-r8) are excluded because other paths may invalidate them. Leaf functions with lazy spill propagate all registers since no calls exist.
  • Leaf+lazy_spill intersection: Multi-predecessor blocks in leaf functions with lazy spill now use the same intersection logic as non-leaf functions. Previously, leaf+lazy_spill blocks used define_label (clear all) at every block boundary. With the pred_map now available, the intersection approach keeps entries that all processed predecessors agree on.
  • pred_map condition expanded: The predecessor map was previously built only for non-leaf functions. It is now built whenever has_regalloc && (!is_leaf || lazy_spill_enabled), enabling alloc state propagation for leaf functions with lazy spill.
  • Impact: fib(20) -5.1% gas, factorial(10) -7.1% gas, is_prime(25) -4.6% gas, PiP aslan-fib -0.52% gas.

Callee-Saved Preference for Call-Spanning Intervals (Phase 12, 2026-03)

  • Problem: The linear scan’s default free_regs.pop() behavior assigns callee-saved registers (added last to allocatable_regs) to the FIRST intervals processed. Call-spanning intervals, penalized by CALL_SPANNING_PENALTY, sort later and get caller-saved registers that are invalidated after every call — the opposite of what’s optimal.
  • Solution: LiveInterval.spans_calls flag marks intervals whose live range contains at least one real call. In non-leaf functions, call-spanning intervals explicitly prefer callee-saved registers (r9-r12 beyond max_call_args), while non-call-spanning intervals prefer caller-saved (r5-r8). In leaf functions, all registers are equal (no preference applied). The preferred_reg hint (e.g., r7 for call return values) takes priority over the class preference.
  • Impact: Modest — primarily benefits non-leaf functions with call-spanning values. anan-as PVM interpreter -0.2% code size. Most benchmarks are leaf-dominated.

Non-Leaf r5-r8 Allocation and load_operand Reload Bug (Phase 6, 2026-03)

  • Removing the leaf-only restriction for r5-r8: Previously r5/r6 (allocate_scratch_regs) and r7/r8 (allocate_caller_saved_regs) were only available in leaf functions. Phase 6 makes them available in all functions. The existing non-leaf call lowering infrastructure (spill_allocated_regs before calls, clear_reg_cache after calls, lazy reload on next access) handles caller-saved register spill/reload automatically, so no new mechanism was needed.
  • Removing the calls_in_loops gate: Previously, non-leaf functions with calls inside loop bodies were skipped entirely by the register allocator (the theory being that reload traffic outweighs savings). Phase 6 removes this restriction. The lazy spill + per-call-site arity-aware invalidation makes allocation beneficial even with calls in loops, since only registers actually clobbered by a specific call’s arity are invalidated rather than all registers.
  • load_operand reload-into-allocated-register bug: When an allocated register is invalidated (e.g., after a call) and load_operand is asked to reload the value into a different target register (e.g., TEMP1 for a binary operation), the original code would reload into the allocated register first, then copy to the target. This is incorrect when the allocated register is being used for call argument setup – writing to the allocated register corrupts the argument being prepared. The fix: when the allocated register is invalidated and the target register differs, load directly from the stack into the target register, bypassing the allocated register entirely. This prevents corruption during call argument setup sequences where multiple allocated values are being moved into argument registers (r9, r10, etc.).
  • r7/r8 invalidation after calls: The reload_allocated_regs_after_call_with_arity predicate was extended to also invalidate r7/r8 after calls (not just r9-r12), since r7/r8 are now allocatable in non-leaf functions and are always clobbered by call return values.
  • Impact: 79 non-leaf functions now receive allocation in the anan-as compiler (up from 0), bringing the total to 205 out of 210 functions allocated.

Callee-Saved State Preservation After Calls — Investigated, Not Feasible (2026-03)

  • Idea: After direct/indirect calls, preserve alloc_reg_slot entries for callee-saved registers (r9-r12) beyond the call’s argument count, since the callee-save convention guarantees these registers survive calls. This would eliminate LoadIndU64 reloads of callee-saved values after calls.
  • Approaches tried: (1) Selective invalidation in clear_reg_cache — only clear clobbered registers. (2) Snapshot/restore — take alloc_reg_slot snapshot before the call, restore after clear_reg_cache. (3) Guarding operand_reg to exclude restored entries (preventing direct register use in address computations).
  • Root cause of failure: The operand_reg function (load-side coalescing) returns the allocated register directly as a source operand for instructions. When this register is used in memory address computations (e.g., as the base for StoreIndU32), the memory lowering code may use the same register as both source AND destination for the address calculation (adding wasm_memory_base), clobbering the preserved value. Even guarding operand_reg is insufficient — the load_operand path also causes issues through the interaction between the preserved alloc state and the general register cache, producing incorrect address offsets. The fundamental problem is that the emitter’s state model has too many interacting subsystems (alloc_reg_slot, slot_cache, reg_to_slot, operand_reg, result_reg) that make preserving alloc state across calls error-prone.
  • Evidence: Only 1 test failure out of 442 (as-array-push-test), but the failure is deterministic: the generated code uses a wrong base register (r7 instead of slot-296’s value) with a shifted offset (+12) for memory accesses after calls, producing result = 0 instead of 28.
  • Conclusion: Not feasible with the current emitter architecture. Would require a significant refactor of how alloc state interacts with the general cache and memory lowering. The expected 50-80% reduction in post-call reloads is not worth the complexity and correctness risk. See PHASE13_PROMPT.md “Idea 1” for the original design.

Per-Phi Early Expiration Guard — Investigated, Not Feasible (2026-03)

  • Idea: Replace the blanket pressure guard (intervals.len() > allocatable_regs.len() * 2) that disables ALL loop phi early expiration with a per-phi check. Only set early expiration if the incoming back-edge value’s interval starts at or after the phi’s pre_extension_end, ensuring the freed register is taken by the intended value.
  • Approaches tried: (1) Per-phi guard only (no blanket guard) — caused 6+ test failures including timeouts, because even with the incoming-start check, freed registers were stolen by unrelated intervals. (2) Per-phi guard + blanket pressure guard fallback — under high pressure, only allow early expiration for “safe” phis (incoming starts after pre_extension_end). Still caused 9 failures in regalloc-two-loops and regalloc-nested-loops because the “safe” condition is necessary but not sufficient: another interval starting between pre_extension_end and the incoming value can steal the freed register. (3) Per-phi guard under low pressure caused fib(20) gas regression (+19.6%) because the guard is more conservative than the original blanket approach (it disables early expiration for phis whose incoming value is defined within the loop body, before pre_extension_end).
  • Root cause: The early expiration + register reuse mechanism depends on the linear scan’s allocation ORDER, which can’t be predicted during interval computation. Even when the incoming value starts “after” the phi’s expiration, intervening intervals may steal the freed register. The blanket pressure threshold is a crude but effective proxy for this condition.
  • Conclusion: The blanket pressure guard is the right tool for this. A correct per-phi guard would require lookahead into the linear scan’s allocation decisions, which defeats the purpose of computing it upfront. Alternative approaches (preferred_reg hints, register reservation in linear_scan) are possible but add significant complexity for marginal benefit. See PHASE13_PROMPT.md “Idea 2” for the original design.

Non-Leaf r7/r8 Allocation — Investigated, Not Feasible (2026-03)

  • Idea: Allow r7/r8 to be allocated by the linear scan in non-leaf functions. After calls, r7/r8 are invalidated (since they hold return values), but between calls they could hold allocated values, reducing register pressure.
  • Root cause of failure: Same as “Callee-Saved State Preservation After Calls” above. The operand_reg() function (load-side coalescing) returns the allocated register directly as a source operand for memory lowering. When this register is used in address computations (e.g., adding wasm_memory_base), the lowering code may use it as both source and destination, clobbering the value. This is the fundamental operand_reg() hazard: any register that participates in address calculation can be corrupted when the emitter uses in-place arithmetic on the base register.
  • Conclusion: Not feasible without reworking how memory address calculations interact with allocated registers. The operand_reg() function would need to distinguish between “use as data operand” (safe) and “use as address base” (unsafe, may be clobbered by in-place add of wasm_memory_base). This is the same architectural limitation that blocks callee-saved state preservation after calls.

PVM-in-PVM Execution

The compiler can compile the anan-as PVM interpreter (written in AssemblyScript) to PVM bytecode, then run PVM programs inside this PVM interpreter that is itself running on PVM. This serves as a comprehensive integration test and stress test of the compiler.


Goal

Run PVM programs (trap.jam, add.jam) through the anan-as PVM interpreter that is itself compiled to PVM bytecode and running on PVM.

Pipeline: inner.wat → inner.jam + compiler.wasm → compiler.jam → feed inner.jam as args to compiler.jam → outer anan-as CLI runs it all.

Bugs Found & Fixed

Bug 1: HasMetadata.Yes in anan-as entry point

File: vendor/anan-as/assembly/index-compiler.ts:91

The anan-as compiler entry point was calling:

prepareProgram(InputKind.SPI, HasMetadata.Yes, spiProgram, [], [], [], innerArgs);

With HasMetadata.Yes, the SPI parser first calls extractCodeAndMetadata() which reads a varint-encoded metadata length from the start of the data. Since inner JAM programs don’t have metadata, this read garbage values (e.g., the ro_data_length field), corrupting all subsequent parsing.

Symptom: Native WASM test failed with "Not enough bytes left. Need: 7561472, left: 56377" — the parser was reading the first SPI header bytes as a metadata length.

Fix: Changed to HasMetadata.No and rebuilt the vendor with npm run asbuild:compiler.

Bug 2: Unknown WASM imports compiled to TRAP

File: crates/wasm-pvm/src/llvm_backend/calls.rs:137-138

The wasm-pvm compiler mapped all unknown WASM imports (anything not host_call or pvm_ptr) to PVM TRAP instructions. The anan-as compiler.wasm imports two functions:

  • env.abort — called on unrecoverable AS runtime errors
  • env.console.log — called during normal execution for debug logging

Since console.log is called in the normal success path (confirmed by native WASM test showing console.log: 11952), the TRAP instruction killed the PVM program before it could complete.

Symptom: PVM execution panicked at PC 100640 (a TRAP instruction corresponding to the console.log import call). The outer anan-as interpreter reported "Unhandled host call: ecalli 0".

Fix: Changed unknown imports to be no-ops (silently skip) instead of TRAPs. The abort import specifically remains a TRAP since it indicates unrecoverable errors and should terminate execution.

#![allow(unused)]
fn main() {
// Before: all unknown imports → TRAP
e.emit(Instruction::Trap);

// After: only abort → TRAP, others are no-ops
let is_abort = import_name == Some("abort");
if is_abort {
    e.emit(Instruction::Trap);
}
}

Debugging Journey

  1. Initial state: compiler.jam panicked at PC 150403 after ~95K instructions
  2. First hypothesis (from subagent): Jump table corruption — turned out to be incorrect; the verify-jam tool’s VarU32 decoder has an endianness bug that displayed wrong values
  3. Key insight: Ran compiler.wasm natively with the same args — it also failed! This proved the issue was in the input format, not wasm-pvm compilation
  4. Native error: "Not enough bytes left. Need: 7561472" pointed to SPI parsing reading garbage lengths
  5. Found Bug 1: HasMetadata.Yes → fixed to HasMetadata.No, rebuilt vendor
  6. After fix 1: Native WASM worked perfectly (trap.jam → PANIC, add.jam → result 12), but PVM version still failed with ecalli 0 at PC 100640
  7. Traced PVM execution: Confirmed PC 100640 contains opcode 0x00 (TRAP), which is the compiled console.log import
  8. Confirmed: Native WASM calls console.log during normal execution → in PVM this becomes TRAP → panic
  9. Found Bug 2: Fixed import handling to make non-abort imports no-ops
  10. Both tests pass: trap.jam returns inner PANIC, add.jam returns inner result 12

Performance Notes

PVM-in-PVM tests are inherently slow (~85 seconds each) because:

  • The outer anan-as interpreter executes ~525M PVM instructions
  • Most of this is the inner interpreter’s initialization (AS runtime setup, SPI parsing, memory page allocation)
  • The actual inner program execution is tiny (~46-65K gas)
  • The JS-based anan-as interpreter processes ~6M instructions/second

Tests have 180-second timeouts to accommodate this.

PVM-in-PVM Benchmarks

BenchmarkJAM SizeCode SizeOuter GasDirect GasOverhead
TRAP (interpreter overhead)21 B1 B80,577--
add(5,7)201 B130 B1,238,3023931,751x
AS fib(10)708 B572 B1,753,5463245,412x
JAM-SDK fib(10)*25.4 KB16.2 KB7,230,60342172,157x
Jambrains fib(10)*61.1 KB-6,373,68316,373,683x
JADE fib(10)*67.3 KB45.7 KB19,555,95550438,801x
aslan-fib accumulate*37.1 KB17.6 KB10,511,41315,968658x

*These programs exit on unhandled host calls (ecalli). Gas cost reflects parsing/loading plus partial execution up to the first unhandled ecalli.

Regalloc Cross-Block Propagation Journey

A detailed account of implementing cross-block register allocation propagation — including failed approaches, debugging discoveries, and final results.


Issue: #127 Branch: feature/regalloc-cross-block-propagation Goal: Propagate allocated-register state across block boundaries to avoid unnecessary reloads, especially at loop headers.

Current State (Baseline)

The register allocator assigns loop-carried values to callee-saved registers (r9-r12). The runtime tracking (alloc_reg_slot) is cleared at every block boundary that doesn’t qualify for single-predecessor cross-block cache propagation. This means loop headers (which have 2+ predecessors: preheader + back-edge) always start cold, requiring a reload on first use of each allocated value per loop iteration.

Attempt 1: Blanket alloc_reg_slot persistence (FAILED)

Change: Remove clear_allocated_reg_state() from clear_reg_cache() so alloc_reg_slot is never cleared at block boundaries.

Result: Layers 1-3 (422 tests) pass. PVM-in-PVM fails on as-decoder-subarray-test (2 failures). Direct execution of the same tests passes.

Root cause analysis: Multi-predecessor blocks (merge points) are unsafe because different predecessors may leave allocated registers in different states:

  • Block B has a call → r9 is clobbered at runtime, alloc_reg_slot[r9] = None
  • Block C has no call → alloc_reg_slot[r9] = Some(S) at compile time
  • Block D (successor of both B and C) inherits C’s state (last processed)
  • At runtime via B: r9 holds garbage but compile-time state says Some(S) → skip reload

The write-through argument only holds when NO instruction clobbers the register between the last write-through and the block entry. Calls clobber r9-r12.

Approach 2: Leaf-function-only + predecessor intersection (IMPLEMENTED)

Key insight: In leaf functions (no calls), allocated registers (r9-r12) are ONLY written by store_to_slot (write-through) and load_operand (reload). Both correctly update alloc_reg_slot. So alloc_reg_slot is ALWAYS accurate in leaf functions.

For non-leaf functions: Use predecessor exit snapshot intersection. At multi-predecessor blocks, only keep alloc_reg_slot entries where ALL processed predecessors agree. For back-edges (unprocessed predecessors), be conservative.

Discovery: Leaf detection was broken (THE MAIN WIN)

Critical finding: ALL functions with memory access were classified as non-leaf because PVM intrinsics (__pvm_load_i32, __pvm_store_i32, etc.) are LLVM Call instructions. These are NOT real function calls — they’re lowered inline using temp registers and never use the calling convention.

Fix: Added is_real_call() to distinguish real calls (wasm_func_*, __pvm_call_indirect) from intrinsics (__pvm_*, llvm.*).

Impact: Significant improvements because leaf functions get smaller stack frames (no callee-save prologue/epilogue):

BenchmarkCode ChangeGas Change
AS decoder-2.9%-4.0%
AS array-3.2%-3.7%
PiP TRAP0-3.3%
PiP add0-1.0%
PiP Jambrains0-1.9%
is_prime+0.4%+2.6% (tiny: +2 gas absolute)

Attempt: Phi node allocation (REVERTED)

Hypothesis: Phi nodes at loop headers represent loop-carried variables (induction variables, accumulators). Allow them to be register-allocated.

Result: All tests pass, but gas regressions on key benchmarks:

  • is_prime: +6.4% gas
  • AS factorial: +8.2% gas
  • regalloc two loops: +8.8% gas

Root cause: In PVM, all basic instructions cost 1 gas. Write-through adds 1 MoveReg per phi copy per iteration. The “saved” load is just LoadIndU64 → MoveReg (same cost). Net: +1 gas per iteration per allocated phi node. The write-through model makes phi node allocation a gas regression in the current PVM gas model.

Learning: Register allocation for phi nodes only makes sense when:

  • Loads are cheaper than stores (not the case in PVM: both cost 1 gas)
  • OR the allocated register can be used directly without MoveReg to temp (not the case: allocated regs are r9-r12, temps are r2-r4)
  • OR code size matters more than gas (MoveReg is 2 bytes vs LoadIndU64’s 5 bytes)

Final Results (Leaf Detection + Cross-Block Propagation)

BenchmarkJAM SizeCode SizeGas Change
AS decoder-1.1%-2.9%-4.0%
AS array-1.1%-3.2%-3.7%
anan-as PVM interpreter-0.6%-0.8%-
PiP TRAP00-3.3%
PiP Jambrains00-1.9%
PiP JADE00-0.8%
is_prime+0.3%+0.4%+2.6%

Log

Step 1: Add targeted tests (DONE) — commit e0bfda7

  • regalloc-nested-loops.jam.wat — nested loops with multiple carried values
  • regalloc-loop-with-call.jam.wat — loop calling a function (non-leaf)

Step 2: Blanket alloc_reg_slot persistence (FAILED)

  • PVM-in-PVM: 2 failures in as-decoder-subarray-test
  • Root cause: multi-predecessor blocks with inconsistent predecessor states

Step 3: Leaf-only propagation + predecessor intersection (DONE) — commit e8694cd

  • All 695 tests pass, zero benchmark impact (regalloc rarely activates)

Step 4: Fix leaf detection (DONE) — commit 6960512

  • Distinguish PVM intrinsics from real calls
  • Up to -4% gas, -3.2% code size on real workloads

Step 5: Phi node allocation (REVERTED) — commit 6af12fa → reverted 3445375

  • Gas regression due to write-through MoveReg overhead

Future Opportunities

  1. Direct phi-to-register allocation: Instead of write-through to stack + MoveReg to allocated reg, emit phi copies directly to the allocated register and skip the stack store entirely (DSE would need to remove the dead store). This would make phi allocation gas-neutral and code-size-positive.

  2. Load-from-allocated-register without MoveReg: When the consumer of an allocated value can use r9-r12 directly (instead of requiring TEMP1/TEMP2), avoid the MoveReg. This requires instruction selection awareness of allocated registers.

  3. Non-leaf loop-safe propagation: For non-leaf functions, propagate alloc_reg_slot at loop headers where the loop body has no calls (requires loop-body analysis).

Contributing

Contributions are welcome! This page covers coding conventions, project structure, and where to look for different tasks.

Code Style

  • rustfmt defaults, clippy warnings treated as errors
  • unsafe_code = "deny" at workspace level
  • thiserror for error types, tracing for logging
  • Unit tests inline under #[cfg(test)]

Naming Conventions

  • Types: PascalCase
  • Functions: snake_case
  • Constants: SCREAMING_SNAKE_CASE
  • Indicate WASM vs PVM context in names where relevant

Where to Look

TaskLocation
Add WASM operatorcrates/wasm-pvm/src/llvm_frontend/function_builder.rs
Add PVM lowering (arithmetic)crates/wasm-pvm/src/llvm_backend/alu.rs
Add PVM lowering (memory)crates/wasm-pvm/src/llvm_backend/memory.rs
Add PVM lowering (control flow)crates/wasm-pvm/src/llvm_backend/control_flow.rs
Add PVM lowering (calls)crates/wasm-pvm/src/llvm_backend/calls.rs
Add PVM lowering (intrinsics)crates/wasm-pvm/src/llvm_backend/intrinsics.rs
Modify emitter corecrates/wasm-pvm/src/llvm_backend/emitter.rs
Add PVM instructioncrates/wasm-pvm/src/pvm/opcode.rs + crates/wasm-pvm/src/pvm/instruction.rs
Modify register allocatorcrates/wasm-pvm/src/llvm_backend/regalloc.rs
Modify peephole optimizercrates/wasm-pvm/src/pvm/peephole.rs
Fix WASM parsingcrates/wasm-pvm/src/translate/wasm_module.rs
Fix compilation pipelinecrates/wasm-pvm/src/translate/mod.rs
Fix adapter mergecrates/wasm-pvm/src/translate/adapter_merge.rs
Add integration testtests/layer{1,2,3}/*.test.ts

Anti-Patterns (Forbidden)

  1. No unsafe code — strictly forbidden by workspace lint
  2. No panics in library code — use Result<> with Error::Internal
  3. No floating point — PVM lacks FP support; reject WASM floats
  4. Don’t break register conventions — hardcoded in multiple files
  5. Don’t change opcode numbers — would break existing JAM files

Building & Testing

See the Getting Started and Testing chapters.

Documentation Policy

After every task or commit, update relevant documentation:

  • AGENTS.md — new modules, build process changes, conventions
  • learnings.md — technical discoveries and debugging insights
  • architecture.md — ABI or calling convention changes
  • internals/ — module-specific implementation details
  • SUMMARY.md — when adding new documentation pages

Testing

The project has a comprehensive multi-layer test suite covering unit tests, integration tests, differential tests, and PVM-in-PVM execution tests.

Quick Reference

# Rust unit tests
cargo test

# Lint
cargo clippy -- -D warnings

# Full integration tests (builds artifacts first)
cd tests && bun run test

# Quick validation (Layer 1 only — requires build first)
cd tests && bun build.ts && bun test layer1/

# PVM-in-PVM tests (requires build first)
cd tests && bun build.ts && bun test layer4/ layer5/ --test-name-pattern "pvm-in-pvm"

# Differential tests (PVM vs native WASM)
cd tests && bun run test:differential

Important: Always use bun run test (not bun test) from the tests/ directory — it runs bun build.ts first to compile fixtures.

Test Layers

LayerTestsPurposeSpeed
Layer 1~50Core/smoke testsFast — use for development
Layer 2~100Feature testsMedium
Layer 3~220Regression/edge casesMedium
Layer 43PVM-in-PVM smoke testsSlow (~85s each)
Layer 5~270Comprehensive PVM-in-PVMSlow
Differential~142PVM vs native WASM comparisonMedium

Test Organization

  • Integration tests: tests/layer{1,2,3}/*.test.ts — each file calls defineSuite() with hex args (little-endian)
  • Rust integration tests: crates/wasm-pvm/tests/ — operator coverage, emitter units, stack spill, property tests (true unit tests live inline under #[cfg(test)] in source files)
  • Differential tests: tests/differential/differential.test.ts — verifies PVM output matches Bun’s WebAssembly engine
  • PVM-in-PVM tests: Layers 4-5 — the anan-as PVM interpreter compiled to PVM, running test programs inside

CI Structure

CI runs in stages:

  1. Rust: lint, clippy, unit tests, release build
  2. Integration: layers 1-3
  3. Differential: PVM vs native WASM
  4. PVM-in-PVM: layers 4-5 (only if integration passes)

Fixtures

Test programs live in tests/fixtures/:

  • wat/ — hand-written WAT programs
  • assembly/ — AssemblyScript programs
  • imports/ — import maps (.imports) and adapter files (.adapter.wat)

Build Process

tests/build.ts orchestrates three phases:

  1. Compile AssemblyScript .ts.wasm (skipped if .wasm exists)
  2. Compile .wat/.wasm.jam files
  3. Compile anan-as compiler.wasm → compiler.jam (for PVM-in-PVM)

Important: Delete cached WASM files before working on fixtures:

rm -f tests/build/wasm/*.wasm
cd tests && bun build.ts

Benchmarks

Run ./tests/utils/benchmark.sh for performance data. For branch comparisons:

./tests/utils/benchmark.sh --base main --current <branch>

Every PR must include benchmark results in its description.