WASM-PVM: WebAssembly to PolkaVM Recompiler
A Rust compiler that translates WebAssembly (WASM) bytecode into PolkaVM (PVM) bytecode for execution on the JAM (Join-Accumulate Machine) protocol. Write your JAM programs in AssemblyScript (TypeScript-like), hand-written WAT, or any language that compiles to WASM — and run them on PVM.
WASM ──► LLVM IR ──► PVM bytecode ──► JAM program (.jam)
inkwell mem2reg Rust backend
Key Features
- Multi-language input: AssemblyScript, hand-written WAT, or any WASM-targeting language
- LLVM-powered: Uses inkwell (LLVM 18 bindings) for IR generation and optimization
- No
unsafecode:deny(unsafe_code)enforced at workspace level - Toggleable optimizations: Every non-trivial optimization can be individually disabled via CLI flags
- Comprehensive test suite: 800+ tests across unit, integration, differential, and PVM-in-PVM layers
Supported WASM Features
| Category | Operations |
|---|---|
| Arithmetic (i32 & i64) | add, sub, mul, div_u/s, rem_u/s, all comparisons, clz, ctz, popcnt, rotl, rotr, bitwise ops |
| Control flow | block, loop, if/else, br, br_if, br_table, return, unreachable, block results |
| Memory | load/store (all widths), memory.size, memory.grow, memory.fill, memory.copy, globals, data sections |
| Functions | call, call_indirect (with signature validation), recursion, stack overflow detection |
| Type conversions | wrap, extend_s/u, sign extensions (i32/i64 extend8/16/32_s) |
| Imports | Text-based import maps and WAT adapter files |
Not supported: floating point (by design — PVM has no FP instructions).
Project Structure
crates/
wasm-pvm/ # Core compiler library
src/
llvm_frontend/ # WASM → LLVM IR translation
llvm_backend/ # LLVM IR → PVM bytecode lowering
translate/ # Compilation orchestration & SPI assembly
pvm/ # PVM instruction definitions & peephole optimizer
wasm-pvm-cli/ # Command-line interface
tests/ # Integration tests (TypeScript/Bun)
fixtures/
wat/ # WAT test programs
assembly/ # AssemblyScript examples
imports/ # Import maps & adapter files
vendor/
anan-as/ # PVM interpreter (submodule)
Resources
- PVM Debugger — upload
.jamfiles for disassembly, step-by-step execution, and register/gas inspection - PVM Decompiler — decompile PVM bytecode back to human-readable form
- ananas (anan-as) — PVM interpreter written in AssemblyScript, compiled to PVM itself for PVM-in-PVM execution
- as-lan — example AssemblyScript project compiled from WASM to PVM
- JAM Gray Paper — the JAM protocol specification (PVM is defined in Appendix A)
- AssemblyScript — TypeScript-like language that compiles to WASM
Getting Started
Prerequisites
- Rust (stable, edition 2024)
- LLVM 18 — the compiler uses inkwell (LLVM 18 bindings)
- macOS:
brew install llvm@18thenexport LLVM_SYS_181_PREFIX=/opt/homebrew/opt/llvm@18 - Ubuntu:
apt install llvm-18-dev
- macOS:
- Bun (for running integration tests and the JAM runner) — bun.sh
Build
git clone https://github.com/tomusdrw/wasm-pvm.git
cd wasm-pvm
cargo build --release
Hello World: Compile & Run
Create a simple WAT program that adds two numbers:
;; add.wat
(module
(memory 1)
(func (export "main") (param $args_ptr i32) (param $args_len i32) (result i64)
;; Read two i32 args, add them, write result to memory
(i32.store (i32.const 0)
(i32.add
(i32.load (local.get $args_ptr))
(i32.load (i32.add (local.get $args_ptr) (i32.const 4)))))
(i64.const 17179869184))) ;; packed ptr=0, len=4
Compile it to a JAM blob and run it:
# Compile WAT → JAM
cargo run -p wasm-pvm-cli -- compile add.wat -o add.jam
# Run with two u32 arguments: 5 and 7 (little-endian hex)
npx @fluffylabs/anan-as run add.jam 0500000007000000
# Output: 0c000000 (12 in little-endian)
Inspect the Output
Upload the resulting .jam file to the PVM Debugger for step-by-step execution, disassembly, register inspection, and gas metering visualization.
AssemblyScript Example
You can also write programs in AssemblyScript:
// fibonacci.ts
export function main(args_ptr: i32, args_len: i32): i64 {
const buf = heap.alloc(256);
let n = load<i32>(args_ptr);
let a: i32 = 0;
let b: i32 = 1;
while (n > 0) {
b = a + b;
a = b - a;
n = n - 1;
}
store<i32>(buf, a);
return (buf as i64) | ((4 as i64) << 32); // packed ptr + len
}
Compile via the AssemblyScript compiler to WASM, then use wasm-pvm-cli to produce a JAM blob. See the tests/fixtures/assembly/ directory for more examples.
Using as a Library
You can use wasm-pvm as a Rust dependency in two modes:
Full compiler (default)
Requires LLVM 18 installed on the system.
[dependencies]
wasm-pvm = "0.5.2"
This gives you access to the full compiler pipeline (compile(), compile_with_options()) plus all PVM types.
PVM types only
No LLVM dependency — compiles to any target including wasm32-unknown-unknown.
[dependencies]
wasm-pvm = { version = "0.5.2", default-features = false }
Available types: Instruction, Opcode, ProgramBlob, SpiProgram, abi::*, memory_layout::*, and Error. This is useful for PVM interpreters, debuggers, and bytecode analyzers that don’t need the WASM compiler.
Entry Function ABI
All entry functions must use the signature main(args_ptr: i32, args_len: i32) -> i64. The i64 return value packs a result pointer (lower 32 bits) and result length (upper 32 bits). The compiler unpacks this into PVM’s SPI convention (r7 = start address, r8 = end address).
For WAT programs, the common “return 4 bytes at address 0” constant is (i64.const 17179869184) (= 4 << 32).
For AssemblyScript, use: return (ptr as i64) | ((len as i64) << 32).
CLI Usage
# Compile WAT or WASM to JAM
wasm-pvm compile input.wat -o output.jam
wasm-pvm compile input.wasm -o output.jam
# With import resolution
wasm-pvm compile input.wasm -o output.jam \
--imports imports.txt \
--adapter adapter.wat
# Disable specific optimizations
wasm-pvm compile input.wasm -o output.jam --no-inline --no-peephole
# Disable all optimizations
wasm-pvm compile input.wasm -o output.jam \
--no-peephole --no-register-cache \
--no-icmp-fusion --no-shrink-wrap --no-dead-store-elim \
--no-const-prop --no-inline --no-cross-block-cache \
--no-register-alloc --no-fallthrough-jumps
--debug-skip-llvm-passes is not included above: it disables mem2reg and therefore breaks PVM lowering on any non-trivial input. See Diagnostic & Triage Flags.
Optimization Flags
All non-trivial optimizations are enabled by default. Each can be individually disabled:
| Flag | What it controls |
|---|---|
--no-peephole | Post-codegen peephole optimizer |
--no-register-cache | Per-block store-load forwarding |
--no-icmp-fusion | Fuse ICmp+Branch into single PVM branch |
--no-shrink-wrap | Only save/restore used callee-saved regs |
--no-dead-store-elim | Remove SP-relative stores never loaded from |
--no-const-prop | Skip redundant LoadImm when register already holds the constant |
--no-inline | LLVM function inlining for small callees |
--no-cross-block-cache | Propagate register cache across single-predecessor block boundaries |
--no-register-alloc | Linear-scan register allocation for loop values |
--no-fallthrough-jumps | Skip redundant Jump when target is next block |
See the Optimizations chapter for details on each.
Diagnostic & Triage Flags
These flags affect what the compiler accepts or how it reports failures. They are not optimizations.
| Flag | What it does |
|---|---|
--trap-floats | Replace every f32/f64 operator with a runtime trap instead of failing compilation. See Trap Floats Mode. |
--debug-skip-llvm-passes | Debug only. Skip the entire LLVM pass pipeline (including mem2reg). The PVM backend cannot lower the resulting alloca / unpromoted SSA, so non-trivial WASM will fail to compile. Use only to inspect raw frontend IR. |
When compilation fails on an unsupported operator, the error message includes
the function index, the function’s display name (from the WASM name custom
section, falling back to the export name, then wasm_func_<idx>), and the
operator’s byte offset within the function body. Example:
Error: Compilation failed
Caused by:
Unsupported WASM feature: F64Load { memarg: ... } (in function #42 'compute_score' at byte offset 0x1a3)
This makes it possible to grep into the WASM disassembly (wasm-tools dump)
or anan-as source to find the offending site without bisecting the module.
Import Handling
WASM modules that import external functions need those imports resolved before compilation. Two mechanisms are available, and they can be combined.
Import Map (--imports)
A text file mapping import names to simple actions:
# my-imports.txt
abort = trap # emit unreachable (panic)
console.log = nop # do nothing, return zero
Adapter WAT (--adapter)
A WAT module whose exported functions replace matching WASM imports, enabling arbitrary logic for import resolution (pointer conversion, memory reads, host calls). Adapters are function-only overlays — tables, memories, globals, and data sections from the adapter are not merged:
(module
(import "env" "host_call_5" (func $host_call_5 (param i64 i64 i64 i64 i64 i64) (result i64)))
(import "env" "pvm_ptr" (func $pvm_ptr (param i64) (result i64)))
(func (export "console.log") (param i32)
(drop (call $host_call_5
(i64.const 100) ;; ecalli index
(i64.const 3) ;; log level
(i64.const 0) (i64.const 0) ;; target ptr/len
(call $pvm_ptr (i64.extend_i32_u (local.get 0))) ;; message ptr
(i64.extend_i32_u (i32.load offset=0
(i32.sub (local.get 0) (i32.const 4))))))) ;; message len
)
When both --imports and --adapter are provided, the adapter runs first, then the import map handles remaining unresolved imports. All imports must be resolved or compilation fails.
Host Call Imports
A family of typed host_call_N imports (N=0..6) map to PVM ecalli instructions, where N is the number of data registers (r7..r7+N-1) to set. See the ABI & Calling Conventions chapter for the full reference table and examples.
Variants with a b suffix (e.g. host_call_2b) also capture r8 to a stack slot, retrievable via host_call_r8() -> i64.
The pvm_ptr(wasm_addr) -> pvm_addr import converts a WASM-space address to a PVM-space address.
Trap Floats Mode
PVM has no floating-point instructions. By default, the compiler rejects any
f32/f64 operator with a FloatNotSupported or Unsupported(...) error,
making it impossible to compile any WASM module that touches floats — even if
the float code path is never exercised at runtime.
The --trap-floats flag (or CompileOptions::trap_floats = true in the
library API) changes this behavior: every f32/f64 operator is replaced with a
runtime PVM trap instruction. Compilation completes; if execution ever
reaches one of those operators, the program traps deterministically.
When to use it
-
Triage: a real-world WASM module fails on its first float op. Use
--trap-floatsto push past the wall and discover what other unsupported features the module uses (data segments, exotic SIMD ops, etc.). The diagnostic upgrade in the same release prints the failing function and op offset for any remaining errors, so a single re-compile usually pinpoints every blocker. -
Compiling integer-only entry paths in float-heavy modules: if the float code is dead under your inputs (e.g. error-formatting helpers that you’ll never trigger),
--trap-floatsmakes the rest of the module shippable.
When not to use it
-
Production builds where any float computation is reachable. The trap is silent at compile time and only fires at runtime. If you’re not certain the float code is dead, you’ll ship a JAM that traps on real input.
-
Soft-float emulation.
--trap-floatsdoes not emulate IEEE 754 arithmetic. There is currently no plan to add soft-float support; if your module needs working floats, PVM is the wrong target.
How it works
The frontend has a small table mapping each f32/f64 operator to its
(pop_count, push_count) stack effect. When trap_floats is enabled and a
float operator is encountered:
- An
@llvm.trap()intrinsic call is emitted, followed by an LLVMunreachableterminator so the basic block is well-formed. The PVM backend’slower_llvm_intrinsiclowers@llvm.trap()toInstruction::Trap. Crucially we cannot use a bareunreachablehere:simplifycfgtreatsunreachableas undefined behaviour and will fold away conditional branches whose only path leads to it, silently deleting float-only if-arms (seelearnings.md“Trap-Floats Lowering” for the investigation that caught this). - A fresh basic block is created and the IR builder positions there. The
block has no predecessor edge, so subsequent operators translate into
provably-dead code that LLVM’s
dcepass removes. - The translator pops
pop_countentries from the operand stack and pushespush_countzero placeholders, keeping the operand stack shape consistent with the WASM validator’s view of the rest of the function.
The translator does not set its unreachable flag. That flag is reserved
for WASM-level dead-code skipping (driven by unreachable/return/br); a
float trap is structurally still “live code” from the WASM operand-stack
perspective — the placeholders flow into subsequent ops normally, even though
LLVM will optimise them away.
This approach handles the tricky corner cases:
- A float op inside one arm of an
iftraps that arm; the merge block’s phi still receives an incoming edge from the after-trap block (with a placeholder zero), keeping the IR valid. - A function that returns f64 still produces a function-end phi with at least one incoming branch (the placeholder zero pushed after the trap).
- Calls between functions with float signatures keep working because the i64-uniform calling convention treats every parameter and return value as i64 anyway — both caller and callee just pass placeholders that nobody reads before the trap fires.
Float operators covered
All MVP f32/f64 operators (≈60 ops) are covered:
- Constants:
f32.const,f64.const - Loads / stores:
f{32,64}.{load,store} - Unary:
abs,neg,sqrt,ceil,floor,trunc,nearest - Binary:
add,sub,mul,div,min,max,copysign - Comparisons:
eq,ne,lt,gt,le,ge(return i32) - Conversions: every variant of
i{32,64}.trunc[_sat]_f{32,64}_{s,u},f{32,64}.convert_i{32,64}_{s,u},f32.demote_f64,f64.promote_f32,{i,f}{32,64}.reinterpret_{f,i}{32,64}
SIMD float operators (f32x4.*, f64x2.*) are not in this set; modules
using SIMD will still fail with the SIMD operator’s own unsupported error.
Example
# Default: compilation fails on the first float op.
$ wasm-pvm compile runtime.wasm -o runtime.jam
Error: Compilation failed
Caused by:
Unsupported WASM feature: F64Load { memarg: ... } (in function #42 'compute_score' at byte offset 0x1a3)
# With --trap-floats: compiles, traps at runtime if compute_score is called.
$ wasm-pvm compile runtime.wasm -o runtime.jam --trap-floats
wasm-pvm v0.8.0
...
Compiled in 312ms
Compiler Pipeline
The compiler translates WebAssembly to PVM bytecode in five stages:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Adapter │ │ WASM → │ │ LLVM │ │ LLVM IR │ │ SPI │
│ Merge │────►│ LLVM IR │────►│ Passes │────►│ → PVM │────►│ Assembly │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
(optional) inkwell mem2reg,etc Rust backend JAM blob
Stage 1: Adapter Merge (Optional)
File: crates/wasm-pvm/src/translate/adapter_merge.rs
When a WAT adapter module is provided (--adapter), it is merged into the main WASM binary. Adapter exports replace matching WASM imports, enabling complex import resolution logic (pointer conversion, memory reads, host calls). Uses wasm-encoder to build the merged binary.
Stage 2: WASM → LLVM IR
File: crates/wasm-pvm/src/llvm_frontend/function_builder.rs (~1350 lines)
Each WASM function is translated to LLVM IR using inkwell (LLVM 18 bindings). PVM-specific intrinsics (@__pvm_load_i32, @__pvm_store_i32, etc.) are used for memory operations instead of direct pointer arithmetic, avoiding unsafe GEP/inttoptr patterns.
All values are treated as i64 (matching PVM’s 64-bit registers).
Stage 3: LLVM Optimization Passes
File: crates/wasm-pvm/src/llvm_frontend/function_builder.rs
Three optimization phases run sequentially:
- Pre-inline cleanup:
mem2reg(SSA promotion),instcombine,simplifycfg - Inlining (optional):
cgscc(inline)— function inlining for small callees - Post-inline cleanup:
instcombine<max-iterations=2>,simplifycfg,gvn(redundancy elimination),simplifycfg,dce(dead code removal)
Stage 4: LLVM IR → PVM Bytecode
Files: crates/wasm-pvm/src/llvm_backend/ (7 modules)
A custom Rust backend reads LLVM IR and emits PVM instructions:
| Module | Responsibility |
|---|---|
emitter.rs | Core emitter, value slot management, register cache |
alu.rs | Arithmetic, logic, comparisons, conversions, fused bitwise |
memory.rs | Load/store, memory intrinsics, word-sized bulk ops |
control_flow.rs | Branches, phi nodes, switch, return |
calls.rs | Direct/indirect calls, import stubs |
intrinsics.rs | PVM + LLVM intrinsic lowering |
regalloc.rs | Linear-scan register allocator |
Key optimizations at this stage:
- Per-block register cache: eliminates redundant loads (~50% gas reduction)
- Cross-block cache propagation: for single-predecessor blocks
- ICmp+Branch fusion: combines compare and branch into one PVM instruction
- Linear-scan register allocation: assigns loop values to callee-saved registers
- Peephole optimizer: fuses immediate chains, eliminates dead stores
Stage 5: SPI Assembly
File: crates/wasm-pvm/src/translate/mod.rs
Packages everything into a JAM/SPI program blob:
- Build entry header (jump to main function, optional secondary entry)
- Build dispatch table (for
call_indirect) →ro_data - Build globals + WASM memory initial data →
rw_data(with trailing zero trim) - Encode PVM program blob (jump table + bytecode + instruction mask)
- Write SPI header (ro_data_len, rw_data_len, heap_pages, stack_size)
ABI & Calling Conventions
Register assignments, calling convention, stack frame layout, memory layout, and the SPI/JAM program format used by the WASM-to-PVM recompiler.
The canonical source for constants lives in crates/wasm-pvm/src/abi.rs and crates/wasm-pvm/src/memory_layout.rs.
Register Assignments
PVM provides 13 general-purpose 64-bit registers (r0–r12). The compiler assigns them as follows:
| Register | Alias | Purpose | Saved by |
|---|---|---|---|
| r0 | ra | Return address (jump table index) | Callee |
| r1 | sp | Stack pointer (grows downward) | Callee |
| r2 | t0 | Temp: load operand 1 / immediates | Caller |
| r3 | t1 | Temp: load operand 2 | Caller |
| r4 | t2 | Temp: ALU result | Caller |
| r5 | s0 | Scratch | Caller |
| r6 | s1 | Scratch | Caller |
| r7 | a0 | Return value / SPI args_ptr | Caller |
| r8 | a1 | SPI args_len / second result | Caller |
| r9 | l0 | Local 0 / param 0 | Callee |
| r10 | l1 | Local 1 / param 1 | Callee |
| r11 | l2 | Local 2 / param 2 | Callee |
| r12 | l3 | Local 3 / param 3 | Callee |
Callee-saved (r0, r1, r9–r12): the callee must preserve these across calls. Caller-saved (r2–r8): the caller must assume these are clobbered by any call.
Stack Frame Layout
Every function allocates a stack frame. The stack grows downward (SP decreases).
Higher addresses
┌─────────────────────────┐
│ caller's frame ... │
old SP → ├─────────────────────────┤
│ Saved r0 (ra) +0 │ 8 bytes
│ Saved r9 (l0) +8 │ 8 bytes
│ Saved r10 (l1) +16 │ 8 bytes
│ Saved r11 (l2) +24 │ 8 bytes
│ Saved r12 (l3) +32 │ 8 bytes
├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤ FRAME_HEADER_SIZE = 40
│ SSA value slot 0 +40 │ 8 bytes
│ SSA value slot 1 +48 │ 8 bytes
│ ... │ 8 bytes per SSA value
new SP → ├─────────────────────────┤
│ (operand spill area) │ SP - 0x100 .. SP
└─────────────────────────┘
Lower addresses
Frame size = FRAME_HEADER_SIZE (40) + num_ssa_values * 8
The operand spill area at SP + OPERAND_SPILL_BASE (i.e. SP - 0x100) is used for
temporary storage during phi-node copies and indirect calls. The frame grows upward
from SP (toward higher addresses), while the spill area is below SP, so the two
regions never overlap regardless of frame size. However, a callee’s frame allocation
must not reach into the caller’s spill area — this is protected by the stack overflow
check which ensures SP - frame_size >= stack_limit.
Stack-Slot Approach with Register Allocation
Every LLVM SSA value gets a dedicated 8-byte stack slot. The baseline instruction sequence is:
- Load operands from stack slots into temp registers (t0, t1)
- Execute ALU operation, result in t2
- Store t2 back to the result’s stack slot
A linear-scan register allocator (regalloc.rs) improves on this when a function
contains loop back-edges; loop-free functions skip allocation entirely. Candidate
intervals are built from use-def live-interval analysis and filtered by a minimum-use
threshold (MIN_USES_FOR_ALLOCATION, currently 3), rather than requiring per-value
“loop-spanning” as the eligibility rule. The allocator assigns eligible values to
available callee-saved registers (r9-r12 when not used for this function’s incoming
parameters). In non-leaf functions, r9+ needed for outgoing call arguments are reserved
from allocation. Call-site clobber handling/reloads are performed by the emitter after
calls, not by explicit call-site invalidation logic inside regalloc itself. Combined
with the register cache, this eliminates most redundant memory traffic.
Per-Block Register Cache (Store-Load Forwarding)
PvmEmitter maintains a per-basic-block register cache (slot_cache: HashMap<i32, u8>,
reg_to_slot: [Option<i32>; 13]) that tracks which stack slot values are currently live
in registers. This eliminates redundant LoadIndU64 instructions:
- Cache hit, same register: Skip entirely (0 instructions emitted)
- Cache hit, different register: Emit
AddImm64 dst, cached_reg, 0(register copy) - Cache miss: Emit normal
LoadIndU64, then record in cache
The cache is invalidated:
- When a register is overwritten (auto-detected via
Instruction::dest_reg()) - At block boundaries (
define_label()clears the entire cache) - After function calls (
clear_reg_cache()afterFallthroughreturn points) - After ecalli host calls (
clear_reg_cache()afterEcalli)
Impact: ~50% gas reduction, ~15-40% code size reduction across benchmarks.
Calling Convention
Parameter Passing
| Parameter | Location |
|---|---|
| 1st–4th | r9–r12 |
| 5th+ | param_overflow_base + (i-4)*8 in global memory (dynamic) |
The param-overflow base is computed per-module by compute_param_overflow_base
(see memory_layout.rs). It sits right after the globals/passive-length region,
8-byte aligned. The complementary helper compute_wasm_memory_base returns the
start of WASM linear memory, which lands immediately after the overflow
reservation when one is present. The 256-byte reservation is only emitted when
any module type signature — local function or call_indirect target — has more
than MAX_LOCAL_REGS params, tracked via WasmModule::needs_param_overflow.
For a typical AS program the base lands around 0x30010–0x30020; the old
fixed 0x32000 location is gone.
Return value: r7 (single i64).
Caller Sequence
1. Load arguments into r9–r12 (first 4)
2. Store overflow arguments to PARAM_OVERFLOW_BASE
3. LoadImm64 r0, <return_jump_table_index>
4. Jump <callee_code_offset>
── callee executes ──
5. (fallthrough) Store r7 to result slot if function returns a value
Callee Prologue
1. Stack overflow check (skipped for entry function):
LoadImm64 t1, stack_limit ; unsigned comparison!
AddImm64 t2, sp, -frame_size
BranchGeU t1, t2, continue
Trap ; stack overflow → panic
2. Allocate frame:
AddImm64 sp, sp, -frame_size
3. Save callee-saved registers:
StoreIndU64 [sp+0], r0
StoreIndU64 [sp+8], r9
StoreIndU64 [sp+16], r10
StoreIndU64 [sp+24], r11
StoreIndU64 [sp+32], r12
4. Copy parameters to SSA value slots:
- First 4 from r9–r12
- 5th+ loaded from PARAM_OVERFLOW_BASE
Callee Epilogue (return)
1. Load return value into r7 (if returning a value)
2. Restore callee-saved registers:
LoadIndU64 r9, [sp+8]
LoadIndU64 r10, [sp+16]
LoadIndU64 r11, [sp+24]
LoadIndU64 r12, [sp+32]
3. Restore return address:
LoadIndU64 r0, [sp+0]
4. Deallocate frame:
AddImm64 sp, sp, +frame_size
5. Return:
JumpInd r0, 0
Jump Table & Return Addresses
PVM’s JUMP_IND instruction uses a jump table — it is not a direct address jump:
JUMP_IND rA, offset
target_address = jumpTable[(rA + offset) / 2 - 1]
Return addresses stored in r0 are therefore jump-table indices, not code offsets:
r0 = (jump_table_index + 1) * 2
The jump table is laid out as:
[ return_addr_0, return_addr_1, ..., // for call return sites
func_0_entry, func_1_entry, ... ] // for indirect calls
Each entry is a 4-byte code offset (u32). Jump table entries for call_indirect
encode function entry points used by the dispatch table.
Indirect Calls (call_indirect)
A dispatch table at RO_DATA_BASE (0x10000) maps WASM table indices to
function entry points:
Dispatch table entry (8 bytes each):
[0–3] Jump address (u32, byte offset → jump table index)
[4–7] Type signature index (u32)
The indirect call sequence:
1. Compute dispatch_addr = RO_DATA_BASE + (table_index << 3)
2. Load type_idx from [dispatch_addr + 4]
3. Compare type_idx with expected_type_idx
4. Trap if mismatch (signature validation)
5. Load jump_addr from [dispatch_addr + 0]
6. LoadImmJumpInd jump_addr, r0, <return_jump_table_index>, 0
Import Calls
host_call_N(ecalli_index, r7, ..., r7+N-1) -> i64 → ecalli
A family of typed host call imports where N (0–6) indicates the number of data
arguments loaded into r7–r12. The first argument must be a compile-time constant
(the ecalli index). All variants return r7 as an i64.
| Import | Params | Registers set |
|---|---|---|
host_call_0 | (i64) | none |
host_call_1 | (i64 i64) | r7 |
host_call_2 | (i64 i64 i64) | r7-r8 |
host_call_3 | (i64 i64 i64 i64) | r7-r9 |
host_call_4 | (i64 i64 i64 i64 i64) | r7-r10 |
host_call_5 | (i64 i64 i64 i64 i64 i64) | r7-r11 |
host_call_6 | (i64 i64 i64 i64 i64 i64 i64) | r7-r12 |
Example — JIP-1 log call with 5 register args:
(import "env" "host_call_5" (func $host_call_5 (param i64 i64 i64 i64 i64 i64) (result i64)))
(import "env" "pvm_ptr" (func $pvm_ptr (param i64) (result i64)))
;; ecalli 100 = log; r7=level, r8=target_ptr, r9=target_len, r10=msg_ptr, r11=msg_len
(drop (call $host_call_5
(i64.const 100) ;; ecalli index
(i64.const 3) ;; r7: log level
(call $pvm_ptr (i64.const 0)) ;; r8: target PVM pointer
(i64.const 8) ;; r9: target length
(call $pvm_ptr (i64.const 8)) ;; r10: message PVM pointer
(i64.const 15))) ;; r11: message length
host_call_Nb — two-register output variants
Same as host_call_N but also captures r8 after the ecalli to a dedicated
stack slot (R8_CAPTURE_SLOT_OFFSET relative to SP). Use the companion import
host_call_r8() -> i64 (no arguments) to retrieve the captured value. The
host_call_r8 call must be in the same function as the preceding host_call_Nb.
All *b variants (host_call_0b through host_call_6b) are supported.
Example:
(import "env" "host_call_2b" (func $host_call_2b (param i64 i64 i64) (result i64)))
(import "env" "host_call_r8" (func $host_call_r8 (result i64)))
;; Call ecalli 10, passing r7=100 and r8=200.
;; Store r7 return value, then retrieve r8.
(local $r7 i64)
(local $r8 i64)
(local.set $r7 (call $host_call_2b (i64.const 10) (i64.const 100) (i64.const 200)))
(local.set $r8 (call $host_call_r8))
pvm_ptr(wasm_addr) -> pvm_addr
Converts a WASM-space address to a PVM-space address by zero-extending to 64 bits
and adding wasm_memory_base.
Other imports
The abort import emits Trap (unrecoverable error). All other unresolved
imports cause a compilation error — they must be resolved via --imports or
--adapter before compilation succeeds.
Memory Layout
PVM Address Space:
0x00000 - 0x0FFFF Reserved / guard (fault on access)
0x10000 - 0x1FFFF Read-only data (RO_DATA_BASE) — dispatch tables
0x20000 - 0x2FFFF Gap zone (unmapped, guard between RO and RW)
0x30000 Mem-size slot (4 bytes, only when memory.size/grow/init used)
0x30000 / 0x30004+ User globals (per-global width: 4 B for i32/f32, 8 B for i64/f64,
packed in declaration order; offset by 4 when mem-size slot present)
after globals Passive data segment length slots (4 bytes each)
after lengths Parameter overflow area (256 bytes, 8-byte aligned, only when any module type signature has >`MAX_LOCAL_REGS` params — covers both local functions and `call_indirect` targets)
region_end WASM linear memory (sits immediately after last region — no 4KB alignment)
... (unmapped gap until stack)
0xFEFE0000 STACK_SEGMENT_END (initial SP)
0xFEFF0000 Arguments segment (input data, read-only)
0xFFFF0000 EXIT_ADDRESS (jump here → HALT)
Key formulas (see memory_layout.rs):
- Memory-size slot:
0x30000— stable position, independent ofnum_globals. Emitted only when the module usesmemory.size/memory.grow/memory.init. - Global address: precomputed at parse time as
WasmModule::global_offsets[idx]. Each user global occupiesglobal_storage_width(type)bytes — 4 B fori32/f32, 8 B fori64/f64— packed in declaration order with no inter-global padding.(global i64 ...)round-trips throughLoadU64/StoreU64without truncation;(global i32 ...)keeps its 4-byte slot and usesLoadU32/StoreU32. The LLVM frontend declares each global with its matching int type (i32/i64) and zext/truncs atglobal.get/global.setso the i64 WASM stack representation stays uniform. - Passive segment length slot:
0x30000 + (has_mem_size ? 4 : 0) + sum(global_widths) + ordinal * 4(lengths remain 4 bytes — they’re effective sizes, never i64). - WASM memory base:
compute_wasm_memory_base(num_globals, num_passive_segments, has_mem_size_global, needs_param_overflow). Sits immediately after the last present region with no 4KB alignment — anan-as page-aligns the rw_data tail (heapZerosStart = heapStart + alignToPageSize(rwLength)) separately, so the base can land at any byte offset. When every region is empty (no globals, no mem-size, no passive, no overflow), the base collapses toGLOBAL_MEMORY_BASEitself. - Stack limit:
0xFEFE0000 - stack_size
RW data layout
SPI rw_data is defined as a contiguous dump of every byte from GLOBAL_MEMORY_BASE up to the last initialized byte of the WASM heap; the loader memcpys this region at 0x30000, so there is no sparse encoding or per-segment offsets inside the blob. Because wasm_memory_base is placed tightly after the globals window (no 4KB alignment), the data-segment bytes start within a few bytes of rw_data[0] — the 4KB structural-padding page that the previous layout required for every memory-using program is eliminated. The compiler still trims trailing zeros before encoding.
build_rw_data() trims trailing zero bytes before SPI encoding. Heap pages are zero-initialized, so omitted trailing zeros are semantically equivalent.
Entry Function (SPI Convention)
The entry function is special — it follows SPI conventions rather than the normal calling convention.
Initial register state (set by the PVM runtime):
| Register | Value | Purpose |
|---|---|---|
| r0 | 0xFFFF0000 | EXIT address — jump here to HALT |
| r1 | 0xFEFE0000 | Stack pointer (STACK_SEGMENT_END) |
| r7 | 0xFEFF0000 | Arguments pointer (PVM address) |
| r8 | args.length | Arguments length in bytes |
| r2–r6, r9–r12 | 0 | Available |
Entry prologue differences from a normal function:
- No stack overflow check (main function starts with full stack)
- Allocates frame and stores SSA slots
- No callee-saved register saves (no caller to return to)
- Adjusts args_ptr:
r7 = r7 - wasm_memory_base(convert PVM address to WASM address) - Stores r7 and r8 to parameter slots
Entry return — unified packed i64 convention:
The entry function must return a single i64 value encoding a pointer and length:
- Lower 32 bits = WASM pointer to result data
- Upper 32 bits = result length in bytes
- PVM output:
r7 = (ret & 0xFFFFFFFF) + wasm_memory_base,r8 = r7 + (ret >> 32)
All entry functions end by jumping to EXIT_ADDRESS (0xFFFF0000).
Start Function
If a WASM start function exists, the entry function calls it before processing arguments. r7/r8 are saved to the stack, the start function is called (no arguments), then r7/r8 are restored.
SPI/JAM Program Format
The compiled output is a JAM file in the SPI (Standard Program Interface) format:
Offset Size Field
────── ────── ─────────────────────
0 3 ro_data_len (u24 LE)
3 3 rw_data_len (u24 LE)
6 2 heap_pages (u16 LE)
8 3 stack_size (u24 LE)
11 N ro_data (dispatch table)
11+N M rw_data (globals + WASM memory initial data)
11+N+M 4 code_len (u32 LE)
15+N+M K code (PVM program blob)
heap_pages is computed from the WASM module’s initial_pages (not max_pages).
It represents the number of 4KB PVM pages pre-allocated as zero-initialized writable memory
at program start. Additional memory beyond this is allocated on demand via sbrk/memory.grow.
Programs declaring (memory 0) get a minimum of 16 WASM pages (1MB) to accommodate
AssemblyScript runtime memory accesses.
PVM Code Blob
Inside the code section, the PVM blob format is:
- jump_table_len (varint u32)
- item_len (u8, always 4)
- code_len (varint u32)
- jump_table (4 bytes per entry, code offsets)
- instructions (PVM bytecode)
- mask (bit-packed instruction start markers)
Entry Header
The first 10 bytes of code are the entry header:
[0–4] Jump <main_function_offset> (5 bytes)
[5–9] Jump <secondary_entry_offset> (5 bytes, or Trap + padding)
The secondary entry is for future use (e.g. is_authorized). If unused, it emits
Trap followed by 4 Fallthrough instructions as padding.
Phi Node Handling
Phi nodes (SSA merge points) use a two-pass approach to avoid clobbering:
- Load pass: Load all incoming phi values into temp registers (t0, t1, t2, s0, s1)
- Store pass: Store all temps to their destination phi result slots
This supports up to 5 simultaneous phi values. The two-pass design prevents cycles where storing one phi value would overwrite a source needed by another phi.
Design Trade-offs
| Decision | Rationale |
|---|---|
| Stack-slot for every SSA value | Correctness-first baseline; linear-scan register allocator (for loop-containing functions) assigns high-use values to available callee-saved regs (r9-r12 when not used for this function’s incoming parameters), and per-block register cache eliminates most remaining redundant loads |
| Spill area below SP | Frame grows up from SP, spill area grows down — no overlap |
| Fixed-address overflow region (computed per-module) | Avoids stack frame complexity for overflow params; reserved only when some signature needs it (see needs_param_overflow) |
| Jump-table indices as return addresses | Required by PVM’s JUMP_IND semantics |
| Entry function has no stack check | Starts with full stack, nothing to overflow into |
| Unsigned stack limit comparison | LoadImm64 avoids sign-extension bugs with large addresses |
unsafe forbidden | Workspace-level deny(unsafe_code) lint |
References
crates/wasm-pvm/src/abi.rs— Register and frame constantscrates/wasm-pvm/src/memory_layout.rs— Memory address constantscrates/wasm-pvm/src/llvm_backend/emitter.rs— PvmEmitter and value managementcrates/wasm-pvm/src/llvm_backend/calls.rs— Calling convention implementationcrates/wasm-pvm/src/llvm_backend/control_flow.rs— Prologue/epilogue/returncrates/wasm-pvm/src/spi.rs— JAM/SPI format encoder- Technical Reference — Technical reference and debugging journal
- Gray Paper — JAM/PVM specification
Translation Module
The translation module orchestrates the end-to-end WASM → LLVM IR → PVM lowering and assembles the final SPI/JAM output.
Source: crates/wasm-pvm/src/translate/
Files
| File | Role |
|---|---|
mod.rs | Pipeline dispatch, SPI assembly, entry header + data sections |
wasm_module.rs | WASM section parsing into WasmModule |
memory_layout.rs | Memory address constants and helper functions |
Pipeline
- Parse module sections in
wasm_module.rs(WasmModule::parse()). - Translate WASM operators to LLVM IR in
llvm_frontend/function_builder.rs. - Run LLVM optimization pipeline (
mem2reg,instcombine,simplifycfg, optional inlining, cleanup passes). - Lower LLVM IR to PVM instructions in
llvm_backend/mod.rs. - Build SPI sections in
mod.rs:- Entry header and dispatch tables
ro_data(jump table refs + passive data)rw_data(globals + active data segments), with trailing zero trim- Encoded PVM blob + metadata
Key Behaviors
calculate_heap_pages()uses WASMinitial_pages(not max), with a minimum of 16 WASM pages for(memory 0).compute_wasm_memory_base()lays out (in order) the (optional) mem-size slot atGLOBAL_MEMORY_BASE, user globals, passive segment lengths, and (optionally) the 256-byte parameter overflow area, then placeswasm_memory_baseimmediately after. No 4KB alignment is applied — anan-as page-aligns the rw_data tail (heapZerosStart) separately, so the base may sit at any byte offset. Mem-size is emitted only when the module usesmemory.size/memory.grow/memory.init; overflow (tracked byneeds_param_overflow) is emitted only when any module type signature has more thanMAX_LOCAL_REGS(4) parameters — this covers both local function declarations andcall_indirecttarget types.build_rw_data()copies globals and active segments into a contiguous image, then trims trailing zero bytes before SPI encoding.- Call return addresses are pre-assigned as jump-table refs
((idx + 1) * 2)at emission time; fixup resolution accepts direct (LoadImmJump) and indirect (LoadImm/LoadImmJumpInd) return-address carriers. - Entry resolution prefers canonical export names (
main,main2) over aliases (refine*,accumulate*) regardless of export order. - Entry exports (
main/main2and aliases) must target local (non-imported) functions; imported targets are rejected during parse withError::Internalto avoid index-underflow panics. - WASM
namecustom section (subsection 1, function names) is parsed intolocal_function_names: Vec<Option<String>>.WasmModule::local_function_display_name(local_idx)returns the name-section entry, falling back to the export name, thenwasm_func_<global_idx>. Used by the function-body translator to wrap operator-dispatch errors inError::Located { func_idx, func_name, op_offset, source }— the diagnostic surface for unsupported features. Errors emitted later in the pipeline (LLVM-to-PVM lowering, adapter merge) do not get this wrapping; they fire after the WASM byte offset has been lost.
Current Memory Layout
| Address | Purpose |
|---|---|
0x10000 | Read-only data |
0x30000 | Mem-size slot (4 bytes, only when memory.size/grow/init used), then user globals (per-global width: 4 B for i32/f32, 8 B for i64/f64 — see docs/src/learnings.md “Global Storage Width”; addresses precomputed at parse time as WasmModule::global_offsets), passive segment length slots (4 bytes each), and (when any type signature has >4 params) a 256-byte parameter overflow area. Total size = align_up_8(globals_region_size(...)) + 256 when overflow is reserved (the overflow base is 8-byte aligned — see compute_param_overflow_base), else just globals_region_size(...). |
region_end | WASM linear memory — placed without 4KB alignment immediately after the last region. For a module that only declares memory and never uses memory.size/grow/init, wasm_memory_base collapses to 0x30000. A memory-op-using program with zero user globals, no passive segments, and no overflow lands at 0x30004. A program that also needs overflow (e.g. a 5+ param call_indirect target) lands at 0x30108. |
Anti-Patterns
- Don’t change layout constants without validating pvm-in-pvm tests.
- Don’t bypass
Resulterror handling with panics in library code. - Don’t assume
rw_datamust include trailing zero bytes.
PVM Instruction Module
PolkaVM instruction definitions, opcodes, encoding/decoding, and the peephole optimizer.
Source: crates/wasm-pvm/src/pvm/
Files
| File | Lines | Role |
|---|---|---|
instruction.rs | ~700 | Instruction enum, encoding/decoding logic |
opcode.rs | ~130 | Opcode constants (~100 opcodes) |
blob.rs | 143 | Program blob format with jump table |
peephole.rs | ~400 | Post-codegen peephole optimizer (Fallthroughs, truncation NOPs, dead stores, immediate chain fusion, self-move elimination) |
Key Patterns
Instruction Encoding
#![allow(unused)]
fn main() {
pub enum Instruction {
Add32 { dst: u8, src1: u8, src2: u8 },
LoadIndU32 { dst: u8, base: u8, offset: i32 },
MoveReg { dst: u8, src: u8 },
BranchLtUImm { reg: u8, value: i32, offset: i32 },
BranchEq { reg1: u8, reg2: u8, offset: i32 },
CmovIzImm { dst: u8, cond: u8, value: i32 }, // TwoRegOneImm encoding
StoreImmU32 { address: i32, value: i32 }, // TwoImm encoding
StoreImmIndU32 { base: u8, offset: i32, value: i32 }, // OneRegTwoImm encoding
AndImm { dst: u8, src: u8, value: i32 },
ShloLImm32 { dst: u8, src: u8, value: i32 },
NegAddImm32 { dst: u8, src: u8, value: i32 },
SetGtUImm { dst: u8, src: u8, value: i32 },
// ... ~100 variants total
}
}
Encoding Helpers
encode_three_reg(opcode, dst, src1, src2)- ALU ops (3 regs)encode_two_reg(opcode, dst, src)- Moves/conversions (2 regs)encode_two_reg_one_imm(opcode, dst, src, value)- ALU immediate ops (2 regs + imm)encode_two_imm(opcode, imm1, imm2)- TwoImm format (StoreImm*)encode_one_reg_one_imm_one_off(opcode, reg, imm, offset)- Branch-immediate opsencode_one_reg_two_imm(opcode, base, offset, value)- Store immediate indirectencode_two_reg_one_off(opcode, reg1, reg2, offset)- Branch-register opsencode_two_reg_two_imm(opcode, reg1, reg2, imm1, imm2)- Compound indirect jump (LoadImmJumpInd)encode_imm(value)- Variable-length signed immediate (0-4 bytes)encode_uimm(value)- Variable-length unsigned immediate (0-4 bytes)encode_var_u32(value)- LEB128-style variable int
Decoding Helpers
Instruction::decode(bytes)dispatches by opcode and returns(instruction, consumed_bytes)Opcode::from_u8/Opcode::try_fromprovide explicit opcode-byte to enum conversiondecode_imm_signed/decode_imm_unsignedhandle 0-4 byte immediate expansiondecode_offset_atreads fixed 4-byte branch/jump offsets- For formats where the trailing immediate has no explicit length (
OneImm,OneRegOneImm,TwoRegOneImm,TwoImm,OneRegTwoImm,TwoRegTwoImm), decode consumes the remaining bytes as that immediate
Terminating Instructions
Instructions that end a basic block:
#![allow(unused)]
fn main() {
pub fn is_terminating(&self) -> bool {
matches!(self,
Trap | Fallthrough | Jump {..} | LoadImmJump {..} | JumpInd {..} | LoadImmJumpInd {..} |
BranchNeImm {..} | BranchEqImm {..} | ...)
}
}
Destination Register Query
Used by the register cache in emitter.rs to auto-invalidate stale cache entries:
#![allow(unused)]
fn main() {
pub fn dest_reg(&self) -> Option<u8> {
// Returns Some(reg) for instructions that write to a register
// Returns None for stores, branches, traps, ecalli
}
}
Peephole Notes
- Dead-code elimination runs only when a function has no labels (single-block code). Multi-block functions skip DCE to avoid incorrect liveness across control flow.
- DCE must track side-effects for all store variants:
StoreIndU8/U16/U32/U64,StoreImmIndU8/U16/U32/U64,StoreImmU8/U16/U32/U64,StoreU8/U16/U32/U64 - DCE must track memory loads (can-trap, track dst) for all load variants:
LoadIndU8/I8/U16/I16/U32/I32/U64,LoadU8/I8/U16/I16/U32/I32/U64 - Address-folding for
AddImm*chains is width-aware:AddImm32relations only fold into laterAddImm32, andAddImm64relations only fold into laterAddImm64(no cross-width fusion).
Where to Look
| Task | Location |
|---|---|
| Add new PVM instruction | opcode.rs (add enum variant) + instruction.rs (encoding + decoding) |
| Change instruction encoding | instruction.rs:impl Instruction |
| Check opcode exists | opcode.rs (~100 opcodes defined) |
| Build program blob | blob.rs:ProgramBlob::with_jump_table() |
| Variable int encoding | blob.rs:encode_var_u32() |
Branch Operand Convention (Important!)
Two-register branch instructions use reversed operand order:
Branch_op { reg1: a, reg2: b } branches when reg2 op reg1 (i.e., b op a).
For example, BranchLtU { reg1: 3, reg2: 2 } branches when reg[2] < reg[3], NOT reg[3] < reg[2].
This matches the PVM spec where branch_lt_u(rA, rB) branches when ω_rB < ω_rA.
In the binary encoding, reg1 = high nibble (rA), reg2 = low nibble (rB).
Immediate-form branches are straightforward: BranchLtUImm { reg, value } branches when reg < value.
Anti-Patterns
- Don’t change opcode numbers - Would break existing JAM files
- Preserve register field order -
(dst, src1, src2)convention - Keep encoding compact - Variable-length immediates save space
Testing
Unit tests in same files under #[cfg(test)]:
instruction.rs: Tests encoding and decode(encode) roundtrip coverage for all variantsblob.rs: Tests mask packing, varint encoding
Gray Paper Reference
See gp-0.7.2.md Appendix A for PVM spec:
- Gas costs per instruction (ϱ∆)
- Semantics for each opcode
- This module implements the encoding, not semantics
Optimizations
All non-trivial optimizations can be individually toggled via OptimizationFlags (in translate/mod.rs, re-exported from lib.rs). Each defaults to enabled; CLI exposes --no-* flags.
LLVM Pass Pipeline
Four phases run on every compile. The whole pipeline is gated by the llvm_passes flag (CLI --debug-skip-llvm-passes); the inlining and mergefunc phases also have individual toggles.
mem2reg,instcombine,simplifycfg(pre-inline cleanup)cgscc(inline)(optional, see--no-inline)instcombine<max-iterations=20>,simplifycfg,gvn,simplifycfg,dcemergefunc(optional, see--no-mergefunc)
--debug-skip-llvm-passes (debug only)
Not a tunable optimization. This flag skips the entire pipeline above, including mem2reg. The PVM backend cannot lower alloca / unpromoted SSA — every input non-trivial enough to use locals (i.e. virtually every real WASM module) fails with:
Error: Unsupported WASM feature: LLVM opcode Alloca (in function #N during PVM lowering)
Per the experiments/opt_impact.sh sweep, 31 of 31 representative inputs (fixture WATs, AS-built WASM, polkadot runtimes) fail to compile with this flag set. Use only to inspect the raw frontend IR (--verbose / dumps) before any optimization runs. Do not include it in --no-opt bundles or treat it as comparable to --no-peephole, --no-register-cache, etc.
Function Inlining (--no-inline)
LLVM CGSCC inline pass for small callees. After inlining, instcombine may introduce new LLVM intrinsics (llvm.abs, llvm.smax, etc.) that the backend must handle.
Function Merging (--no-mergefunc)
LLVM’s mergefunc pass, run as Phase 4 after the function-level cleanup. Two behaviors:
- Aliasing: when two functions have byte-identical bodies and their linkage permits, one becomes an alias of the other and only one PVM body survives.
- Thunking: when functions are “weakly identical” (same shape, parameterizable differences), the pass factors a canonical body and emits thunks (
call canonical; ret) for the originals.
Targets rustc monomorphizations — quicksort instantiated for several comparator types, scale_info::TypeInfo::type_info instantiated for many newtype wrappers. Their bodies share opcode shape but differ in inner call targets; the thunk parameterization handles this.
Must run after inlining. If cgscc(inline) ran after mergefunc, the thunks (very small bodies) would inline back into every caller and undo the merge. No trailing dce because the thunks are reachable from their callers — dce would drop nothing and only cost compile time.
Net effect on tiny functions can be negative because each thunked call costs ~5 bytes of call setup, which exceeds the saved body for very short functions. The polkadot wins come from large monomorphized helpers where the body dwarfs the call overhead.
Impact (polkadot fellowship v2.2.2, --trap-floats):
| Runtime | WASM | Baseline code | With mergefunc | Δ |
|---|---|---|---|---|
glutton-kusama | 2.04 MiB | 4,636,361 B | 4,600,277 B | −0.78% |
kusama | 8.43 MiB | 17,965,423 B | 17,832,758 B | −0.74% |
Saving scales roughly linearly with binary size. Compile-time impact: negligible (~+150 ms on glutton; within noise on kusama).
Peephole Optimizer (--no-peephole)
Post-codegen patterns in pvm/peephole.rs:
- Fallthrough elimination: remove redundant
Fallthroughbefore jump/branch - Truncation NOP removal:
[32-bit-producer] → AddImm32(x,x,0)eliminated - Dead store elimination: SP-relative stores never loaded from are removed
- Immediate chain fusion:
LoadImm + AddImm→ singleLoadImm; chainedAddImm→ fused - Self-move elimination:
MoveReg r, rremoved - Address calculation folding:
AddImmoffsets folded into subsequent load/store offsets
Register Cache (--no-register-cache)
Per-basic-block store-load forwarding. Tracks which stack slots are live in registers:
- Cache hit, same register: skip entirely (0 instructions)
- Cache hit, different register: emit register copy (1 instruction)
- Cache miss: normal load + record in cache
Impact: ~50% gas reduction, ~15-40% code size reduction.
Invalidated at block boundaries, after function calls, and after ecalli.
Cross-Block Cache (--no-cross-block-cache)
When a block has exactly one predecessor and no phi nodes, the predecessor’s cache snapshot is propagated instead of clearing. The snapshot is taken before the terminator instruction.
ICmp+Branch Fusion (--no-icmp-fusion)
Combines an LLVM icmp + br pair into a single PVM branch instruction (e.g., BranchLtU), saving one instruction per conditional branch.
Shrink Wrapping (--no-shrink-wrap)
For non-entry functions, only callee-saved registers (r9-r12) that are actually used are saved/restored in prologue/epilogue. Reduces frame header size from fixed 40 bytes to 8 + 8 * num_used_callee_regs.
Dead Store Elimination (--no-dead-store-elim)
Removes StoreIndU64 instructions to SP-relative offsets that are never loaded from. Runs as part of the peephole optimizer.
Constant Propagation (--no-const-prop)
Skips LoadImm/LoadImm64 when the target register already holds the required constant value.
Register Allocation (--no-register-alloc)
Linear-scan allocator assigns SSA values to physical registers, reducing LoadIndU64 memory traffic. Allocates in all functions (looped and straight-line, leaf and non-leaf). Eviction uses a spill-weight model (use_count × 10^loop_depth) to keep loop-hot values in registers. In non-leaf functions, the existing call lowering (spill_allocated_regs + clear_reg_cache + lazy reload) handles spill/reload around calls automatically, and per-call-site arity-aware invalidation only clobbers registers used by each specific call. See the Register Allocation chapter for details.
Aggressive Register Allocation (--no-aggressive-regalloc)
Lowers the minimum-use threshold for register allocation candidates from 2 to 1, capturing more values when a register is free. Enabled by default.
Scratch Register Allocation (--no-scratch-reg-alloc)
Adds r5/r6 (abi::SCRATCH1/SCRATCH2) to the allocatable set in all functions that don’t clobber them (no bulk memory ops, no funnel shifts). Per-function LLVM IR scan detects clobbering operations. In non-leaf functions, r5/r6 are spilled before calls via spill_allocated_regs and lazily reloaded on next access. Doubles allocation capacity in the common case (e.g., 2-param function: 2 → 4 allocatable regs).
Caller-Saved Register Allocation (--no-caller-saved-alloc)
Adds r7/r8 (RETURN_VALUE_REG/ARGS_LEN_REG) to the allocatable set in leaf functions. These registers are idle after the prologue and are never clobbered by calls in leaf functions. In non-leaf functions, r7/r8 are not allocated because every call clobbers r7 (return value) and r8 (scratch), making the constant invalidation/reload overhead a net negative. Combined with r5/r6, gives up to 4 extra registers (r5, r6, r7, r8) beyond callee-saved r9-r12 in leaf functions. The full register convention: r0=return address, r1=SP, r2-r4=temps, r5-r6=scratch, r7=return value/args ptr, r8=args len, r9-r12=callee-saved locals.
Fallthrough Jump Elimination (--no-fallthrough-jumps)
Two coupled steps that elide trailing Jump instructions when the jump target is the next block in emission order:
- Block layout reorder.
compute_block_layoutinllvm_backend/mod.rsconstructs the per-function emission order via greedy trace: from each unplaced block, walk preferred-successor links (uncondbr dest→dest, condbr cond, then, else→elsesincelower_bremitsBranchIfX then; Jump else_label,switch→default). Iterate the original IR order to pick trace starts. The resulting layout is shared with the register allocator so live intervals are computed against the order the emitter actually executes;regalloc::runaccepts the layout as theblock_orderparameter for that reason. - Jump elision. When
emit_jump_to_labelis invoked with the next block in layout already known (next_block_label), theJumpis dropped —define_labelemits aFallthroughmarker (1 byte) instead.
Trampoline paths in lower_br / lower_switch (used when phi copies are needed on every outgoing edge) emit a final Jump to a different target than the layout’s preferred-next. Such blocks miss the fallthrough but remain correct.
Libcall Recognition (--no-libcall-recognition)
Replaces the body of recognized compiler-builtins runtime functions with hand-crafted PVM-friendly implementations. WASM has no i128 type, so rustc for wasm32-unknown-unknown lowers every (a as u128) * b, a / b and similar to calls into runtime helpers (__multi3, __udivti3). Those helpers carry their full Knuth-style bodies into the WASM (~30 IR instructions for __multi3, ~800+ for __udivti3 + specialized_div_rem); when we recognize them by name we can swap the body for something that uses PVM’s native opcodes directly.
Recognition is name-based, by matching the function’s WASM custom name section entry against a fixed table:
| Name | Replacement |
|---|---|
__multi3 | 8 PVM instructions: Mul64 + MulUpperUU + 2×Mul64 + 2×Add64 + 2×StoreIndU64 |
__udivti3 | Fast/slow dispatch on (a_hi | b_hi) == 0: fast path is DivU64 + 2 stores; slow path forwards to the original specialized_div_rem (compiler-builtins) via the same stack-frame setup as the original wrapper |
Each recognition also checks the signature (5 i64 params, no return — the C sret convention) so a user function that happens to share a name isn’t silently mis-translated. For __udivti3, the body is also scanned to extract the slow-path callee and the __stack_pointer global; without both we silently no-op.
Impact (microbenchmark, 1000 iterations of the underlying operation):
| Workload | With | Without | Δ Gas | Δ Size |
|---|---|---|---|---|
| u128 mul | 75,029 | 119,029 | −37% | −170 B |
u128 div (fast path, a_hi = b_hi = 0) | 76,029 | 129,029 | −41% | +110 B |
u128 div (slow path, b_hi != 0) | 143,029 | 129,029 | +11% | +110 B |
The __udivti3 fast path is the b_hi specialization win: when callers pass i64 0 for the high halves (the dominant shape in substrate’s Perbill / currency arithmetic), it becomes a 5-PVM-instruction inline divide. The 11% slow-path regression is the cost of the dispatch (Or + ICmp + Branch) — accepted because real workloads are dominated by the fast path.
Limitations (documented in crates/wasm-pvm/src/llvm_frontend/libcall_recognition.rs):
- Strips of the WASM
namecustom section disable recognition silently (no correctness impact). - Aggressive inlining (
--inline-threshold> body size) inlines the libcall everywhere; recognition still applies but the inlined call sites still run the slow original. A separate IR pattern matcher would be needed to catch those — explicitly out of scope. - A user function literally named
__multi3with the exact 5-i64-param signature would be silently replaced. Mitigation: signature gate + the names are reserved by the C/Rust ABI.
Lazy Spill (--no-lazy-spill)
Eliminates write-through stack stores for register-allocated values. When a value is stored to a slot that has an allocated register, the value goes only into the register (marked “dirty”) and the StoreIndU64 to the stack is skipped. Values are flushed to the stack only when required:
- When the register is about to be clobbered by another instruction (auto-spill in
invalidate_reg) - Before function calls and ecalli (via
spill_allocated_regs()) - Before the function epilogue (return)
- Before terminators at block boundaries
- After prologue parameter stores
With register-aware phi resolution, phi copies between blocks use direct register-to-register moves when both the incoming value and the phi destination are in allocated registers, avoiding stack round-trips. The target block restores alloc_reg_slot for phi destinations after define_label, so subsequent reads use the register directly. For mixed cases (some values allocated, some not), a two-pass approach loads all incoming values into temp registers, then stores to destinations (registers or stack). This handles all dependency cases including cycles without needing a separate parallel move resolver.
Requires register_allocation to be effective.
The sections below are codegen-only optimizations: no individual flag, always active when register_allocation is enabled. Implementation in llvm_backend/emitter.rs and llvm_backend/regalloc.rs.
Store-Side Coalescing
result_reg() / result_reg_or(fallback) in emitter.rs return a value’s allocated register so ALU / memory-load / intrinsic lowering writes the result there directly, eliminating the MoveReg from TEMP_RESULT that store_to_slot would otherwise emit. The _or(TEMP1) variant is used by zext/sext/trunc to preserve TEMP1-based cache behavior in the non-allocated path.
Not coalesced (TEMP_RESULT live across control flow, or load corrupts cache for subsequent operand loads): lower_select, emit_pvm_memory_grow, lower_abs.
Impact (anan-as compiler): store_moves 2720 → 1262 (−54%), instructions 37,225 → 35,744 (−4%), JAM 169,853 → 164,902 B (−2.9%).
Load-Side Coalescing
operand_reg() returns a value’s allocated register when it currently holds the right slot, so lowering uses it directly as the instr’s source operand instead of going through load_operand() + temp copy. Applied across binary arith (incl. immediate-folding), comparisons, zext/sext/trunc, load/store addresses and values, branch conditions, fused ICmp+Branch, switch values, min/max, bswap, ctlz/cttz/ctpop, rotations, and lower_select Cmov operands.
Not coalesced: div/rem (trap code clobbers SCRATCH1), non-rotation funnel shifts (use SCRATCH1/2 after spill), lower_abs, call argument setup, phi resolution.
Dst-conflict safety (apply_dst_conflict_fallback): when an operand’s allocated register matches the dst, fall back to the temp register to avoid invalidate_reg hazards. Exception: dst == TEMP_RESULT keeps the alias (PVM reads both srcs before writing dst), eliminating MoveReg r4 → r2 chains. bitreverse keeps the conservative fallback (clobbers TEMP_RESULT mid-sequence to materialize i64 masks).
Impact of dst==TEMP_RESULT relaxation alone on polkadot/glutton-kusama: MoveReg −61% (70,141 → 27,155), PVM instructions −4%, JAM −1.97%.
Spill Weight Refinement
effective_weight = base_weight − num_spanning_calls × 2.0. Live ranges that cross real call instructions get a 2.0 penalty per spanning call (representing the spill+reload pair), pushing the allocator toward values that don’t cross call boundaries. Call positions collected during linearization via is_real_call(), counted via binary search.
Call Return Value Coalescing
LiveInterval.preferred_reg hints r7 (RETURN_VALUE_REG) for values defined by real calls — the return value is already in r7, so picking r7 (when free) eliminates the post-call MoveReg. Best-effort; if r7 is taken, a different register is used.
Loop Phi Early Interval Expiration
Loop phi destination intervals expire at their actual last use (before loop extension), freeing the register early so the incoming back-edge value can take it via the free pool. When both share the register, the phi copy becomes a no-op (emit_phi_copies_regaware skips it when incoming_reg == phi_reg AND is_alloc_reg_valid confirms the register still holds the incoming value). store_to_slot spills dirty values before overwriting alloc_reg_slot with a different slot.
A blanket pressure guard (intervals > 2× registers) disables this under register pressure, preventing freed registers from being stolen by unrelated values. Per-phi guards are unworkable — see learnings.md “Per-Phi Early Expiration Guard”.
Impact: fib(20) −15.7% gas / −7.2% code, factorial −5.6% gas.
Cross-Block Alloc State Propagation
At block boundaries with unprocessed predecessors (back-edges at loop headers), the dominator predecessor’s alloc_reg_slot is propagated instead of cleared. Filtered per register class to stay correct:
- Non-leaf: only callee-saved beyond
max_call_args(r5–r8 may be invalidated after calls on other paths). - Leaf + lazy spill: all registers (no calls to clobber them).
- Multi-predecessor blocks (both flavors): intersection logic keeps entries all processed predecessors agree on.
pred_map is built when has_regalloc && (!is_leaf || lazy_spill_enabled); set_alloc_reg_slot_filtered() applies the per-class filter.
Impact: fib(20) −5.1% gas, factorial(10) −7.1%, is_prime(25) −4.6%, PiP aslan-fib −0.52%.
Callee-Saved Preference for Call-Spanning Intervals
In non-leaf functions, the linear scan prefers callee-saved (r9–r12 beyond max_call_args) for intervals that span real calls (these survive calls without invalidation) and caller-saved (r5–r8) for intervals that don’t. LiveInterval.spans_calls set during interval construction; linear_scan() reads is_leaf and picks accordingly. preferred_reg hints (e.g. r7 for call returns) take priority. Leaf functions use default pop() order (no calls = no preference needed).
Impact: anan-as PVM interpreter −0.2% code (106,820 → 106,577 B). Primarily helps non-leaf functions with call-spanning values.
Adding a New Optimization
- Add a field to
OptimizationFlagsintranslate/mod.rs - Thread it through
LoweringContext→EmitterConfig - Guard the optimization code with
e.config.<flag> - Add a
--no-*CLI flag inwasm-pvm-cli/src/main.rs
Benchmarks
All optimizations enabled (default):
| Benchmark | WASM size | JAM size | Code size | Gas Used |
|---|---|---|---|---|
| add(5,7) | 68 B | 164 B | 99 B | 28 |
| fib(20) | 110 B | 226 B | 148 B | 409 |
| factorial(10) | 102 B | 198 B | 124 B | 156 |
| is_prime(25) | 162 B | 285 B | 201 B | 62 |
| AS fib(10) | 235 B | 631 B | 504 B | 245 |
| AS factorial(7) | 234 B | 616 B | 490 B | 207 |
| AS gcd(2017,200) | 229 B | 640 B | 517 B | 174 |
| AS decoder | 1.5 KB | 6.6 KB | 4,944 B | 953 |
| AS array | 1.4 KB | 6.1 KB | 4,427 B | 820 |
| regalloc two loops | 252 B | 587 B | 461 B | 16,769 |
| host-call-log | 171 B | 458 B | 104 B | 40 |
| aslan-fib accumulate | - | 20.7 KB | 13,365 B | 11,474 |
| blake2b(“abc”, 32) | 1.1 KB | 3.8 KB | 2,558 B | 17,930 |
| sha512(“abc”) | 1.7 KB | 3.7 KB | 2,559 B | 17,981 |
| anan-as PVM interpreter | 53.4 KB | 115.6 KB | 84,281 B | - |
Register Allocation
The compiler uses a linear-scan register allocator to assign frequently-used SSA values to physical callee-saved registers (r9-r12), reducing memory traffic.
Overview
Every LLVM SSA value gets a dedicated 8-byte stack slot (the baseline). The register allocator improves on this by keeping hot values in registers across block boundaries and loop iterations.
Eligibility
- Only functions with loop back-edges are considered (loop-free functions skip allocation)
- Values must have ≥3 uses (
MIN_USES_FOR_ALLOCATION) - Live intervals are computed from use-def analysis with loop extension
Available Registers
Callee-saved registers r9-r12, minus those used for incoming parameters:
- A function with 2 parameters uses r9-r10 → r11-r12 are available for allocation
- In non-leaf functions, registers needed for outgoing call arguments are also reserved
Allocation Strategy
- Build candidate intervals from use-def live-interval analysis
- Filter by minimum-use threshold
- Run linear scan: assign to available callee-saved registers, evict lower-priority intervals when needed
- Naturally expired intervals remain in the mapping (earlier uses still benefit)
- Evicted intervals are removed entirely (whole-interval mapping invalid after eviction)
Runtime Integration
load_operandchecks regalloc before stack: usesMoveRegfrom allocated reg instead ofLoadIndU64store_to_slotuses write-through: copies to allocated reg AND stores to stack- Dead store elimination removes the stack store if never loaded
- After calls in non-leaf functions, allocated register mappings are invalidated and lazily reloaded
Cross-Block Propagation
- Leaf functions:
alloc_reg_slotis preserved across all block boundaries (allocated registers are never clobbered by calls) - Non-leaf functions: Predecessor exit snapshots are intersected at multi-predecessor blocks — only entries where ALL predecessors agree are kept
- Back-edges (unprocessed predecessors) are treated conservatively
Debugging
Enable allocator logs with RUST_LOG=wasm_pvm::regalloc=debug:
regalloc::run()prints candidate/assignment statslower_function()prints per-function usage counters (alloc_load_hits,alloc_store_hits, etc.)
Quick triage:
allocatable_regs=0→ no allocation will happen- Non-zero
allocated_valueswith near-zero load/store hits → move/reload overhead dominates
For the full development journey, see Regalloc Cross-Block Journey.
Technical Reference
Accumulated technical knowledge from development — LLVM pass behavior, PVM instruction semantics, code generation patterns, and optimization details.
Entry Function ABI — Unified Packed i64 Convention
All entry functions (both WAT and AssemblyScript) must use main(args_ptr: i32, args_len: i32) -> i64.
The i64 return value packs a WASM pointer and length: (ptr as u64) | ((len as u64) << 32).
The PVM epilogue unpacks: r7 = (ret & 0xFFFFFFFF) + wasm_memory_base, r8 = r7 + (ret >> 32).
Common constant: ptr=0, len=4 → i64.const 17179869184 (= 4 << 32).
Previous conventions (globals-based, multi-value (result i32 i32), simple scalar) were removed.
AssemblyScript uses a writeResult(val: i32): i64 helper that stores the value and returns packResult(ptr, len).
LLVM New Pass Manager (inkwell 0.8.0 / LLVM 18)
Pass Pipeline Syntax
Module::run_passes()accepts a pipeline string parsed as a module-level pipeline- Function passes (like
mem2reg,instcombine) auto-wrap asmodule(function(...)) - CGSCC passes (like
inline) cannot be mixed with function passes in a single string - To run the inliner: use a separate
run_passes("cgscc(inline)")call - Pass parameters use angle brackets:
instcombine<max-iterations=2>
instcombine Convergence
instcombinedefaults tomax-iterations=1, which can causeLLVM ERROR: Instruction Combining did not reach a fixpointon complex IR (e.g., after aggressive inlining). The error is a hardreport_fatal_error(process abort), not a recoverable Rust error — it bypassesError::Locateddiagnostics- Fix: use
instcombine<max-iterations=N>for a higher cap. We currently useN=20 - A cap of 2 is enough for typical IR shapes but not for
--trap-floatson large modules: every float operator emits a@llvm.trap()+unreachablecluster, and propagating those through real control flow takes more iterations to fold (issue #212 — observed on the polkadot-fellows v2.2.2 relay-chain runtimes) - Running
instcombine,simplifycfgbefore inlining also helps by simplifying the IR first
Inlining Creates New LLVM Intrinsics
- After inlining,
instcombinemay transform patterns into LLVM intrinsics that weren’t present before:if x < 0 then -x else xbecomesllvm.abs.i64- Similar patterns may produce
llvm.smax,llvm.smin,llvm.umax,llvm.umin
- The PVM backend must handle these intrinsics (see
llvm_backend/intrinsics.rs)
PassBuilderOptions
set_inliner_threshold()is onPassManagerBuilder, NOT onPassBuilderOptionsPassBuilderOptionshas no direct way to set the inline threshold- The inline pass uses LLVM’s default threshold (225) when invoked via
cgscc(inline)
PVM Branch Operand Convention
Two-register branch instructions use reversed operand order: Branch_op { reg1: a, reg2: b } branches when reg2 op reg1 (i.e., b op a). For example, BranchLtU { reg1: 3, reg2: 2 } branches when reg[2] < reg[3]. This matches the Gray Paper where branch_lt_u(rA, rB) branches when ω_rB < ω_rA. In the encoding, reg1 = high nibble (rA), reg2 = low nibble (rB). Immediate-form branches are straightforward: BranchLtUImm { reg, value } branches when reg < value.
PVM Memory Layout Optimization
- Globals only occupy the bytes they actually need: the compiler tracks
globals_region_size = (num_globals + (1 if memory.size/grow/init used else 0) + num_passive_segments) * 4bytes. The heap usually starts right after this region, but when the compiler also reserves a 256-byte parameter-overflow area (any module type signature has >MAX_LOCAL_REGSparams),wasm_memory_basemoves tocompute_param_overflow_base(...) + 256. The mem-size slot is elided for programs that never read/grow memory size or usememory.init, saving 4 bytes ofrw_data. - Leading-zero rw_data trim (issue #195 Option 2A, extended): anan-as places
rw_dataat0x30000via a fixed memcpy, so leading zero bytes can’t be dropped without a format change. Two moves together collapse the 4KB structural-padding page that would otherwise prefixrw_datafor every memory-using program:- Stable mem-size slot at
0x30000: the compiler-managed memory-size global is placed at a fixed offset (GLOBAL_MEMORY_BASEitself) independent ofnum_globals. User globals shift to0x30004+when the slot is present. Memory-op lowering (memory.size/grow/init) reads a constant address, unaware of the program’s global count. - No 4KB alignment on
wasm_memory_base: anan-as allocatesrw_dataa page at a time viasetDataand computesheapZerosStart = heapStart + alignToPageSize(rwLength)independently, so the base can land at any byte offset inside the first page without leaving holes. Dropping the alignment placeswasm_memory_basejust past the globals/passive/overflow regions — typically0x30004to0x30018— so the first data-segment byte sits almost atrw_data[0]. Saves ~4 KB per fixture that declares(memory N)with data segments, including AS-runtime programs (verified: -3.7 KB onanan-as-compiler.jam, -4 KB on most AS fixtures). Note: the WASM-sideargs_ptrvalue (ARGS_SEGMENT_START - wasm_memory_base) shifts with the base, which is an observable ABI change for tests that hard-coded it.
- Stable mem-size slot at
heap_pagesis computed afterbuild_rw_data(): uses the actual (trimmed)rw_datalength to cover WASM memory fromGLOBAL_MEMORY_BASEtowasm_memory_base + initial_pages * 64KB. A single-page (+1) headroom at the heap boundary is reserved so the firstmemory.grow/sbrk call has a pre-allocated page — required for PVM-in-PVM execution to propagate correctly.
Code Generation
- Leaf Functions: Functions that make no calls don’t need to save/restore the return address (
ra/r0) because it’s invariant. This optimization saves 2 instructions per leaf function. - Address Calculation: Fusing
AddImminto subsequentLoadInd/StoreIndoffsets reduces instruction count. - Dead Code Elimination: Basic DCE for ALU operations removes unused computations (e.g. from macro expansions).
StoreImm (TwoImm Encoding)
- Opcodes 30-33: StoreImmU8/U16/U32/U64
- TwoImm encoding:
[opcode, addr_len & 0x0F, address_bytes..., value_bytes...] - Both address and value are variable-length signed immediates (0-4 bytes each)
- Semantics:
mem[address] = value(no registers involved) - Used for:
data.drop(store 0 to segment length addr),global.setwith constants - Savings: 3 instructions (LoadImm + LoadImm + StoreInd) → 1 instruction
StoreImmInd (Store Immediate Indirect)
Encoding (OneRegTwoImm)
- Format:
[opcode, (offset_len << 4) | (base & 0x0F), offset_bytes..., value_bytes...] - Both offset and value use variable-length signed encoding (
encode_imm) - Opcodes: StoreImmIndU8=70, StoreImmIndU16=71, StoreImmIndU32=72, StoreImmIndU64=73
- Semantics:
mem[reg[base] + sign_extend(offset)] = value(truncated/sign-extended per width) - For U64:
valueis sign-extended from i32 to i64
Optimization Triggers
emit_pvm_store: When WASM store value is a compile-time constant fitting i32- Saves 1 instruction (LoadImm) per constant store to WASM linear memory
ALU Immediate Opcode Folding
Immediate folding for binary operations
- When one operand of a binary ALU op is a constant that fits in i32, use the *Imm variant (e.g.,
And+ const →AndImm) - Saves 1 gas per folded instruction (no separate
LoadImm/LoadImm64needed) + code size reduction - Available for: Add, Mul, And, Or, Xor, ShloL, ShloR, SharR (both 32-bit and 64-bit)
- Sub with const RHS →
AddImmwith negated value; Sub with const LHS →NegAddImm - ICmp UGT/SGT with const RHS →
SetGtUImm/SetGtSImm(avoids swap trick) - LLVM often constant-folds before reaching the PVM backend, so benefits are most visible in complex programs
Instruction Decoder (Instruction::decode)
instruction.rsnow hasInstruction::decode(&[u8]) -> Result<(Instruction, usize)>so roundtrip tests and disassembly-style tooling can share one decode path.Opcode::from_u8/TryFrom<u8>are now the canonical byte→opcode conversion helpers for code and tests.- Fixed-width formats (
Zero,ThreeReg,TwoReg,OneOff,TwoRegOneOff,OneRegOneExtImm,OneRegOneImmOneOff) return exact consumed length. - Formats with trailing variable-length immediates but no explicit terminal length marker (
OneImm,OneRegOneImm,TwoRegOneImm,TwoImm,OneRegTwoImm,TwoRegTwoImm) are decoded by consuming the remaining bytes for that trailing immediate. - Unknown opcode passthrough is explicit: decode returns
Instruction::Unknown { opcode, raw_bytes }with original bytes preserved.
Conditional Move (CmovIz/CmovNz)
Branchless select lowering
select i1 %cond, %true_val, %false_valnow usesCmovNzinstead of a branch- Old: load false_val, branch on cond==0, load true_val, define label (5-6 instructions)
- New: load false_val, load true_val, load cond, CmovNz (4 instructions, branchless)
- CmovIz/CmovNz are ThreeReg encoded:
[opcode, (cond<<4)|src, dst] - Semantics:
if reg[cond] == 0 (CmovIz) / != 0 (CmovNz) then reg[dst] = reg[src] - Note: CmovNz conditionally writes dst — the register cache must invalidate dst after CmovNz/CmovIz since the write is conditional
CmovIzImm / CmovNzImm (TwoRegOneImm Encoding)
- Opcodes 147-148: Conditional move with immediate value
- TwoRegOneImm encoding:
[opcode, (cond << 4) | dst, imm_bytes...] - CmovIzImm:
if reg[cond] == 0 then reg[dst] = sign_extend(imm) - CmovNzImm:
if reg[cond] != 0 then reg[dst] = sign_extend(imm) - Now used: optimize
selectwhen one operand is a compile-time constant that fits in i32
LoadImmJumpInd (Opcode 180) — Implemented
- TwoRegTwoImm encoding: fuses
LoadImm + JumpIndinto one instruction. - Semantics:
reg[dst] = sign_extend(value); jump to reg[base] + sign_extend(offset). call_indirectnow emitsLoadImmJumpInd { base: r8, dst: r0, value: preassigned_return_addr, offset: 0 }.- Dispatch table address math for indirect calls can use
ShloLImm32(..., value=3)instead of threeAdd32doublings (idx*8), reducing one hot-path sequence from 3 instructions to 1 with equivalent 32-bit wrap/sign-extension semantics. - Fixups remain stable by:
- pre-assigning return jump-table slots at emission time, and
- recording
return_addr_instr == jump_ind_instrfor this fused call instruction.
return_addr_jump_table_idx()acceptsLoadImmJump,LoadImm, andLoadImmJumpInd, so mixed old/new patterns still resolve safely.- Important semantic pitfall: do not assume
base == dstis safe for absolute jumps. UsingLoadImmJumpIndfor the main epilogue (EXIT_ADDRESS) caused global failures because jump target evaluation does not behave like a guaranteed “write dst first, then read base” in practice.
PVM Intrinsic Lowering
llvm.abs (absolute value)
- Signature:
llvm.abs.i32(x, is_int_min_poison)/llvm.abs.i64(x, is_int_min_poison) - Lowered as:
if x >= 0 then x else 0 - x - For i32: must sign-extend first (zero-extension from load_operand makes negatives look positive in i64 comparisons)
llvm.bitreverse vs llvm.bswap
Two distinct LLVM intrinsics easy to confuse:
-
llvm.bswap.iN— reverses byte order (0xAABBCCDD → 0xDDCCBBAA). Lowers directly to PVMReverseBytes(opcode 111). For widths < 64,ReverseBytesleaves the result in the high bytes of the 64-bit register, so the bswap path follows up with aShloRImm64to recover (shift by64 - bits). -
llvm.bitreverse.iN— reverses bit order within the value (0x80000001is a palindrome — bitreverse maps it to itself). PVM has no native bit-reverse, so this is software-emulated via the standard “swap odd/even bits, swap pairs, swap nibbles, swap bytes” algorithm. Supported widths:i8,i16,i32,i64.- i8: 3 mask phases (masks
0x55/0x33/0x0F) usingAndImm+ShloLImm32/ShloRImm32— no byte-swap step needed for a single byte (the running value stays clean within the low 8 bits). - i16: same shape with masks
0x5555/0x3333/0x0F0F, thenReverseBytes+ShloRImm64by 48 to recover (matches the bswap path’s i16 recovery shift). - i32: masks
0x55555555/0x33333333/0x0F0F0F0F, thenReverseBytes+ShloRImm64by 32. - i64: masks must be loaded via
LoadImm64intoTEMP_RESULTand combined with the register-formAnd(since 64-bit masks don’t fit inAndImm’s i32 immediate); 64-bit shift variants throughout; no post-shift afterReverseBytes.
- i8: 3 mask phases (masks
Substrate / polkadot-fellows runtimes hit llvm.bitreverse.i32 regularly (shared codec/hashing code). LLVM 18’s recognizeBSwapOrBitReverseIdiom pass folds the canonical open-coded pattern (at any width — we verified i8/i16/i32/i64) into the matching intrinsic before our lowering sees it, so writing the algorithm in WAT is sufficient to exercise every path in tests. For i8/i16 the trick is to load/store with narrow ops (i32.load8_u / i32.store8 etc.) so LLVM’s demanded-bits analysis narrows the width of the bitreverse intrinsic from the default i32.
LoadImmJump for Direct Calls
Combined Instruction Replaces LoadImm64 + Jump
- Direct function calls previously used two instructions:
LoadImm64 { reg: r0, value }(10 bytes) +Jump { offset }(5 bytes) = 15 bytes, 2 gas LoadImmJump { reg: r0, value, offset }(opcode 80) combines both into a single instruction: 6-10 bytes, 1 gas- Uses
encode_one_reg_one_imm_one_offencoding:opcode(1) + (imm_len|reg)(1) + imm(0-4) + offset(4) - For typical call return addresses (small positive integers like 2, 4, 6), the imm field is 1 byte, so total is 7 bytes
LoadImmJumpdoes not read any source registers; treat it likeLoadImm/LoadImm64inInstruction::src_regsfor DCE- PVM-in-PVM args are passed via a temp binary file; use a unique temp dir + random filename to avoid collisions under concurrent
bun testworkers. Debug knobs:PVM_IN_PVM_DEBUG=1for extra logging,PVM_IN_PVM_KEEP_ARGS=1to retain the temp args file on disk. - DCE
src_regs: Imm ALU ops read onlysrc;StoreImm*reads no regs;StoreImmInd*reads base only.
Pre-Assignment of Jump Table Addresses
- Same challenge as
LoadImmfor return addresses:LoadImmJumphas variable-size encoding, so the value must be known at emission time - Solution: Thread a
next_call_return_idxcounter through the compilation pipeline, pre-computing(index + 1) * 2at emission time - During
resolve_call_fixups, only theoffsetfield is patched (always 4 bytes, size-stable) - The
valuefield is verified viadebug_assert!to match the actual jump table index
Bonus: Peephole Fallthrough Elimination
- Since
LoadImmJumpis a terminating instruction, the peephole optimizer can remove a precedingFallthrough - This saves an additional 1 byte per call site where a basic block boundary precedes the call
- Total savings per call: -8 bytes (instruction) + -1 byte (Fallthrough removal) + -1 gas
Call Return Address Encoding
LoadImm vs LoadImm64 for Call Return Addresses
- Call return addresses are jump table addresses:
(jump_table_index + 1) * 2 - These are always small positive integers (2, 4, 6, …) that fit in
LoadImm(3-6 bytes) - Previously used
LoadImm64(10 bytes) with placeholder value 0, patched during fixup resolution - Problem with late patching:
LoadImmhas variable encoding size (2 bytes for value 0, 3 bytes for value 2), so changing the value after branch fixups are resolved corrupts relative offsets - Solution: Pre-assign jump table indices at emission time by threading a
next_call_return_idxcounter through the compilation pipeline. This wayLoadImmvalues are known during emission, ensuring correctbyte_offsettracking for branch fixup resolution - For direct calls,
LoadImmJumpcombines return address load + jump into one instruction, using the same pre-assigned index - For indirect calls (
call_indirect),LoadImmJumpIndis used to combine return-address setup and the indirect jump - Impact: Saves 7 bytes per indirect call site (LoadImm vs LoadImm64). Direct calls save even more via LoadImmJump fusion.
Why LoadImm64 was originally needed
LoadImm64has fixed 10-byte encoding regardless of value, so placeholder patching was safeLoadImmwith value 0 encodes to 2 bytes, but after patching to value 2 becomes 3 bytes- This size change would break branch fixups already resolved with the old instruction sizes
PVM 32-bit Instruction Semantics
Sign Extension
- All PVM 32-bit arithmetic/shift instructions produce
u32SignExtend(result)— the lower 32 bits are computed, then sign-extended to fill the full 64-bit register - This means
AddImm32(x, x, 0)after a 32-bit producer is a NOP (both sign-extend identically) - Confirmed in anan-as reference:
add_32,sub_32,mul_32,div_u_32,rem_u_32,shlo_l_32, etc. all callu32SignExtend()
Peephole Truncation Pattern
- The pattern
[32-bit-producer] → [AddImm32(x, x, 0)]is eliminated by peephole when directly adjacent - In practice with LLVM passes enabled,
instcombinealready eliminatestrunc(32-bit-op)at the LLVM IR level, so this peephole pattern fires rarely - The peephole is still valuable for
--debug-skip-llvm-passesmode and as defense-in-depth - Known limitation: the pattern only matches directly adjacent instructions; a
StoreIndU64between producer and truncation breaks the match
Peephole AddImm Width Safety
optimize_address_calculation()must not fold address relations acrossAddImm32/AddImm64width boundaries.- Track
AddImmrelation width alongside(base, offset)and only fold when widths match (32→32,64→64), while still allowing width-agnosticMoveRegalias folding.
Cross-Block Register Cache
Approach
- Pre-scan computes
block_single_predmap by scanning terminator successors - For each block with exactly 1 predecessor and no phi nodes, restore the predecessor’s cache snapshot instead of clearing
- Snapshot is taken before the terminator instruction to avoid capturing path-specific phi copies
Key Pitfall: Terminator Phi Copies
lower_switchemits phi copies for the default path inline (not in a trampoline)- These phi copies modify the register cache (storing values to phi slots)
- If the exit cache includes these entries, they are WRONG for case targets (which don’t take the default path)
- Fix: snapshot before the terminator and invalidate TEMP1/TEMP2 (registers the terminator clobbers for operand loads)
- Same issue can occur with conditional branches when one path has phis and the other doesn’t (trampoline case)
Specialized PVM Instructions for Common Patterns
Absolute Address Load/Store (LoadU32/StoreU32)
LoadU32 { dst, address }replacesLoadImm { reg, value: addr } + LoadIndU32 { dst, base: reg, offset: 0 }for known-address loads (globals)StoreU32 { src, address }similarly replaces the store pattern- OneRegOneImm encoding:
[opcode, reg & 0x0F, encode_imm(address)...] - PVM-in-PVM layout sensitivity: Replacing multi-instruction sequences with single instructions changes bytecode layout (code size, jump offsets). Test each significant code generation change with the full PVM-in-PVM suite.
LoadU32is used forlower_wasm_global_load.StoreU32is used forlower_wasm_global_store. Both absolute-address variants are now emitted everywhere applicable.
LoadIndI32 (Sign-Extending Indirect Load)
- Replaces
LoadIndU32 { dst, base, offset } + AddImm32 { dst, src: dst, value: 0 }for signed i32 loads - Single instruction:
LoadIndI32 { dst, base, offset }(sign-extends result to 64 bits) - Safe for PVM-in-PVM (small layout change)
Min/Max/MinU/MaxU (Single-Instruction Min/Max)
- Replaces
SetLt + branch + stores + jumppattern (~5-8 instructions) withMin/Max/MinU/MaxU(1 instruction) - For i32 signed variants, must keep
AddImm32 { value: 0 }sign-extension before the instruction (PVM compares full 64-bit values)
ReverseBytes (Byte Swap)
llvm.bswapintrinsic lowered asReverseBytes { dst, src }instead of byte-by-byte extraction- For sub-64-bit types: add
ShloRImm64to align bytes (48 for i16, 32 for i32) - Savings: i16: ~10→2 instructions, i32: ~20→2, i64: ~40→1
CmovIzImm/CmovNzImm (Conditional Move with Immediate)
- For
selectwith one constant operand:CmovNzImm { dst, cond, value }orCmovIzImm { dst, cond, value } - Load non-constant operand as default, then conditionally overwrite with immediate
- Note: LLVM may invert conditions, so
select(cond, true_const, false_runtime)may emit CmovIzImm instead of CmovNzImm
RotL/RotR (Rotate Instructions)
llvm.fshl(a, b, amt)/llvm.fshr(a, b, amt)when a == b (same SSA value) → rotation- Detected via
val_key_basic(a) == val_key_basic(b)identity check - fshl with same operands →
RotL32/RotL64, fshr →RotR32/RotR64 - Falls back to existing shift+or sequence when operands differ
Linear-Scan Register Allocation
- Allocates SSA values to physical registers using spill-weight eviction (
use_count × 10^loop_depth). - Operates on LLVM IR before PVM lowering; produces
ValKey→ physical register mapping load_operandchecks regalloc before slot lookup: usesMoveRegfrom allocated reg instead ofLoadIndU64from stackstore_to_slotuses write-through: copies to allocated reg AND stores to stack; DSE removes the stack store if never loaded- r5/r6 allocatable in safe leaf functions (no bulk memory ops or funnel shifts); detected by
scratch_regs_safe()LLVM IR scan - r7/r8 allocatable in all leaf functions; lowering paths that use them as scratch trigger
invalidate_regviaemit() - Clobbered allocated scratch regs (when present) are handled with lazy invalidation/reload instead of eager spill+reload
- Allocates in all functions (looped and straight-line), not just loop-heavy code
- MIN_USES default=2 (aggressive=1); values with fewer uses are skipped
- Loop extension: back-edges detected by successor having lower block index; live ranges extended to cover the back-edge source
- Eviction uses spill weight (sum of
10^loop_depthper use) instead of furthest-end heuristic linear_scanmust track active assignments separately from final assignments:- naturally expired intervals should remain in the final
val_to_reg/slot_to_regmaps (their earlier uses still benefit), - evicted intervals must be removed from final mapping (whole-interval mapping is no longer valid after eviction).
- naturally expired intervals should remain in the final
- Unit tests cover both interval outcomes (non-overlapping reuse and eviction dropping).
- Targeted benchmark fixture:
tests/fixtures/wat/regalloc-two-loops.jam.wat(regalloc two loops(500)row). - Regalloc instrumentation:
regalloc::run()logs candidate/assignment stats at targetwasm_pvm::regalloc(enable viaRUST_LOG=wasm_pvm::regalloc=debug).lower_function()logs per-function summary including allocation usage counters (alloc_load_hits,alloc_store_hits).
- Instrumentation root cause and fix:
- Root cause was
allocatable_regs=0in non-leaf functions because only leaf functions exposed r9-r12 to regalloc. - Fix: expose available r9-r12 registers in both leaf and non-leaf functions; reserve outgoing argument registers (
r9..r9+max_call_args-1) from non-leaf allocation and invalidate local-register mappings after calls. - Example (
regalloc-two-loops):allocatable_regs=2,allocated_values=4,alloc_load_hits=11,alloc_store_hits=8.
- Root cause was
- Non-leaf stabilization:
- Reserve outgoing call-argument registers (r9.. by max call arity) from the non-leaf allocatable set.
- Initially,
alloc_reg_validwas reset at label boundaries (define_label/define_label_preserving_cache) because that validity state was not path-sensitive andCacheSnapshotdid not yet snapshotalloc_reg_slotduring cross-block cache propagation. - Without boundary reset, large workloads (notably
anan-as-compiler.jam) can miscompile under pvm-in-pvm despite direct tests passing.
- Follow-up stabilization:
- Corrective follow-up:
CacheSnapshotnow includes allocated-register slot ownership (alloc_reg_slot), which replaced the earlier label-boundaryalloc_reg_validreset approach by restoring allocation state path-sensitively across propagated edges. alloc_reg_validwas removed; slot identity (alloc_reg_slot == Some(slot)) is sufficient to decide whether a lazy reload is needed.- Non-leaf gate: skip when no allocatable registers remain (all r9-r12 used by params/call args). Previously skipped at <2 regs and <24 SSA values, but these conservative gates were removed in Phase 2 (#165).
- Corrective follow-up:
- Post-fix benchmark shape: consistent JAM size reductions from regalloc, but gas/time gains are workload-dependent and often near-noise on current microbenchmarks.
- Leaf detection fix: PVM intrinsics (
__pvm_load_i32,__pvm_store_i32, etc.) are LLVMCallinstructions but are NOT real function calls — they’re lowered inline using temp registers only. Theis_real_call()function inemitter.rsdistinguishes real calls (wasm_func_*,__pvm_call_indirect) from intrinsics (__pvm_*,llvm.*). Before this fix, ALL functions with memory access were classified as non-leaf, causing unnecessary callee-save prologue/epilogue overhead. - Cross-block alloc_reg_slot propagation: In leaf functions (no real calls),
alloc_reg_slotis preserved across all block boundaries because allocated registers are never clobbered. In non-leaf functions with multi-predecessor blocks, predecessor exit snapshots are intersected — only entries where ALL processed predecessors agree are kept. Back-edges (unprocessed predecessors) are treated conservatively. - Phi node allocation is a gas regression in PVM: Allocating phi nodes at loop headers adds +1 MoveReg per iteration per phi (write-through to allocated reg) with 0 gas savings (MoveReg replaces LoadIndU64, both cost 1 gas). Net: +1 gas per iteration per allocated phi. Only beneficial when loads are cheaper than stores, when allocated regs can be used directly by instructions (avoiding MoveReg to temps), or when code size matters more than gas.
Fused Inverted Bitwise (AndInv / OrInv / Xnor)
and(a, xor(b, -1))→AndInv(a, b)(bit clear): saves 1 instruction (eliminates separate Xor for NOT)or(a, xor(b, -1))→OrInv(a, b)(or-not): same patternxor(a, xor(b, -1))→Xnor(a, b)(equivalence): note that LLVM instcombine may reassociatexor(a, xor(b, -1))toxor(xor(a,b), -1), which makes Xnor fire less often in practice- Detection is commutative: checks both LHS and RHS for the NOT pattern
- All three use ThreeReg encoding:
[opcode, (src2<<4)|src1, dst]
CmovIz Register Form for Inverted Select
select(!cond, true_val, false_val)now usesCmovIzinstead of computing the inversion +CmovNz- Detected patterns:
xor(cond, 1)(boolean flip) andicmp eq cond, 0(i32.eqz) - Saves 2-3 instructions by avoiding the boolean inversion sequence
- Note: LLVM instcombine often folds
select(icmp eq x, 0, tv, fv)→select(x, fv, tv), so the pattern fires mainly in edge cases or with specific IR shapes
Intentionally Not Emitted Opcodes
- MulUpperSS/UU/SU (213-215): No WASM operator produces 128-bit multiply upper halves
- Alt shift immediates (reversed):
dst = imm OP srcform — no WASM pattern generates this (LLVM canonicalizes register on LHS) - Absolute address non-32-bit sizes: All WASM globals use 4-byte (i32) slots; no need for U8/U16/U64 absolute address variants
RW Data Trimming
translate::build_rw_data()now trims trailing zero bytes before SPI encoding.- Semantics remain correct because heap pages are zero-initialized; omitted high-address zero tail bytes are equivalent.
- This is a low-risk blob-size optimization and does not materially affect gas.
Fallthrough Jump Elimination
- When LLVM block N ends with an unconditional branch to block N+1 (next in layout order), the
Jumpcan be skipped — execution falls through naturally. - Controlled by
fallthrough_jumpsoptimization flag (--no-fallthrough-jumpsto disable). - Implementation:
PvmEmitter.next_block_labeltracks the label of the next block.emit_jump_to_label()skips theJumpwhen the target matchesnext_block_label. - Critical pitfall — phi node trampolines: When conditional branches target blocks with phi nodes, the codegen emits per-edge trampoline code (phi copies + Jump) between blocks. The
emit_jump_to_label()in trampoline code must NOT be eliminated, because the jump is not the last instruction before the next block’sdefine_label. Fix:lower_brandlower_switchtemporarily clearnext_block_labelduring trampoline emission. - Entry header shrunk from 10 to 6 bytes when no secondary entry (removed 4 Fallthrough padding after Trap).
- Main function emitted first (right after entry header) to minimize Jump distance.
Memory Layout Sensitivity (PVM-in-PVM)
- Moving the globals/overflow/spill region around directly affects the base address that the interpreter loads as the WASM heap, so every change still requires a full pvm-in-pvm validation. Direct/unit runs may look fine, but the outer interpreter can panic if the linear memory isn’t page-aligned or overlaps reserved slots.
- Critical: The parameter overflow area must be >=
GLOBAL_MEMORY_BASE(0x30000) because the SPI rw_data zone starts at 0x30000. The gap zone (0x20000-0x2FFFF) between ro_data and rw_data is unmapped. Placing constants in the gap zone causes PVM panics. - The compact layout places the parameter overflow area dynamically right after globals (no fixed address), and
SPILLED_LOCALS_BASE/SPILLED_LOCALS_PER_FUNChave been removed. This reduces the gap between globals and WASM linear memory, saving ~8KB RW data for typical programs (WASM memory base moves from ~0x33000 to ~0x31000 for a program with 5 globals).
Benchmark Comparison Parsing
tests/utils/benchmark.shemits two different result tables:- Direct:
Benchmark | WASM Size | JAM Size | Gas Used | Time - PVM-in-PVM:
Benchmark | JAM Size | Outer Gas Used | Time
- Direct:
- Branch comparison must parse JAM size and gas from the correct columns per table header (direct rows use columns 3/4; PiP rows use 2/3).
- With
set -u, EXIT trap handlers must not depend on function-local variables at exit time; expand local values when installing the trap.
Peephole Immediate Chain Fusion (2026-03)
- LoadImm + AddImm fusion:
LoadImm r1, A; AddImm r1, r1, B→LoadImm r1, A+B- Saves 1 instruction when loading a value then adjusting it
- Only applies when combined result fits in i32
- Chained AddImm fusion:
AddImm r1, r1, A; AddImm r1, r1, B→AddImm r1, r1, A+B- Collapses sequences of incremental adjustments
- Common in address calculations and loop induction variables
- MoveReg self-elimination:
MoveReg r1, r1→ removed entirely (no-op)- Can appear after register allocation or phi lowering
- Implementation in
peephole.rs::optimize_immediate_chains()
Comparison Code Size Optimizations (2026-03)
PVM-in-PVM Ecalli Forwarding (2026-03)
-
Dynamic ecalli index is not supported by PVM: The
ecalliinstruction takes a static u32 immediate. To forward inner program ecalli with dynamic indices, either use a per-ecalli dispatch table in the adapter or use a fixed “proxy” ecalli with a data buffer protocol. -
Adapter import resolution against main exports:
adapter_merge.rsresolves adapter imports matching main export names internally. Key use case: adapter importinghost_read_memory/host_write_memory(exported by the compiler module) to access inner PVM memory during ecalli handling. -
Scratch buffer protocol for trace replay: The replay adapter allocates a single WASM memory page (
memory.grow(1)) on the first ecalli call and caches the address at a sentinel location (0xFFFF0) for reuse on subsequent calls. The outer handler writes the ecalli response ([8:new_r7][8:new_r8][4:num_memwrites][8:new_gas][memwrites...]) to the buffer at the PVM address obtained viapvm_ptr. The adapter reads the response, applies memwrites viahost_write_memory, and returns the new register values. -
Adapter globals not supported:
adapter_mergeonly merges function-related sections (types, imports, functions, code) from the adapter. Globals, data sections, and memory declarations from the adapter are NOT included in the merged module. Workaround: use main module memory with fixed addresses ormemory.grow. -
host_call_N requires compile-time constant ecalli index: The first argument to
host_call_Nimports must be a compile-time constant because it becomes the immediate operand of the PVMecalliinstruction. Runtime ecalli indices (e.g., forwarded from inner programs) cause compilation failure. -
NE comparison optimization was reverted for correctness in PVM-in-PVM:
Xor + SetGtUImm(0)looked equivalent toXor + LoadImm(0) + SetLtU, but it regressedas-decoder-subarray-testin layer5 (inner run returned emptyResult: [0x]). Keep the conservativeLoadImm(0) + SetLtUlowering foricmp ne. -
i1→i64 sign-extension:
LoadImm(0) + Sub64→NegAddImm64(0)- Original: 2 instructions to compute
0 - val(negate boolean to 0/-1) - Optimized: 1 instruction using
NegAddImm64which computesval = imm - src NegAddImm64(dst, src, 0)=dst = 0 - src=-src- Saves 1 instruction per boolean sign-extension
- Original: 2 instructions to compute
Register-Aware Phi Resolution (Phase 5, 2026-03)
- Ordering dependencies between reg→reg and reg→stack phi copies: When phi copies include both register-to-register copies and copies involving stack, they must be treated as a single set of parallel moves. An initial implementation separated them into two independent phases, but this caused incorrect results when a reg→reg copy clobbered a source register that a reg→stack copy also needed. The fix: use a unified two-pass approach (load ALL incoming values into temp registers first, then store all to destinations).
- Phi destinations must be restored after
define_label: Afterdefine_labelclears all alloc state at a block boundary, blocks with phi nodes must callrestore_phi_alloc_reg_slotsto re-establishalloc_reg_slotfor phi destinations. Without this,load_operandfalls back to stack loads, missing the values that the phi copy placed in registers. - Dirty phi values and block exit: After
restore_phi_alloc_reg_slotsmarks phi destinations as dirty, the before-terminatorspill_all_dirty_regs()writes them to the stack. This is essential: non-phi successor blocks (like loop exit blocks) clear alloc state and read from the stack. Without the spill, exit paths read stale stack values. This limits the code-size benefit of lazy spill — each iteration still writes phi values to the stack once via the before-terminator spill. alloc_reg_slotshared between phi destination and incoming value: The same SSA value can be both a phi destination (in the header) and an incoming value (from the body). After mem2reg, phi incoming values from the loop body ARE the phi results from the current iteration. The regalloc may assign them the same physical register. Whenphi_reg == incoming_reg, the phi copy is a no-op (the value is already in the right register).
Load-Side Coalescing (Phase 8, 2026-03)
- Eliminating MoveReg by reading directly from allocated registers:
operand_reg()checks if a value is currently live in its allocated register and returns that register directly. Lowering code uses the allocated register as the instruction’s source operand instead of loading into TEMP1/TEMP2, eliminating theMoveRegthatload_operand()would have emitted. This complements store-side coalescing — together they eliminate moves on both sides of instructions. - Dst-conflict safety: When an operand’s allocated register equals the instruction’s destination register (
result_reg), the operand must fall back to a temp register. Otherwise,emit() → invalidate_reg(dst)auto-spills the old value and clears alloc tracking before the instruction reads the operand. While the PVM instruction itself would execute correctly (read-before-write at hardware level), the conservative approach avoids subtle alloc-state corruption in edge cases. - Div/rem excluded from coalescing: Signed division/remainder trap code (
emit_wasm_signed_overflow_trap) uses SCRATCH1 (r5) as scratch for sign-extending 32-bit operands. If the LHS operand is in r5, the trap code clobbers it before the div instruction can read it. Rather than adding per-operation conflict checks, div/rem operations always load into TEMP1/TEMP2. - Immediate-folding paths coalesced: The
commutative_imm_instructionhelper was parameterized to accept asrcregister instead of hardcoding TEMP1. This allows immediate-folding paths (the most common for LLVM-optimized code) to use the allocated register directly. Shift/sub immediate paths were similarly updated. - Store instructions have no dst conflict: PVM store instructions (
StoreIndU8, etc.) write to memory, not to a register, so they have no destination register. Both address and value operands can freely use allocated registers without conflict checks. - Impact: The fib(20) benchmark dropped from 613 to 511 gas (17%), regalloc two loops from 23,334 to 16,776 gas (28%), and the anan-as PVM interpreter JAM size from 164.9 KB to 158.9 KB (3.6%).
Rematerialization — Not Feasible (Phase 8 investigation, 2026-03)
Reloading values with LoadImm instead of LoadIndU64 from stack has zero practical impact in this architecture. Three approaches (LLVM IR constant detection, PVM emitter reg_to_const tracking at store_to_slot time, regalloc-level val_constants map) all failed for the same reason: every value reaching the regalloc reload path is a non-constant instruction result. LLVM’s IRBuilder constant-folds at instruction creation time, so no all-constant-operand instruction survives into the IR; LLVM constants that do exist are intercepted by get_sign_extended_constant() at the top of load_operand(), before the alloc code path. There is no gap between “LLVM knows it’s constant” and “the emitter needs to reload it”.
Prerequisite for retrying: PVM-level constant propagation that tracks results across AddImm32 etc., not just LoadImm/LoadImm64. Significant feature, uncertain ROI.
Store-Side Coalescing (Phase 7, 2026-03)
- Avoiding MoveReg by computing directly into allocated registers:
result_reg()returns the allocated register for the current instruction’s result slot, allowing ALU/memory-load/intrinsic lowering to use it as the output destination. This eliminates theMoveRegthatstore_to_slotwould otherwise emit to copy from TEMP_RESULT into the allocated register. On the anan-as compiler, this reduced store_moves by 54% (2720 to 1262) and total instructions by 4%. lower_selectstore-side coalescing cannot be used: Loading the default value into the allocated register viaload_operand(val, alloc_reg)triggersinvalidate_reg(alloc_reg)inemit(), which corrupts register cache state for subsequent operand loads. However, load-side coalescing works (Phase 9):operand_reg()is used for all Cmov operands so values already in their allocated registers are used directly without MoveReg copies. This is safe because all select operands are simultaneously live (the allocator guarantees different registers) and the Cmov instruction’sdstregister is only invalidated byemit(), not byload_operand()on the other operands.result_reg_or()needed for zext/sext/trunc: These lowering paths use TEMP1 (not TEMP_RESULT) as the working register in the non-allocated case, because the source operand is already in TEMP1 and the in-place truncation/extension writes back to the same register. Using TEMP_RESULT would require an extraMoveReg.result_reg_or(TEMP1)returns the allocated register when available, or TEMP1 as fallback, preserving the existing efficient non-allocated codepath.- Control-flow-spanning TEMP_RESULT uses cannot be coalesced:
emit_pvm_memory_growandlower_absboth use TEMP_RESULT across branches (grow success/failure, positive/negative paths). Computing into the allocated register would corrupt it if the branch takes the alternative path. These remain uncoalesced.
Spill Weight Refinement and Call Return Hints (Phase 9, 2026-03)
- Spill weight call penalty: Values whose live ranges span real call instructions receive a penalty of 2.0 per spanning call to their spill weight. This represents the cost of the spill+reload pair required when a register is allocated across a call boundary. Binary search on sorted call positions enables efficient counting. Trade-off: a tiny regression in very small functions with a single call (e.g., host-call-log: +3 gas) for consistent improvements in larger functions (e.g., AS fib: -2 gas, aslan-fib: -28 gas).
- Call return value register hints: The linear scan allocator accepts
preferred_reghints on live intervals. Values defined by real call instructions get a hint for r7 (RETURN_VALUE_REG), since the return value is already in r7 after a call. If r7 is free, it’s used; otherwise, a different register is allocated. This eliminates theMoveRegfrom r7 to the allocated register instore_to_slot. is_real_call()madepub(super): The function distinguishing real calls from PVM/LLVM intrinsics was made module-visible soregalloc.rscan use it for call position collection without code duplication.
Loop Phi Early Interval Expiration (Phase 10, 2026-03)
- Post-allocation coalescing doesn’t work: Three approaches were tried and all failed due to the emitter’s per-register
alloc_reg_slottracking disagreeing with the allocator’s per-value liveness model. See git history for details. - Early interval expiration works: Modifying the linear scan to expire loop phi destination intervals at their actual last use (before loop extension) frees the register earlier. The incoming back-edge value naturally gets the freed register via the free pool. Since the linear scan’s
slot_to_regmaps reflect both assignments from the start, the emitter handles transitions correctly. - Pressure guard: When
intervals.len() > allocatable_regs.len() * 2, early expiration is disabled. Under high pressure, freed phi registers get taken by unrelated values, causing reload traffic that outweighs the MoveReg savings. - Phi copy no-op: When incoming_reg == phi_reg AND the register currently holds the incoming value (verified by
is_alloc_reg_valid), the phi copy is skipped — just updatealloc_reg_slot. Theis_alloc_reg_validcheck is critical: without it, a third value that overwrote the register between the incoming’s store and the phi copy would cause silent data corruption. - store_to_slot safety: When storing to a slot whose allocated register currently holds a DIFFERENT dirty slot, spill the dirty value first. Prevents data loss when multiple slots share a register via early expiration.
- Impact: fib(20) -15.7% gas / -7.2% code, factorial -5.6% gas. No regressions.
Cross-Block Alloc State Propagation (Phase 11, 2026-03)
- Back-edge dominator propagation instead of clearing: At loop headers with unprocessed predecessor back-edges, instead of clearing all
alloc_reg_slotentries, the dominator predecessor’s alloc state is propagated throughset_alloc_reg_slot_filtered(). This avoids unnecessary reloads at loop entry for values that remain valid across the back-edge. - Register class filtering for safety: Non-leaf functions only propagate callee-saved registers beyond
max_call_args— these are the only registers guaranteed safe across all paths (never clobbered by calls). Caller-saved registers (r5-r8) are excluded because other paths may invalidate them. Leaf functions with lazy spill propagate all registers since no calls exist. - Leaf+lazy_spill intersection: Multi-predecessor blocks in leaf functions with lazy spill now use the same intersection logic as non-leaf functions. Previously, leaf+lazy_spill blocks used
define_label(clear all) at every block boundary. With the pred_map now available, the intersection approach keeps entries that all processed predecessors agree on. - pred_map condition expanded: The predecessor map was previously built only for non-leaf functions. It is now built whenever
has_regalloc && (!is_leaf || lazy_spill_enabled), enabling alloc state propagation for leaf functions with lazy spill. - Impact: fib(20) -5.1% gas, factorial(10) -7.1% gas, is_prime(25) -4.6% gas, PiP aslan-fib -0.52% gas.
Callee-Saved Preference for Call-Spanning Intervals (Phase 12, 2026-03)
- Problem: The linear scan’s default
free_regs.pop()behavior assigns callee-saved registers (added last toallocatable_regs) to the FIRST intervals processed. Call-spanning intervals, penalized byCALL_SPANNING_PENALTY, sort later and get caller-saved registers that are invalidated after every call — the opposite of what’s optimal. - Solution:
LiveInterval.spans_callsflag marks intervals whose live range contains at least one real call. In non-leaf functions, call-spanning intervals explicitly prefer callee-saved registers (r9-r12 beyondmax_call_args), while non-call-spanning intervals prefer caller-saved (r5-r8). In leaf functions, all registers are equal (no preference applied). Thepreferred_reghint (e.g., r7 for call return values) takes priority over the class preference. - Impact: Modest — primarily benefits non-leaf functions with call-spanning values. anan-as PVM interpreter -0.2% code size. Most benchmarks are leaf-dominated.
TEMP_RESULT Chain Coalescing (Phase 13, 2026-05)
- Problem: The dst-conflict fallback in load-side coalescing (Phase 8) was uniform: whenever an operand’s cached register equalled the consuming instruction’s
dst, the lowering forced a fallback temp (TEMP1orTEMP2), which the per-block cache then satisfied withMoveReg TEMP1, TEMP_RESULT. For chains of non-allocated results (each landing inTEMP_RESULT= r4), this emitted ~47k redundantr4 → r2moves per polkadot runtime (67% of all MoveReg in glutton-kusama). - Observation: PVM 3-operand instructions read
src1/src2before writingdst. SoAdd r4, r4, ?evaluates correctly even when src1 aliases dst. The conservative fallback is only necessary whendstis an allocated register — there, alias-with-source can tripinvalidate_reg, theslot_cache, or lazy-spill bookkeeping. - Solution: Route every dst-conflict check through
apply_dst_conflict_fallback(op_reg, fallback, dst)(emitter.rs). Whendst == TEMP_RESULT, the helper keeps the alias; otherwise it falls back as before. Threaded through 17 lowering sites inalu.rs,intrinsics.rs,memory.rs. - Excluded:
bitreverse(intrinsics.rs) emitsLoadImm64 TEMP_RESULT, maskmid-sequence — relaxing the alias would clobberval_reg. The conservative fallback is preserved with an inline comment. - Naturally excluded because they bypass
operand_reg:lower_select,emit_pvm_memory_grow,lower_absuseload_operanddirectly. - Cascade beyond MoveReg elimination: The targeted optimization eliminates the
r4 → r2MoveReg pattern (47k instances observed). Actual MoveReg reduction is 42,986 (70,141 → 27,155, -61%) — slightly below the targeted 47k because somer4 → r2instances were already covered by other paths. But total PVM instruction reduction is 50,476 (-4.02%), more than the MoveReg drop alone: eliminating each MoveReg also shortens the surrounding sequence, allowing the following block-boundary cache invalidation / Fallthrough / constant-load chain to shrink. JAM size: -1.97%. - Impact on polkadot/glutton-kusama: JAM 6,573,304 → 6,444,138 bytes (-129 KB, -1.97%). Code 4,751,176 → 4,636,361 bytes (-2.42%). Full integration suite (465 tests) green; clippy clean.
Non-Leaf r5-r8 Allocation and load_operand Reload Bug (Phase 6, 2026-03)
- Removing the leaf-only restriction for r5-r8: Previously r5/r6 (
allocate_scratch_regs) and r7/r8 (allocate_caller_saved_regs) were only available in leaf functions. Phase 6 makes them available in all functions. The existing non-leaf call lowering infrastructure (spill_allocated_regsbefore calls,clear_reg_cacheafter calls, lazy reload on next access) handles caller-saved register spill/reload automatically, so no new mechanism was needed. - Removing the
calls_in_loopsgate: Previously, non-leaf functions with calls inside loop bodies were skipped entirely by the register allocator (the theory being that reload traffic outweighs savings). Phase 6 removes this restriction. The lazy spill + per-call-site arity-aware invalidation makes allocation beneficial even with calls in loops, since only registers actually clobbered by a specific call’s arity are invalidated rather than all registers. load_operandreload-into-allocated-register bug: When an allocated register is invalidated (e.g., after a call) andload_operandis asked to reload the value into a different target register (e.g., TEMP1 for a binary operation), the original code would reload into the allocated register first, then copy to the target. This is incorrect when the allocated register is being used for call argument setup – writing to the allocated register corrupts the argument being prepared. The fix: when the allocated register is invalidated and the target register differs, load directly from the stack into the target register, bypassing the allocated register entirely. This prevents corruption during call argument setup sequences where multiple allocated values are being moved into argument registers (r9, r10, etc.).- r7/r8 invalidation after calls: The
reload_allocated_regs_after_call_with_aritypredicate was extended to also invalidate r7/r8 after calls (not just r9-r12), since r7/r8 are now allocatable in non-leaf functions and are always clobbered by call return values. - Impact: 79 non-leaf functions now receive allocation in the anan-as compiler (up from 0), bringing the total to 205 out of 210 functions allocated.
Callee-Saved State Preservation After Calls — Not Feasible (2026-03)
Preserving alloc_reg_slot for callee-saved registers (r9–r12) across calls breaks because operand_reg() (load-side coalescing) returns the allocated register directly as a source operand for memory lowering. The memory lowering code may then use the same register as both source AND destination when adding wasm_memory_base for address computation, clobbering the preserved value. Selective invalidation in clear_reg_cache, snapshot/restore around the call, and guarding operand_reg were all tried; all fail through interactions between preserved alloc state and the general register cache. Deterministic failure mode (as-array-push-test): wrong base register r7 with shifted +12 offset after a call, producing result = 0 instead of 28. Same root cause as “Non-Leaf r7/r8 Allocation” below — both would need operand_reg() to distinguish “data operand” (safe) from “address base” (unsafe).
Per-Phi Early Expiration Guard — Not Feasible (2026-03)
Replacing the blanket pressure guard (intervals.len() > allocatable_regs.len() * 2) that disables all loop-phi early expiration with a per-phi check fails under both pressure regimes: high pressure (multiple failures + timeouts because intervening intervals steal the freed register even when the incoming-start condition holds) and low pressure (fib(20) +19.6% gas because the per-phi guard disables expiration for phis whose incoming value is defined inside the loop body). Root cause: early expiration + register reuse depends on the linear scan’s allocation order, which can’t be predicted during interval computation. A correct per-phi guard would require lookahead into allocation decisions, defeating the purpose. The blanket pressure threshold is a crude but effective proxy.
Non-Leaf r7/r8 Allocation — Not Feasible (2026-03)
Same root cause as “Callee-Saved State Preservation After Calls” above. The operand_reg() hazard: any allocated register that participates in an address calculation can be corrupted when the lowering code uses it as both base and destination for in-place arithmetic. Fixing this would require operand_reg() to distinguish “use as data operand” from “use as address base” — a non-trivial emitter rework.
Multi-Predecessor Cross-Block Cache Propagation — Zero Realized Impact (2026-05)
Extending single-predecessor cross-block cache forwarding to multi-predecessor blocks (intersect predecessor snapshots, invalidate phi destinations) was correct and tested green — but byte-identical to baseline on glutton-kusama and kusama. Three reasons it didn’t fire:
load_operandskipsslot_cachefor allocated values: cache lookup is gated behindregalloc.val_to_reg. With aggressive regalloc, virtually every live value is allocated, so reads route throughalloc_reg_slotand never reach the cache. The propagation helps a code shape that barely exists in regalloc’d runtimes.alloc_reg_slotintersection was already present: the existingall_processedbranch inlower_functionalready doesset_alloc_reg_slot_from(pred0) + intersect_alloc_reg_slot(rest). The “new” propagation re-derives the same end state.- Block layout prevents
all_processedat most multi-pred merges:compute_block_layoutis greedy fallthrough-biased, not topological. For a canonical if-elseentry → {then, else} → join, layout isentry, else, join, then— atjoin’s emissionthenis unprocessed and propagation skips. Instrumented: 48,473 merge candidates on glutton-kusama, ~250 entries actually propagated function-wide.
Unblocking would require RPO emission (sacrifices fallthrough), loop-body register-liveness analysis, or a two-pass exit-snapshot dataflow pre-pass — all substantially larger than a localized tweak.
Hand-Crafted Blake2b WAT (2026-04)
WAT memarg attribute order: offset must come before align
Writing (i64.load align=1 offset=8 ...) fails to parse in this project’s WAT frontend with “unknown operator or unexpected token”. Writing (i64.load offset=8 align=1 ...) parses cleanly. The WebAssembly text format spec permits either order, so this is a tooling quirk (likely wat-parser / wasmparser). If you’re hand-writing WAT and see unexplained parse errors on i64.load / i64.store with memargs, swap the attribute order first. Example from tests/fixtures/wat/blake2b.jam.wat.
Gas/size characteristics of a typical cryptographic hash on PVM
For reference when sizing new crypto workloads on PVM:
- Blake2b (“abc”, 32 B output): JAM = 8269 B, PVM code = 3076 B, gas = 17,749, time ≈ 71 ms single-run.
- Blake2b (1024 B input, 32 B output): gas = 138,478 (~15k gas per 128-byte compression block, roughly 9 blocks).
- In PVM-in-PVM, the same 3-byte input costs ~16.7M outer gas — a ~944× multiplier over direct PVM execution, consistent with what other compute-heavy fixtures show.
Per-compression-block gas is dominated by the 12 rounds × 8 G calls × ~18 i64 ops. No specific compiler optimization was needed to land this — the default pipeline (mem2reg, instcombine, GVN, peephole, register allocation) produced a correct, reasonably compact output on the first run.
Output-pointer convention for fixtures: don’t rely on WASM offset 0
blake2b.jam.wat currently writes its hash output to WASM-relative offset 0 and returns (ptr=0, len=out_len). This works today because the WAT has no globals, no prologue, and no data segments below 0x80. But this is fragile — if a future compiler change puts anything at offset 0, the hash would be silently corrupted. When writing new fixtures, prefer an explicit offset ≥ 0x100 for output buffers. Retrofitting blake2b to this convention is a cheap follow-up but was not done in the initial PR since the tests cover the output end-to-end.
(if COND (then (unreachable))) guards can be silently eliminated
While adding invalid-out_len trap tests for blake2b, we discovered that a bare (if COND (then (unreachable))) guard can be elided by the LLVM-based compiler even when COND is a runtime value. The trap appeared to fire for some inputs (e.g. out_len=0 via i32.eqz) but not others (out_len > 64 via i32.gt_u). Adding any side-effecting instruction before unreachable — e.g. (i32.store8 ...) — restores the guard.
Mechanism (hypothesized): LLVM treats unreachable as a UB hint — “control never reaches here.” The optimizer can legally conclude “if this path is UB, then COND is always false” and delete the check entirely. Which specific patterns get eliminated depends on how instcombine / simplifycfg / GVN canonicalize the condition. i32.eqz apparently canonicalizes into a form the optimizer preserves; i32.gt_u into a form it doesn’t.
Workaround: Put at least one side-effecting operation in the then block. A sentinel store to an unused memory byte is sufficient:
(if (some-condition)
(then
(i32.store8 (i32.const 0x268) (i32.const 0xEE))
(unreachable)))
Runtime trap observation from anan-as / SPI mode: a trapped program exits with OS exit code 0 (not an error), prints STATUS = -1 in debug output, and produces an empty Result: [0x]. runJamBytes therefore does not throw on trap — it returns an empty Uint8Array. Test assertions for trap behavior should check result.length === 0 rather than expect(...).toThrow().
Follow-up: a proper compiler-level fix would be to mark unreachable as a true trap (non-UB) in the PVM lowering, or emit an explicit trap instruction that the optimizer can’t eliminate. Until then, the sentinel-store workaround is the portable fix for WAT-level fixtures.
anan-as SPI mode: transient “Run out of pages” failure under sustained test load
Under rapid back-to-back bun test runs at high iteration counts (e.g. SHA512_RANDOM_COUNT=1000), the anan-as PVM runtime in --spi mode occasionally prints:
Warning: Run out of pages! Allocating.
Unhandled host call: ecalli 0. Finishing.
and the test result comes back empty. The default iteration count (SHA512_RANDOM_COUNT=50) has not reproduced the failure. The same input hex that triggered it under bun test succeeded on 10/10 standalone node anan-as ... run invocations, ruling out any problem in the SHA-512 WAT or the test harness.
This is a non-deterministic issue in the anan-as runtime itself, not a PVM compiler bug or SHA-512 correctness issue. The runtime appears to run out of pre-allocated pages and then fails to service the resulting allocation host call (shown as ecalli 0) in --spi mode; the exact trigger is unclear but correlates with sustained rapid test-suite execution.
Repro (against the original SHA-512 WAT): seed 0x0123456789abcdef, iteration 9 (inputLen 14439), run under bun test layer3/sha512.test.ts with SHA512_RANDOM_COUNT=1000.
WAT-level mitigation that correlated with a fix in the SHA-512 case: copy the entire input from the PVM args region (args_ptr, at 0xFEFF0000) into WASM memory in one upfront memory.copy, then stream from there. The hot compress loop now reads only from the pre-allocated WASM region. After this change, 1000-iter run went from 999/1000 pass in 1023 s to 1000/1000 pass in 506 s. We have only the observed correlation — the exact trigger inside anan-as remains unclear — but the scattered args-region reads are a plausible contributor to both the failure and the wall-clock overhead, and consolidating them into one contiguous read is defensible on design grounds regardless. The +143 B JAM-size / ~4% gas cost is cheap for the apparent stability and speed gains.
The blake2b follow-up (see next section) gives a more mechanical explanation for the wall-clock component — misaligned cross-page u64 loads — which is very likely the same root cause.
Cross-PVM-page memory.copy reads from a misaligned source blow up gas
memory.copy’s word loop issues one LoadIndU64 per iteration. When the source address is 8-byte-aligned, each load sits entirely inside one PVM page (pages are 4 KB). When the source is misaligned, one u64 read per page will straddle two pages — and that cross-page u64 read is extremely expensive in anan-as (orders of magnitude slower than aligned reads). A WAT that streams from the PVM args region (0xFEFF0000, always 4 KB-aligned) via a pointer like args_ptr + 1 (misaligned by 1) will OOG well before finishing a 32 KB input; the inflection point is around ~4 KB, right where the first cross-page straddle happens.
Observed with tests/fixtures/wat/blake2b.jam.wat while raising its differential input cap from 2 KB to 32 KB (issue #197):
- Original
[out_len: u8][input: bytes]format placed the input atargs_ptr + 1— misaligned, cliff at ~4 KB inputs. - Any WAT-level “copy-into-WASM-memory-first” fix had to keep both the bulk copy’s source and destination 8-byte-aligned, or the same cross-page cost reappeared during the upfront copy (only now hidden from the naïve “stream from WASM memory” mental model).
- Final shape: pad the header to 8 bytes (
[out_len: u8][7 zero bytes][input: bytes]). Withargs_ptralways 4 KB-aligned and the destination at0x1000, the bulk copy is fully aligned, the input lands atargs_ptr + 8(still aligned), anddata_ptr = 0x1008keeps every downstream 128-byte stream copy aligned in WASM memory. Test-harness only sees the new 8-byte args envelope viaencodeBlake2bArgs(). - Gas at 32 KB went from “OOG past 1 B gas” to ~4.6 M gas. Linear scaling restored.
Heuristic for new WAT fixtures that read args in bulk: make the input portion of args start at an 8-byte offset from args_ptr (either by having no prefix, like SHA-512, or by padding any prefix out to 8 bytes, like the blake2b fix above). Keeping every downstream data_ptr / stream-memory.copy source 8-byte-aligned avoids the cliff regardless of which page of the args region the tail falls in.
The SHA-512 WAT happens to have no prefix (input starts at args_ptr + 0), which is why the earlier SHA-512 fix was sufficient for that fixture — it stayed aligned by accident of format. Blake2b needed the padding change to benefit from the same pattern.
Compilation Reproducibility (2026-04)
The compiler must produce byte-identical JAM output for the same WASM input across invocations. Two subtle traps were hit and fixed; keep both in mind when adding code to the backend.
Trap 1: HashMap/HashSet iteration order is process-randomised
Rust’s default HashMap/HashSet use a per-process-randomised hasher, so iteration order changes between CLI invocations. Any iteration whose side effects reach the emitted bytes (emitting an instruction, assigning a register/offset, mutating state read by the next iteration) leaks that randomness. The mitigation is the AGENTS.md rule: prefer BTreeMap/BTreeSet throughout; for keys whose natural type has no Ord (inkwell SSA values and basic blocks), wrap with a per-function insertion-order ID — ValKey/BbKey in llvm_backend::emitter.
Trap 2: ValKey originally wrapped a raw LLVM pointer
ValKey used to wrap Value::as_value_ref() as usize — the raw LLVM pointer. LLVM allocates different Value subclasses (e.g. Argument, InstructionValue) from separate arenas at independent ASLR-randomised base addresses, so the derived Ord was pointer-address order: a BTreeMap<ValKey, _> iterated in pointer order, which flipped between process invocations whenever entries came from different arenas.
Where this bit us: compute_live_intervals iterated value_slots: BTreeMap<ValKey, i32> directly and then pushed intervals into a Vec in that order. The downstream linear scan is stable-sorted by (start, spill_weight); ties fell back to input order, which meant pointer order, which meant non-deterministic register assignments under aggressive allocation (more ties at min_uses=1).
The fix (issue #204) replaces the raw pointer with an insertion-order ID. A per-function ValKeyCache on PvmEmitter maps the LLVM pointer to a monotonically-increasing u32 the first time the value is observed during IR walking; subsequent observations return the same ID. Because the IR-walking order (pre_scan_function + regalloc linearisation) is deterministic, the IDs are too — BTreeMap<ValKey, _> iteration is now reproducible across runs by construction, no derived-key sort required.
Trap 3: Order-dependent loops over HashMap<BasicBlock, _>
Most HashMap<BasicBlock, _> iteration sites in the backend are commutative (e.g. end = end.max(...) across loop headers, depths[i] += 1 across positions), so the carve-out for BasicBlock keys (which lack Ord) was considered safe. Except one case wasn’t commutative: the live-interval extension loop reads end in its predicate and mutates end in its body, so iteration N+1’s predicate depends on iteration N’s effect. The immediate fix returned Vec<(BasicBlock, usize)> sorted by header position from detect_loop_headers. Issue #205 then removed the carve-out entirely by mirroring the ValKey pattern for blocks: BbKey is a per-function insertion-order ID, every backend BasicBlock-keyed map is now BTreeMap<BbKey, _>, and the sort in detect_loop_headers is preserved only because BbKey order = first-IR-walk-intern order ≠ block-emission order.
Detection
tests/utils/check-determinism.sh compiles a diverse set of fixtures N times in separate processes and diffs the output. A single-process cargo test cannot catch these traps because the HashMap hasher seed and the LLVM arena addresses are both fixed for the lifetime of one process. The script is wired into the integration CI job.
Trap-Floats Lowering — Don’t Set unreachable = true, and Use @llvm.trap
--trap-floats replaces every f32/f64 operator with an LLVM-level trap (PVM backend lowers to Trap). Two non-obvious traps to avoid in the implementation:
Trap A: setting self.unreachable = true after the float trap
The naive implementation is “emit the trap, set self.unreachable = true, push placeholder zeros for the operator’s outputs.” This is wrong on two counts:
-
The placeholder zeros are never consumed. The dead-code skip path at the top of
translate_operatorreturnsOk(())for every non-control-flow op whenself.unreachableis true — including any future op that would have consumed those zeros. Pushing them is dead work. -
Function-result phis end up with no incoming branches. The function-end implicit
Blockframe’sEndhandler skips the “pop result, branch to merge” path whenself.unreachableis true. If the only path through the body trapped, the result phi atfn_returnhas zero incoming edges → LLVM verifier rejects the module. The same hazard applies toif-arm phis when both arms trap.
The correct lowering: emit unreachable, create a fresh after_float_trap basic block, position there, pop the operator’s inputs from the operand stack, push i64 0 placeholders for its outputs, and leave self.unreachable alone. Subsequent ops translate normally into the (provably-dead) block; End handlers run their reachable branch and add a placeholder-zero incoming to the merge phi; LLVM’s dce collapses the unreachable region away. Result: valid IR + correct runtime trap + no special-case handling for trap-floats in any other translator path.
The investigation cost was non-trivial — the broken phi only manifests when both arms of a structured construct trap, which is a rare pattern in the unit tests but common in trap-floats mode (entire float-heavy functions trap on the first const). The integration test trap_floats_inside_if_arm_compiles pins this down.
self.unreachable keeps its original meaning: “WASM operand-stack-aware dead code following an explicit unreachable/return/br operator.” The trap-floats lowering produces LLVM-level dead code, not WASM-level dead code, and the two abstractions must not be conflated.
Trap B: bare unreachable is folded by simplifycfg as UB
The first working version emitted only build_unreachable() (no @llvm.trap call). Tests verified compilation succeeded, but a runtime-execution test caught the real bug: floats inside an if-arm vanished. anan-as reported Status: 0 (clean halt) on the trap path because LLVM’s simplifycfg folds branches whose only path leads to unreachable — it treats unreachable as “this code is impossible; the condition must steer away from it” and rewrites the conditional branch to always take the other arm. Float-only else-bodies were silently deleted; the JAM ran the then-arm regardless of the condition.
The fix: emit @llvm.trap() (a real intrinsic call) followed by build_unreachable(). @llvm.trap is noreturn but not UB-on-reach — the optimizer treats it as a side-effecting call and preserves it. The PVM backend gains a dedicated case in lower_llvm_intrinsic that emits Instruction::Trap. The bare unreachable after the call is fine (it’s now redundant but lets the verifier see the BB has a terminator).
Detection lesson: a pure compilation test can’t catch this. The Rust integration tests all checked “JAM compiles and contains a Trap instruction” — which was true (the entry-header trap is always present). Only running the JAM through anan-as with both branch inputs and asserting Status: 1 on the trap path exposed the elimination. The bun layer1 test trap-floats.test.ts is the regression guard.
Loop End Must Preserve unreachable When the Body Has No Fall-Through (2026-05)
The only path into a loop’s merge_bb is the fall-through branch from the body — br N targeting a Loop jumps to the header, never to the merge. So when the body ends in unreachable state (e.g. loop { return …; br 0 }), merge_bb is left with zero predecessors, and post-loop code is physically dead.
The original ControlFrame::Loop End handler unconditionally reset self.unreachable = false, which broke this invariant: subsequent operators were translated as if reachable, even though their only path was through an empty merge_bb. In the polkadot-fellows v2.2.2 hashbrown insert (surfaced once --trap-floats lets us reach it), this caused the function-level End to call pop() on an empty operand stack and fail with Internal error: operand stack underflow.
The fix in function_builder.rs::translate_operator is two parts. (1) Loop’s End now mirrors the body’s fall-through: keep self.unreachable = true when the body didn’t fall through, and terminate the empty merge_bb with build_unreachable() so the LLVM verifier still accepts it. Just toggling the flag without the terminator trips Basic Block in function 'X' does not have terminator!. (2) The dead-code dispatcher’s “dummy” Block/If frames reuse the current — already terminated — block as merge_bb (and else_bb); their matching End/Else handlers must detect this via merge_bb.get_terminator().is_some() and skip the position_at_end/unreachable=false reset, otherwise the bug returns one nesting level out (a downstream operator emits past a terminator, or worse, the function-level End again sees a stale unreachable=false).
Why both fixes are needed together: with only fix (1), an inner construct (e.g. another loop (result T)) appearing after the unreachable loop becomes a dummy frame; its End handler still flipped unreachable=false, re-creating the same underflow at function-level End. The Rust test loop_unreachable_end.rs::unreachable_loop_followed_by_result_loop_compiles exercises both paths simultaneously and is the regression guard.
Validation note: the WASM validator does not propagate the loop body’s unreachable state into the surrounding scope — pop_ctrl() pushes the frame’s end_types onto the outer operand stack regardless of inner unreachability. So the most-minimal loop { return; br 0 } end_function shape is rejected upstream by wasmparser::validate. The bug only surfaces when the post-loop region is well-typed for the validator (e.g. a trailing unreachable, or a follow-up construct that pushes the function’s result type) but the compiler’s own unreachable tracking has been corrupted.
LLVM freeze Lowers to a Value Passthrough (#218)
LLVM’s freeze instruction takes a value that may be poison/undef and converts it into “some specific bit pattern, but we don’t say which” — operationally a no-op on a concrete value. Our LLVM optimizer occasionally emits it (instcombine sinking poison-carrying ops past branches; observed on polkadot-fellows v2.2.2 glutton-kusama_runtime and encointer-kusama_runtime under --trap-floats).
By the time IR reaches the PVM backend, every value is a concrete i64 in a stack slot — there is no poison/undef representation. So freeze is implemented as a value passthrough: take the operand, materialize it into the result slot. The arm sits next to Phi in lower_instruction (llvm_backend/mod.rs) and uses load-side coalescing — when the operand is already in an allocated register, store_to_slot writes from that register directly.
Two pieces are required for the lowering to work end-to-end:
- The match arm in
lower_instruction(the visible fix). Freezelisted ininstruction_produces_value(llvm_backend/emitter.rs). The pre-scan walks every block and allocates a stack slot for any instruction whose result is consumed downstream; withoutFreezein the producer set,result_slot()later returnsError::Internal("no slot for Freeze result"). Easy to miss: thelower_instructionarm compiles cleanly without it and the passthrough is well-defined — the failure only surfaces when the test actually runs.
Testing strategy: triggering freeze reliably from a small WAT input is hard. WASM produces no poison itself, our frontend never adds nsw/nuw flags, and the optimizer passes we run (mem2reg, instcombine, simplifycfg, gvn, dce) only emit freeze for specific shapes that don’t reduce to small fixtures. The regression test in llvm_backend::tests::freeze_is_lowered_as_passthrough parses hand-written LLVM IR text via Context::create_module_from_ir (inkwell 0.8 doesn’t expose build_freeze) and runs it through lower_function directly with a minimal LoweringContext. This bypasses the LLVM-version-dependent question of “what input emits freeze” and pins down the lowering arm directly.
Saturating-arithmetic intrinsic lowering (#217)
Lowering llvm.{u,s}{add,sub}.sat.iN splits cleanly by width:
-
Narrow widths (i8/i16/i32) — clamp via wider arithmetic:
- Unsigned: zero-extend operands, do 64-bit add/subtract (which cannot overflow because both operands fit in 32 bits), then
MinU(uadd) or branch +CmovNzImm dst, cond, 0(usub) to saturate. Result is naturally zero-extended. - Signed: sign-extend operands (
SignExtend8/SignExtend16orAddImm32 _, _, 0for i32), do 64-bit add/subtract (true result fits in i64 because two iN values differ/sum to at most 2^(N+1)), then clamp to[INT_MIN, INT_MAX]via signedMax/Min. Result is naturally sign-extended.
- Unsigned: zero-extend operands, do 64-bit add/subtract (which cannot overflow because both operands fit in 32 bits), then
-
i64 — no wider register, must detect overflow in-place:
- Unsigned:
Add64, then testsum < a(unsigned) for wrap;CmovNzsaturates toUINT64_MAX. - Signed: Hacker’s Delight — overflow flag is bit 63 of
(a^b) & (a^sum)(sub) or(a^sum) & (b^sum)(add).SharRImm64 by 63extracts the flag as 0 or -1; saturation valueINT_MIN/INT_MAXis built fromsign(a) XOR INT_MAX. The signed i64 paths use SCRATCH1/SCRATCH2 and bracket the sequence withspill_allocated_regs+reload_allocated_regs_after_scratch_clobber(same compromise as non-rotationfshl/fshr).
- Unsigned:
The narrow paths are 5-7 instructions; i64 paths are 4 (uadd) / 3 (usub) / 10 (ssub/sadd). All paths use result_reg-driven store-side coalescing so the final saturated value lands directly in the register-allocated destination.
Critical: avoid TEMP_RESULT clobber after dst is written. result_reg may return TEMP_RESULT (r4) when no allocated register is available. After Add64 dst, ... (or Sub64), any subsequent LoadImm TEMP_RESULT, ... would overwrite the sum/difference. The narrow-width sat helpers therefore load constants into TEMP1 (which is dead after Add/Sub), not TEMP_RESULT. The bug surfaced under register pressure in the layer3 fixture; it doesn’t show up in small unit tests where result_reg returns an allocated register.
Test coverage limitation: WAT-driven tests for narrow-width and signed sat intrinsics only fold to @llvm.{u,s}{add,sub}.sat.i64 (not the narrow widths) because LLVM 18 instcombine doesn’t fold the canonical clamp shape through outer zext/sext to i32. The narrow-width and signed-narrow backend paths are present and correct algorithmically, exercised by real-world Rust IR (verified via the polkadot-fellows v2.2.2 runtime smoke check). The dump_llvm_ir test-harness helper exposes the post-pass IR so unit tests can assert which intrinsics were folded.
Phi-Copy Resolution: Slot-Based Parallel Moves (#219)
The original phi-copy lowering snapshotted every incoming value into a distinct temp register (TEMP1, TEMP2, TEMP_RESULT, SCRATCH1, SCRATCH2 — five slots) and then wrote them all to their destinations, bailing with Unsupported("too many phi values for available temp registers") whenever a join block produced more than five copies on a single edge. The shape is rare in MVP-style code but appears reliably in the largest polkadot-fellows runtimes (asset-hub-{kusama,polkadot}, bridge-hub-polkadot) when compiled with --trap-floats.
The fix replaces the bail with a slot-based parallel-move resolver in llvm_backend/control_flow.rs::emit_phi_copies_via_slots. Key design points:
- Canonical state on the stack.
spill_all_dirty_regs()runs first, so each value’s authoritative copy lives at its allocated slot. The resolver reads/writes slots directly withLoadIndU64/StoreIndU64and never depends on register-cache state. - Constants are detached from the dependency graph. A phi whose incoming value is a constant has no source slot, so it cannot participate in a cycle. Constants are emitted after the slot-to-slot moves with
LoadImm + StoreIndU64. If the constant-copy destination happens to be another phi’s source, the slot reads have already happened, so the order is sound. - Topological pass for the easy case. A copy whose destination slot isn’t anyone else’s source can fire immediately (2 instructions: load via TEMP1, store). Real-world phi shapes — even on hot blocks in large runtimes — are dominated by this case.
- Single-temp cycle handling for the hard case. Remaining copies form one or more disjoint permutation cycles. For each cycle
(d_0, s_0) … (d_{k-1}, s_{k-1})(closed whens_{k-1} == d_0), the resolver: saves slotd_0to TEMP1, walks copies 0..k-1 via TEMP2 (2 instructions each), then finalizes the last write from TEMP1. Total2kPVM instructions per cycle — same as the old temp-snapshot path used to cost when it didn’t bail. Two temp registers are enough for arbitrary cycle length. - Cache invalidation after every direct slot store. Each raw
StoreIndU64to a phi destination callsPvmEmitter::invalidate_cache_for_slot, which drops the generalslot_cacheentry and clears anyalloc_reg_slot[r] == Some(slot)mapping. Without this, lateroperand_reg/load_operandcalls in the same block could believe an allocated register still holds the (now stale) old value of the destination slot.
The two existing fast paths (≤5 copies) are kept verbatim: the regression risk is concentrated entirely in the new fallback, and benchmarks show zero gas/size delta across the standard benchmark suite (no benchmark hits the >5 threshold).
Why a stack-only resolver, not a register-based one? The regaware (lazy-spill) phi path could in principle resolve cycles in registers (it already discovers per-copy incoming_reg/phi_reg allocations). But once the fallback triggers, the active set is large enough that the dependency graph cuts across both register- and stack-only copies; the cleanest correctness story is to drop into a uniform slot-based representation. The resolver invalidates alloc_reg_slot for every destination slot it writes, so the next access through load_operand reloads from the canonical stack copy — no special-casing needed.
The loop-header swap as the canonical cycle. The motivating cycle shape comes from loops whose header contains multiple phis that reference each other on the back-edge, e.g.
header:
%a = phi [%init1, %entry], [%b, %latch]
%b = phi [%init2, %entry], [%a, %latch]
On the body→header edge this becomes two simultaneous copies — a.slot ← b.slot and b.slot ← a.slot — a 2-cycle. The test many_phi_values_with_loop_cycle_compiles (in crates/wasm-pvm/tests/phi_many_values.rs) drives a 6-cycle through this pattern.
O(N²) Byte-Size Scans Blocked Real-World Compilation (#225)
Once #214/#215/#217/#218/#224 closed every correctness gap that had been bailing the backend early on Polkadot runtimes, compilation finally reached translate/mod.rs::compile_via_llvm’s emission loop and resolve_call_fixups — and hung at 99% CPU past 10 minutes on the smallest 2 MiB runtime. Per-pass timing showed all LLVM passes finishing in ~2 s and per-function PVM backend lowering in ~1.6 s across 1631 functions; the missing minutes were spent in two adjacent O(N²) shapes neither of which had ever been exercised on a multi-MB module before.
The bug. Both loops computed instruction byte offsets the same way:
#![allow(unused)]
fn main() {
// Emission loop, per function:
let func_start_offset: usize = all_instructions.iter().map(|i| i.encode().len()).sum();
function_offsets[local_func_idx] = func_start_offset;
// ...
all_instructions.extend(translation.instructions);
// resolve_call_fixups, per direct + indirect call:
let return_addr_offset: usize = instructions[..=jump_idx]
.iter().map(|i| i.encode().len()).sum();
let jump_start_offset: usize = instructions[..jump_idx]
.iter().map(|i| i.encode().len()).sum();
}
Each invocation re-summed every preceding instruction’s encoded byte length. For F functions / C call sites / M total instructions, the work is O(F × M) + O(C × M). glutton-kusama_runtime lands at F=1631, C≈20 000, M≈1.5 M — roughly 3 × 10¹⁰ allocating encode() calls total. Instruction::encode() returns a fresh Vec<u8> whose only consumer was .len(), so the cost was 30 billion small Vec allocations on top of the arithmetic.
The shape had been latent for as long as the emission loop and the fixup resolver have existed. It went unnoticed because the backend used to fail early on real-world modules — every Polkadot runtime hit either bitreverse, usub.sat, freeze, or a “too many phi values” bail before reaching the offset-computation hot path.
The fix. Two O(N+M) replacements in translate/mod.rs:
- Emission loop: maintain a running
current_code_bytes: usizeseeded from the entry header (which is pushed before the loop), update it by summing only the newly appended slice after each function is lowered, and use it directly forfunction_offsets[local_func_idx]. resolve_call_fixups: compute abyte_prefix: Vec<usize>once at function entry, withbyte_prefix[i] = sum(instructions[0..i].encode().len()). Each fixup then readsbyte_prefix[jump_idx]/byte_prefix[jump_idx + 1]directly.
Why the prefix sum stays valid through patching. The fixup loop patches LoadImmJump.offset (per encode_one_reg_one_imm_one_off, always a fixed 4-byte little-endian field — bytes.extend_from_slice(&offset.to_le_bytes())) and, after the loop returns, the entry-header Jump.offset (per Self::Jump { offset }, also to_le_bytes() so 4 bytes). Neither patch changes the instruction’s encoded length, so a prefix sum computed once at the top of resolve_call_fixups is safe to use throughout.
This is not true of encode_imm (used for plain LoadImm, JumpInd, AddImm32, etc.) which produces 0–4 bytes depending on the immediate’s magnitude — but those instructions aren’t patched anywhere in compile_via_llvm once emitted, so they stay constant from the prefix-sum computation onwards.
Verified-safe seeding. The emission loop pushes 2 entry-header instructions (one Jump + either another Jump or Trap) before iterating, so current_code_bytes is initialized from all_instructions.iter().map(|i| i.encode().len()).sum() — paying the one-time cost across exactly those two entries. Forgetting this offset (= 0) was an early version of the fix that passed glutton but broke test_branch_fixup_resolution (crates/wasm-pvm/tests/emitter_unit.rs:194-220), which compiles a single-if function where main is emitted first and the entry-header Jump.offset ends up at zero — a fast in-flight regression catch that justifies why this test was worth keeping.
Result on glutton (2.04 MiB WASM, 1631 functions): compile time drops from >10 min (hard timeout, never finished) to ~4 s — ≥150× speedup. All 14 polkadot-fellows v2.2.2 runtimes now compile in 4:26 wall-clock total. Standard benchmark JAM/code/gas numbers are byte-identical across main and the fix (verified by md5sum), since this change is purely compile-time.
Libcall Recognition for __multi3 / __udivti3 (2026-05)
WASM has no i128 type, so rustc for wasm32-unknown-unknown lowers every 128-bit operation to a call into the compiler-builtins runtime, which it bakes into each binary. The two workhorses are __multi3 (i128 × i128 → i128, ~110 bytes WASM body of Knuth-style i64 partial products) and __udivti3 (u128 / u128 → u128, a thin wrapper over specialized_div_rem, ~1100 bytes total). Every (a as u128) * (b as u128), (a as u128) / (b as u128), and the *_hi helpers route through these.
After our LLVM optimization passes (with inline_threshold = Some(5)) these stay as separate functions — their body sizes far exceed the threshold so they’re marked noinline and the call sites remain visible as call wasm_func_N(sret, a_lo, a_hi, b_lo, b_hi). That gave us a clean intercept point.
Recognition is name-based. During WasmModule::parse we scan the local-function name table (from the WASM custom name section, falling back to exports), match against __multi3 / __udivti3, verify the signature is exactly (i32 sret, i64 a_lo, i64 a_hi, i64 b_lo, i64 b_hi) → void (in our i64-uniform IR: 5 i64 params, no return), and for __udivti3 additionally walk the body for its first Call (the slow-path callee) and first GlobalGet (the __stack_pointer global). Both are required for the synthesized body to have a working slow path; without them recognition silently no-ops. The signature gate prevents a user function that happens to share a reserved-by-ABI name from being silently mis-translated.
Why not IR pattern matching. Naive IR pattern matching on call sites would catch the post-inline case (when someone bumps --inline-threshold past the body size), but is fragile across rustc versions: different toolchain releases shuffle the Knuth-expansion shape and a matcher tuned for rustc 1.85 silently stops matching on 1.86. Name-based body replacement is robust as long as compiler-builtins keeps these reserved names, which is part of the C/Rust ABI.
__multi3 body (8 PVM instructions). For a × b mod 2^128 where a = a_lo + 2^64·a_hi and similarly for b:
low_half = a_lo × b_lo (Mul64)
high_half = upper64(a_lo × b_lo) + (a_lo × b_hi) + (a_hi × b_lo) (MulUpperUU + 2×Mul64 + 2×Add64)
All operations are mod 2^64, which conveniently provides the i128 sign correction: when callers pass sign-extended high halves ((a as i64) >> 63 = all-ones or all-zeros), (-1) × b_lo = -b_lo is exactly the correction term needed to convert the unsigned upper half into the signed upper half. So MulUpperUU (opcode 214) is sufficient — we don’t need MulUpperSS / MulUpperSU.
__udivti3 body (fast/slow dispatch). Compiler-builtins’ specialized_div_rem is a polished Knuth Algorithm D implementation with CTLZ-based normalization, native udiv i64 for the quotient digits, and dispatch on operand sizes. It compiles to ~800 PVM instructions in our pipeline. Beating it from scratch is out of scope: a naive binary long-division replacement would be ~3000 PVM instructions (worse on every dimension). The pragmatic win is the b_hi specialization:
if (a_hi | b_hi) == 0:
q = a_lo / b_lo ; native PVM DivU64
sret = (q, 0)
return
else:
sp_old = __stack_pointer
__stack_pointer = sp_old - 32 ; specialized_div_rem writes 32 bytes (q + r)
call specialized_div_rem(sp_new, ...)
copy quotient (16 bytes) to caller sret
__stack_pointer = sp_old
return
The slow path re-implements the original __udivti3 wrapper verbatim — passing the caller’s 16-byte sret directly to specialized_div_rem is unsafe because it writes 32 bytes (quotient + remainder).
Measured dynamic gas impact (microbenchmarks at 1000 iterations through anan-as, see tests/fixtures/wat/u128-{mul,div}-bench*.jam.wat):
| Operation | Recognition off | Recognition on | Δ Gas | Notes |
|---|---|---|---|---|
| u128 mul | 119,029 | 75,029 | −37% | Body replacement, no dispatch |
u128 div fast path (a_hi = b_hi = 0) | 129,029 | 76,029 | −41% | Native DivU64 vs full __udivti3 + specialized_div_rem stub |
u128 div slow path (b_hi != 0) | 129,029 | 143,029 | +11% | Dispatch overhead (Or + ICmp + Branch) |
Measured static impact (real substrate runtimes via examples/polkadot/, combined mul + div recognition vs --no-libcall-recognition):
| Runtime | __multi3 calls | __udivti3 calls | Δ PVM instr | Δ JAM bytes |
|---|---|---|---|---|
| glutton-kusama | 79 | small | -20 | -64 |
| asset-hub-kusama | 962 | 135 | -20 | -64 |
The __multi3 body saves ~45 PVM instructions one-shot (it shrinks from ~30 to 8). The __udivti3 body grows by ~25 PVM instructions (the original was a thin 20-instr wrapper; we now carry a fast path + slow path + dispatch). Net per-runtime is roughly −20 instructions / −64 bytes — static savings are minor in either direction. The real win is dynamic gas (microbench table above): the b_hi specialization fast path runs in ~5 PVM instructions instead of ~50 in the original. On workloads where most callers pass zero high halves (substrate’s Perbill::from_rational, currency math fitting comfortably in u64), every __udivti3 invocation pays a much smaller runtime cost.
The slow-path regression is the cost of the dispatch. For workloads dominated by full u128/u128 arithmetic, the 11% regression is real but bounded. In substrate, the pattern (x: u64 as u128) / (y: u64 as u128) is extremely common (Perbill::from_rational, currency arithmetic where balances comfortably fit in u64), so the fast path is expected to dominate. End-to-end runtime gas measurement requires running the chain, which is out of scope here — the microbench numbers above are the available signal.
What we explicitly did not do. Naive binary long division to replace specialized_div_rem entirely (loses ~2000 PVM instructions static, slow-path 3-4× worse). Newton-Raphson reciprocal or other algorithmic improvements (multi-week project for an uncertain win). Caller-side IR pattern matching to inline u64/u64 directly at call sites (fragile across LLVM passes, conflicts with our preference for body recognition). See crates/wasm-pvm/src/llvm_frontend/libcall_recognition.rs for the full design.
Block Layout for Fallthrough Bias (with regalloc realignment)
The pre-existing --no-fallthrough-jumps flag elided trailing Jumps when the target happened to be the next block in function.get_basic_blocks() order. LLVM’s IR order isn’t picked with PVM fallthroughs in mind, so on glutton-kusama only 16,729 of 69,932 trailing branches actually fell through; the remaining 53,203 paid 5 bytes/Jump (~266 KB code) where 1 byte would do.
compute_block_layout(function) in llvm_backend/mod.rs chooses a per-function emission order via greedy trace from each unplaced IR block, following a “preferred successor” link per terminator:
br dest→destbr cond, then, else→else(matches the trailingJump else_labelafterBranchIfX then_label)switch val, default, ...→default(matches the trailingJump default_label)ret/unreachable→ none
Trampoline paths in lower_br / lower_switch (per-edge phi copies on both outgoing edges) emit a different final Jump target. Those blocks miss the fallthrough but stay correct.
Critical wiring detail. Regalloc must walk the same order the emitter does. regalloc::run accepts the layout as a block_order: &[BasicBlock] parameter; without that, live intervals were computed against IR order while the emitter executed in layout order, and downstream reads through operand_reg / load_operand picked up a register the linear scan thought still held a value but the layout had clobbered. The original symptom was the anan-as compiler’s compiled-PVM interpretation losing its r7/r8 mappings — the inner JAM ran fine in Layer 3 (direct anan-as on Node) but halted with empty output under Layer 4/Layer 5 PVM-in-PVM, because the compiled outer interpreter had the wrong live ranges. Realigning regalloc to layout order is what made pvm-in-pvm: as-flat-ternary-test green again.
The two pieces (block layout + jump elision) are coupled — the elision is meaningless without the layout choosing the right successor — so both sit behind the existing OptimizationFlags::fallthrough_jumps flag, default on.
Phi-Copy Temp/Destination Aliasing (Pre-existing Latent Bug)
emit_phi_copies_legacy and emit_phi_copies_regaware in control_flow.rs use a temp pool when 2-5 phi copies fit it:
#![allow(unused)]
fn main() {
let temp_regs = [TEMP1, TEMP2, TEMP_RESULT, SCRATCH1, SCRATCH2];
}
The trap: in llvm_backend/emitter.rs the names SCRATCH1 / SCRATCH2 are re-exported as ARGS_LEN_REG = r8 and ARGS_PTR_REG = r7 — not the r5 / r6 from crate::abi. So temp_regs == [r2, r3, r4, r8, r7]. With allocate_caller_saved_regs (default on), r7 and r8 are also valid phi destinations.
When a phi copy at index i has phi_reg == temp_regs[j] for some j != i, the legacy “Phase 1: load all temps; Phase 2: write all destinations in 0..N order” sequence corrupts itself: writing destination at step i clobbers temp_regs[j] before step j reads it. The clobbered value is silently substituted.
The regalloc-two-loops fixture exercised exactly this (5 phi copies, local_5’s phi_reg = r7 = temp_regs[4], local_3’s incoming value loaded into r7): local_3 (the loop counter i) ended up holding local_5’s value (b), so the loop iterated against the wrong counter and returned the wrong sum. The test expectations were calibrated to the buggy output (72 / 154 / 328 / …) — native WASM gives (76 / 211 / 720 / 2851 / 58958 / 165809572) for n ∈ {0,1,2,3,5,10}.
Fix in topo_order_phase2: build a dependency edge i → j whenever phi_regs[j] == temp_regs[i] (i != j), then Kahn-sort to produce a Phase-2 emission order where every consumer of a temp is processed before any producer that overwrites it. Cycles (phi_regs[3] = r7 AND phi_regs[4] = r8, etc.) drop to the slot-based emit_phi_copies_via_slots resolver. The temp pool and Phase-1 loads are unchanged; only Phase-2 ordering shifts.
Cross-Block Snapshot Must Mirror Terminator-Clobber Set
The cache snapshot taken before lowering a block’s terminator (llvm_backend/mod.rs) used to invalidate only TEMP1 and TEMP2, because they’re the operand-load temps for any branch/switch terminator. That was correct for branches without phi copies. But emit_phi_copies_regaware also uses TEMP_RESULT (r4) and the emitter-scope SCRATCH1 / SCRATCH2 (= r8 / r7) as Phase-1 temps for the 3rd/4th/5th active copy. When a successor restored that stale snapshot, its alloc_reg_slot showed r4 / r7 / r8 still owning whatever the predecessor’s block-body had put there — but the phi-copy that ran in between had overwritten them. Downstream reads via operand_reg / load_operand took the fast path against alloc_reg_slot and returned the wrong value.
Fix: invalidate TEMP1, TEMP2, TEMP_RESULT, SCRATCH1, SCRATCH2 in the snapshot — the full set of registers any terminator path may touch. This is a strict superset of what was invalidated before, so it can never make a successor read a fresher cache entry than is actually valid.
Intra-Block Trap-Bypass Labels Must Preserve Cache (#256)
The WASM-style trap helpers in llvm_backend/alu.rs — emit_wasm_div_zero_trap and emit_wasm_signed_overflow_trap — emit a one-shot bypass pattern: BranchNeImm/BranchGeS … → ok_label; Trap; ok_label:. The label is purely intra-block (its only predecessor is the branch above; the falls-through Trap is unreachable), so register state at the label equals state at the branch. Both helpers used to call define_label(ok_label), which under the hood calls clear_reg_cache and wipes alloc_reg_slot / alloc_dirty for every allocated register.
With lazy spill on, that wipe is silently destructive. The back-edge phi copy in emit_phi_copies_regaware writes the new phi value into the phi’s allocated register (via emit_raw_move) and calls set_alloc_reg_for_slot(phi_reg, phi_slot) — but it does not emit a StoreIndU64 to the slot (that’s the entire point of lazy spill). The phi value lives in the register; the stack slot stays stale. Then on the next iteration, every load_operand(%phi, …) is supposed to take the fast path against alloc_reg_slot[phi_reg] == Some(phi_slot) and emit MoveReg from the alloc reg. If a define_label clears alloc_reg_slot between two uses of the phi, the second load_operand instead emits LoadIndU64 from the stale slot — and the loop reads garbage.
The aslan-ecalli fixture trips this for the value phi in AssemblyScript’s utoa_dec_simple (value % 10 then value / 10 per iteration, with the rem’s trap bypass between the two reads). With every other optimization on, the loop reads stack[64] forever and burns ~100M gas. Turning off any of --lazy-spill, --register-alloc, or --shrink-wrap masks the bug (the first two suppress the store elision; the third shifts the regalloc decision so the phi lands elsewhere) — the symptom is a 4-orders-of-magnitude gas swing from flipping unrelated-looking flags.
Fix: both trap helpers now call define_label_preserving_cache(ok_label). That records the label PC and emits a Fallthrough if needed, but does not clear any cache state. Safe because the only live edge into the label is the branch above, which doesn’t write any registers — so the cache at the label equals the cache before the branch.
This pattern is specific to single-predecessor intra-block labels where the fall-through path is unreachable. Other define_label callers in llvm_backend/intrinsics.rs (the abs two-path merge), llvm_backend/memory.rs (bulk-memory loop bodies), and llvm_backend/mod.rs (block boundaries with cross-block propagation) all have multiple live predecessors and must keep clearing — preservation would let one path’s stale alloc state leak into another. The trap-bypass case is uniquely safe.
Don’t generalize this by making define_label “smart” (e.g. “preserve if the previous instruction is the lone branch to this label”). The emitter doesn’t track which define_label calls are merge points vs. trap-bypasses, and a peephole-style “only one preceding branch” check would miss labels whose predecessors are also stitched in by emit_jump_to_label fixups elsewhere. The call-site distinction is structural and clearer.
Global Storage Width: Per-Type Slots, Not Uniform 8-Byte Widening
For most of the compiler’s history each WASM global was stored in a fixed 4-byte slot at 0x30000 + (has_mem_size ? 4 : 0) + idx * 4, and the lowering in llvm_backend/memory.rs emitted LoadU32/StoreU32 for every global.get / global.set. That worked invisibly because:
- the WASM parser only matched
I32Constineval_const_i32(silently droppingI64Constinitializers to 0); - the LLVM frontend declared every global as LLVM
i64regardless of the WASM-declared type; - and
wasmparser::validateenforced that any WASM operator consuming a global’s value matched the global’s declared type, so for(global i32 ...)the trailing i32 ops truncated whatever garbage was in the top 32 bits.
The combination silently corrupted (global i64 ...) values whose high 32 bits were non-zero — store dropped them, load zero-filled them, and no test fixture exercised i64 globals at all so the regression never surfaced.
Rejected approach: uniform 8-byte widening. The first cut of this fix simply widened every global slot to 8 bytes (GLOBAL_SLOT_SIZE = 8) and switched lowering to LoadU64/StoreU64 unconditionally. That paid an i32-global-wide tax to fix a bug no current input triggers — every polkadot fellowship runtime (v2.2.2, 14 modules) has exactly 3 globals, all i32 (the standard Rust→wasm32 trio: stack pointer, __data_end, __heap_base). Rust→wasm32 effectively never emits i64 globals because pointers are 32-bit and most LLVM-managed globals live in linear memory. So uniform widening added 12 bytes of rw_data per polkadot runtime for zero observable benefit.
Chosen approach: per-global widths. Storage width matches the declared WASM type — 4 bytes for i32/f32, 8 bytes for i64/f64. Address resolution moves from a closed-form idx * SLOT formula to a precomputed WasmModule::global_offsets: Vec<i32> parallel to globals/global_widths. The LLVM frontend keeps its uniform load i64/store i64 shape (unchanged from before this PR); the backend reads the per-global width from ctx.global_widths[idx] and selects the matching PVM opcode. Keeping the LLVM IR shape identical avoids LLVM-pass outcomes drifting for i32-only modules — an exploratory variant that issued load i32/zext and trunc/store i32 regressed the anan-as PVM interpreter by ~2.5% (+2872 bytes) before being reverted.
Implementation, layer by layer.
WasmModule::parseonly acceptsi32/i64globals;f32/f64,v128, and ref-type globals all error out withError::Unsupportedat parse time. (An earlier draft toleratedf32/f64globals on the assumption that--trap-floatswould catch reads, butglobal.get/global.setare lowered as plain integer loads/stores —--trap-floatsonly traps float operators, so a program could observe a zeroed float global viai32.reinterpret_f32or by forwarding the loaded i64 elsewhere. Rejecting up front avoids that footgun. No real workload uses float globals: all 14 polkadot fellowship runtimes have 3 i32 globals each and zero floats.)WasmModulenow carriesglobal_init_values: Vec<i64>,global_widths: Vec<u32>, andglobal_offsets: Vec<i32>, all parallel toglobals.eval_const_global_initaccepts only a singleI32Const/I64Constliteral followed byEnd; multi-operator extended-const expressions (legal under wasmparser’s defaultEXTENDED_CONSTfeature) and any other operator (global.getof an imported const,ref.func,ref.null) error — the previous pattern of silently returningOk(0)for unsupported init-exprs (or only consuming the first operator of a multi-op chain) would have corrupted a program’s initial state without any compile-time signal.memory_layout:globals_region_size,data_segment_length_offset,compute_param_overflow_base, andcompute_wasm_memory_basenow take a&[u32]widths slice instead ofnum_globals: usize. Newcompute_global_offsets(widths, has_mem_size)precomputes absolute PVM addresses; newglobal_storage_width(ValType)returns 4 or 8 per type (gated onfeature = "compiler"because it consumeswasmparser::ValType; the rest ofmemory_layoutstays usable without the compiler toolchain). The oldglobal_addr(idx, has_mem_size)closed-form helper is gone — callers indexWasmModule::global_offsetsdirectly.LoweringContextgainsglobal_offsets: Vec<i32>andglobal_widths: Vec<u32>(cloned fromWasmModuleat compile entry). The backend’s two global-access lowerings (lower_wasm_global_load,lower_wasm_global_store) look upctx.global_offsets[idx]for the address andctx.global_widths[idx]for the width, then pickLoadU32vsLoadU64(andStoreU32/StoreImmU32vsStoreU64/StoreImmU64) per width. Width is not derived from the LLVM instruction’s type — the LLVM IR is uniformly i64 and would be misleading.- The LLVM frontend (
function_builder.rs) is unchanged from main: every global is declared as LLVMi64, andglobal.get/global.setissueload i64/store i64uniformly. The width-vs-LLVM-IR mismatch (LLVM IR claims to read/write 8 bytes from a 4-byte i32 slot) is invisible to LLVM (no pass observes raw storage widths) and resolved at the backend viactx.global_widths. The “top 32 bits = 0” invariant holds because the frontend’s i32 ops always zero-extend to i64 before pushing onto the operand stack. build_rw_datatakes the widths slice and writes the lowwidthbytes of eachi64init value into the appropriate slot, packed in declaration order. ReturnsResult<Vec<u8>>so layout-invariant violations (mismatched parallel arrays, unsupported widths > 8 B from a hypothetical bypassed parse guard) surface asError::Internalrather than as a release-build slice panic —debug_assert!would have disappeared in release.
Why i32 globals are unchanged for typical programs. With per-global widths, an all-i32 module (every fixture, every polkadot runtime) sees byte-identical globals_region_size, wasm_memory_base, and rw_data layout as before this PR. The fix is invisible until someone actually compiles a module with (global i64 ...).
Verification. crates/wasm-pvm/tests/i64_globals.rs (9 cases): (i) i64 global.get lowers to LoadU64 (not LoadU32); (ii) i64 global.set with a small const lowers to StoreImmU64; (iii) i64 global.set with a >i32-range const lowers to LoadImm64 + StoreU64; (iv) i32 globals still lower to LoadU32 / StoreU32 (no 64-bit opcodes, no regression for the common case — split across two functions to defeat LLVM intra-function store→load forwarding); (v) mixed-width modules emit both i32 and i64 opcodes; (vi) v128 globals are rejected at parse; (vii) f32/f64 globals are rejected at parse; (viii) non-const-literal init expressions (e.g. global.get of an imported global) are rejected; (ix) extended-const init expressions (e.g. i32.add of two literals) are rejected. Plus two build_rw_data unit tests (rejects_mismatched_parallel_arrays, rejects_unsupported_global_width) covering the error-path replacements for the prior debug_assert!/slice panics. Full Rust + integration + PVM-in-PVM + differential suites stay green; benchmarks are byte-identical to main for every existing fixture (no fixture uses i64 globals).
Value-Lifetime-Aware DSE + Stack-Slot Reuse — Nothing Shipped (2026-05)
A position-aware DSE extension (kill SP-relative stores whose offset is overwritten later in the same basic block with no intervening load) was hypothesized to unblock a stack-slot reuse pass for a combined ~10 % code-size win on polkadot runtimes. Measured 0.03 % on glutton-kusama (4,636,361 → 4,634,900 B code; 6,444,121 → 6,442,477 B JAM) — two orders of magnitude below the hypothesis. Both the DSE rewrite and the slot-reuse port were reverted.
The new DSE alone is byte-identical to main: each SSA value currently owns a unique stack-slot offset, so the “two stores at the same offset, no intervening load” pattern doesn’t arise within a basic block. The new pass is dormant.
Slot reuse + DSE saves 0.03 %, and the win comes from offset-encoding compression (shared offsets fit in fewer varint bytes), not store elimination: when V1 and V2 share offset X, the emitted sequence is store V1@X; … load V1@X; store V2@X; … load V2@X. The intervening load V1@X clears the kill-pending set before V2’s overwrite, so V1’s store stays. Pass 2b only fires for stores with no reload at all (lazy-spill flushes satisfied entirely by the register cache) — rare, and lazy spill already optimizes the common cases.
Slot reuse also reduces the original pass 1’s kill rate: an SSA value held in a register with its slot otherwise unused has its store killed today (offset has no consumers); under slot reuse the offset is shared with a live value, so pass 1 keeps both stores. Pass 2b recovers most but not all.
Promising direction not pursued: attack the lazy-spill flush at the source — skip the just-in-case store at block exits when proven unreachable. Removing the store at the source also kills the matching reload.
PVM-in-PVM Execution
The compiler can compile the anan-as PVM interpreter (written in AssemblyScript) to PVM bytecode, then run PVM programs inside this PVM interpreter that is itself running on PVM. This serves as a comprehensive integration test and stress test of the compiler.
Goal
Run PVM programs (trap.jam, add.jam) through the anan-as PVM interpreter that is itself compiled to PVM bytecode and running on PVM.
Pipeline: inner.wat → inner.jam + compiler.wasm → compiler.jam → feed inner.jam as args to compiler.jam → outer anan-as CLI runs it all.
Bugs Found & Fixed
Bug 1: HasMetadata.Yes in anan-as entry point
File: vendor/anan-as/assembly/index-compiler.ts:91
The anan-as compiler entry point was calling:
prepareProgram(InputKind.SPI, HasMetadata.Yes, spiProgram, [], [], [], innerArgs);
With HasMetadata.Yes, the SPI parser first calls extractCodeAndMetadata() which reads a varint-encoded metadata length from the start of the data. Since inner JAM programs don’t have metadata, this read garbage values (e.g., the ro_data_length field), corrupting all subsequent parsing.
Symptom: Native WASM test failed with "Not enough bytes left. Need: 7561472, left: 56377" — the parser was reading the first SPI header bytes as a metadata length.
Fix: Changed to HasMetadata.No and rebuilt the vendor with npm run asbuild:compiler.
Bug 2: Unknown WASM imports compiled to TRAP
File: crates/wasm-pvm/src/llvm_backend/calls.rs:137-138
The wasm-pvm compiler mapped all unknown WASM imports (anything not host_call or pvm_ptr) to PVM TRAP instructions. The anan-as compiler.wasm imports two functions:
env.abort— called on unrecoverable AS runtime errorsenv.console.log— called during normal execution for debug logging
Since console.log is called in the normal success path (confirmed by native WASM test showing console.log: 11952), the TRAP instruction killed the PVM program before it could complete.
Symptom: PVM execution panicked at PC 100640 (a TRAP instruction corresponding to the console.log import call). The outer anan-as interpreter reported "Unhandled host call: ecalli 0".
Fix: Changed unknown imports to be no-ops (silently skip) instead of TRAPs. The abort import specifically remains a TRAP since it indicates unrecoverable errors and should terminate execution.
#![allow(unused)]
fn main() {
// Before: all unknown imports → TRAP
e.emit(Instruction::Trap);
// After: only abort → TRAP, others are no-ops
let is_abort = import_name == Some("abort");
if is_abort {
e.emit(Instruction::Trap);
}
}
Debugging Journey
- Initial state: compiler.jam panicked at PC 150403 after ~95K instructions
- First hypothesis (from subagent): Jump table corruption — turned out to be incorrect; the verify-jam tool’s VarU32 decoder has an endianness bug that displayed wrong values
- Key insight: Ran compiler.wasm natively with the same args — it also failed! This proved the issue was in the input format, not wasm-pvm compilation
- Native error:
"Not enough bytes left. Need: 7561472"pointed to SPI parsing reading garbage lengths - Found Bug 1:
HasMetadata.Yes→ fixed toHasMetadata.No, rebuilt vendor - After fix 1: Native WASM worked perfectly (trap.jam → PANIC, add.jam → result 12), but PVM version still failed with
ecalli 0at PC 100640 - Traced PVM execution: Confirmed PC 100640 contains opcode 0x00 (TRAP), which is the compiled
console.logimport - Confirmed: Native WASM calls console.log during normal execution → in PVM this becomes TRAP → panic
- Found Bug 2: Fixed import handling to make non-abort imports no-ops
- Both tests pass: trap.jam returns inner PANIC, add.jam returns inner result 12
Performance Notes
PVM-in-PVM tests are inherently slow (~85 seconds each) because:
- The outer anan-as interpreter executes ~525M PVM instructions
- Most of this is the inner interpreter’s initialization (AS runtime setup, SPI parsing, memory page allocation)
- The actual inner program execution is tiny (~46-65K gas)
- The JS-based anan-as interpreter processes ~6M instructions/second
Tests have 180-second timeouts to accommodate this.
PVM-in-PVM Benchmarks
| Benchmark | JAM Size | Code Size | Outer Gas | Direct Gas | Overhead |
|---|---|---|---|---|---|
| TRAP (interpreter overhead) | 21 B | 1 B | 89,939 | - | - |
| add(5,7) | 164 B | 99 B | 1,219,622 | 28 | 43,558x |
| host-call-log | 458 B | 104 B | 1,265,258 | 40 | 31,631x |
| AS fib(10) | 631 B | 504 B | 1,571,677 | 245 | 6,415x |
| JAM-SDK fib(10)* | 25.4 KB | 16.2 KB | 9,582,904 | - | - |
| Jambrains fib(10)* | 61.1 KB | - | 29,245,041 | - | - |
| JADE fib(10)* | 67.3 KB | 45.7 KB | 20,493,145 | - | - |
| aslan-fib accumulate* | 20.7 KB | 13.1 KB | 15,849,103 | 11,474 | 1,381x |
| blake2b(“abc”, 32) | 3.8 KB | 2.5 KB | 16,243,164 | 17,930 | 906x |
| sha512(“abc”) | 3.7 KB | 2.5 KB | 15,533,350 | 17,981 | 864x |
*These programs exit on unhandled host calls (ecalli). Gas cost reflects parsing/loading plus partial execution up to the first unhandled ecalli.
Regalloc Cross-Block Propagation Journey
A detailed account of implementing cross-block register allocation propagation — including failed approaches, debugging discoveries, and final results.
Issue: #127
Branch: feature/regalloc-cross-block-propagation
Goal: Propagate allocated-register state across block boundaries to avoid unnecessary reloads, especially at loop headers.
Current State (Baseline)
The register allocator assigns loop-carried values to callee-saved registers (r9-r12).
The runtime tracking (alloc_reg_slot) is cleared at every block boundary that doesn’t
qualify for single-predecessor cross-block cache propagation. This means loop headers
(which have 2+ predecessors: preheader + back-edge) always start cold, requiring a
reload on first use of each allocated value per loop iteration.
Attempt 1: Blanket alloc_reg_slot persistence (FAILED)
Change: Remove clear_allocated_reg_state() from clear_reg_cache() so
alloc_reg_slot is never cleared at block boundaries.
Result: Layers 1-3 (422 tests) pass. PVM-in-PVM fails on as-decoder-subarray-test
(2 failures). Direct execution of the same tests passes.
Root cause analysis: Multi-predecessor blocks (merge points) are unsafe because different predecessors may leave allocated registers in different states:
- Block B has a call → r9 is clobbered at runtime,
alloc_reg_slot[r9] = None - Block C has no call →
alloc_reg_slot[r9] = Some(S)at compile time - Block D (successor of both B and C) inherits C’s state (last processed)
- At runtime via B: r9 holds garbage but compile-time state says
Some(S)→ skip reload
The write-through argument only holds when NO instruction clobbers the register between the last write-through and the block entry. Calls clobber r9-r12.
Approach 2: Leaf-function-only + predecessor intersection (IMPLEMENTED)
Key insight: In leaf functions (no calls), allocated registers (r9-r12) are ONLY
written by store_to_slot (write-through) and load_operand (reload). Both correctly
update alloc_reg_slot. So alloc_reg_slot is ALWAYS accurate in leaf functions.
For non-leaf functions: Use predecessor exit snapshot intersection. At multi-predecessor
blocks, only keep alloc_reg_slot entries where ALL processed predecessors agree. For
back-edges (unprocessed predecessors), be conservative.
Discovery: Leaf detection was broken (THE MAIN WIN)
Critical finding: ALL functions with memory access were classified as non-leaf because
PVM intrinsics (__pvm_load_i32, __pvm_store_i32, etc.) are LLVM Call instructions.
These are NOT real function calls — they’re lowered inline using temp registers and never
use the calling convention.
Fix: Added is_real_call() to distinguish real calls (wasm_func_*, __pvm_call_indirect)
from intrinsics (__pvm_*, llvm.*).
Impact: Significant improvements because leaf functions get smaller stack frames (no callee-save prologue/epilogue):
| Benchmark | Code Change | Gas Change |
|---|---|---|
| AS decoder | -2.9% | -4.0% |
| AS array | -3.2% | -3.7% |
| PiP TRAP | 0 | -3.3% |
| PiP add | 0 | -1.0% |
| PiP Jambrains | 0 | -1.9% |
| is_prime | +0.4% | +2.6% (tiny: +2 gas absolute) |
Attempt: Phi node allocation (REVERTED)
Hypothesis: Phi nodes at loop headers represent loop-carried variables (induction variables, accumulators). Allow them to be register-allocated.
Result: All tests pass, but gas regressions on key benchmarks:
is_prime: +6.4% gasAS factorial: +8.2% gasregalloc two loops: +8.8% gas
Root cause: In PVM, all basic instructions cost 1 gas. Write-through adds 1 MoveReg per phi copy per iteration. The “saved” load is just LoadIndU64 → MoveReg (same cost). Net: +1 gas per iteration per allocated phi node. The write-through model makes phi node allocation a gas regression in the current PVM gas model.
Learning: Register allocation for phi nodes only makes sense when:
- Loads are cheaper than stores (not the case in PVM: both cost 1 gas)
- OR the allocated register can be used directly without MoveReg to temp (not the case: allocated regs are r9-r12, temps are r2-r4)
- OR code size matters more than gas (MoveReg is 2 bytes vs LoadIndU64’s 5 bytes)
Final Results (Leaf Detection + Cross-Block Propagation)
| Benchmark | JAM Size | Code Size | Gas Change |
|---|---|---|---|
| AS decoder | -1.1% | -2.9% | -4.0% |
| AS array | -1.1% | -3.2% | -3.7% |
| anan-as PVM interpreter | -0.6% | -0.8% | - |
| PiP TRAP | 0 | 0 | -3.3% |
| PiP Jambrains | 0 | 0 | -1.9% |
| PiP JADE | 0 | 0 | -0.8% |
| is_prime | +0.3% | +0.4% | +2.6% |
Log
Step 1: Add targeted tests (DONE) — commit e0bfda7
regalloc-nested-loops.jam.wat— nested loops with multiple carried valuesregalloc-loop-with-call.jam.wat— loop calling a function (non-leaf)
Step 2: Blanket alloc_reg_slot persistence (FAILED)
- PVM-in-PVM: 2 failures in
as-decoder-subarray-test - Root cause: multi-predecessor blocks with inconsistent predecessor states
Step 3: Leaf-only propagation + predecessor intersection (DONE) — commit e8694cd
- All 695 tests pass, zero benchmark impact (regalloc rarely activates)
Step 4: Fix leaf detection (DONE) — commit 6960512
- Distinguish PVM intrinsics from real calls
- Up to -4% gas, -3.2% code size on real workloads
Step 5: Phi node allocation (REVERTED) — commit 6af12fa → reverted 3445375
- Gas regression due to write-through MoveReg overhead
Future Opportunities
-
Direct phi-to-register allocation: Instead of write-through to stack + MoveReg to allocated reg, emit phi copies directly to the allocated register and skip the stack store entirely (DSE would need to remove the dead store). This would make phi allocation gas-neutral and code-size-positive.
-
Load-from-allocated-register without MoveReg: When the consumer of an allocated value can use r9-r12 directly (instead of requiring TEMP1/TEMP2), avoid the MoveReg. This requires instruction selection awareness of allocated registers.
-
Non-leaf loop-safe propagation: For non-leaf functions, propagate alloc_reg_slot at loop headers where the loop body has no calls (requires loop-body analysis).
Contributing
Contributions are welcome! This page covers coding conventions, project structure, and where to look for different tasks.
Code Style
rustfmtdefaults,clippywarnings treated as errorsunsafe_code = "deny"at workspace levelthiserrorfor error types,tracingfor logging- Unit tests inline under
#[cfg(test)]
Naming Conventions
- Types:
PascalCase - Functions:
snake_case - Constants:
SCREAMING_SNAKE_CASE - Indicate WASM vs PVM context in names where relevant
Where to Look
| Task | Location |
|---|---|
| Add WASM operator | crates/wasm-pvm/src/llvm_frontend/function_builder.rs |
| Add PVM lowering (arithmetic) | crates/wasm-pvm/src/llvm_backend/alu.rs |
| Add PVM lowering (memory) | crates/wasm-pvm/src/llvm_backend/memory.rs |
| Add PVM lowering (control flow) | crates/wasm-pvm/src/llvm_backend/control_flow.rs |
| Add PVM lowering (calls) | crates/wasm-pvm/src/llvm_backend/calls.rs |
| Add PVM lowering (intrinsics) | crates/wasm-pvm/src/llvm_backend/intrinsics.rs |
| Modify emitter core | crates/wasm-pvm/src/llvm_backend/emitter.rs |
| Add PVM instruction | crates/wasm-pvm/src/pvm/opcode.rs + crates/wasm-pvm/src/pvm/instruction.rs |
| Modify register allocator | crates/wasm-pvm/src/llvm_backend/regalloc.rs |
| Modify peephole optimizer | crates/wasm-pvm/src/pvm/peephole.rs |
| Fix WASM parsing | crates/wasm-pvm/src/translate/wasm_module.rs |
| Fix compilation pipeline | crates/wasm-pvm/src/translate/mod.rs |
| Fix adapter merge | crates/wasm-pvm/src/translate/adapter_merge.rs |
| Add integration test | tests/layer{1,2,3}/*.test.ts |
Anti-Patterns (Forbidden)
- No
unsafecode — strictly forbidden by workspace lint - No panics in library code — use
Result<>withError::Internal - No floating point — PVM lacks FP support; reject WASM floats
- Don’t break register conventions — hardcoded in multiple files
- Don’t change opcode numbers — would break existing JAM files
Building & Testing
See the Getting Started and Testing chapters.
Documentation Policy
After every task or commit, update relevant documentation:
AGENTS.md— new modules, build process changes, conventionslearnings.md— technical discoveries and debugging insightsarchitecture.md— ABI or calling convention changesinternals/— module-specific implementation detailsSUMMARY.md— when adding new documentation pages
Testing
The project has a comprehensive multi-layer test suite covering unit tests, integration tests, differential tests, and PVM-in-PVM execution tests.
Quick Reference
# Rust unit tests
cargo test
# Lint
cargo clippy -- -D warnings
# Full integration tests (builds artifacts first)
cd tests && bun run test
# Quick validation (Layer 1 only — requires build first)
cd tests && bun build.ts && bun test layer1/
# PVM-in-PVM tests (requires build first)
cd tests && bun build.ts && bun test layer4/ layer5/ --test-name-pattern "pvm-in-pvm"
# Differential tests (PVM vs native WASM)
cd tests && bun run test:differential
Important: Always use bun run test (not bun test) from the tests/ directory — it runs bun build.ts first to compile fixtures.
Test Layers
| Layer | Tests | Purpose | Speed |
|---|---|---|---|
| Layer 1 | ~50 | Core/smoke tests | Fast — use for development |
| Layer 2 | ~100 | Feature tests | Medium |
| Layer 3 | ~220 | Regression/edge cases | Medium |
| Layer 4 | 3 | PVM-in-PVM smoke tests | Slow (~85s each) |
| Layer 5 | ~270 | Comprehensive PVM-in-PVM | Slow |
| Differential | ~142 | PVM vs native WASM comparison | Medium |
Test Organization
- Integration tests:
tests/layer{1,2,3}/*.test.ts— each file callsdefineSuite()with hex args (little-endian) - Rust integration tests:
crates/wasm-pvm/tests/— operator coverage, emitter units, stack spill, property tests (true unit tests live inline under#[cfg(test)]in source files) - Differential tests:
tests/differential/differential.test.ts— verifies PVM output matches Bun’s WebAssembly engine - PVM-in-PVM tests: Layers 4-5 — the anan-as PVM interpreter compiled to PVM, running test programs inside
CI Structure
CI runs in stages:
- Rust: lint, clippy, unit tests, release build
- Integration: layers 1-3
- Differential: PVM vs native WASM
- PVM-in-PVM: layers 4-5 (only if integration passes)
Fixtures
Test programs live in tests/fixtures/:
wat/— hand-written WAT programsassembly/— AssemblyScript programsimports/— import maps (.imports) and adapter files (.adapter.wat)
Build Process
tests/build.ts orchestrates three phases:
- Compile AssemblyScript
.ts→.wasm(skipped if.wasmexists) - Compile
.wat/.wasm→.jamfiles - Compile anan-as compiler.wasm → compiler.jam (for PVM-in-PVM)
Important: Delete cached WASM files before working on fixtures:
rm -f tests/build/wasm/*.wasm
cd tests && bun build.ts
Benchmarks
Run ./tests/utils/benchmark.sh for performance data. For branch comparisons:
./tests/utils/benchmark.sh --base main --current <branch>
Every PR must include benchmark results in its description.