Zero-Copy GPU Inference From WebAssembly on Apple Silicon — Portable KV Caches Are Next

A working prototype shares memory directly between a WebAssembly sandbox and Apple Silicon's Metal GPU — no copies, no serialization overhead. Demonstrated with Llama 3.2 1B at ~9ms/token on M1, with a 5.45x speedup over KV cache recomputation. The approach enables portable, snapshotable conversation state that can migrate across machines.

Original source

A technical post published April 18 details a working prototype that eliminates the memory copy overhead that has long made sandboxed WebAssembly impractical for GPU inference. The approach exploits Apple Silicon's Unified Memory Architecture: since CPU and GPU share physical memory, a WebAssembly linear memory region and a Metal GPU buffer can reference the same pages with no copy between them.

The author validated the technique with matrix multiplication benchmarks before moving to a full Llama 3.2 1B inference test, achieving approximately 9ms per token on M1. For context, that's competitive with non-sandboxed local inference tools — the safety boundary of Wasm normally carries a 30-50% latency penalty that this approach eliminates on Apple Silicon.

The more consequential finding is the KV cache angle. Because the GPU buffer is now just memory that Wasm can address directly, KV caches become serializable artifacts — you can snapshot the full conversation state, write it to disk or send it over the network, and resume inference on a different machine. The author reports a 5.45x speedup over full KV cache recomputation when restoring from a snapshot.

The implications extend beyond Apple hardware. The core insight — that unified memory architectures collapse the host/accelerator boundary that made sandboxed inference expensive — applies to AMD's APU lineup and upcoming x86 chips with integrated memory. If the approach generalizes, WebAssembly becomes a viable runtime for multi-tenant LLM serving with real isolation guarantees.

This is early research, not a shipping product. But the prototype is real, the benchmarks are reproducible, and the architectural insight is novel. The combination of Wasm-level isolation, GPU-level performance, and portable state is exactly what stateful AI agent infrastructure needs.

Panel Takes

The Builder

Developer Perspective

“Portable KV caches are the sleeper feature here. The ability to snapshot and restore conversation state — with a claimed 5.45x speedup over recomputation — changes how you architect long-running agent sessions. Combine that with Wasm-level sandboxing and you have a credible story for multi-tenant LLM hosting with actual isolation.”

The Skeptic

Reality Check

“This is a single-author prototype on a single hardware platform. 9ms/token on M1 with a 1B parameter model isn't impressive — you can match that with llama.cpp without any of the Wasm complexity. The zero-copy technique also only works on unified memory architectures, which excludes most datacenter GPUs where this would actually matter at scale.”

The Futurist

Big Picture

“If Wasm becomes a viable inference runtime with near-native GPU performance, every major cloud platform gets a new primitive: truly isolated, portable AI workloads that can migrate between machines mid-conversation. Combined with Wasm's browser compatibility, you could run the same inference stack client-side and server-side with identical semantics.”

Panel Takes

Bookmarks