Building an Unverified Compiler with Agents

Article: Kiran Gopinathan, Jack Feser, Dan Waxman, Zenna Tavares

April 16, 2026

Four agents spent 14 days and 93,000 lines of Lean building a verified JS-to-WASM compiler from scratch. The compiler ran; the proofs didn’t close.

Paris Métro Line 14 has run driverless since 1998. Its control software was built with the B method and formally verified before the first train ran.

If you can specify a system with a series of tests, frontier models are surprisingly good at producing code that passes them. But passing tests is not the same as being correct — and when we tried to raise the bar from “passes tests” to “formally verified” — mathematically proving the code is correct — we hit a wall.

This ability to generate test-passing software is being pushed to scales unthinkable just months ago. Anthropic threw 16 Claude agents at building a Rust-based C compiler from scratch, completing a rudimentary frontend capable of compiling the Linux kernel in just two weeks. Cursor’s CEO Michael Truell reports on using agents to build a web browser from scratch; while the code outsourced several key components to existing libraries, the resulting executable was capable of rendering real web pages, measuring over 3 million lines of code.

These systems are impressive. They are also riddled with bugs, frequently broken in ways that are hard to catch. Claude’s C compiler had several bugs, sometimes recreating specific basic failure modes from hobby C compilers, and other times errors no human would ever write¹.

In theory, there is a stronger guarantee than testing. Formal methods are techniques for mathematically reasoning about the correctness of software, ranging from static type systems and program analysis to full formal verification — where a theorem prover constructs a machine-checked mathematical proof that code satisfies its specification. Type systems alone have already reshaped everyday software: languages like Rust eliminate entire classes of errors at compile time. Verification has produced landmark artifacts like the CompCert verified C compiler and the seL4 verified OS kernel, and the proof process routinely uncovers bugs that escape every test.

Formal verification has the potential to complement agentic software engineering, producing software that is automatically proved correct. The case for verified software as a solution to insecure AI-generated code is gaining traction in part because verification in theorem provers like Lean is itself a programming task — one that benefits directly from the same model advances driving LLM coding progress.

We set out to test this directly: can agents build a verified compiler using LLMs?

What is a verified compiler? We define the problem and explain why a JS-to-WASM compiler is a good landmark challenge for agent capabilities.
An initial attempt. We set four Claude Code agents to building a verified JS-to-WASM compiler in Lean 4 over the course of 14 days.
Results. Despite producing over 90k lines of Lean code, the agents ultimately failed. We analyse where they succeeded and where they fell short.

The codebase from our experiments is on GitHub.

What does it mean to verify a compiler?

A compiler at a high level is simply an executable that translates programs from one source language (i.e. the JavaScript programming language) to some target output language (i.e. WebAssembly). A verified compiler comes with a mathematical proof that this translation preserves the meaning of the program: that is, if the source program produced some result when it was executed, the compiled program will produce the exact same result.

To state this property mathematically, we need to define what it means for a program to “produce a result.” In formal methods, this is done by writing a semantics: an interpreter that defines the observable behaviour of every program in the language, the values it computes and the side effects it produces. This is to say that the definition of a language, for verification purposes, is nothing other than a kind of interpreter for the language.

The following diagram captures the core correctness property for a compiler. For every source program e, if the source semantics evaluates e to a value v with some observable trace, then compiling e and evaluating the result under the target semantics should produce the same value and trace:

Commuting diagram: JS program e evaluates to (v, trace) via eval_js; compiling e produces WASM module e', which evaluates to the same (v, trace) via eval_wasm. All three functions are agent-generated.

Building a verified compiler requires four artifacts: a semantics for the source language, a semantics for the target language, the compiler itself, and a proof that the diagram commutes. The diagram illustrates this for a JS-to-WASM compiler: eval_js defines the meaning of JS programs, eval_wasm defines the meaning of WASM programs, and compile translates between them. Each semantics can be independently validated against tests or a specification, but only the proof guarantees that the compiler preserves meaning.

This raises an obvious question about what verification actually buys you. Verifying a compiler leaves you with the compiler itself plus a mechanical proof of its correctness. If the reader has to inspect both to trust the verification, they might as well have just read the compiler directly. The point, though, is that the proof body is the one part you don’t have to read, since the theorem prover has already mechanically checked it. What does need reading is everything the proof depends on: the theorem statement, the definitions of eval_js and eval_wasm, and any axioms the proof invokes along the way. Mechanised verification, then, narrows what a human has to read down to the specifications.

To make this a little more concrete, here is what a toy correctness proof looks like in a theorem prover such as Lean. We define a language with just numbers and addition, write a trivial compiler that emits WASM i32.add instructions, and prove that compilation preserves evaluation:

inductive Expr where
  | num : Int → Expr
  | add : Expr → Expr → Expr
def eval_js : Expr → Int
| .num n => n
| .add a b => eval_js a + eval_js b
def compile : Expr → WasmExpr
| .num n => .i32_const n
| .add a b => .i32_add (compile a) (compile b)
theorem correctness (e : Expr) :
eval_wasm (compile e) = eval_js e := by
induction e with
| num n => simp [compile, eval_js, eval_wasm]
| add a b ih_a ih_b =>
simp [compile, eval_js, eval_wasm, ih_a, ih_b]

The proof begins after := by. induction e splits the goal into one case per constructor of Expr, and simp closes each case by unfolding the definitions of compile, eval_js, and eval_wasm and checking that both sides reduce to the same thing. If a case cannot be discharged, Lean reports an error. If the author wants to leave a case for later, they can write sorry instead of a proof, marking it as an unproved hole that the system will accept but flag. A verified compiler is one with no sorries.

The proof above is trivial, but it illustrates the basic structure of what a verified compiler looks like: definitions, a theorem statement, and a proof that Lean checks mechanically. Each component of this pipeline is individually something that agents can already do. Agents can write interpreters from test suites. Agents can write compilers that pass conformance tests. Agents have independently written mechanical proofs to close open mathematical problems. The key question is whether they can do all of this together.

A landmark challenge

To test whether agents can build verified software, we need a concrete problem. We chose a JS-to-WASM compiler, a somewhat arbitrary choice, but actually quite a well-scoped problem for testing agent verification capabilities.

→

WebAssembly has a formal semantics. WASM is unusual in that the rules defining how WASM programs execute have already been mechanised in theorem provers (Rocq, Isabelle). This means agents have a formal reference to build from, rather than having to formalise the target language from scratch.
JavaScript is well-exercised. While JavaScript has no actively maintained formalization work (there was some work, JSCert, that aimed certify a small subset of the language back in 2014), it has Test262, one of the most comprehensive conformance suites of any programming language, with over 46,000 tests defining correct behaviour for every language feature. This gives agents a way to validate their JS interpreter against ground truth.
No prior work exists. A verified JS-to-WASM compiler has, to our knowledge, never been developed before, so models cannot simply reproduce an existing artifact or proof.

This is not an easy problem. Verified compilers build upon a large body of theory. CompCert, the verified C compiler, took approximately 6 person-years of expert effort and covers a relatively static language. JavaScript has enormous surface area: prototype inheritance, closures, dynamic typing, async/await, generators, proxies. Each of these would require novel formalisation work that does not exist in the literature. A complete verified JS compiler would demand not just proof engineering but new verification theory.

The difficulty is what makes it a good test. If agents could verify a compiler for a language as complex as JavaScript, it would be strong evidence that formal methods can scale to agent-written software in general. A modern mechanized JavaScript semantics would also be independently valuable, both because JavaScript is widely deployed and because formalizing more of the language remains a substantial open problem. And if agents cannot verify a compiler, understanding why they fail tells us what tooling is missing.

An Initial Attempt

On March 7-8, at ARIA’s Scaling Trust Hackathon, we gave a single Claude Code agent a blank Lean 4 project, pointed it at CompCert, the ECMA-262 specification, and the WasmCert-Rocq codebase as references, and set it to work.

Over the course of the weekend, with our initial prompt and the references to prior art, the agent was able to produce a reasonable compilation pipeline, some plausible semantic relations for the source and target languages, and some theorem statements relating the two. The generated compiler binary ran over a small subset of the Test262 conformance suite and produced mostly working WebAssembly code that passed the respective tests (see below).

JavaScript

↓

Elaborate

Core

↓

Closure Convert

Flat

↓

ANF Convert

ANF

↓

Lower

Wasm IR

↓

Emit

WebAssembly

The compilation pipeline from JS to WASM. Click each pass to see the generated code.

The catch, of course, was in the proofs. Lean, like many other theorem provers, provides a facility to allow leaving holes in a proof (in Lean’s case, this is the sorry tactic), or to declare certain theorems as axioms that are outside the scope of verification. This is useful during development, as it allows work to proceed on other parts of the codebase while leaving difficult proofs for later. While the agent had written something resembling a compiler, upon closer inspection, it had not actually written a verified compiler. Most of the theorem statements it had expressed were either trivial or left unproven:

theorem closureConvert_step_simulation
    (sc sf : State) (rel : CC_SimRel sc sf)
    (t : TraceEvent) (sc' : State) (h : step? sc = some (t, sc')) :
    ∃ sf', steps sf t sf' ∧ CC_SimRel sc' sf' := by
  sorry

Worse, some of the relations the agent produced were uninhabited. Recall from the commuting diagram that eval_js defines what it means for a JS program to produce a result. If the agent writes eval_js such that no program can ever produce a result under it, then the correctness theorem “for every program that produces a result under eval_js, the compiled version produces the same result under eval_wasm” becomes vacuously true. The hypothesis is never satisfied, so the theorem holds for free. The compiler could do anything and the proof would still go through.

Chat showing the agent realising ANF.Behaves is uninhabited, making correctness theorems vacuously true

The codebase was broken, but it had promise. Some of the theorems were reasonable, and the compiler at least ran end to end. Could better prompting and more agents have completed the task? We scaled things up to find out.

Experiment timeline: Mar 7-8 hackathon with 1 agent, then Mar 19 - Apr 1 with 4 autonomous agents

From March 19 to April 1, we ran four Claude Code instances autonomously on a single dedicated server: three worker agents and one supervisor. Anthropic’s own multi-agent experiment had identified coordination failures as a key bottleneck, with agents silently overwriting each other’s work and lacking awareness of the broader codebase. We attempted to pre-empt these issues: each agent ran as a separate Unix user with filesystem permissions preventing agents from editing one another’s files, and we included a supervisor agent to identify cycles and unblock stuck agents.

More details on the agent architecture are in the appendix.

Results

Overall, the agents failed. 93,516 lines of Lean, 3,026 commits, 14 days of autonomous operation, and the compiler does not verify. The sorry count fluctuated up and down over the two weeks with no clear trend towards zero. The proof files grew enormous, but most of the proved cases were trivial. No sorry was closed on any non-trivial property. 14 sorry holes and 17 axioms remain. The full codebase is available, with a per-module breakdown in the appendix.

The per-pass breakdown shows where the remaining holes are:

Pass	LOC	Sorries	Axioms	Builds?
JS → Core	59,573	0	0	✓
Core → Flat	2,581	2	0	✗ timeout
Flat → ANF	1,543	11	0	✓
ANF → WASM	14,591	1	17	✓
Total	93,516	14	17

Closure conversion and ANF conversion account for 13 of the 14 remaining sorries. The WASM lowering contributes the remaining sorry and 17 axioms, which, as we will see, is worse.

The commuting diagram decomposes a verified compiler into four artifacts: source semantics, target semantics, compiler, and proof. We examine the agents’ output guided by three research questions:

Does verification add value beyond testing for agent-written code?
Can agents complete the verification autonomously?
What can we learn from incomplete verification?

Does verification add value beyond testing?

Formal methods has a track record of catching bugs that tests miss in human-written software. Does the same hold for agent-written code?

The JS spec agent wrote 59,573 lines of formalised semantics with zero sorries. The agent iterated on its interpreter, driven by the Test262 conformance suite, until it passed the target test cases, producing a functioning evaluator for a substantial subset of JavaScript. The semantics had structural issues invisible to tests. The step? function, the small-step evaluation relation at the heart of every semantics, was declared as partial def. In Lean, every function must be proven to terminate before the proof system will reason about it. A partial def is a function that Lean cannot prove terminates, so it treats it as a black box: the function runs just fine, but the proof system refuses to look inside it. Every correctness theorem that depends on step? was stuck, unable to unfold, case-split, or reason about the function’s behaviour.

Mar 20 16:50 — proof agent diagnoses the blocker

"step? is partial def: All three step functions (Core, Flat, ANF) are partial def, making them opaque to Lean's proof system. Cannot unfold, case-split, or reason about behavior."

Mar 20 16:50 — proof agent writes the exact fix

"Recommended fix for jsspec: Make step? non-partial using termination_by sizeOf s.expr."

Mar 22 01:05 — jsspec finally makes Core.step? non-partial (32 hours later)

The change was mechanical (adding termination_by and decreasing_by clauses), but it required changes to the implementation that would never have been made under a test-focused approach. A partial def passes every test just fine. Making it total is only necessary for proof, and the longer a partial definition sits in the codebase with other code building on top of it, the harder it becomes to change. This is exactly the kind of issue that only verification catches.

The proof process also drove improvements to the correctness specifications, which in turn constrain the compiler. The commuting diagram gives the high-level correctness property, but a real compiler has multiple passes, and each pass needs its own simulation relation: a precise invariant describing how the source and target states correspond at every step of execution. Getting these relations wrong means the compiler cannot be proved correct, even if it happens to produce the right output on every test. Closure conversion, the pass that lifts free variables into explicit environment objects so that closures can be compiled to flat function calls, is a good example. The simulation relation CC_SimRel must capture how source-level closures correspond to target-level environments and heap objects.

The agents refined this relation through a process that looks remarkably like CEGAR: attempt a proof, discover the relation is too weak (or too strong), refine it, try again.

The proof agent’s first attempt at HeapCorr used exact heap equality (sf.heap = sc.heap), which is too strong for any pass that allocates. The supervisor eventually wrote the corrected version, its first and most useful contribution of actual Lean code:

def HeapCorr (cheap fheap : Core.Heap) : Prop :=
  cheap.objects.size ≤ fheap.objects.size ∧
  ∀ addr, addr < cheap.objects.size →
    cheap.objects[addr]? = fheap.objects[addr]?

Getting here took three days and sixteen escalations between agents, for a definition that a human verification engineer would probably write in an afternoon. But the insight was correct: closure conversion allocates new environment objects, so the target heap is a superset of the source heap.

The final version of CC_SimRel has 8 conjuncts plus the expression correspondence. The well-formedness predicates (ExprAddrWF, EnvAddrWF, HeapValuesWF) are correctly defined:

private def HeapValuesWF (heap : Core.Heap) : Prop :=
  ∀ addr, addr < heap.objects.size →
    ∀ props, heap.objects[addr]? = some props →
      ∀ kv, kv ∈ props → ValueAddrWF kv.2 heap.objects.size

The relation was not general enough (the injection maps HeapInj and EnvCorrInj accept an injMap parameter and then ignore it; the CCState is existentially quantified when it should be Skolemised following CompCert’s approach), but the relation had the right conjuncts, in the right places, for the right reasons. The agents could not close the proofs that depended on it.

Verification also surfaced bugs in the semantics, though only shallow ones. The six Flat semantics bugs the WASM spec agent fixed (evalUnary .bitNot returning .undefined, toNumber producing wrong results, trace events formatted with repr instead of valueToString, initialState missing the console binding) were all in spec-level definitions, not compiler code, and simple tests would have caught most of them.

The answer, then, is mixed. Verification surfaced structural issues that tests would never catch, and the proof process drove real improvements to both the implementation and the specifications. But the bugs found through verification were shallow, and the deeper issues (uninhabited relations, partial definitions) required human inspection to identify.

Can agents complete the verification autonomously?

This is where things broke down. The closureConvert_step_simulation proof body size tells the story:

The proof agent spent its first week growing the proof body to 240KB by proving 40+ easy cases where closure conversion is structurally an identity (lit, seq, if, while, assign, tryCatch). The supervisor did not flag this on its own; we had to manually direct it to intervene:

"You are avoiding the hard cases. Stop. You have proved 40+ easy cases where closure conversion is identity. The ACTUAL verification is in the cases you keep marking as sorry: captured variable, function call, function definition. YOU MUST ATTEMPT THESE NOW. Not later. Not after 'one more easy case.' NOW."

The proof agent responded by deleting the 240KB proof body entirely and restructuring from scratch. The hard cases (captured variables, function calls, function definitions) remained sorry’d through to the end of the experiment. The ClosureConvertCorrect file grew to 7,302 lines and 239 declarations, but most of the work was mechanical. 122 of those declarations are step-function lemmas like this:

private theorem Flat_step?_break (s : Flat.State) (label : Option Core.LabelName) :
    Flat.step? { s with expr := .break label } =
      some (.error ("break:" ++ label.getD ""), { ... }) := by
  simp [Flat.step?]

These lemmas fall into groups that share identical proof tactics: 18 proved by simp [Core.step?, hnv, hss, Core.pushTrace], 17 by simp [Flat.step?, hss]; split <;> simp_all, 15 by simp [Flat.step?]. The ExprAddrWF_mono lemma has 31 cases, each doing the same thing with minor structural variations. A human would abstract these into a generic evaluation context lemma. The agents wrote each one individually.

Lean’s simp tactic is the simplifier: it rewrites terms using known equalities and reduction rules, and is usually the first thing to try when proving a goal. As a simple example, if the goal is 1 + 0 = 1, simp will apply the rule that n + 0 = n and close the goal automatically. But carelessly adding rewrite rules can cause simp to loop. If a rule rewrites A to B and another rewrites B back to A, simp will cycle forever. For step?, simp triggers exactly this kind of loop: the equation lemma step?.eq_1 unfolds the function, but the unfolded form contains step? again, causing infinite recursion.

Mar 21 — proof agent discovers the loop

"simp only [Flat.step?] loops on step?.eq_1. The split at h approach works but must be applied per-constructor."

Mar 21 — supervisor adds a system note

"SIMP LOOP IN Core/Semantics.lean — FIX NOW. Never pass step? directly to simp. Use unfold step? or simp only [step?.eq_def]."

Mar 22 — proof agent uses the pattern again

"Pattern: simp [step?, h]; rfl"

Mar 28 — jsspec adds @[simp] to 18 eval lemmas, breaking the CC build

"jsspec added @[simp] to 18 eval lemmas. These made simp [Flat.step?, ...] fully close goals that previously needed ; rfl. The trailing rfl now hits 'No goals to be solved'. All 12 errors are in lines 946–1101."

Each agent independently turned to trying simp [step?], because for most definitions in Lean it is the right tactic. The negative constraint (“not for this particular function”) did not persist across runs.

What can we learn from incomplete verification?

The WASM lowering pass offers a cautionary answer. The file Wasm/Semantics.lean is 12,830 lines long and contains 1 sorry. It includes a forward simulation proof, step_sim, that does real case analysis on every ANF expression form, extracts the corresponding IR code via LowerCodeCorr inversions, and constructs matching IR execution steps with stuttering:

theorem step_sim (prog : ANF.Program) (irmod : IRModule) :
    ∀ (s1 : ANF.State) (s2 : IRExecState) (t : TraceEvent) (s1' : ANF.State),
    LowerSimRel prog irmod s1 s2 → anfStepMapped s1 = some (t, s1') →
    ∃ (s2' : IRExecState) (ir_trace : List TraceEvent),
      IRSteps s2 ir_trace s2' ∧
      LowerSimRel prog irmod s1' s2' ∧
      observableEvents ir_trace = observableEvents [t] := by
  ...  -- 5000+ lines of case analysis

This is nontrivial reasoning about the compiler. It also rests on 17 axioms, including this one:

axiom lower_main_code_corr (prog : ANF.Program) (irmod : IRModule)
    (h : Wasm.lower prog = .ok irmod) :
    LowerCodeCorr prog.main (irInitialState irmod).code

This axiom is not obviously false. It states that if lowering succeeds, the lowered IR code corresponds to the source program’s main expression, which is plausible. But there is no reason to axiomatise it — it should have been proved by induction on the lowering function. If any of these 17 axioms are inconsistent, the entire proof follows from False, and the low sorry count is meaningless. A sorry is an honest hole: it flags incomplete work. An axiom is an unverified claim that Lean treats as true. The distinction matters, and current tools do not help agents see it.

The picture that emerges is that verification adds real value to agent-written code, but agents cannot complete the verification autonomously, and even their partial results require careful evaluation.

Conclusion

In 2003, Tony Hoare proposed a grand challenge for computing: a verifying compiler that uses mathematical proof to build confidence in the software it processes. Two decades of formal methods research have chipped away at the edges, verifying individual systems like CompCert and seL4 through extraordinary sustained human effort. The challenge remains open.

Our experiment tested whether agents could meet this challenge. They produced 93k lines of Lean in 14 days, discovered reasonable correctness criteria, and even drove real changes to an implementation through the proof process. They could not close a nontrivial proof. The gap between stating a theorem and proving it remains beyond what current agents can do autonomously.

Why did the agents fail to build a verified compiler? An interesting observation was that the failure modes we saw were different from human ones. Verification engineers typically struggle with scale, the tedium of case-splitting and invariant bookkeeping across a large codebase. Agents handled scale fine. What they could not do was reason at depth: generalise a relation when the current one is too weak, retain a negative constraint across sessions, or coordinate a fix that crosses an agent boundary.

Existing verification tools were designed for humans. Proof assistants like Lean, Rocq, and Isabelle assume a human in the loop who can read error messages, internalise failed attempts, and adjust strategy. Agent-first verification tooling does not exist yet. The findings from each section of this experiment point to the same gap:

Agents wrote semantics that passed tests but were opaque to proof. Tools should flag definitions that are well-typed but proof-unfriendly (partial def, uninhabited relations).
Agents discovered the right simulation relation through iterative refinement, but could not evaluate whether it was strong enough without attempting a full proof. Tools that can test relation strength, for example by checking whether key relations are inhabited, would accelerate this loop.
Agents could not build proof abstractions. 122 mechanical lemmas with identical tactics should have been a single generic lemma. Agents need a way to define custom tactics and metaprograms, with guardrails so they cannot inadvertently break the build.

We believe these are engineering problems, not fundamental limitations. Agents can already write interpreters, compilers, and mathematical proofs individually. What they cannot do is co-develop software and proofs together, and the tools they work with were never designed to help them do so. Building agent-first tooling is the next step towards enabling agentic verified software development.

Appendix: Agent Architecture

Multi-agent pipeline: JS spec writer, Proof writer, WASM spec writer, supervised by a Supervisor agent

The three worker agents each owned a slice of the codebase with orthogonal goals: the JS spec writer for autoformalising the ECMA-262 natural language semantics into a mechanised form, a WASM spec writer for translating the WebAssembly Rocq mechanisation into Lean, and a dedicated proof writer for implementing and verifying a compiler between the two semantic relations. The spec agents were prompted to ensure their relations were inhabited, proving that the test cases could be explained using the semantic relations they defined. To avoid collusion or preemptively downscoping their tasks, agents could not read each other’s prompts and logs, nor write their own prompt. The supervisor was given privileges to read each agent’s logs and rewrite their prompts to guide them. Hourly cron jobs restarted each agent if they stopped or crashed.

Appendix: Build Status

Per-module build status from a clean lake build with a 60-second timeout. Sorry and axiom counts via LeanSplitter.

Module	Build	LOC	Sorries	Axioms
Core
Syntax	✓	344	0	0
Semantics	✓	58,549	0	0
Elaborate	✓	680	0	0
Flat
Syntax	✓	265	0	0
Semantics	✓	1,948	0	0
ClosureConvert	✓	368	0	0
ANF
Syntax	✓	108	0	0
Semantics	✓	875	0	0
Convert	✓	560	0	0
WASM
Syntax	✓	236	0	0
IR	✓	61	0	0
Lower	✓	1,464	0	0
Semantics	✓	12,830	1	17
Proofs
ElaborateCorrect	✓	22	0	0
OptimizeCorrect	✓	15	0	0
ClosureConvertCorrect	✗ timeout	7,302	2	0
ANFConvertCorrect	✓ (sorry)	8,276	11	0
LowerCorrect	✓	57	0	0
EmitCorrect	✓	67	0	0
EndToEnd	✗ timeout	72	0	0

John Regehr writes: “They are mostly the kind of mistake that one would make if one was implementing a C compiler without reading the standard closely and carefully. They’re surface-level bugs that you would simply not find in a serious compiler. I don’t think we ever found a bug like these in GCC.” ↩︎