Back
Epoch AI / METRResearchEpoch AI / METR2026-04-29

MirrorCode: AI Can Already Complete Weeks-Long Coding Tasks — But There's a Catch

Epoch AI and METR's new MirrorCode benchmark shows Claude Opus 4.6 autonomously reimplementing a 16,000-line bioinformatics toolkit — a task estimated to take a human engineer weeks — raising significant questions about AI capability timelines and benchmark validity.

Original source

Epoch AI and METR jointly released MirrorCode, a new long-horizon software engineering benchmark, showing that Claude Opus 4.6 can successfully complete coding tasks that researchers estimate would take a human engineer weeks. The benchmark's headline result: Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics CLI toolkit from scratch, given only execute-only access to the original binary and a set of visible test cases.

**How MirrorCode Works**

The benchmark constructs tasks around existing software projects, each consisting of a CLI program the agent must reimplement exactly. The agent receives no source code — only the compiled binary to probe as a black box and test cases to validate against. Tasks span Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. The full benchmark contains 20+ target programs of varying complexity.

**The Capability Signal**

Results show modern LLMs handling multi-week engineering tasks autonomously at success rates that would have seemed implausible 18 months ago. The bioinformatics toolkit result is the most striking: not just completing the task, but doing so with no access to the implementation, relying entirely on behavioral observation and test coverage.

**The Memorization Caveat**

Researchers flag an important limitation: there's meaningful risk that AI performance is inflated by training data memorization of the target programs. The team attempted mitigation by detecting and excluding likely-memorized targets, but this remains an open methodological challenge. Real-world weeks-long coding tasks don't come with this clean benchmark structure — requirements are ambiguous, dependencies evolve, and there's no oracle to check against.

**Why It Matters**

MirrorCode is the clearest evidence yet that the "weeks-long task horizon" threshold for AI agents — long treated as a distant future capability — may have arrived, at least in controlled conditions. The implications for software employment, developer tooling investment, and AI safety timelines are all significant.

Panel Takes

The Builder

The Builder

Developer Perspective

As a developer, the 16,000-line reimplementation result is simultaneously impressive and unsettling. The 'execute-only access' constraint is the key — that's closer to real-world reverse engineering than most benchmarks. I'm updating my timeline estimates for when AI will be able to replace junior engineer workflows on well-specified tasks.

The Skeptic

The Skeptic

Reality Check

The memorization caveat is doing a lot of work here. Open-source bioinformatics toolkits with test suites are exactly the kind of well-represented training data that LLMs might be partially regurgitating. Real weeks-long engineering work involves ambiguous requirements, stakeholder negotiation, and infrastructure that changes under your feet — none of which appear in this benchmark.

The Futurist

The Futurist

Big Picture

Even with the memorization caveats, MirrorCode represents a methodological advance: long-horizon benchmarks that can't be gamed by short-burst performance. The week-long task horizon being demonstrated in controlled settings in April 2026 suggests fully autonomous software agents for well-defined work could arrive in the next 12-18 months at current capability growth rates.

Bookmarks

Loading bookmarks...

No bookmarks yet

Bookmark tools to save them for later