MirrorCode: AI Can Already Complete Weeks-Long Coding Tasks — But There's a Catch

Epoch AI and METR's new MirrorCode benchmark shows Claude Opus 4.6 autonomously reimplementing a 16,000-line bioinformatics toolkit — a task estimated to take a human engineer weeks — raising significant questions about AI capability timelines and benchmark validity.

Original source

Epoch AI and METR jointly released MirrorCode, a new long-horizon software engineering benchmark, showing that Claude Opus 4.6 can successfully complete coding tasks that researchers estimate would take a human engineer weeks. The benchmark's headline result: Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics CLI toolkit from scratch, given only execute-only access to the original binary and a set of visible test cases.

**How MirrorCode Works**

The benchmark constructs tasks around existing software projects, each consisting of a CLI program the agent must reimplement exactly. The agent receives no source code — only the compiled binary to probe as a black box and test cases to validate against. Tasks span Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. The full benchmark contains 20+ target programs of varying complexity.

**The Capability Signal**

Results show modern LLMs handling multi-week engineering tasks autonomously at success rates that would have seemed implausible 18 months ago. The bioinformatics toolkit result is the most striking: not just completing the task, but doing so with no access to the implementation, relying entirely on behavioral observation and test coverage.

**The Memorization Caveat**

Researchers flag an important limitation: there's meaningful risk that AI performance is inflated by training data memorization of the target programs. The team attempted mitigation by detecting and excluding likely-memorized targets, but this remains an open methodological challenge. Real-world weeks-long coding tasks don't come with this clean benchmark structure — requirements are ambiguous, dependencies evolve, and there's no oracle to check against.

**Why It Matters**

MirrorCode is the clearest evidence yet that the "weeks-long task horizon" threshold for AI agents — long treated as a distant future capability — may have arrived, at least in controlled conditions. The implications for software employment, developer tooling investment, and AI safety timelines are all significant.

Panel Takes

The Builder

Developer Perspective

“As a developer, the 16,000-line reimplementation result is simultaneously impressive and unsettling. The 'execute-only access' constraint is the key — that's closer to real-world reverse engineering than most benchmarks. I'm updating my timeline estimates for when AI will be able to replace junior engineer workflows on well-specified tasks.”

The Skeptic

Reality Check

“The memorization caveat is doing a lot of work here. Open-source bioinformatics toolkits with test suites are exactly the kind of well-represented training data that LLMs might be partially regurgitating. Real weeks-long engineering work involves ambiguous requirements, stakeholder negotiation, and infrastructure that changes under your feet — none of which appear in this benchmark.”

The Futurist

Big Picture

“Even with the memorization caveats, MirrorCode represents a methodological advance: long-horizon benchmarks that can't be gamed by short-burst performance. The week-long task horizon being demonstrated in controlled settings in April 2026 suggests fully autonomous software agents for well-defined work could arrive in the next 12-18 months at current capability growth rates.”

Panel Takes

Bookmarks