OpenAI Declares SWE-bench Verified Dead — Contamination and Flawed Tests End a Key AI Coding Yardstick
OpenAI published findings that SWE-bench Verified is no longer a reliable measure of frontier coding capability — 59.4% of audited problems have flawed test cases, all major models show training contamination, and progress has stalled. The industry is pivoting to SWE-bench Pro.
Original source## The Benchmark That Moved AI Coding Is Over
SWE-bench Verified, the dataset that became the standard leaderboard for AI coding agents over the past two years, has been declared no longer fit for purpose by OpenAI. In a detailed analysis published this week, the company found two disqualifying problems: systematic test case flaws that penalize correct solutions, and pervasive training contamination across every major frontier model.
## The Two Fatal Problems
**Flawed test cases:** OpenAI's audit found that at least 59.4% of problems in the Verified set have test cases that reject functionally correct solutions. A model can produce working code and fail the benchmark because the test was written incorrectly. This means scores have always been noisier than the community assumed.
**Training contamination:** Every frontier model tested — GPT-5.x, Claude Opus variants, Gemini — was able to reproduce verbatim bug fixes or exact problem statement wording for a subset of SWE-bench tasks. This is strong evidence that all of them were trained on benchmark data, making improvements on the set increasingly reflective of memorization rather than generalizable engineering skill.
## The Saturation Signal
The data tells the story: state-of-the-art progress on SWE-bench Verified improved from 74.9% to 80.9% over the last six months — a slowdown that, combined with the contamination findings, suggests models are approaching the ceiling of what's achievable through benchmark gaming rather than genuine capability improvements.
## What Comes Next: SWE-bench Pro
Scale AI's SWE-bench Pro is the emerging replacement. With 1,865 long-horizon tasks drawn from public, held-out, and commercial codebases — explicitly designed to minimize contamination — it provides a harder and more realistic target. The signal: GPT-5 and Claude Opus 4.1 both score around 23% on SWE-bench Pro, versus 80%+ on Verified. There's real room to measure progress again.
## Why This Matters
Benchmark retirement is painful for the industry because marketing, fundraising, and hiring claims are built on leaderboard positions. But the alternative — continuing to optimize for a broken signal — is worse. The field's willingness to sunset SWE-bench Verified is actually a healthy sign that the engineering culture around evals is maturing.
Panel Takes
The Builder
Developer Perspective
“Honestly, I stopped trusting SWE-bench scores months ago when I'd see a model top the leaderboard and still struggle with my actual codebase. SWE-bench Pro's 23% scores are humbling but they finally feel honest.”
The Skeptic
Reality Check
“OpenAI is publishing this analysis while being one of the contaminating parties. That's convenient — retiring a benchmark you've maxed out lets you reset the leaderboard. SWE-bench Pro will face the same contamination problems in 18 months.”
The Futurist
Big Picture
“The benchmark treadmill is a fundamental problem: every eval eventually gets optimized into uselessness. The real answer is continuous, held-out benchmarking against real-world tasks — something the industry hasn't solved yet but urgently needs.”