OpenAI Declares SWE-bench Verified Dead — Contamination and Flawed Tests End a Key AI Coding Yardstick

OpenAI published findings that SWE-bench Verified is no longer a reliable measure of frontier coding capability — 59.4% of audited problems have flawed test cases, all major models show training contamination, and progress has stalled. The industry is pivoting to SWE-bench Pro.

Original source

## The Benchmark That Moved AI Coding Is Over

SWE-bench Verified, the dataset that became the standard leaderboard for AI coding agents over the past two years, has been declared no longer fit for purpose by OpenAI. In a detailed analysis published this week, the company found two disqualifying problems: systematic test case flaws that penalize correct solutions, and pervasive training contamination across every major frontier model.

## The Two Fatal Problems

**Flawed test cases:** OpenAI's audit found that at least 59.4% of problems in the Verified set have test cases that reject functionally correct solutions. A model can produce working code and fail the benchmark because the test was written incorrectly. This means scores have always been noisier than the community assumed.

**Training contamination:** Every frontier model tested — GPT-5.x, Claude Opus variants, Gemini — was able to reproduce verbatim bug fixes or exact problem statement wording for a subset of SWE-bench tasks. This is strong evidence that all of them were trained on benchmark data, making improvements on the set increasingly reflective of memorization rather than generalizable engineering skill.

## The Saturation Signal

The data tells the story: state-of-the-art progress on SWE-bench Verified improved from 74.9% to 80.9% over the last six months — a slowdown that, combined with the contamination findings, suggests models are approaching the ceiling of what's achievable through benchmark gaming rather than genuine capability improvements.

## What Comes Next: SWE-bench Pro

Scale AI's SWE-bench Pro is the emerging replacement. With 1,865 long-horizon tasks drawn from public, held-out, and commercial codebases — explicitly designed to minimize contamination — it provides a harder and more realistic target. The signal: GPT-5 and Claude Opus 4.1 both score around 23% on SWE-bench Pro, versus 80%+ on Verified. There's real room to measure progress again.

## Why This Matters

Benchmark retirement is painful for the industry because marketing, fundraising, and hiring claims are built on leaderboard positions. But the alternative — continuing to optimize for a broken signal — is worse. The field's willingness to sunset SWE-bench Verified is actually a healthy sign that the engineering culture around evals is maturing.

Panel Takes

The Builder

Developer Perspective

“Honestly, I stopped trusting SWE-bench scores months ago when I'd see a model top the leaderboard and still struggle with my actual codebase. SWE-bench Pro's 23% scores are humbling but they finally feel honest.”

The Skeptic

Reality Check

“OpenAI is publishing this analysis while being one of the contaminating parties. That's convenient — retiring a benchmark you've maxed out lets you reset the leaderboard. SWE-bench Pro will face the same contamination problems in 18 months.”

The Futurist

Big Picture

“The benchmark treadmill is a fundamental problem: every eval eventually gets optimized into uselessness. The real answer is continuous, held-out benchmarking against real-world tasks — something the industry hasn't solved yet but urgently needs.”

Panel Takes

Bookmarks