AI Agent Costs Are Rising Faster Than Capabilities — Toby Ord's METR Analysis

Philosopher and AI safety researcher Toby Ord analyzes METR's agent time-horizon benchmarks and finds a troubling pattern: as agent capabilities improve, hourly compute costs scale up in parallel — with top models costing $40-$350/hr, potentially exceeding human engineer rates.

Original source

Toby Ord, the Oxford philosopher and AI safety researcher, published a sharp analysis today that's currently sitting at #3 on Hacker News with 226 points. Using data from METR's agent time-horizon evaluations, Ord argues that published improvements in AI agent capability have a hidden cost: they're partly purchased with dramatically increasing compute expenditures per hour of agent operation.

The numbers are stark. OpenAI's o3 at full capacity costs approximately $40 per hour of effective agent operation. More capable configurations, with extended thinking and higher sampling budgets, reach $350 per hour in some benchmarks — well above the fully-loaded cost of a mid-senior software engineer at roughly $120/hr. Ord notes that several headline "improvement" figures in METR evaluations, when adjusted for compute cost, show much more modest capability-per-dollar gains.

The analysis raises a pointed question for anyone building agent products: are you building on a foundation that gets cheaper over time (as GPU prices and model efficiency improve) or one where capability and cost are structurally coupled in a way that caps the addressable market?

Ord is careful to note that this isn't necessarily a fundamental barrier — hardware improvements and algorithmic efficiency could break the coupling. But he argues the current trend warrants much more attention than it's getting, especially given that most agent product pitches implicitly assume cost will approach zero as capability increases.

The piece has also reignited debate about whether METR's time-horizon metric is the right way to measure agent progress, with several researchers arguing in the HN comments that it systematically rewards compute-lavish approaches over efficient ones.

Panel Takes

The Builder

Developer Perspective

“This should be required reading for anyone pitching an agent product. The unit economics are brutal right now — you can't build a $100/month SaaS if the underlying agent costs $40/hr to run. The products that survive will be ones that minimize agent invocation through smart scripting and condition checks, not ones that run agents all day.”

The Skeptic

Reality Check

“Ord is doing important work here. The agent hype cycle has largely ignored cost — everyone cites benchmark improvements without noting that those benchmarks were achieved with 10x more compute than last year's baseline. 'Getting better' and 'getting more expensive to run' are being conflated constantly.”

The Futurist

Big Picture

“Historical precedent suggests compute costs do fall faster than capability grows — that's the entire arc of the last 20 years of Moore's Law. The current coupling may be a phase transition artifact rather than a permanent structural feature. But Ord is right that we shouldn't assume it away without empirical evidence.”

Panel Takes

Bookmarks