ClawBench
153 real-world browser tasks, live websites — best AI agent scores only 33%
Expert verdict
Ship
3-1The Panel's Take
ClawBench is a browser agent evaluation framework built around 153 real-world tasks running on 144 live production websites — not simulated environments or curated sandboxes. Tasks span e-commerce, travel booking, SaaS dashboards, government portals, and developer tools. A built-in request interceptor blocks genuinely irreversible actions (payments, form submissions that send data) so evaluations can run safely on real sites. The benchmark records five layers of data per run: session replays, screenshots at each decision point, raw HTTP traffic, agent reasoning traces, and browser action sequences. This makes failure analysis tractable — you can see exactly which DOM element the agent misidentified, not just a final score. The dataset is open and the evaluation harness is reproducible. The headline finding is sobering: Claude Sonnet 4.6, the best performer, completes only 33.3% of tasks. GLM-5 is second at 24.2%. No model exceeds 50% on any individual task category. The implication is stark — current browser agents are far from autonomous on the open web, and the gap between benchmark performance and production performance is still enormous.
Share this verdict
ClawBench verdict: SHIP 🚀 3 ships · 1 skip from the expert panel Full review: shiporskip.io/tool/clawbench-browser-agent-benchmark-153-tasks-live-websites-2026
Weekly AI Tool Verdicts
Get the next verdict in your inbox
7 critics review a new AI tool every day. Weekly digest — free.
Compare ClawBench with Others
Looking for ClawBench alternatives?
Compare ClawBench with every other Research tool reviewed by our panel.
See all Research alternativesEmbed this verdict
Tool makers can add a live ShipOrSkip badge to their site. Badge loads track impressions; clicks route back to this review.
<a href="https://shiporskip.io/api/badge-click/clawbench-browser-agent-benchmark-153-tasks-live-websites-2026" target="_blank" rel="noopener"><img src="https://shiporskip.io/api/badge/clawbench-browser-agent-benchmark-153-tasks-live-websites-2026" alt="ClawBench Ship verdict on ShipOrSkip" width="360" height="90" /></a>[](https://shiporskip.io/api/badge-click/clawbench-browser-agent-benchmark-153-tasks-live-websites-2026)<iframe src="https://shiporskip.io/embed/clawbench-browser-agent-benchmark-153-tasks-live-websites-2026" title="ClawBench ShipOrSkip verdict" width="360" height="260" style="border:0;border-radius:16px;max-width:100%;" loading="lazy"></iframe>The reviews
“The five-layer recording (replays, HTTP traffic, reasoning traces) is the right approach for actual debugging — finally a benchmark where failure analysis is tractable. The 33% score also sets honest expectations for teams planning to ship production browser agents right now.”
“Live website testing is a double-edged sword: sites change their DOM, anti-bot measures evolve, and a task that passes today may fail next week with no code change. Benchmark drift on live websites could make ClawBench scores meaningless over 6-month periods without constant maintenance.”
“33% on live websites is actually more impressive than it sounds given the adversarial diversity of the real web. The trajectory from 5% in 2024 to 33% in 2026 means we're likely crossing 60% in 18 months — at which point browser agents start displacing RPA software at scale.”
“As someone who uses browser agents for research and competitor monitoring, the failure mode analysis is exactly what I need. Knowing which website categories agents handle well (dev tools) vs. poorly (government portals) helps me route tasks appropriately right now.”