Question 1

Which is better: AI-Scientist-v2 or ClawBench?

Accepted Answer

Based on our expert panel, ClawBench has a stronger verdict with a 75% Ship rate. AI-Scientist-v2 received a panel verdict of Mixed and ClawBench received Ship.

Question 2

Is AI-Scientist-v2 free?

Accepted Answer

AI-Scientist-v2 pricing: Free / Open Source (custom license)

Question 3

Is ClawBench free?

Accepted Answer

ClawBench pricing: Free / Research

Question 4

What do experts say about AI-Scientist-v2 vs ClawBench?

Accepted Answer

AI-Scientist-v2: AI-Scientist-v2 is Sakana AI's second-generation autonomous research system that generates scientific papers end-to-end — from hypothesis formation through experimentation, data analysis, and manuscript writing. It's historically notable for producing the first AI-authored workshop paper accepted through peer review.

The v2 system removes reliance on human-authored templates that constrained the original, instead using a progressive agentic tree search guided by an experiment manager agent. This makes it more exploratory across ML domains, though Sakana acknowledges it trades v1's high template success rate for broader generalization with lower per-run success.

Costs run roughly $20-25 per full research run using Claude 3.5 Sonnet. The system integrates with Semantic Scholar for literature review and supports OpenAI, Gemini, and Claude via AWS Bedrock. The custom license requires disclosure of AI use in resulting publications — a meaningful ethical constraint for a system that could otherwise flood conferences with AI-generated submissions. ClawBench: ClawBench is a browser agent evaluation framework built around 153 real-world tasks running on 144 live production websites — not simulated environments or curated sandboxes. Tasks span e-commerce, travel booking, SaaS dashboards, government portals, and developer tools. A built-in request interceptor blocks genuinely irreversible actions (payments, form submissions that send data) so evaluations can run safely on real sites.

The benchmark records five layers of data per run: session replays, screenshots at each decision point, raw HTTP traffic, agent reasoning traces, and browser action sequences. This makes failure analysis tractable — you can see exactly which DOM element the agent misidentified, not just a final score. The dataset is open and the evaluation harness is reproducible.

The headline finding is sobering: Claude Sonnet 4.6, the best performer, completes only 33.3% of tasks. GLM-5 is second at 24.2%. No model exceeds 50% on any individual task category. The implication is stark — current browser agents are far from autonomous on the open web, and the gap between benchmark performance and production performance is still enormous.

AI-Scientist-v2 vs ClawBench

AI-Scientist-v2

ClawBench

Bookmarks