Question 1

Which is better: ClawBench or OpenMythos?

Accepted Answer

Based on our expert panel, ClawBench has a stronger verdict with a 75% Ship rate. ClawBench received a panel verdict of Ship and OpenMythos received Ship.

Question 2

Is ClawBench free?

Accepted Answer

ClawBench pricing: Free / Research

Question 3

Is OpenMythos free?

Accepted Answer

OpenMythos pricing: Open Source (Apache 2.0)

Question 4

What do experts say about ClawBench vs OpenMythos?

Accepted Answer

ClawBench: ClawBench is a browser agent evaluation framework built around 153 real-world tasks running on 144 live production websites — not simulated environments or curated sandboxes. Tasks span e-commerce, travel booking, SaaS dashboards, government portals, and developer tools. A built-in request interceptor blocks genuinely irreversible actions (payments, form submissions that send data) so evaluations can run safely on real sites.

The benchmark records five layers of data per run: session replays, screenshots at each decision point, raw HTTP traffic, agent reasoning traces, and browser action sequences. This makes failure analysis tractable — you can see exactly which DOM element the agent misidentified, not just a final score. The dataset is open and the evaluation harness is reproducible.

The headline finding is sobering: Claude Sonnet 4.6, the best performer, completes only 33.3% of tasks. GLM-5 is second at 24.2%. No model exceeds 50% on any individual task category. The implication is stark — current browser agents are far from autonomous on the open web, and the gap between benchmark performance and production performance is still enormous. OpenMythos: OpenMythos is a PyTorch reconstruction of the suspected architecture underlying Anthropic's Claude Mythos model, built entirely from published research. Creator Kye Gomez hypothesizes that Mythos uses a Recurrent-Depth Transformer (RDT) — where a subset of transformer layers loops multiple times per forward pass with shared weights rather than stacking unique layers. This allows the model to simulate "thinking" by iterating over the same compute graph, giving it emergent chain-of-thought behavior without explicit CoT prompting.

At 770M parameters, the OpenMythos implementation reportedly matches the downstream quality of a 1.3B standard transformer on benchmarks. The architecture combines Multi-Latent Attention for memory compression, LTI (Linear Time-Invariant) stability constraints to prevent training instability during recurrence, Mixture of Experts routing for specialization, and Adaptive Computation Time (ACT) halting to decide when to stop looping per token.

The project exploded on GitHub within days — 6.2k stars, 1.2k forks — and Kye's X announcement drove massive engagement (4.1k likes, 4.5k reposts). Community reaction is genuinely divided: AI researchers calling it "the most sophisticated reverse-engineering of an LLM architecture I've seen" while Anthropic has not confirmed or denied any of the architectural claims. This is an educated speculation backed by real engineering, not a marketing exercise.

ClawBench vs OpenMythos

ClawBench

OpenMythos

Bookmarks