GLM-5.1
#1 on SWE-Bench Pro — 744B MoE model that runs autonomously for 8 hours
Expert verdict
Skip
2-2The Panel's Take
GLM-5.1 is Z.AI's post-training upgrade of the 744B Mixture-of-Experts GLM-5 model, and it has just claimed the top spot on SWE-Bench Pro with a score of 58.4 — beating GPT-5.4 (57.7), Claude Opus 4.6 (57.3), and Gemini 3.1 Pro (54.2). The model is designed for long-horizon agentic tasks and can run autonomously for up to 8 hours across thousands of iterations on a single problem. The agentic capabilities include extended context retention, tool-calling with recovery loops, and a reinforcement-trained "persistence" mode that keeps the model on-task through failures and dead ends rather than surfacing errors to the user. The model was trained entirely on Huawei Ascend 910B chips using the MindSpore framework — no US silicon, no CUDA. The geopolitical dimension is as significant as the technical one: GLM-5.1 is direct evidence that US export controls on Nvidia hardware have not meaningfully slowed China's frontier model development. The 8-hour autonomous execution window is also a step-change from current agentic systems that struggle past 20-30 minutes of coherent work — if this benchmark holds up in real-world testing, it's a genuine advancement in the class of problems AI agents can independently solve.
Share this verdict
GLM-5.1 verdict: SKIP ⏭️ 2 ships · 2 skips from the expert panel Full review: shiporskip.io/tool/glm-51-zai-swe-bench-pro-1-744b-moe-8hour-autonomous-huawei-chips
Weekly AI Tool Verdicts
Get the next verdict in your inbox
7 critics review a new AI tool every day. Weekly digest — free.
Similar Products
Compare GLM-5.1 with Others
Looking for GLM-5.1 alternatives?
Compare GLM-5.1 with every other AI Models tool reviewed by our panel.
See all AI Models alternativesEmbed this verdict
Tool makers can add a live ShipOrSkip badge to their site. Badge loads track impressions; clicks route back to this review.
<a href="https://shiporskip.io/api/badge-click/glm-51-zai-swe-bench-pro-1-744b-moe-8hour-autonomous-huawei-chips" target="_blank" rel="noopener"><img src="https://shiporskip.io/api/badge/glm-51-zai-swe-bench-pro-1-744b-moe-8hour-autonomous-huawei-chips" alt="GLM-5.1 Skip verdict on ShipOrSkip" width="360" height="90" /></a>[](https://shiporskip.io/api/badge-click/glm-51-zai-swe-bench-pro-1-744b-moe-8hour-autonomous-huawei-chips)<iframe src="https://shiporskip.io/embed/glm-51-zai-swe-bench-pro-1-744b-moe-8hour-autonomous-huawei-chips" title="GLM-5.1 ShipOrSkip verdict" width="360" height="260" style="border:0;border-radius:16px;max-width:100%;" loading="lazy"></iframe>The reviews
“If the 8-hour autonomous execution claim is real and not cherry-picked, this changes the calculus for using AI on genuinely hard engineering problems. SWE-Bench Pro #1 is also a credible metric — I want to test this on my own repos immediately.”
“SWE-Bench benchmarks have historically shown poor correlation with real-world coding productivity, and the '8-hour autonomous' claim needs independent validation. Z.AI is also a relatively unknown quantity compared to Anthropic or Google — API reliability and pricing are completely unproven.”
“The strategic significance of a Chinese lab hitting #1 on the coding benchmark using zero US hardware cannot be overstated. The export control strategy is officially not working as intended, and GLM-5.1 will accelerate the geopolitical AI arms race in ways that reshape the entire industry.”
“For creative work, I need a model with strong multimodal capabilities and reliable API access — both unproven for GLM-5.1. The coding benchmark lead is impressive but not directly relevant to my workflows. I'll wait for independent reviews before switching.”