N-Day-Bench: Frontier LLMs Now Score 80+ on Real Vulnerability Discovery — GPT-5.4 Leads, GLM-5.1 Surprises

A new cybersecurity benchmark called N-Day-Bench tests frontier LLMs on discovering real software vulnerabilities disclosed after their training cutoff — and the results show GPT-5.4 (83.93), GLM-5.1 (80.13), and Claude Opus 4.6 (79.95) all clustered above 79 in the April 2026 run.

Original source

## What Is N-Day-Bench?

Winfunc Research has published N-Day-Bench, a cybersecurity benchmark that evaluates large language models on their ability to discover real-world software vulnerabilities — specifically "N-days," meaning flaws that were publicly disclosed after a model's training knowledge cutoff. This design prevents models from simply reciting known CVE details; they have to actually reason about vulnerable code.

The benchmark scanned 1,000 security advisories for the April 2026 run, accepted 47 cases that met its rigor criteria, and ran each model against standardized conditions. Monthly updates keep the test cases current and models at their latest versions, aiming to create a living measure of real offensive AI capability rather than a static snapshot.

## The Leaderboard

The April 13, 2026 results are striking in how tightly clustered the top models are: - **GPT-5.4** (OpenAI): 83.93 - **GLM-5.1** (Zhipu AI): 80.13 - **Claude Opus 4.6** (Anthropic): 79.95

Zhipu's GLM-5.1 placing second — above Claude Opus — is the headline surprise, continuing its strong run after also claiming the #1 spot on SWE-Bench Pro for software engineering tasks. The three-point gap between first and third suggests these models are operating at comparable capability levels on real security tasks.

## Why This Matters

Previous cybersecurity benchmarks have drawn criticism for testing on well-documented vulnerabilities that may already be in training data. N-Day-Bench's post-cutoff design is a meaningful methodological improvement — if a model scores 80, it's because it can reason about vulnerabilities, not because it memorized them.

The timing is notable: AISI's evaluation of Claude Mythos earlier this year showed frontier models solving multi-step corporate network breach scenarios. N-Day-Bench adds a standardized, reproducible data point to that picture. Security teams, policymakers, and AI labs now have a public benchmark to anchor conversations about AI-assisted offensive security — and all the implications that follow.

Panel Takes

The Builder

Developer Perspective

“A standardized benchmark for LLM vulnerability discovery on N-day flaws is exactly what security engineers needed. GPT-5.4 at 83.93 and GLM-5.1 at 80.13 means the gap between frontier models on real security tasks is narrowing fast — which means defensive tooling needs to keep pace.”

The Skeptic

Reality Check

“47 accepted cases out of 1,000 advisories is a small evaluation set — scores can shift significantly with dataset changes. And measuring N-day discovery (known vulns after cutoff) is a proxy for real offensive capability, not a direct test. Treat these numbers as directional, not definitive.”

The Futurist

Big Picture

“N-Day-Bench arriving at a moment when all frontier models score above 79 on real vulnerability discovery is a signal that AI-assisted exploit development is no longer a research curiosity. This benchmark will become a standard reference — expect security policy conversations to cite these numbers within months.”

Panel Takes

Bookmarks