SWE-BENCH-PRO

30 days · UTC

LIVE_DATA_STREAM // APRIL_14_2026

Synchronizing with global intelligence nodes...

DENSITY_RATIO: MAX

ANTHROPIC’S MYTHOS AND PROJECT GLASSWING PUSH AI INTO REAL-WORLD VULN DISCOVERY, WITH TIGHT ACCESS AND STRONG BENCHMARK SIGNALS

Anthropic launched Project Glasswing and a Mythos Preview model that finds serious software bugs, pairing industry partners with restricted access and...

ANTHROPIC

APR_08 // 06:22

Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals and fix your AI test blind spots

Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public leaderboards don’t reflect it yet. A de...

SWE-BENCH-PRO

APR_04 // 06:19

SWE-Bench Pro leaderboard: small gains at the top, big contexts, and mostly self-reported results

A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores. The updated [SWE-Benc...

ZHIPU-AI

MAR_24 // 07:29

Coding-agent benchmarks are wobbling—trust results only after your own cross-context checks

SWE-Bench-style coding scores are spiking, but contamination and self-reported leaderboards mean you should trust results only after your own verifica...

CLAUDE-45-SONNET

FEB_24 // 21:10

E2E agentic benchmarks replace SWE-bench; Gemini 3.1 favors deliberation

Agentic coding benchmarks are shifting toward end-to-end app-building tests as SWE-bench Verified is being phased out, while Google’s Gemini 3.1 Pro t...

BITO

FEB_03 // 18:43

Coding agents: smarter context and sequential planning beat model-only upgrades

Third‑party tests show Bito’s AI Architect lifted a Claude Sonnet 4.5 agent to 60.8% on SWE‑Bench Pro by adding MCP‑delivered codebase intelligence—up...

PROJDEVBENCH

FEB_03 // 18:40

E2E coding agents: 27% pass, cheaper scaling, and safer adoption

A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports...