SWE-Bench Pro
TermA framework for software engineering benchmarking and evaluation.
article
5 storys
calendar_today
First: 2026-02-03
update
Last: 2026-04-17
open_in_new
Website
menu_book
Wikipedia
Stories
Completed digest stories linked to this service.
-
SWE-bench scores are spiking, but variant mix-ups make the leaderboard noisy for...2026-04-12Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. ...
-
Anthropic’s Mythos and Project Glasswing push AI into real-world vuln discovery,...2026-04-09Anthropic launched Project Glasswing and a Mythos Preview model that finds serious software bugs, pairing indu...
-
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals...2026-04-08Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public ...
-
SWE-Bench Pro leaderboard: small gains at the top, big contexts, and mostly self...2026-04-04A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self...
-
Coding-agent benchmarks are wobbling—trust results only after your own cross-con...2026-03-24SWE-Bench-style coding scores are spiking, but contamination and self-reported leaderboards mean you should tr...