LLM-BENCHMARKS
30 days · UTC
Synchronizing with global intelligence nodes...
Anthropic’s Mythos and Project Glasswing push AI into real-world vuln discovery, with tight access and strong benchmark signals
Anthropic launched Project Glasswing and a Mythos Preview model that finds serious software bugs, pairing industry partners with restricted access and...
SWE-Bench Pro leaderboard: small gains at the top, big contexts, and mostly self-reported results
A new SWE-Bench Pro leaderboard shows top code models clustered around 0.55–0.58, with large contexts and self-reported scores. The updated [SWE-Benc...
Production reality check for coding agents: reliability over benchmarks
AI coding agents are hitting production walls where reliability, latency, and evaluation—not raw benchmarks—decide whether they help or hurt teams. A...
Coding-agent benchmarks are wobbling—trust results only after your own cross-context checks
SWE-Bench-style coding scores are spiking, but contamination and self-reported leaderboards mean you should trust results only after your own verifica...
Benchmarks vs. reality: AI code review passes the test, fails the repo
Independent results show popular LLM code-review benchmarks overstate real-world quality; many “passing” AI fixes would be rejected by maintainers. M...