METR
30 days · UTC
LIVE_DATA_STREAM // APRIL_14_2026
Synchronizing with global intelligence nodes...
DENSITY_RATIO: MAX
OPENAI
MAR_14 // 07:40
Benchmarks Aren’t Shipping Code: How to Vet AI Code Agents Before CI
New evidence shows top-scoring AI coding tools pass benchmarks but stumble in real code review and day‑to‑day engineering workflows. METR reports tha...
SWE-BENCH
MAR_13 // 07:41
SWE-bench passes aren’t merge-ready: new reviews question benchmark claims and real-world gains
Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A discussion sparked by METR’s review finds ...
METR
MAR_12 // 07:40
METR study challenges SWE-bench wins as Sonar touts 79.2% "Verified" score
A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sonar’s 79.2% in a different light. Researc...
GITHUB-COPILOT
MAR_08 // 07:27
AI coding assistants can slow devs—fix the verification gap
Studies show AI coding assistants can slow experienced developers and raise bug rates, so leaders should add friction and track real productivity. A ...