METR

30 days · UTC

LIVE_DATA_STREAM // APRIL_14_2026

Synchronizing with global intelligence nodes...

DENSITY_RATIO: MAX

BENCHMARKS VS. REALITY: AI CODE REVIEW PASSES THE TEST, FAILS THE REPO

Independent results show popular LLM code-review benchmarks overstate real-world quality; many “passing” AI fixes would be rejected by maintainers. M...

OPENAI

MAR_14 // 07:40

Benchmarks Aren’t Shipping Code: How to Vet AI Code Agents Before CI

New evidence shows top-scoring AI coding tools pass benchmarks but stumble in real code review and day‑to‑day engineering workflows. METR reports tha...

SWE-BENCH

MAR_13 // 07:41

SWE-bench passes aren’t merge-ready: new reviews question benchmark claims and real-world gains

Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A discussion sparked by METR’s review finds ...

METR

MAR_12 // 07:40

METR study challenges SWE-bench wins as Sonar touts 79.2% "Verified" score

A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sonar’s 79.2% in a different light. Researc...

GITHUB-COPILOT

MAR_08 // 07:27

AI coding assistants can slow devs—fix the verification gap

Studies show AI coding assistants can slow experienced developers and raise bug rates, so leaders should add friction and track real productivity. A ...