METR
CompanyMETR (Model Evaluation and Threat Research) is a Berkeley-based nonprofit that runs empirical studies and benchmarks to assess the real-world capabilities and risks of frontier AI systems. Its recent work includes auditing AI-generated code that passes the SWE-bench benchmark and showing that many such patches are rejected in human code review.
Stories
Completed digest stories linked to this service.
-
Benchmarks vs. reality: AI code review passes the test, fails the repo2026-03-15Independent results show popular LLM code-review benchmarks overstate real-world quality; many “passing” AI fi...
-
Benchmarks Aren’t Shipping Code: How to Vet AI Code Agents Before CI2026-03-14New evidence shows top-scoring AI coding tools pass benchmarks but stumble in real code review and day‑to‑day ...
-
SWE-bench passes aren’t merge-ready: new reviews question benchmark claims and r...2026-03-13Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A di...
-
METR study challenges SWE-bench wins as Sonar touts 79.2% "Verified" score2026-03-12A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sona...