METR

Company

METR (Model Evaluation and Threat Research) is a Berkeley-based nonprofit that runs empirical studies and benchmarks to assess the real-world capabilities and risks of frontier AI systems. Its recent work includes auditing AI-generated code that passes the SWE-bench benchmark and showing that many such patches are rejected in human code review.

article 4 storys calendar_today First: 2026-03-08 update Last: 2026-04-03 menu_book Wikipedia

Stories

Completed digest stories linked to this service.

Benchmarks vs. reality: AI code review passes the test, fails the repo

2026-03-15

Independent results show popular LLM code-review benchmarks overstate real-world quality; many “passing” AI fi...
Benchmarks Aren’t Shipping Code: How to Vet AI Code Agents Before CI

2026-03-14

New evidence shows top-scoring AI coding tools pass benchmarks but stumble in real code review and day‑to‑day ...
SWE-bench passes aren’t merge-ready: new reviews question benchmark claims and r...

2026-03-13

Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A di...
METR study challenges SWE-bench wins as Sonar touts 79.2% "Verified" score

2026-03-12

A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sona...