METR STUDY CHALLENGES SWE-BENCH WINS AS SONAR TOUTS 79.2% "VERIFIED" SCORE
A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sonar’s 79.2% in a different light. Researchers had a...
A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sonar’s 79.2% in a different light.
Researchers had active maintainers of scikit-learn, Sphinx, and pytest review 296 AI-generated patches and found roughly half of SWE-bench Verified “passing” solutions would be rejected in practice, often for real functional issues rather than style (The Decoder, daily.dev). One summary puts maintainer approvals about 24 percentage points below the benchmark’s automated grader AIBase.
At the same time, Sonar announced its Foundation Agent—built on Anthropic’s Claude—now tops SWE-bench: 79.2% on Verified and 52.62% on Full, with claims of 9 minutes and $1.9 per verified issue PRNewswire. Conversations about potential benchmark contamination and overfitting continue in the community YouTube.
Benchmark “passes” may not translate to merge-ready fixes, so relying on SWE-bench alone can mislead roadmap and staffing decisions.
Vendors will keep posting big scores; teams need guardrails and internal evaluation to avoid brittle auto-fixes landing in prod.
-
terminal
Shadow-run an agent (including Sonar’s) on your repos and require maintainer review; track merge rate vs. SWE-bench-style pass rate.
-
terminal
Augment unit tests with property tests and hidden canaries, then measure how many agent fixes regress when tests are mutated.
Legacy codebase integration strategies...
- 01.
Gate all agent PRs behind owners’ review and expanded CI (mutation testing, static analysis, fuzzing) before allowing merges.
- 02.
Start with low-risk classes of issues (docs, lints, flaky tests) and instrument rollback metrics for any automated remediation.
Fresh architecture paradigms...
- 01.
Design for agentic remediation: rich tests, invariants, and contracts by default so agents can’t game shallow checks.
- 02.
Structure services with clear module boundaries and golden datasets to make AI-driven patches safer to evaluate.