30 days · UTC
Synchronizing with global intelligence nodes...
A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sonar’s 79.2% in a different light. Researc...
Agentic coding benchmarks are shifting toward end-to-end app-building tests as SWE-bench Verified is being phased out, while Google’s Gemini 3.1 Pro t...