SWE-BENCH

30 days · UTC

LIVE_DATA_STREAM // APRIL_14_2026

Synchronizing with global intelligence nodes...

DENSITY_RATIO: MAX

SWE-BENCH SCORES ARE SPIKING, BUT VARIANT MIX-UPS MAKE THE LEADERBOARD NOISY FOR REAL-WORLD TOOL CHOICES

Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. SWE-bench measures fail-to-pass bug fix...

MICROSOFT

APR_10 // 06:28

Oracle-SWE dissects the “oracle hints” behind SWE-bench wins, challenging headline coding benchmarks

New research isolates which “oracle” hints actually move SWE-bench agent scores, explaining why headline results often don’t match real coding impact....

ANTHROPIC

APR_08 // 06:22

Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals and fix your AI test blind spots

Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public leaderboards don’t reflect it yet. A de...

HUGGING-FACE

APR_02 // 06:36

Code agents grow up: CI-scale benchmarking, structured patch checks, and cheaper eval runs

Code agent evaluation is shifting to long-run maintainability, execution-free patch checks, and leaner, cheaper benchmark runs. A new benchmark, [SWE...

ZHIPU-AI

MAR_24 // 07:29

Coding-agent benchmarks are wobbling—trust results only after your own cross-context checks

SWE-Bench-style coding scores are spiking, but contamination and self-reported leaderboards mean you should trust results only after your own verifica...

CLAUDE-CODE

MAR_18 // 07:44

NxCode ranks 2026 AI coding tools: Claude Code (Opus 4.6) tops with 80.8% SWE-bench

NxCode ranked 10 AI coding tools for 2026 and put Claude Code (Opus 4.6) first with an 80.8% SWE-bench score. The review weights five factors—SWE-ben...

CLAUDE-SONNET-46

MAR_15 // 07:20

Benchmarks vs. reality: AI code review passes the test, fails the repo

Independent results show popular LLM code-review benchmarks overstate real-world quality; many “passing” AI fixes would be rejected by maintainers. M...

SWE-BENCH

MAR_13 // 07:41

SWE-BENCH PASSES AREN’T MERGE-READY: NEW REVIEWS QUESTION BENCHMARK CLAIMS AND REAL-WORLD GAINS

Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A discussion sparked by METR’s review finds ...

METR

CRITICAL_LEVEL // MAR_12 // 07:40

METR STUDY CHALLENGES SWE-BENCH WINS AS SONAR TOUTS 79.2% "VERIFIED" SCORE

A new METR review finds many SWE-bench "passes" aren’t merge-worthy, casting recent leaderboard wins like Sonar’s 79.2% in a different light. Researc...

ETH-ZURICH

MAR_07 // 07:47

Study: LLM-generated AGENTS.md hurts agent success and raises cost

A new ETH Zurich and LogicStar.ai study finds that LLM-generated repository context files like AGENTS.md reduce coding agent success and raise inferen...

QUESMA

FEB_20 // 12:17

Agents ace SWE-bench but stumble on OpenTelemetry tasks

Recent benchmarks show AI agents excel at code-fix tasks but falter on real-world observability work, signaling teams must evaluate agents against dom...

WINDSURF

JAN_27 // 11:01

Benchmark trust: SWE-bench questions; Qwen3‑Max emerges; Windsurf delivers

Community signals suggest AI coding assistants are advancing fast but require local validation: a practitioner credits [Windsurf with Claude Sonnet 3....

SWE-BENCH

JAN_23 // 07:49

Pick One LLM Benchmark That Mirrors Your Backend/Data Work

A community prompt asks which single LLM benchmark best reflects real daily tasks. For backend and data engineering, practical choices are SWE-bench (...

LLM-AGENTS

JAN_06 // 08:13

Agentic AI: architecture patterns and what to measure before you ship

A new survey consolidates how LLM-based agents are built—policy/LLM core, memory, planners, tool routers, and critics—plus orchestration choices (sing...

GITHUB-COPILOT

DEC_26 // 22:14

AI weekly (Dec 26, 2025): code agents, model updates, SWE-bench

A single roundup video reports advances in coding agents and model refreshes. Highlights cited include a GitHub Copilot agent oriented to clearing bac...