SWE-BENCH
30 days · UTC
Synchronizing with global intelligence nodes...
Oracle-SWE dissects the “oracle hints” behind SWE-bench wins, challenging headline coding benchmarks
New research isolates which “oracle” hints actually move SWE-bench agent scores, explaining why headline results often don’t match real coding impact....
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals and fix your AI test blind spots
Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public leaderboards don’t reflect it yet. A de...
Code agents grow up: CI-scale benchmarking, structured patch checks, and cheaper eval runs
Code agent evaluation is shifting to long-run maintainability, execution-free patch checks, and leaner, cheaper benchmark runs. A new benchmark, [SWE...
Coding-agent benchmarks are wobbling—trust results only after your own cross-context checks
SWE-Bench-style coding scores are spiking, but contamination and self-reported leaderboards mean you should trust results only after your own verifica...
NxCode ranks 2026 AI coding tools: Claude Code (Opus 4.6) tops with 80.8% SWE-bench
NxCode ranked 10 AI coding tools for 2026 and put Claude Code (Opus 4.6) first with an 80.8% SWE-bench score. The review weights five factors—SWE-ben...
Benchmarks vs. reality: AI code review passes the test, fails the repo
Independent results show popular LLM code-review benchmarks overstate real-world quality; many “passing” AI fixes would be rejected by maintainers. M...
Study: LLM-generated AGENTS.md hurts agent success and raises cost
A new ETH Zurich and LogicStar.ai study finds that LLM-generated repository context files like AGENTS.md reduce coding agent success and raise inferen...
Agents ace SWE-bench but stumble on OpenTelemetry tasks
Recent benchmarks show AI agents excel at code-fix tasks but falter on real-world observability work, signaling teams must evaluate agents against dom...
Benchmark trust: SWE-bench questions; Qwen3‑Max emerges; Windsurf delivers
Community signals suggest AI coding assistants are advancing fast but require local validation: a practitioner credits [Windsurf with Claude Sonnet 3....
Pick One LLM Benchmark That Mirrors Your Backend/Data Work
A community prompt asks which single LLM benchmark best reflects real daily tasks. For backend and data engineering, practical choices are SWE-bench (...
Agentic AI: architecture patterns and what to measure before you ship
A new survey consolidates how LLM-based agents are built—policy/LLM core, memory, planners, tool routers, and critics—plus orchestration choices (sing...
AI weekly (Dec 26, 2025): code agents, model updates, SWE-bench
A single roundup video reports advances in coding agents and model refreshes. Highlights cited include a GitHub Copilot agent oriented to clearing bac...