BENCHMARKS

30 days · UTC

LIVE_DATA_STREAM // APRIL_14_2026

Synchronizing with global intelligence nodes...

DENSITY_RATIO: MAX

SWE-BENCH SCORES ARE SPIKING, BUT VARIANT MIX-UPS MAKE THE LEADERBOARD NOISY FOR REAL-WORLD TOOL CHOICES

Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. SWE-bench measures fail-to-pass bug fix...

MICROSOFT

APR_10 // 06:28

Oracle-SWE dissects the “oracle hints” behind SWE-bench wins, challenging headline coding benchmarks

New research isolates which “oracle” hints actually move SWE-bench agent scores, explaining why headline results often don’t match real coding impact....

CURSOR-IDE

MAR_24 // 07:27

Cursor Composer 2 lands with agentic coding gains, cost claims, and questions about provenance and safety

Cursor launched Composer 2, a MoE-based agentic coding model claiming strong multi-file performance at lower cost, but its base model and stability ar...

CURSOR

MAR_21 // 07:13

Cursor ships Composer 2: a cheaper, stronger coding model with a fast default — and some early hiccups

Cursor launched Composer 2, a cheaper coding model that claims big quality gains and a new fast default variant. Cursor’s own post says [Composer 2](...

OPENAI

MAR_14 // 07:40

Benchmarks Aren’t Shipping Code: How to Vet AI Code Agents Before CI

New evidence shows top-scoring AI coding tools pass benchmarks but stumble in real code review and day‑to‑day engineering workflows. METR reports tha...

SWE-BENCH

MAR_13 // 07:41

SWE-bench passes aren’t merge-ready: new reviews question benchmark claims and real-world gains

Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A discussion sparked by METR’s review finds ...

NVIDIA

MAR_12 // 07:42

NVIDIA’s AI-Q tops DeepResearch benchmarks, hinting at a full-stack agent push with Nemotron 3 Super

NVIDIA’s AI-Q open agent stack hit #1 on DeepResearch Bench I and II and points to a broader open, enterprise agent strategy. NVIDIA details how its ...

CLAUDE-CODE

MAR_11 // 07:24

NEW LONG-HORIZON BENCHMARKS SAY CODING AGENTS REGRESS UNDER MAINTENANCE; TREAT THEM LIKE JUNIOR DEVS WITH TOUGHER CI

A new wave of long-horizon benchmarks shows most coding agents ship regressions over time, not just fixes. A summary in [TLDR Dev 2026-03-09](https:/...

SCALE-AI

CRITICAL_LEVEL // MAR_09 // 07:25

SWE‑ATLAS AND SWE‑CI SHOW AI CODING AGENTS STILL BREAK REAL CODEBASES

New agent benchmarks show LLM coders falter on real maintenance tasks and can quietly ship regressions. Scale AI’s new [SWE‑Atlas benchmark](https://...

QWEN-35

MAR_03 // 23:22

Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination an...

ANTHROPIC

DEC_30 // 19:19

Update: Anthropic Claude Opus 4.5

New third‑party coverage (AOL/Yahoo) reiterates that Claude Opus 4.5 is Anthropic's 'most intelligent' model but provides no added technical specs, be...