BENCHMARKS

30 days · UTC

LIVE_DATA_STREAM // APRIL_14_2026

Synchronizing with global intelligence nodes...

DENSITY_RATIO: MAX
MICROSOFT
APR_10 // 06:28

Oracle-SWE dissects the “oracle hints” behind SWE-bench wins, challenging headline coding benchmarks

New research isolates which “oracle” hints actually move SWE-bench agent scores, explaining why headline results often don’t match real coding impact....

CURSOR-IDE
MAR_24 // 07:27

Cursor Composer 2 lands with agentic coding gains, cost claims, and questions about provenance and safety

Cursor launched Composer 2, a MoE-based agentic coding model claiming strong multi-file performance at lower cost, but its base model and stability ar...

CURSOR
MAR_21 // 07:13

Cursor ships Composer 2: a cheaper, stronger coding model with a fast default — and some early hiccups

Cursor launched Composer 2, a cheaper coding model that claims big quality gains and a new fast default variant. Cursor’s own post says [Composer 2](...

OPENAI
MAR_14 // 07:40

Benchmarks Aren’t Shipping Code: How to Vet AI Code Agents Before CI

New evidence shows top-scoring AI coding tools pass benchmarks but stumble in real code review and day‑to‑day engineering workflows. METR reports tha...

SWE-BENCH
MAR_13 // 07:41

SWE-bench passes aren’t merge-ready: new reviews question benchmark claims and real-world gains

Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A discussion sparked by METR’s review finds ...

NVIDIA
MAR_12 // 07:42

NVIDIA’s AI-Q tops DeepResearch benchmarks, hinting at a full-stack agent push with Nemotron 3 Super

NVIDIA’s AI-Q open agent stack hit #1 on DeepResearch Bench I and II and points to a broader open, enterprise agent strategy. NVIDIA details how its ...

QWEN-35
MAR_03 // 23:22

Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination an...

ANTHROPIC
DEC_30 // 19:19

Update: Anthropic Claude Opus 4.5

New third‑party coverage (AOL/Yahoo) reiterates that Claude Opus 4.5 is Anthropic's 'most intelligent' model but provides no added technical specs, be...

SUBSCRIBE_FEED
Get the digest delivered. No spam.