BENCHMARKS
30 days · UTC
Synchronizing with global intelligence nodes...
Oracle-SWE dissects the “oracle hints” behind SWE-bench wins, challenging headline coding benchmarks
New research isolates which “oracle” hints actually move SWE-bench agent scores, explaining why headline results often don’t match real coding impact....
Cursor Composer 2 lands with agentic coding gains, cost claims, and questions about provenance and safety
Cursor launched Composer 2, a MoE-based agentic coding model claiming strong multi-file performance at lower cost, but its base model and stability ar...
Cursor ships Composer 2: a cheaper, stronger coding model with a fast default — and some early hiccups
Cursor launched Composer 2, a cheaper coding model that claims big quality gains and a new fast default variant. Cursor’s own post says [Composer 2](...
Benchmarks Aren’t Shipping Code: How to Vet AI Code Agents Before CI
New evidence shows top-scoring AI coding tools pass benchmarks but stumble in real code review and day‑to‑day engineering workflows. METR reports tha...
SWE-bench passes aren’t merge-ready: new reviews question benchmark claims and real-world gains
Fresh reviews suggest high SWE-bench scores don’t translate to mergeable code or big productivity gains. A discussion sparked by METR’s review finds ...
NVIDIA’s AI-Q tops DeepResearch benchmarks, hinting at a full-stack agent push with Nemotron 3 Super
NVIDIA’s AI-Q open agent stack hit #1 on DeepResearch Bench I and II and points to a broader open, enterprise agent strategy. NVIDIA details how its ...
Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check
Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination an...
Update: Anthropic Claude Opus 4.5
New third‑party coverage (AOL/Yahoo) reiterates that Claude Opus 4.5 is Anthropic's 'most intelligent' model but provides no added technical specs, be...