SWE-BENCH-VERIFIED

30 days · UTC

LIVE_DATA_STREAM // APRIL_14_2026

Synchronizing with global intelligence nodes...

DENSITY_RATIO: MAX

NEW LONG-HORIZON BENCHMARKS SAY CODING AGENTS REGRESS UNDER MAINTENANCE; TREAT THEM LIKE JUNIOR DEVS WITH TOUGHER CI

A new wave of long-horizon benchmarks shows most coding agents ship regressions over time, not just fixes. A summary in [TLDR Dev 2026-03-09](https:/...

MINIMAX-M25

MAR_04 // 20:48

MiniMax-M2.5 launches with SOTA coding claims; verify SWE-bench results

MiniMax launched MiniMax-M2.5, a fast, low-cost coding and agentic model, but teams should validate its headline SWE-bench gains with internal tests g...

QWEN-35

MAR_03 // 23:22

Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination an...

CLAUDE-45-SONNET

FEB_24 // 21:10

E2E agentic benchmarks replace SWE-bench; Gemini 3.1 favors deliberation

Agentic coding benchmarks are shifting toward end-to-end app-building tests as SWE-bench Verified is being phased out, while Google’s Gemini 3.1 Pro t...

PROJDEVBENCH

FEB_03 // 18:40

E2E coding agents: 27% pass, cheaper scaling, and safer adoption

A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports...