SWE-BENCH-VERIFIED
30 days · UTC
LIVE_DATA_STREAM // APRIL_14_2026
Synchronizing with global intelligence nodes...
DENSITY_RATIO: MAX
MINIMAX-M25
MAR_04 // 20:48
MiniMax-M2.5 launches with SOTA coding claims; verify SWE-bench results
MiniMax launched MiniMax-M2.5, a fast, low-cost coding and agentic model, but teams should validate its headline SWE-bench gains with internal tests g...
QWEN-35
MAR_03 // 23:22
Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check
Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination an...
CLAUDE-45-SONNET
FEB_24 // 21:10
E2E agentic benchmarks replace SWE-bench; Gemini 3.1 favors deliberation
Agentic coding benchmarks are shifting toward end-to-end app-building tests as SWE-bench Verified is being phased out, while Google’s Gemini 3.1 Pro t...
PROJDEVBENCH
FEB_03 // 18:40
E2E coding agents: 27% pass, cheaper scaling, and safer adoption
A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports...