GEMINI-31-PRO
30 days · UTC
Synchronizing with global intelligence nodes...
SWE-bench scores are spiking, but variant mix-ups make the leaderboard noisy for real-world tool choices
Vendors are touting big SWE-bench jumps, but versions differ and scores alone won’t pick your coding copilot. SWE-bench measures fail-to-pass bug fix...
Claude Mythos posts record SWE-bench numbers, but it’s gated; tighten your evals and fix your AI test blind spots
Anthropic’s Claude Mythos preview claims record SWE-bench results, but it isn’t publicly available and public leaderboards don’t reflect it yet. A de...
Google’s agentic dev stack: Gemini 3.1 long-context and ADK 2.0 deterministic graphs move from hype to practice
Google is consolidating its AI coding bet around Gemini 3.1 and a new ADK 2.0 graph workflow, pushing agentic, deterministic software delivery. A Web...
Cheaper coding LLMs and subagent stacks are here—time to re-architect your model routing
Production-ready, cheaper models plus subagent patterns are shifting AI economics for coding and document workflows. Z.ai’s new GLM-5.1 posts a 45.3 ...
Coding LLMs, March 2026: default to Sonnet 4.6, escalate to GPT-5.4, watch scaffold-driven benchmarks
March 2026 coding LLM benchmarks show mid-tier models rival flagships, but scaffolding and cost drive real-world choices. The latest multi-benchmark ...
Usable Context, Not Token Hype: How to pick and harden LLMs for long docs and agents
Choosing an LLM for long context and agents comes down to usable context and safety, not headline token counts. A careful comparison argues that cont...
Benchmarks Are Breaking: Evaluate LLMs in Your Harness, Not Theirs
LLM benchmark scores are failing under real-world conditions, so choose and tune models by testing them in your own harness with controlled tools and ...
E2E agentic benchmarks replace SWE-bench; Gemini 3.1 favors deliberation
Agentic coding benchmarks are shifting toward end-to-end app-building tests as SWE-bench Verified is being phased out, while Google’s Gemini 3.1 Pro t...
Google ships Gemini 3.1 Pro with big reasoning gains and 1M‑token context
Google released Gemini 3.1 Pro with major reasoning gains, a context window up to 1 million tokens, and broad availability across developer and enterp...
Windsurf ships new models, Linux ARM64, and enterprise hooks
Windsurf rolled out new frontier coding models, full Linux ARM64 support, and enterprise-grade Cascade Hooks while community feedback spotlights its t...