E2E CODING AGENTS: 27% PASS, CHEAPER SCALING, AND SAFER ADOPTION
A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports only 27.3...
A new end-to-end benchmark, ProjDevBench1 with code2, reports only 27.38% acceptance for agent-built repos, highlighting gaps in system design, complexity, and resource management. Efficiency is improving: SWE-Replay3 recycles prior agent trajectories to cut test-time compute by up to 17.4% while maintaining or slightly improving fix rates. For evaluation and safety, Together AI shows open LLM judges can beat GPT‑5.2 on preference alignment post5, Java teams get a pragmatic path via ASTRA‑LangChain4j6, and an open‑weight coding LM targets agentic/local dev Qwen3‑Coder‑Next7.
-
Adds: defines an E2E agent benchmark with architecture, correctness, and refinement criteria plus pass-rate findings. ↩
-
Adds: benchmark repository for tasks, harnesses, and evaluation assets. ↩
-
Adds: test-time scaling via trajectory replay with up to 17.4% cost reduction and small performance gains on SWE-Bench variants. ↩
-
Adds: DPO-tuned open "LLM-as-judge" models outperform GPT‑5.2 on RewardBench 2 preference alignment, with code/how-to. ↩
-
Adds: security analysis of self-propagating adversarial prompts ("prompt worms") and the OpenClaw agent network example. ↩
-
Adds: Java integration pattern for agent+LLM via ASTRA modules and LangChain4J, including BeliefRAG and Maven packaging. ↩
-
Adds: open-weight coding model positioned for agentic workflows and local development. ↩
E2E success is still low, so teams need realistic benchmarks and cost-aware scaling to avoid overpromising agent capabilities.
Better judges and security patterns reduce regressions and mitigate risks from autonomous, networked agents.
-
terminal
Run ProjDevBench tasks on your agent stack and gate outputs with an open LLM judge to quantify quality and drift.
-
terminal
Add a trajectory archive (SWE-Replay style) to agent retries and measure cost/latency vs. pass-rate deltas in CI.
Legacy codebase integration strategies...
- 01.
Wrap existing CI/CD with judge-based checks, tool allowlists, and sandboxed execution before enabling agent autonomy.
- 02.
For Java services, integrate LLM calls via ASTRA‑LangChain4j behind feature flags with audit logging and rollback.
Fresh architecture paradigms...
- 01.
Design agent-first workflows with ephemeral environments, secrets isolation, and human-in-the-loop checkpoints.
- 02.
Prototype locally with open-weight coding LMs and plug in judge models early for PR review and regression scoring.