Agent evals are now system tests, not mo…

PROMPTFOO PUB_DATE: 2026.04.25

AGENT EVALS ARE NOW SYSTEM TESTS, NOT MODEL TESTS

Coding AI moved from single-shot prompts to agents you must evaluate as full systems. The new [Promptfoo agent eval guide](https://www.promptfoo.dev/docs/guide...

Coding AI moved from single-shot prompts to agents you must evaluate as full systems.

The new Promptfoo agent eval guide (Promptfoo is now part of OpenAI) reframes testing around runtime tiers—plain LLM, SDK-based agent, and rich client/server—where tool access, safety posture, and state drive outcomes. It pushes teams to log intermediate steps, cost, and latency, not just final answers.

Benchmarks are also splitting. A short GLM-5.1 SWE-Bench explainer shows why “Verified” vs “Pro” scores diverge, while vendor videos tout wins on assorted suites (e.g., MiMo V2.5 Pro). The takeaway: fix your runtime boundary and scoring rubric before you compare anything.

[ WHY_IT_MATTERS ]

01.

Evaluating the system (tools, state, safety) reveals cost, latency, and failure modes hidden by final-answer scoring.

02.

SWE-Bench variants score differently; locked-down, reproducible harnesses stop apples-to-oranges comparisons.

[ WHAT_TO_TEST ]

terminal
Run the same model as plain LLM vs SDK-based agent on the same patch set; compare success, steps, tool calls, cost, and wall time.
terminal
Reproduce a SWE-Bench Verified run, then switch to Pro and document the delta; pin seeds and runtime boundary for both.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Add an agent-eval job to CI with a read-only sandbox and explicit tool allowlists; fail on cost/latency regressions or policy violations.
02.
Standardize on one agent tier (SDK vs app-server) per pipeline and store full traces for diffing across upgrades.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Design agents eval-first: step budgets, cost caps, tool permissions, and required traces baked in from day one.
02.
Choose the minimal runtime tier that meets needs; simpler boundaries reduce variance and blast radius.

Enjoying_this_story?

Get daily PROMPTFOO + SDLC updates.

Practical tactics you can ship tomorrow
Tooling, workflows, and architecture notes
One short email each weekday

arrow_back

PREVIOUS_DATA_LOG

Cursor teams with Chainguard to harden AI coding agent supply chains

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Google shifts from apps to agents across Android and Cloud

arrow_forward