PROMPTFOO PUB_DATE: 2026.04.25

AGENT EVALS ARE NOW SYSTEM TESTS, NOT MODEL TESTS

Coding AI moved from single-shot prompts to agents you must evaluate as full systems. The new [Promptfoo agent eval guide](https://www.promptfoo.dev/docs/guide...

Coding AI moved from single-shot prompts to agents you must evaluate as full systems.

The new Promptfoo agent eval guide (Promptfoo is now part of OpenAI) reframes testing around runtime tiers—plain LLM, SDK-based agent, and rich client/server—where tool access, safety posture, and state drive outcomes. It pushes teams to log intermediate steps, cost, and latency, not just final answers.

Benchmarks are also splitting. A short GLM-5.1 SWE-Bench explainer shows why “Verified” vs “Pro” scores diverge, while vendor videos tout wins on assorted suites (e.g., MiMo V2.5 Pro). The takeaway: fix your runtime boundary and scoring rubric before you compare anything.

[ WHY_IT_MATTERS ]
01.

Evaluating the system (tools, state, safety) reveals cost, latency, and failure modes hidden by final-answer scoring.

02.

SWE-Bench variants score differently; locked-down, reproducible harnesses stop apples-to-oranges comparisons.

[ WHAT_TO_TEST ]
  • terminal

    Run the same model as plain LLM vs SDK-based agent on the same patch set; compare success, steps, tool calls, cost, and wall time.

  • terminal

    Reproduce a SWE-Bench Verified run, then switch to Pro and document the delta; pin seeds and runtime boundary for both.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Add an agent-eval job to CI with a read-only sandbox and explicit tool allowlists; fail on cost/latency regressions or policy violations.

  • 02.

    Standardize on one agent tier (SDK vs app-server) per pipeline and store full traces for diffing across upgrades.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Design agents eval-first: step budgets, cost caps, tool permissions, and required traces baked in from day one.

  • 02.

    Choose the minimal runtime tier that meets needs; simpler boundaries reduce variance and blast radius.

Enjoying_this_story?

Get daily PROMPTFOO + SDLC updates.

  • Practical tactics you can ship tomorrow
  • Tooling, workflows, and architecture notes
  • One short email each weekday

FREE_FOREVER. TERMINATE_ANYTIME. View an example issue.

GET_DAILY_EMAIL
AI + SDLC // 5 MIN DAILY