Promptfoo joins OpenAI with a practical …

PROMPTFOO PUB_DATE: 2026.04.29

PROMPTFOO JOINS OPENAI WITH A PRACTICAL PLAYBOOK FOR EVALUATING CODING AGENTS

Promptfoo is now part of OpenAI and published a hands-on guide that reframes how to evaluate coding agents in the real world. The guide breaks down why agent e...

Promptfoo is now part of OpenAI and published a hands-on guide that reframes how to evaluate coding agents in the real world.

The guide breaks down why agent evals differ from single-shot LLM tests and shows tiered setups—plain LLM baselines, SDK-based agents, and rich app-server harnesses—with concrete trade-offs in cost, latency, and side effects Promptfoo guide. It also maps provider choices like Codex SDK/app-server and Claude Agent SDK to specific runtime boundaries.

Meanwhile, the SWE-Bench Verified leaderboard shows frontier models crowding the top, hinting that static patch benchmarks are nearing saturation for state-of-the-art models SWE-Bench Verified leaderboard. Community chatter questions how much these scores reflect deploy-time agent behavior versus memorization, reinforcing the shift to system-level evals.

[ WHY_IT_MATTERS ]

01.

Evaluation focus is shifting from static benchmarks to end-to-end agent behavior with real tool use, costs, and failure modes.

02.

Teams can adopt a clearer, repeatable test harness to compare plain LLMs vs agent runtimes before betting on an SDK or platform.

[ WHAT_TO_TEST ]

terminal
Run the same tasks across three tiers (plain LLM, SDK agent, app-server) and record tool calls, latency, and cost; verify real gains over a plain LLM baseline.
terminal
Toggle filesystem/shell access and approvals to quantify how much each capability changes pass rate, regressions, and side effects.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Gate rollout by wiring agent evals into CI with read-only sandboxes and approvals off-by-default, then promote capabilities gradually.
02.
Compare agent harnesses against your current scripts on flaky tests, refactors, and patch generation to measure real operational lift.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Pick the minimal runtime boundary (SDK vs app-server) that meets requirements; fewer tools mean simpler safety and lower variance.
02.
Start with a plain LLM baseline to prove each added ability (file access, shell, plugins) actually moves the needle.

Enjoying_this_story?

Get daily PROMPTFOO + SDLC updates.

Practical tactics you can ship tomorrow
Tooling, workflows, and architecture notes
One short email each weekday

arrow_back

PREVIOUS_DATA_LOG

Claude Code 2.1.122–2.1.123: Bedrock tier switch, better OTel types, and an OAuth loop fix

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

After Gemini key leak, lock down AI agents with zero-trust controls

arrow_forward