How do I evaluate whether an extension improved the output?

Two approaches: deterministic checks (string search, AST parsing) and LLM-as-judge (natural language criteria). Deterministic checks are precise but brittle : string search for "v3 SDK" passes if v3 appears in a comment even if the code doesn't use it. LLM-as-judge can trace flows and evaluate architectural patterns, but may need custom tools for version comparison or schema validation. Write criteria the same way you'd write pull request acceptance criteria: specific enough that two reviewers would agree on the verdict.

What is a scenario in agent extension evaluation?

A scenario is a specific task you ask the agent to complete, with three required components: (1) Starting workspace : the repository state before the agent starts (can be empty folder or existing project : agents behave differently in each); (2) Prompt : what you tell the agent to do, written the way a developer who doesn't know your extension exists would write it; (3) Evaluation criteria : how you determine whether the output is correct.

Is Your Agent Extension Actually Working?

Q: What metrics should I track when evaluating an agent extension?

Two core metrics: (1) Outcome improvement : did the output get better? (2) Cost : what did it cost in tokens, turns, and compute? A scenario that completes in 3 turns without your extension might take 7 turns with it. If outcomes improve by 10% but token costs triple, that's still lift : just an expensive one. Measure both dimensions.

Tool invocation looks like success. But if your agent produces the same output without your extension, your extension is drag. not lift. Here's how to measure whether your MCP server actually helps.

TL;DR: I built an MCP server for my agent. The agent called it in every turn. Then I disabled it and the agent produced the same output. I had built a 200-line no-op.

If your agent produces the same output without your extension, your extension is drag. Here is how to measure it.

Key takeaways:

Tool invocation isn’t success: your extension might be drag (no improvement over baseline) or even negative lift (worse outcomes with it installed)

Run controlled comparisons: same model, harness, prompt, and workspace: only variable is your extension

Track two dimensions: outcome quality and cost (tokens, turns, compute): expensive lift is still lift

Write evaluation criteria the way you’d write PR acceptance criteria: specific enough that two reviewers agree on the verdict

LLM-as-judge is more practical than deterministic checkers for architectural patterns: but must be calibrated for accuracy and consistency

The most common mistake in agent extension work: treating tool invocation as a success signal. (It’s the same mistake as treating benchmark scores as agent loop performance: they measure different things.)

Your extension was called. It returned content. The agent used it. That looks like success.

Here’s the uncomfortable truth: your extension might be producing drag while appearing to work.

The MCP ecosystem is exploding. Google bet on MCP at I/O 2026, Accenture released MCP-Bench for standardized tool evaluation, and the research community has produced at least three dedicated benchmarks (MCPAgentBench, LiveMCPBench, MCP-Bench) in the last six months. Every new MCP server, every coding agent extension, every tool your agent calls is running the same unmeasured risk: it might be consuming context window space with zero benefit.

The problem is that nobody measures this. Tool providers publish invocation counts. Agents log which tools were called. Neither tells you whether the output improved because of the tool.

Why tool invocation is not success

When the agent calls your MCP server or coding agent extension, three things can happen:

Your extension changes the output for the better: the agent produces something it couldn’t have without your extension. This is lift.
Your extension changes nothing: the agent would have produced the same output using only its training data and workspace context. Your extension consumed context window space for zero benefit. This is drag.
Your extension makes things worse: it returned too much content, pushed relevant workspace context out of the context window, and the agent produced a worse result than the baseline. This is negative lift.

From the outside, all three look identical: the tool was called, content was returned. There’s no signal in the tool invocation itself that tells you whether the content helped.

How do I measure whether my agent extension works?

To know if your extension works, you need two data points:

Profile	What it is
Baseline	Agent with no extensions: only training data + workspace context
With Extension	Agent has your extension available

Run the same scenario against both. Everything else stays constant:

Same model
Same harness
Same prompt
Same starting workspace

Only variable: your extension.

If the output improves with your extension, it’s lift. If it’s identical, it’s drag. If it’s worse, your extension is net negative: and you need to know that before you ship it to users.

What to measure

Two dimensions, both required:

Outcome quality: did the output get better? This is the hard part. Binary checks (does it compile? does it use the v3 SDK?) are easier to implement but miss architectural quality. LLM-as-judge can evaluate pattern matching and flow correctness, but needs calibration: the same model evaluating the same output twice might give different scores if the criteria aren’t specific enough.

Cost, what did it cost? Token usage, turns, compute time. A scenario that completes in 3 turns without your extension might take 7 turns with it. If outcomes improve by 10% but token costs triple, that’s still lift, just an expensive one. You need both dimensions to make a tradeoff decision.

How to write evaluation criteria

Write criteria the same way you’d write acceptance criteria for a pull request: specific enough that two reviewers would agree on the verdict.

❌ Vague	✅ Specific
”The code should be well-structured"	"The authentication flow should use the PKCE pattern with a redirect URI, not client credentials"
"The code uses the latest SDK version"	"The generated code imports and uses the v3 SDK in application code, not just in comments or string literals”

Vague criteria produce noisy results. Specific criteria produce reliable signals.

For pattern-based evaluation (is the auth flow correct? does the solution structure match documentation?), LLM-as-judge is more practical than writing deterministic checkers. A grader that inspects AST for PKCE flow verification is harder to maintain than the extension itself. Write the criteria in natural language, update them as a text file, and let a capable model apply them.

After each criteria change, verify:

Accuracy: do scores reflect what’s in the code? Run against outputs where you know the correct verdict.
Consistency: do you get the same score when evaluating the same output multiple times? If verdicts flip, the criteria aren’t reliable.

Reliable evaluation criteria are the foundation of measurement. Do them wrong, and you end up amplifying noise and making decisions based on false signals.

Why does the starting workspace affect extension evaluation?

Agents behave differently in empty folders versus existing projects with structure, config, and dependencies. Your evaluation should cover both:

Empty folder: build from scratch. Tests whether the agent can use your extension to generate correct initial structure.
Existing project: extend or modify. Tests whether the agent can integrate your extension’s guidance with existing code.

If your extension only helps in empty folders but hurts in populated ones, you need to know before you ship.

Why is a baseline essential for extension evaluation?

The hardest part of extension evaluation: accepting that you might be wrong. The baseline run exists to prove your extension wrong: to show that the agent would have done just as well without it.

If you only run your extension with the extension installed, you’ll always see “the tool was called.” You’ll never see that the tool wasn’t needed. The baseline is the only honest measurement of whether your work matters.

This is the same discipline the Vertical Agent Method applies at the workflow level: before you commit to building an agent for a workflow, you verify that the workflow needs an agent: that the baseline (doing it manually) is worse than the agent-assisted version. Extension work follows the same logic: the baseline is the floor you’re trying to beat, not a control group you can skip.

FAQ

Why is tool invocation a misleading success signal? Tool invocation means the agent called your extension and received content back. It doesn’t mean the content helped. The agent might have produced the same output using only its training data and workspace context: in which case your extension consumed context window space for no benefit. From the outside, everything looks fine: the tool was called, content was returned.

What is the baseline in agent extension evaluation? The baseline is the agent running with only its training data and workspace context: no extensions installed. To measure your extension, you run the same scenario twice: once with your extension, once without. Everything else stays constant: same model, same harness, same prompt, same workspace. The only variable is your extension.

What metrics should I track? Two core metrics: (1) Outcome improvement: did the output get better? (2) Cost: what did it cost in tokens, turns, and compute? A scenario that completes in 3 turns without your extension might take 7 turns with it. If outcomes improve by 10% but token costs triple, that’s still lift: just an expensive one. Measure both dimensions to make a tradeoff decision.

How do I avoid writing evaluation criteria that favor my extension? Write prompts the way a developer who doesn’t know your extension exists would write them. If you write “make sure it uses the v3 SDK” and then evaluate whether the v3 SDK was used, you’re improving for your extension rather than measuring its real value. The test should reveal whether your extension changes the outcome compared to the baseline: not whether the agent followed your hint.

Read Spec-Driven Development and the Vertical Agent Method for how the Vertical Agent Method applies the same verification discipline at the workflow level. Also see Best AI agent frameworks 2026 for a comparison of how different frameworks handle extension and tool evaluation.

Tools to evaluate and benchmark AI agent performance covers agent evaluation frameworks and metrics.

MCP evaluation guide covers measurement strategies for agent tool extensions.

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev