How to Build Evaluation Pipelines with LocalEvaluator and Keyword Checks in Agent Framework

You can build fast, API-free evaluation pipelines by combining LocalEvaluator with built-in checks like keyword_check to validate agent responses against expected keywords without external service calls.

The Microsoft Agent Framework provides a provider-agnostic evaluation engine that enables rapid validation of agent behavior through local checks. By leveraging LocalEvaluator and the keyword_check function, developers can create lightweight CI-friendly pipelines that verify response content without consuming cloud API quotas. This approach centers on three core abstractions—EvalItem, EvalCheck, and the Evaluator protocol—implemented in the framework's core evaluation module at python/packages/core/agent_framework/_evaluation.py.

Understand the Evaluation Architecture

The framework defines three foundational concepts in _evaluation.py:

  • EvalItem (line 181): A normalized representation of a single query-response interaction, containing conversation history, tool calls, context, and expected outputs.
  • EvalCheck (line 776): A callable that receives an EvalItem and returns a CheckResult with passed, reason, and check_name fields.
  • Evaluator protocol (line 504): An abstract interface implemented by any evaluation provider, whether cloud-based, LLM-as-judge, or local.

The public orchestration function evaluate_agent (lines 1540-1640) serves as the entry point. It normalizes input arguments, executes the agent (or reuses supplied responses), converts interactions into EvalItem objects via AgentEvalConverter.to_eval_item, attaches ground-truth data, and dispatches the items to every supplied evaluator.


# https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py#L1545-L1600

async def evaluate_agent(...):
    # Normalisation, validation, repetition handling, building EvalItem list

    # ...

    # Dispatch to each evaluator (including LocalEvaluator)

    # ...

Use LocalEvaluator for Fast, API-Free Validation

LocalEvaluator implements the Evaluator protocol to run checks locally without external API calls. Defined in _evaluation.py (lines 1343-1347), the class stores a collection of EvalCheck objects and executes them sequentially against each EvalItem.

The evaluation loop (lines 1380-1410) handles both synchronous and asynchronous checks, aggregating pass/fail counts into EvalResults. Because validation occurs entirely within the Python process, LocalEvaluator is ideal for unit-test style validation, CI smoke tests, and rapid prototyping workflows.

from agent_framework import LocalEvaluator, keyword_check

local = LocalEvaluator(keyword_check("weather"))

Implement Keyword Checks

The framework provides a ready-made keyword_check function (lines 886-905 in _evaluation.py) that validates the presence of required keywords in agent responses.


# https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py#L886-L905

@experimental(feature_id=ExperimentalFeature.EVALS)
def keyword_check(*keywords: str, case_sensitive: bool = False) -> EvalCheck:
    """Check that the response contains all specified keywords."""
    def _check(item: EvalItem) -> CheckResult:
        text = item.response if case_sensitive else item.response.lower()
        missing = [k for k in keywords if (k if case_sensitive else k.lower()) not in text]
        if missing:
            return CheckResult(passed=False,
                               reason=f"Missing keywords: {missing}",
                               check_name="keyword_check")
        return CheckResult(passed=True,
                           reason="All keywords found",
                           check_name="keyword_check")
    return _check

Parameters:

  • *keywords: Required words or phrases that must appear in the response.
  • case_sensitive: Optional boolean flag (default False) controlling string matching behavior.

The function returns an EvalCheck callable that can be passed directly to LocalEvaluator.

Build a Complete Evaluation Pipeline

A typical workflow involves four steps:

  1. Create checks using keyword_check() to define validation criteria.
  2. Instantiate LocalEvaluator with the configured checks.
  3. Call evaluate_agent with your agent, test queries, and the evaluator.
  4. Inspect EvalResults for aggregated statistics and per-item failure reasons.

# https://github.com/microsoft/agent-framework/blob/main/python/samples/02-agents/evaluation/evaluate_agent.py#L19-L34

from agent_framework import LocalEvaluator, keyword_check, evaluate_agent, Agent

# 1️⃣ Build an agent (any subclass of Agent)

my_agent = Agent(...)          # ← configure your agent here

# 2️⃣ Define the checks you care about

kw_check = keyword_check("weather", "temperature")   # must mention both words

# 3️⃣ Wrap them in a LocalEvaluator

local = LocalEvaluator(kw_check)

# 4️⃣ Run the evaluation pipeline

results = await evaluate_agent(
    agent=my_agent,
    queries=["What’s the weather like today?"],
    evaluators=local,
)

# 5️⃣ Inspect the outcome

print("Passed?" , results[0].all_passed)               # True / False

print("Details:", results[0].items[0].scores)          # per‑check score

Advanced Pipeline Features

The evaluation engine supports several advanced capabilities for robust validation:

Feature Description Implementation Location
Multiple repetitions Run each query N times to measure stability; each run creates a distinct EvalItem. evaluate_agent (lines 1550-1560)
Custom split strategy Override the default ConversationSplit.LAST_TURN to evaluate full conversations or specific turns. evaluate_agent (lines 1600-1610)
Mixed evaluators Combine LocalEvaluator with cloud providers like FoundryEvals in a single call. evaluate_agent evaluator loop
Tool-call expectations Validate specific tool invocations using expected_tool_calls and built-in checks like tool_calls_present. Ground-truth stamping (lines 1640-1660)

Mix Local and Cloud Evaluators

You can validate keywords locally while simultaneously running semantic evaluations through cloud providers:


# https://github.com/microsoft/agent-framework/blob/main/python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py#L70-L85

from agent_framework import LocalEvaluator, keyword_check, evaluate_agent
from agent_framework.foundry import FoundryEvals

# Local check: ensure the word "weather" appears

local = LocalEvaluator(keyword_check("weather"))

# Cloud (Foundry) evaluator – requires a Foundry client (omitted for brevity)

foundry = FoundryEvals(project_client=client, model="gpt-4o")

# Run both providers in one call

results = await evaluate_agent(
    agent=my_agent,
    queries=["Tell me the weather in Seattle."],
    evaluators=[local, foundry],
)

# `results` is a list: first entry = Local, second = Foundry

for r in results:
    print(r.provider, "passed:", r.all_passed)

Test Consistency with Repetitions

Run queries multiple times to detect flaky behavior:


# Reuse any agent; we repeat each query 3 times

results = await evaluate_agent(
    agent=my_agent,
    queries=["What’s the capital of France?"],
    evaluators=LocalEvaluator(keyword_check("Paris")),
    num_repetitions=3,          # three independent runs

)

print("Total items evaluated:", len(results[0].items))   # → 3

Summary

  • LocalEvaluator provides fast, API-free validation by executing EvalCheck objects directly against EvalItem instances without external service calls.
  • keyword_check creates reusable validation rules that verify keyword presence in agent responses, supporting case-insensitive matching by default.
  • evaluate_agent orchestrates the entire pipeline, normalizing inputs, running agents, and dispatching to multiple evaluators including local and cloud providers.
  • The architecture supports mixed evaluation strategies, allowing you to combine lightweight keyword checks with semantic LLM-as-judge evaluators in a single execution.
  • All core components reside in python/packages/core/agent_framework/_evaluation.py, with reference implementations available in the python/samples/02-agents/evaluation/ directory.

Frequently Asked Questions

How does LocalEvaluator differ from cloud-based evaluators like FoundryEvals?

LocalEvaluator executes validation logic entirely within the Python process, checking EvalItem objects against provided EvalCheck functions without network requests. Cloud-based evaluators like FoundryEvals send data to external APIs for semantic judging. According to the source code in _evaluation.py (lines 1340-1410), LocalEvaluator iterates through checks locally, making it suitable for CI pipelines and unit tests where speed and offline operation are critical.

Can I use keyword_check for case-sensitive matching?

Yes. The keyword_check function accepts an optional case_sensitive parameter (default False). When set to True, the check preserves the original casing of both the agent response and the target keywords during comparison. The implementation in _evaluation.py (lines 886-905) handles this by conditionally applying .lower() to both texts based on the parameter value.

How do I evaluate multiple queries with different expected keywords?

Create separate LocalEvaluator instances for each unique validation requirement, or compose multiple keyword_check calls into a single evaluator. The evaluate_agent function accepts a list of evaluators, allowing you to segment validation logic. Each EvalResult in the returned list corresponds to one evaluator, containing per-item pass/fail details for all queries processed by that specific check set.

What file contains the unit tests for LocalEvaluator and keyword_check?

The unit tests for LocalEvaluator, keyword_check, and automatic function wrapping reside in python/packages/core/tests/core/test_local_eval.py in the repository. These tests demonstrate usage patterns and validate the local evaluation loop's handling of both synchronous and asynchronous check functions.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →