How to Build Evaluation Pipelines with LocalEvaluator and Keyword Checks in Agent Framework
You can build fast, API-free evaluation pipelines by combining LocalEvaluator with built-in checks like keyword_check to validate agent responses against expected keywords without external service calls.
The Microsoft Agent Framework provides a provider-agnostic evaluation engine that enables rapid validation of agent behavior through local checks. By leveraging LocalEvaluator and the keyword_check function, developers can create lightweight CI-friendly pipelines that verify response content without consuming cloud API quotas. This approach centers on three core abstractions—EvalItem, EvalCheck, and the Evaluator protocol—implemented in the framework's core evaluation module at python/packages/core/agent_framework/_evaluation.py.
Understand the Evaluation Architecture
The framework defines three foundational concepts in _evaluation.py:
EvalItem(line 181): A normalized representation of a single query-response interaction, containing conversation history, tool calls, context, and expected outputs.EvalCheck(line 776): A callable that receives anEvalItemand returns aCheckResultwithpassed,reason, andcheck_namefields.Evaluatorprotocol (line 504): An abstract interface implemented by any evaluation provider, whether cloud-based, LLM-as-judge, or local.
The public orchestration function evaluate_agent (lines 1540-1640) serves as the entry point. It normalizes input arguments, executes the agent (or reuses supplied responses), converts interactions into EvalItem objects via AgentEvalConverter.to_eval_item, attaches ground-truth data, and dispatches the items to every supplied evaluator.
# https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py#L1545-L1600
async def evaluate_agent(...):
# Normalisation, validation, repetition handling, building EvalItem list
# ...
# Dispatch to each evaluator (including LocalEvaluator)
# ...
Use LocalEvaluator for Fast, API-Free Validation
LocalEvaluator implements the Evaluator protocol to run checks locally without external API calls. Defined in _evaluation.py (lines 1343-1347), the class stores a collection of EvalCheck objects and executes them sequentially against each EvalItem.
The evaluation loop (lines 1380-1410) handles both synchronous and asynchronous checks, aggregating pass/fail counts into EvalResults. Because validation occurs entirely within the Python process, LocalEvaluator is ideal for unit-test style validation, CI smoke tests, and rapid prototyping workflows.
from agent_framework import LocalEvaluator, keyword_check
local = LocalEvaluator(keyword_check("weather"))
Implement Keyword Checks
The framework provides a ready-made keyword_check function (lines 886-905 in _evaluation.py) that validates the presence of required keywords in agent responses.
# https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py#L886-L905
@experimental(feature_id=ExperimentalFeature.EVALS)
def keyword_check(*keywords: str, case_sensitive: bool = False) -> EvalCheck:
"""Check that the response contains all specified keywords."""
def _check(item: EvalItem) -> CheckResult:
text = item.response if case_sensitive else item.response.lower()
missing = [k for k in keywords if (k if case_sensitive else k.lower()) not in text]
if missing:
return CheckResult(passed=False,
reason=f"Missing keywords: {missing}",
check_name="keyword_check")
return CheckResult(passed=True,
reason="All keywords found",
check_name="keyword_check")
return _check
Parameters:
*keywords: Required words or phrases that must appear in the response.case_sensitive: Optional boolean flag (defaultFalse) controlling string matching behavior.
The function returns an EvalCheck callable that can be passed directly to LocalEvaluator.
Build a Complete Evaluation Pipeline
A typical workflow involves four steps:
- Create checks using
keyword_check()to define validation criteria. - Instantiate
LocalEvaluatorwith the configured checks. - Call
evaluate_agentwith your agent, test queries, and the evaluator. - Inspect
EvalResultsfor aggregated statistics and per-item failure reasons.
# https://github.com/microsoft/agent-framework/blob/main/python/samples/02-agents/evaluation/evaluate_agent.py#L19-L34
from agent_framework import LocalEvaluator, keyword_check, evaluate_agent, Agent
# 1️⃣ Build an agent (any subclass of Agent)
my_agent = Agent(...) # ← configure your agent here
# 2️⃣ Define the checks you care about
kw_check = keyword_check("weather", "temperature") # must mention both words
# 3️⃣ Wrap them in a LocalEvaluator
local = LocalEvaluator(kw_check)
# 4️⃣ Run the evaluation pipeline
results = await evaluate_agent(
agent=my_agent,
queries=["What’s the weather like today?"],
evaluators=local,
)
# 5️⃣ Inspect the outcome
print("Passed?" , results[0].all_passed) # True / False
print("Details:", results[0].items[0].scores) # per‑check score
Advanced Pipeline Features
The evaluation engine supports several advanced capabilities for robust validation:
| Feature | Description | Implementation Location |
|---|---|---|
| Multiple repetitions | Run each query N times to measure stability; each run creates a distinct EvalItem. |
evaluate_agent (lines 1550-1560) |
| Custom split strategy | Override the default ConversationSplit.LAST_TURN to evaluate full conversations or specific turns. |
evaluate_agent (lines 1600-1610) |
| Mixed evaluators | Combine LocalEvaluator with cloud providers like FoundryEvals in a single call. |
evaluate_agent evaluator loop |
| Tool-call expectations | Validate specific tool invocations using expected_tool_calls and built-in checks like tool_calls_present. |
Ground-truth stamping (lines 1640-1660) |
Mix Local and Cloud Evaluators
You can validate keywords locally while simultaneously running semantic evaluations through cloud providers:
# https://github.com/microsoft/agent-framework/blob/main/python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py#L70-L85
from agent_framework import LocalEvaluator, keyword_check, evaluate_agent
from agent_framework.foundry import FoundryEvals
# Local check: ensure the word "weather" appears
local = LocalEvaluator(keyword_check("weather"))
# Cloud (Foundry) evaluator – requires a Foundry client (omitted for brevity)
foundry = FoundryEvals(project_client=client, model="gpt-4o")
# Run both providers in one call
results = await evaluate_agent(
agent=my_agent,
queries=["Tell me the weather in Seattle."],
evaluators=[local, foundry],
)
# `results` is a list: first entry = Local, second = Foundry
for r in results:
print(r.provider, "passed:", r.all_passed)
Test Consistency with Repetitions
Run queries multiple times to detect flaky behavior:
# Reuse any agent; we repeat each query 3 times
results = await evaluate_agent(
agent=my_agent,
queries=["What’s the capital of France?"],
evaluators=LocalEvaluator(keyword_check("Paris")),
num_repetitions=3, # three independent runs
)
print("Total items evaluated:", len(results[0].items)) # → 3
Summary
LocalEvaluatorprovides fast, API-free validation by executingEvalCheckobjects directly againstEvalIteminstances without external service calls.keyword_checkcreates reusable validation rules that verify keyword presence in agent responses, supporting case-insensitive matching by default.evaluate_agentorchestrates the entire pipeline, normalizing inputs, running agents, and dispatching to multiple evaluators including local and cloud providers.- The architecture supports mixed evaluation strategies, allowing you to combine lightweight keyword checks with semantic LLM-as-judge evaluators in a single execution.
- All core components reside in
python/packages/core/agent_framework/_evaluation.py, with reference implementations available in thepython/samples/02-agents/evaluation/directory.
Frequently Asked Questions
How does LocalEvaluator differ from cloud-based evaluators like FoundryEvals?
LocalEvaluator executes validation logic entirely within the Python process, checking EvalItem objects against provided EvalCheck functions without network requests. Cloud-based evaluators like FoundryEvals send data to external APIs for semantic judging. According to the source code in _evaluation.py (lines 1340-1410), LocalEvaluator iterates through checks locally, making it suitable for CI pipelines and unit tests where speed and offline operation are critical.
Can I use keyword_check for case-sensitive matching?
Yes. The keyword_check function accepts an optional case_sensitive parameter (default False). When set to True, the check preserves the original casing of both the agent response and the target keywords during comparison. The implementation in _evaluation.py (lines 886-905) handles this by conditionally applying .lower() to both texts based on the parameter value.
How do I evaluate multiple queries with different expected keywords?
Create separate LocalEvaluator instances for each unique validation requirement, or compose multiple keyword_check calls into a single evaluator. The evaluate_agent function accepts a list of evaluators, allowing you to segment validation logic. Each EvalResult in the returned list corresponds to one evaluator, containing per-item pass/fail details for all queries processed by that specific check set.
What file contains the unit tests for LocalEvaluator and keyword_check?
The unit tests for LocalEvaluator, keyword_check, and automatic function wrapping reside in python/packages/core/tests/core/test_local_eval.py in the repository. These tests demonstrate usage patterns and validate the local evaluation loop's handling of both synchronous and asynchronous check functions.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →