# How to Build Evaluation Pipelines with LocalEvaluator and Keyword Checks in Agent Framework

> Build fast API-free evaluation pipelines using Agent Frameworks LocalEvaluator and keyword checks. Validate agent responses efficiently without external service calls.

- Repository: [Microsoft/agent-framework](https://github.com/microsoft/agent-framework)
- Tags: how-to-guide
- Published: 2026-04-05

---

**You can build fast, API-free evaluation pipelines by combining `LocalEvaluator` with built-in checks like `keyword_check` to validate agent responses against expected keywords without external service calls.**

The Microsoft Agent Framework provides a provider-agnostic evaluation engine that enables rapid validation of agent behavior through local checks. By leveraging `LocalEvaluator` and the `keyword_check` function, developers can create lightweight CI-friendly pipelines that verify response content without consuming cloud API quotas. This approach centers on three core abstractions—`EvalItem`, `EvalCheck`, and the `Evaluator` protocol—implemented in the framework's core evaluation module at [`python/packages/core/agent_framework/_evaluation.py`](https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py).

## Understand the Evaluation Architecture

The framework defines three foundational concepts in [`_evaluation.py`](https://github.com/microsoft/agent-framework/blob/main/_evaluation.py):

- **`EvalItem`** (line 181): A normalized representation of a single query-response interaction, containing conversation history, tool calls, context, and expected outputs.
- **`EvalCheck`** (line 776): A callable that receives an `EvalItem` and returns a `CheckResult` with `passed`, `reason`, and `check_name` fields.
- **`Evaluator` protocol** (line 504): An abstract interface implemented by any evaluation provider, whether cloud-based, LLM-as-judge, or local.

The **public orchestration function** `evaluate_agent` (lines 1540-1640) serves as the entry point. It normalizes input arguments, executes the agent (or reuses supplied responses), converts interactions into `EvalItem` objects via `AgentEvalConverter.to_eval_item`, attaches ground-truth data, and dispatches the items to every supplied evaluator.

```python

# https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py#L1545-L1600

async def evaluate_agent(...):
    # Normalisation, validation, repetition handling, building EvalItem list

    # ...

    # Dispatch to each evaluator (including LocalEvaluator)

    # ...

```

## Use LocalEvaluator for Fast, API-Free Validation

`LocalEvaluator` implements the `Evaluator` protocol to run checks locally without external API calls. Defined in [`_evaluation.py`](https://github.com/microsoft/agent-framework/blob/main/_evaluation.py) (lines 1343-1347), the class stores a collection of `EvalCheck` objects and executes them sequentially against each `EvalItem`.

The evaluation loop (lines 1380-1410) handles both synchronous and asynchronous checks, aggregating pass/fail counts into `EvalResults`. Because validation occurs entirely within the Python process, `LocalEvaluator` is ideal for **unit-test style validation**, CI smoke tests, and rapid prototyping workflows.

```python
from agent_framework import LocalEvaluator, keyword_check

local = LocalEvaluator(keyword_check("weather"))

```

## Implement Keyword Checks

The framework provides a ready-made `keyword_check` function (lines 886-905 in [`_evaluation.py`](https://github.com/microsoft/agent-framework/blob/main/_evaluation.py)) that validates the presence of required keywords in agent responses.

```python

# https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py#L886-L905

@experimental(feature_id=ExperimentalFeature.EVALS)
def keyword_check(*keywords: str, case_sensitive: bool = False) -> EvalCheck:
    """Check that the response contains all specified keywords."""
    def _check(item: EvalItem) -> CheckResult:
        text = item.response if case_sensitive else item.response.lower()
        missing = [k for k in keywords if (k if case_sensitive else k.lower()) not in text]
        if missing:
            return CheckResult(passed=False,
                               reason=f"Missing keywords: {missing}",
                               check_name="keyword_check")
        return CheckResult(passed=True,
                           reason="All keywords found",
                           check_name="keyword_check")
    return _check

```

**Parameters**:

- `*keywords`: Required words or phrases that must appear in the response.
- `case_sensitive`: Optional boolean flag (default `False`) controlling string matching behavior.

The function returns an `EvalCheck` callable that can be passed directly to `LocalEvaluator`.

## Build a Complete Evaluation Pipeline

A typical workflow involves four steps:

1. **Create checks** using `keyword_check()` to define validation criteria.
2. **Instantiate `LocalEvaluator`** with the configured checks.
3. **Call `evaluate_agent`** with your agent, test queries, and the evaluator.
4. **Inspect `EvalResults`** for aggregated statistics and per-item failure reasons.

```python

# https://github.com/microsoft/agent-framework/blob/main/python/samples/02-agents/evaluation/evaluate_agent.py#L19-L34

from agent_framework import LocalEvaluator, keyword_check, evaluate_agent, Agent

# 1️⃣ Build an agent (any subclass of Agent)

my_agent = Agent(...)          # ← configure your agent here

# 2️⃣ Define the checks you care about

kw_check = keyword_check("weather", "temperature")   # must mention both words

# 3️⃣ Wrap them in a LocalEvaluator

local = LocalEvaluator(kw_check)

# 4️⃣ Run the evaluation pipeline

results = await evaluate_agent(
    agent=my_agent,
    queries=["What’s the weather like today?"],
    evaluators=local,
)

# 5️⃣ Inspect the outcome

print("Passed?" , results[0].all_passed)               # True / False

print("Details:", results[0].items[0].scores)          # per‑check score

```

## Advanced Pipeline Features

The evaluation engine supports several advanced capabilities for robust validation:

| Feature | Description | Implementation Location |
|---|---|---|
| **Multiple repetitions** | Run each query N times to measure stability; each run creates a distinct `EvalItem`. | `evaluate_agent` (lines 1550-1560) |
| **Custom split strategy** | Override the default `ConversationSplit.LAST_TURN` to evaluate full conversations or specific turns. | `evaluate_agent` (lines 1600-1610) |
| **Mixed evaluators** | Combine `LocalEvaluator` with cloud providers like `FoundryEvals` in a single call. | `evaluate_agent` evaluator loop |
| **Tool-call expectations** | Validate specific tool invocations using `expected_tool_calls` and built-in checks like `tool_calls_present`. | Ground-truth stamping (lines 1640-1660) |

### Mix Local and Cloud Evaluators

You can validate keywords locally while simultaneously running semantic evaluations through cloud providers:

```python

# https://github.com/microsoft/agent-framework/blob/main/python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py#L70-L85

from agent_framework import LocalEvaluator, keyword_check, evaluate_agent
from agent_framework.foundry import FoundryEvals

# Local check: ensure the word "weather" appears

local = LocalEvaluator(keyword_check("weather"))

# Cloud (Foundry) evaluator – requires a Foundry client (omitted for brevity)

foundry = FoundryEvals(project_client=client, model="gpt-4o")

# Run both providers in one call

results = await evaluate_agent(
    agent=my_agent,
    queries=["Tell me the weather in Seattle."],
    evaluators=[local, foundry],
)

# `results` is a list: first entry = Local, second = Foundry

for r in results:
    print(r.provider, "passed:", r.all_passed)

```

### Test Consistency with Repetitions

Run queries multiple times to detect flaky behavior:

```python

# Reuse any agent; we repeat each query 3 times

results = await evaluate_agent(
    agent=my_agent,
    queries=["What’s the capital of France?"],
    evaluators=LocalEvaluator(keyword_check("Paris")),
    num_repetitions=3,          # three independent runs

)

print("Total items evaluated:", len(results[0].items))   # → 3

```

## Summary

- **`LocalEvaluator`** provides fast, API-free validation by executing `EvalCheck` objects directly against `EvalItem` instances without external service calls.
- **`keyword_check`** creates reusable validation rules that verify keyword presence in agent responses, supporting case-insensitive matching by default.
- **`evaluate_agent`** orchestrates the entire pipeline, normalizing inputs, running agents, and dispatching to multiple evaluators including local and cloud providers.
- The architecture supports **mixed evaluation strategies**, allowing you to combine lightweight keyword checks with semantic LLM-as-judge evaluators in a single execution.
- All core components reside in [`python/packages/core/agent_framework/_evaluation.py`](https://github.com/microsoft/agent-framework/blob/main/python/packages/core/agent_framework/_evaluation.py), with reference implementations available in the `python/samples/02-agents/evaluation/` directory.

## Frequently Asked Questions

### How does LocalEvaluator differ from cloud-based evaluators like FoundryEvals?

`LocalEvaluator` executes validation logic entirely within the Python process, checking `EvalItem` objects against provided `EvalCheck` functions without network requests. Cloud-based evaluators like `FoundryEvals` send data to external APIs for semantic judging. According to the source code in [`_evaluation.py`](https://github.com/microsoft/agent-framework/blob/main/_evaluation.py) (lines 1340-1410), `LocalEvaluator` iterates through checks locally, making it suitable for CI pipelines and unit tests where speed and offline operation are critical.

### Can I use keyword_check for case-sensitive matching?

Yes. The `keyword_check` function accepts an optional `case_sensitive` parameter (default `False`). When set to `True`, the check preserves the original casing of both the agent response and the target keywords during comparison. The implementation in [`_evaluation.py`](https://github.com/microsoft/agent-framework/blob/main/_evaluation.py) (lines 886-905) handles this by conditionally applying `.lower()` to both texts based on the parameter value.

### How do I evaluate multiple queries with different expected keywords?

Create separate `LocalEvaluator` instances for each unique validation requirement, or compose multiple `keyword_check` calls into a single evaluator. The `evaluate_agent` function accepts a list of evaluators, allowing you to segment validation logic. Each `EvalResult` in the returned list corresponds to one evaluator, containing per-item pass/fail details for all queries processed by that specific check set.

### What file contains the unit tests for LocalEvaluator and keyword_check?

The unit tests for `LocalEvaluator`, `keyword_check`, and automatic function wrapping reside in [`python/packages/core/tests/core/test_local_eval.py`](https://github.com/microsoft/agent-framework/blob/main/python/packages/core/tests/core/test_local_eval.py) in the repository. These tests demonstrate usage patterns and validate the local evaluation loop's handling of both synchronous and asynchronous check functions.