# What Is the Dexter Evals System? Automated Benchmarking for Financial AI Agents

> Discover the Dexter Evals System, an automated framework for benchmarking financial AI agents. Learn how it accurately measures agent performance using dataset-driven tests and LLM-as-judge evaluation.

- Repository: [Virat Singh/dexter](https://github.com/virattt/dexter)
- Tags: internals
- Published: 2026-02-16

---

**The Dexter Evals System is an automated, end-to-end framework that measures how accurately the Dexter financial-research agent answers real-world finance questions using dataset-driven testing, LLM-as-judge evaluation, and LangSmith traceability.**

The Dexter Evals System serves as the quality assurance backbone for the [virattt/dexter](https://github.com/virattt/dexter) repository, providing a repeatable benchmark that validates agent performance against a curated dataset of over 300 finance Q&A pairs. By combining automated agent execution with structured correctness scoring, the system enables developers to quantify improvements and detect regressions across model iterations.

## Core Components of the Dexter Evals System

### Dataset-Driven Testing with CSV Parsing

The evaluation pipeline begins with a structured dataset located at `src/evals/dataset/finance_agent.csv`, which contains more than 300 finance questions and their corresponding reference answers. The system uses a robust CSV parser implemented in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts) that correctly handles quoted, multi-line fields through the `parseCSV` function, ensuring complex financial queries are parsed without data corruption.

### Agent Execution and Target Functions

Each evaluation example instantiates a fresh Dexter `Agent` configured with `gpt-5.2` and 10 maximum iterations. The `target` function in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts) invokes the agent with the question from the dataset and captures the final generated answer, creating an isolated test environment for each query to prevent state contamination between evaluations.

### LLM-as-Judge Correctness Evaluation

The Dexter Evals System employs a secondary LLM as an automated judge to assess answer correctness. The `correctnessEvaluator` function in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts) utilizes a structured output schema defined as `EvaluatorOutputSchema`, which forces the judge to return a numeric `score` (0 or 1) and a free-form `comment` explaining the evaluation rationale. This approach eliminates subjective human grading while maintaining transparent reasoning for each score.

### LangSmith Integration for Traceability

Every evaluation run is instrumented with LangSmith to provide persistent, searchable traces. The runner creates or reuses a LangSmith dataset via the client configuration in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts), uploads evaluation examples, and emits detailed runs containing inputs, outputs, timing metrics, and evaluator scores. This integration enables versioned comparison across model iterations and facilitates audit trails for compliance-sensitive financial applications.

### Real-Time Terminal UI with React Ink

The evaluation process streams progress events to an interactive terminal interface built with React Ink. The `EvalApp` component in [`src/evals/components/EvalApp.tsx`](https://github.com/virattt/dexter/blob/main/src/evals/components/EvalApp.tsx) consumes event types including `init`, `question_start`, `question_end`, and `complete` to render a progress bar, current question status, live statistics, and a final summary. This real-time feedback loop allows developers to monitor long-running evaluation batches without losing visibility into individual question performance.

## Running the Dexter Evals System

### Command-Line Execution

Execute the full evaluation suite against the complete dataset or a random sample using the following commands:

```bash

# Execute all questions in the dataset

bun run src/evals/run.ts

# Run a random sample of 10 questions for quick validation

bun run src/evals/run.ts --sample 10

```

The CLI automatically renders the Ink-based UI defined in [`src/evals/components/EvalApp.tsx`](https://github.com/virattt/dexter/blob/main/src/evals/components/EvalApp.tsx), providing real-time progress visualization.

### Programmatic Integration

For custom testing pipelines or CI/CD integration, import the core runner directly:

```typescript
import { createEvaluationRunner } from './src/evals/run.ts';

// Run a 5-question sample and collect results programmatically
(async () => {
  const evalIter = createEvaluationRunner(5);
  for await (const ev of evalIter()) {
    if (ev.type === 'question_end') {
      console.log(`Q: ${ev.question}`);
      console.log(`Score: ${ev.score} — ${ev.comment}`);
    }
  }
})();

```

This pattern enables automated regression testing by asserting on `ev.score` values within your test framework.

## Extending the Evaluator

Customize the correctness evaluation by modifying the structured output schema in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts). For example, to capture additional metrics such as precision and recall:

```typescript
const NewSchema = z.object({
  score: z.number().min(0).max(1),
  comment: z.string(),
  precision: z.number().optional(),
  recall: z.number().optional(),
});

```

Instantiate the structured LLM with `NewSchema` and update the `correctnessEvaluator` return logic to propagate these additional fields into the LangSmith traces.

## Summary

- The **Dexter Evals System** provides automated benchmarking for the Dexter financial-research agent using a dataset of 300+ finance Q&A pairs stored in `src/evals/dataset/finance_agent.csv`.
- The system executes each question through a fresh `Agent` instance with `gpt-5.2` and evaluates responses using an **LLM-as-judge** pattern that returns structured scores (0/1) and comments via `correctnessEvaluator` in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts).
- **LangSmith integration** ensures all runs are traceable, versioned, and searchable for regression analysis.
- A **React Ink terminal UI** ([`src/evals/components/EvalApp.tsx`](https://github.com/virattt/dexter/blob/main/src/evals/components/EvalApp.tsx)) provides real-time progress monitoring with event-driven updates.

## Frequently Asked Questions

### How does the Dexter Evals System handle complex CSV parsing?

The system uses a custom `parseCSV` function implemented in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts) that correctly handles quoted fields and multi-line content within the `src/evals/dataset/finance_agent.csv` file. This ensures that complex financial questions containing commas or line breaks are parsed without data corruption before being fed to the evaluation pipeline.

### What model configuration does the Dexter Evals System use for evaluation?

According to the source code in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts), the system instantiates the Dexter `Agent` with `gpt-5.2` and a maximum of 10 iterations for each evaluation example. The LLM-as-judge evaluator uses the same `gpt-5.2` model to assess answer correctness, ensuring consistency between the agent being tested and the evaluation criteria.

### Can I run the Dexter Evals System on a subset of questions?

Yes, the CLI supports sampling via the `--sample` flag. Running `bun run src/evals/run.ts --sample 10` executes a random sample of 10 questions from the full dataset instead of all 300+ entries. For programmatic use, the `createEvaluationRunner` function accepts a numeric parameter to limit the evaluation batch size.

### How are evaluation results stored and tracked?

The Dexter Evals System integrates with LangSmith to persist evaluation runs. As implemented in [`src/evals/run.ts`](https://github.com/virattt/dexter/blob/main/src/evals/run.ts), the runner creates or reuses a LangSmith dataset, uploads examples, and emits detailed runs containing inputs, outputs, timing metrics, and the evaluator's structured scores and comments. This enables versioned comparison across model iterations and provides searchable audit trails for compliance-sensitive financial applications.