What Is the Dexter Evals System? Automated Benchmarking for Financial AI Agents
The Dexter Evals System is an automated, end-to-end framework that measures how accurately the Dexter financial-research agent answers real-world finance questions using dataset-driven testing, LLM-as-judge evaluation, and LangSmith traceability.
The Dexter Evals System serves as the quality assurance backbone for the virattt/dexter repository, providing a repeatable benchmark that validates agent performance against a curated dataset of over 300 finance Q&A pairs. By combining automated agent execution with structured correctness scoring, the system enables developers to quantify improvements and detect regressions across model iterations.
Core Components of the Dexter Evals System
Dataset-Driven Testing with CSV Parsing
The evaluation pipeline begins with a structured dataset located at src/evals/dataset/finance_agent.csv, which contains more than 300 finance questions and their corresponding reference answers. The system uses a robust CSV parser implemented in src/evals/run.ts that correctly handles quoted, multi-line fields through the parseCSV function, ensuring complex financial queries are parsed without data corruption.
Agent Execution and Target Functions
Each evaluation example instantiates a fresh Dexter Agent configured with gpt-5.2 and 10 maximum iterations. The target function in src/evals/run.ts invokes the agent with the question from the dataset and captures the final generated answer, creating an isolated test environment for each query to prevent state contamination between evaluations.
LLM-as-Judge Correctness Evaluation
The Dexter Evals System employs a secondary LLM as an automated judge to assess answer correctness. The correctnessEvaluator function in src/evals/run.ts utilizes a structured output schema defined as EvaluatorOutputSchema, which forces the judge to return a numeric score (0 or 1) and a free-form comment explaining the evaluation rationale. This approach eliminates subjective human grading while maintaining transparent reasoning for each score.
LangSmith Integration for Traceability
Every evaluation run is instrumented with LangSmith to provide persistent, searchable traces. The runner creates or reuses a LangSmith dataset via the client configuration in src/evals/run.ts, uploads evaluation examples, and emits detailed runs containing inputs, outputs, timing metrics, and evaluator scores. This integration enables versioned comparison across model iterations and facilitates audit trails for compliance-sensitive financial applications.
Real-Time Terminal UI with React Ink
The evaluation process streams progress events to an interactive terminal interface built with React Ink. The EvalApp component in src/evals/components/EvalApp.tsx consumes event types including init, question_start, question_end, and complete to render a progress bar, current question status, live statistics, and a final summary. This real-time feedback loop allows developers to monitor long-running evaluation batches without losing visibility into individual question performance.
Running the Dexter Evals System
Command-Line Execution
Execute the full evaluation suite against the complete dataset or a random sample using the following commands:
# Execute all questions in the dataset
bun run src/evals/run.ts
# Run a random sample of 10 questions for quick validation
bun run src/evals/run.ts --sample 10
The CLI automatically renders the Ink-based UI defined in src/evals/components/EvalApp.tsx, providing real-time progress visualization.
Programmatic Integration
For custom testing pipelines or CI/CD integration, import the core runner directly:
import { createEvaluationRunner } from './src/evals/run.ts';
// Run a 5-question sample and collect results programmatically
(async () => {
const evalIter = createEvaluationRunner(5);
for await (const ev of evalIter()) {
if (ev.type === 'question_end') {
console.log(`Q: ${ev.question}`);
console.log(`Score: ${ev.score} — ${ev.comment}`);
}
}
})();
This pattern enables automated regression testing by asserting on ev.score values within your test framework.
Extending the Evaluator
Customize the correctness evaluation by modifying the structured output schema in src/evals/run.ts. For example, to capture additional metrics such as precision and recall:
const NewSchema = z.object({
score: z.number().min(0).max(1),
comment: z.string(),
precision: z.number().optional(),
recall: z.number().optional(),
});
Instantiate the structured LLM with NewSchema and update the correctnessEvaluator return logic to propagate these additional fields into the LangSmith traces.
Summary
- The Dexter Evals System provides automated benchmarking for the Dexter financial-research agent using a dataset of 300+ finance Q&A pairs stored in
src/evals/dataset/finance_agent.csv. - The system executes each question through a fresh
Agentinstance withgpt-5.2and evaluates responses using an LLM-as-judge pattern that returns structured scores (0/1) and comments viacorrectnessEvaluatorinsrc/evals/run.ts. - LangSmith integration ensures all runs are traceable, versioned, and searchable for regression analysis.
- A React Ink terminal UI (
src/evals/components/EvalApp.tsx) provides real-time progress monitoring with event-driven updates.
Frequently Asked Questions
How does the Dexter Evals System handle complex CSV parsing?
The system uses a custom parseCSV function implemented in src/evals/run.ts that correctly handles quoted fields and multi-line content within the src/evals/dataset/finance_agent.csv file. This ensures that complex financial questions containing commas or line breaks are parsed without data corruption before being fed to the evaluation pipeline.
What model configuration does the Dexter Evals System use for evaluation?
According to the source code in src/evals/run.ts, the system instantiates the Dexter Agent with gpt-5.2 and a maximum of 10 iterations for each evaluation example. The LLM-as-judge evaluator uses the same gpt-5.2 model to assess answer correctness, ensuring consistency between the agent being tested and the evaluation criteria.
Can I run the Dexter Evals System on a subset of questions?
Yes, the CLI supports sampling via the --sample flag. Running bun run src/evals/run.ts --sample 10 executes a random sample of 10 questions from the full dataset instead of all 300+ entries. For programmatic use, the createEvaluationRunner function accepts a numeric parameter to limit the evaluation batch size.
How are evaluation results stored and tracked?
The Dexter Evals System integrates with LangSmith to persist evaluation runs. As implemented in src/evals/run.ts, the runner creates or reuses a LangSmith dataset, uploads examples, and emits detailed runs containing inputs, outputs, timing metrics, and the evaluator's structured scores and comments. This enables versioned comparison across model iterations and provides searchable audit trails for compliance-sensitive financial applications.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →