# Setting Up AutoGen Bench for Benchmarking Agent Performance: A Complete Guide

> Benchmark agent performance with AutoGen Bench. This guide shows you how to set up and run this official Microsoft AutoGen tool for evaluating agent capabilities efficiently and effectively.

- Repository: [Microsoft/autogen](https://github.com/microsoft/autogen)
- Tags: how-to-guide
- Published: 2026-03-07

---

**AutoGen Bench (agbench) is the official benchmarking suite in the Microsoft AutoGen repository that runs predefined scenarios under controlled conditions to evaluate agent performance across multiple runs and configurations.**

AutoGen Bench provides a reproducible framework for measuring how well AutoGen agents perform on standardized tasks. This guide walks through the complete setup process using the actual implementation from the `microsoft/autogen` repository, from installation to result analysis.

## What Is AutoGen Bench?

AutoGen Bench is a command-line tool that repeatedly executes collections of AutoGen scenarios under tightly-controlled conditions. It expands scenario templates, orchestrates Docker containers (or native execution), captures execution logs, and aggregates performance metrics. The tool is implemented in `python/packages/agbench/` within the main AutoGen repository and enables researchers to compare agent configurations across languages, hardware, and repetition counts.

## Architecture and Key Components

Understanding the internal structure helps troubleshoot setup issues and customize workflows.

### CLI Entry Point

The `agbench` command parses user inputs in [`agbench/cli.py`](https://github.com/microsoft/autogen/blob/main/agbench/cli.py), dispatching to sub-commands like `run`, `tabulate`, and `remove_missing`. This file serves as the primary interface between you and the benchmarking engine.

### Scenario Runner

The core execution logic lives in [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py). The `run_scenarios` function (approximately lines 59-84) discovers JSONL scenario definitions, handles subsampling, and manages repetition loops. For each scenario instance, it calls `expand_scenario` to perform copy-and-replace operations on template files using substitution dictionaries from the JSONL definition.

### Docker Orchestration

By default, AutoGen Bench uses the Python Docker SDK to create containers via `run_scenario_in_docker` in [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py). It mounts the expanded scenario folder, injects environment variables, enforces timeouts, and cleans up after execution. Alternatively, `run_scenario_natively` supports host execution, though this is discouraged for reproducibility.

### Result Storage and Tabulation

Results are stored in a deterministic hierarchy: `Results/<scenario>/<task_id>/<instance_id>/<repeat>`. Files written include stdout, stderr, and AutoGen trace files (lines 35-55 in [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py)). The [`tabulate_cmd.py`](https://github.com/microsoft/autogen/blob/main/tabulate_cmd.py) module reads these JSON logs to aggregate success rates and print markdown-compatible tables.

## Prerequisites and Installation

Before running benchmarks, ensure you have Docker installed and API credentials ready.

Install the package in editable mode for development or standard mode for usage:

```bash
pip install -e autogen/python/packages/agbench

```

Verify installation by checking the CLI:

```bash
agbench --help

```

## Step-by-Step Setup Guide

### Configure API Credentials

AutoGen Bench requires OpenAI or Azure credentials. Create an `OAI_CONFIG_LIST` file in your working directory or export it as an environment variable:

```bash
export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)

```

Alternatively, set a single key directly:

```bash
export OPENAI_API_KEY=sk-...

```

For additional APIs like Bing, create an [`ENV.json`](https://github.com/microsoft/autogen/blob/main/ENV.json) file alongside your scenario folder:

```json
{
    "BING_API_KEY": "your-bing-key"
}

```

The `get_scenario_env` helper in [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py) loads these configurations at runtime.

### Initialize Benchmark Tasks

Most benchmarks require task initialization to download datasets and create scenario files. For example, to set up HumanEval:

```bash
cd autogen/python/packages/agbench/benchmarks/HumanEval
python Scripts/init_tasks.py

```

This populates the `Tasks/` folder with JSONL files defining individual problem instances.

## Running Your First Benchmark

Execute a benchmark using the JSONL scenario definition. The following command runs HumanEval tasks once in Docker:

```bash
agbench run Tasks/human_eval_MagenticOne.jsonl

```

For statistically significant results, use the `--repeat` flag to run each scenario multiple times:

```bash
agbench run --repeat 10 Tasks/human_eval_MagenticOne.jsonl

```

### Docker vs Native Execution

**Docker (default):** Builds and uses the `agbench:default` image automatically. Supply a custom image with `--docker-image myrepo/agbench:custom`.

**Native mode:** Add `--native` to run directly on the host OS. This bypasses container isolation but may leave stray state on your machine and reduce reproducibility.

### Subsampling Options

Test on a subset of tasks using `--subsample`:

- `agbench run --subsample 0.5 Tasks/scenario.jsonl` runs 50% of tasks randomly
- `agbench run --subsample 5 Tasks/scenario.jsonl` runs exactly 5 random tasks

## Analyzing Results with Tabulation

After execution, aggregate results using the tabulate command:

```bash
agbench tabulate Results/human_eval_MagenticOne

```

This command reads the per-run JSON logs from the results hierarchy and prints a concise table showing success rates, average runtime, and token usage. The output is markdown-compatible for easy inclusion in reports.

## Advanced Configuration Options

Customize execution behavior through these CLI flags:

- **`--docker-image`**: Specify a pre-built image instead of building `agbench:default`
- **`--timeout`**: Override the default execution timeout for long-running scenarios
- **`--remove_missing`**: Clean up result entries for scenarios that no longer exist in the task definition

For custom environment variables, modify the [`ENV.json`](https://github.com/microsoft/autogen/blob/main/ENV.json) file or extend the `get_scenario_env` logic in [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py) to handle additional secret management systems.

## Summary

- Install AutoGen Bench using `pip install -e autogen/python/packages/agbench` to access the `agbench` CLI.
- Configure credentials via `OAI_CONFIG_LIST` or `OPENAI_API_KEY`, with additional keys in [`ENV.json`](https://github.com/microsoft/autogen/blob/main/ENV.json).
- Initialize tasks using provided scripts like [`Scripts/init_tasks.py`](https://github.com/microsoft/autogen/blob/main/Scripts/init_tasks.py) before running benchmarks.
- Execute scenarios with `agbench run`, using `--repeat` for multiple trials and `--subsample` for partial datasets.
- Analyze outcomes using `agbench tabulate Results/<scenario>` to generate performance summaries.
- Prefer Docker execution (default) over `--native` to ensure reproducible, isolated environments.

## Frequently Asked Questions

### What file format defines benchmark scenarios?

AutoGen Bench uses **JSONL (JSON Lines)** files where each line describes a scenario instance with template paths and substitution dictionaries. The [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py) module expands these definitions into concrete execution folders before running.

### How does AutoGen Bench handle sensitive API keys?

The tool loads keys from `OAI_CONFIG_LIST` (OpenAI/Azure) and [`ENV.json`](https://github.com/microsoft/autogen/blob/main/ENV.json) (additional services) at runtime. The `get_scenario_env` helper in [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py) injects these as environment variables into Docker containers or native processes, ensuring credentials never appear in execution logs.

### Can I run benchmarks without Docker?

Yes, using the `--native` flag, but this is **discouraged** for production benchmarking. Native execution runs directly on your host OS and may leave residual files or conflicting state between runs, compromising the reproducibility that Docker provides through container isolation.

### Where are benchmark results stored?

Results follow a deterministic path: `Results/<scenario>/<task_id>/<instance_id>/<repeat>/`. Each directory contains `stdout`, `stderr`, and AutoGen trace files generated during execution, as implemented in lines 35-55 of [`run_cmd.py`](https://github.com/microsoft/autogen/blob/main/run_cmd.py).