Setting Up AutoGen Bench for Benchmarking Agent Performance: A Complete Guide

AutoGen Bench (agbench) is the official benchmarking suite in the Microsoft AutoGen repository that runs predefined scenarios under controlled conditions to evaluate agent performance across multiple runs and configurations.

AutoGen Bench provides a reproducible framework for measuring how well AutoGen agents perform on standardized tasks. This guide walks through the complete setup process using the actual implementation from the microsoft/autogen repository, from installation to result analysis.

What Is AutoGen Bench?

AutoGen Bench is a command-line tool that repeatedly executes collections of AutoGen scenarios under tightly-controlled conditions. It expands scenario templates, orchestrates Docker containers (or native execution), captures execution logs, and aggregates performance metrics. The tool is implemented in python/packages/agbench/ within the main AutoGen repository and enables researchers to compare agent configurations across languages, hardware, and repetition counts.

Architecture and Key Components

Understanding the internal structure helps troubleshoot setup issues and customize workflows.

CLI Entry Point

The agbench command parses user inputs in agbench/cli.py, dispatching to sub-commands like run, tabulate, and remove_missing. This file serves as the primary interface between you and the benchmarking engine.

Scenario Runner

The core execution logic lives in run_cmd.py. The run_scenarios function (approximately lines 59-84) discovers JSONL scenario definitions, handles subsampling, and manages repetition loops. For each scenario instance, it calls expand_scenario to perform copy-and-replace operations on template files using substitution dictionaries from the JSONL definition.

Docker Orchestration

By default, AutoGen Bench uses the Python Docker SDK to create containers via run_scenario_in_docker in run_cmd.py. It mounts the expanded scenario folder, injects environment variables, enforces timeouts, and cleans up after execution. Alternatively, run_scenario_natively supports host execution, though this is discouraged for reproducibility.

Result Storage and Tabulation

Results are stored in a deterministic hierarchy: Results/<scenario>/<task_id>/<instance_id>/<repeat>. Files written include stdout, stderr, and AutoGen trace files (lines 35-55 in run_cmd.py). The tabulate_cmd.py module reads these JSON logs to aggregate success rates and print markdown-compatible tables.

Prerequisites and Installation

Before running benchmarks, ensure you have Docker installed and API credentials ready.

Install the package in editable mode for development or standard mode for usage:

pip install -e autogen/python/packages/agbench

Verify installation by checking the CLI:

agbench --help

Step-by-Step Setup Guide

Configure API Credentials

AutoGen Bench requires OpenAI or Azure credentials. Create an OAI_CONFIG_LIST file in your working directory or export it as an environment variable:

export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)

Alternatively, set a single key directly:

export OPENAI_API_KEY=sk-...

For additional APIs like Bing, create an ENV.json file alongside your scenario folder:

{
    "BING_API_KEY": "your-bing-key"
}

The get_scenario_env helper in run_cmd.py loads these configurations at runtime.

Initialize Benchmark Tasks

Most benchmarks require task initialization to download datasets and create scenario files. For example, to set up HumanEval:

cd autogen/python/packages/agbench/benchmarks/HumanEval
python Scripts/init_tasks.py

This populates the Tasks/ folder with JSONL files defining individual problem instances.

Running Your First Benchmark

Execute a benchmark using the JSONL scenario definition. The following command runs HumanEval tasks once in Docker:

agbench run Tasks/human_eval_MagenticOne.jsonl

For statistically significant results, use the --repeat flag to run each scenario multiple times:

agbench run --repeat 10 Tasks/human_eval_MagenticOne.jsonl

Docker vs Native Execution

Docker (default): Builds and uses the agbench:default image automatically. Supply a custom image with --docker-image myrepo/agbench:custom.

Native mode: Add --native to run directly on the host OS. This bypasses container isolation but may leave stray state on your machine and reduce reproducibility.

Subsampling Options

Test on a subset of tasks using --subsample:

  • agbench run --subsample 0.5 Tasks/scenario.jsonl runs 50% of tasks randomly
  • agbench run --subsample 5 Tasks/scenario.jsonl runs exactly 5 random tasks

Analyzing Results with Tabulation

After execution, aggregate results using the tabulate command:

agbench tabulate Results/human_eval_MagenticOne

This command reads the per-run JSON logs from the results hierarchy and prints a concise table showing success rates, average runtime, and token usage. The output is markdown-compatible for easy inclusion in reports.

Advanced Configuration Options

Customize execution behavior through these CLI flags:

  • --docker-image: Specify a pre-built image instead of building agbench:default
  • --timeout: Override the default execution timeout for long-running scenarios
  • --remove_missing: Clean up result entries for scenarios that no longer exist in the task definition

For custom environment variables, modify the ENV.json file or extend the get_scenario_env logic in run_cmd.py to handle additional secret management systems.

Summary

  • Install AutoGen Bench using pip install -e autogen/python/packages/agbench to access the agbench CLI.
  • Configure credentials via OAI_CONFIG_LIST or OPENAI_API_KEY, with additional keys in ENV.json.
  • Initialize tasks using provided scripts like Scripts/init_tasks.py before running benchmarks.
  • Execute scenarios with agbench run, using --repeat for multiple trials and --subsample for partial datasets.
  • Analyze outcomes using agbench tabulate Results/<scenario> to generate performance summaries.
  • Prefer Docker execution (default) over --native to ensure reproducible, isolated environments.

Frequently Asked Questions

What file format defines benchmark scenarios?

AutoGen Bench uses JSONL (JSON Lines) files where each line describes a scenario instance with template paths and substitution dictionaries. The run_cmd.py module expands these definitions into concrete execution folders before running.

How does AutoGen Bench handle sensitive API keys?

The tool loads keys from OAI_CONFIG_LIST (OpenAI/Azure) and ENV.json (additional services) at runtime. The get_scenario_env helper in run_cmd.py injects these as environment variables into Docker containers or native processes, ensuring credentials never appear in execution logs.

Can I run benchmarks without Docker?

Yes, using the --native flag, but this is discouraged for production benchmarking. Native execution runs directly on your host OS and may leave residual files or conflicting state between runs, compromising the reproducibility that Docker provides through container isolation.

Where are benchmark results stored?

Results follow a deterministic path: Results/<scenario>/<task_id>/<instance_id>/<repeat>/. Each directory contains stdout, stderr, and AutoGen trace files generated during execution, as implemented in lines 35-55 of run_cmd.py.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →