Setting Up AutoGen Bench for Benchmarking Agent Performance: A Complete Guide
AutoGen Bench (agbench) is the official benchmarking suite in the Microsoft AutoGen repository that runs predefined scenarios under controlled conditions to evaluate agent performance across multiple runs and configurations.
AutoGen Bench provides a reproducible framework for measuring how well AutoGen agents perform on standardized tasks. This guide walks through the complete setup process using the actual implementation from the microsoft/autogen repository, from installation to result analysis.
What Is AutoGen Bench?
AutoGen Bench is a command-line tool that repeatedly executes collections of AutoGen scenarios under tightly-controlled conditions. It expands scenario templates, orchestrates Docker containers (or native execution), captures execution logs, and aggregates performance metrics. The tool is implemented in python/packages/agbench/ within the main AutoGen repository and enables researchers to compare agent configurations across languages, hardware, and repetition counts.
Architecture and Key Components
Understanding the internal structure helps troubleshoot setup issues and customize workflows.
CLI Entry Point
The agbench command parses user inputs in agbench/cli.py, dispatching to sub-commands like run, tabulate, and remove_missing. This file serves as the primary interface between you and the benchmarking engine.
Scenario Runner
The core execution logic lives in run_cmd.py. The run_scenarios function (approximately lines 59-84) discovers JSONL scenario definitions, handles subsampling, and manages repetition loops. For each scenario instance, it calls expand_scenario to perform copy-and-replace operations on template files using substitution dictionaries from the JSONL definition.
Docker Orchestration
By default, AutoGen Bench uses the Python Docker SDK to create containers via run_scenario_in_docker in run_cmd.py. It mounts the expanded scenario folder, injects environment variables, enforces timeouts, and cleans up after execution. Alternatively, run_scenario_natively supports host execution, though this is discouraged for reproducibility.
Result Storage and Tabulation
Results are stored in a deterministic hierarchy: Results/<scenario>/<task_id>/<instance_id>/<repeat>. Files written include stdout, stderr, and AutoGen trace files (lines 35-55 in run_cmd.py). The tabulate_cmd.py module reads these JSON logs to aggregate success rates and print markdown-compatible tables.
Prerequisites and Installation
Before running benchmarks, ensure you have Docker installed and API credentials ready.
Install the package in editable mode for development or standard mode for usage:
pip install -e autogen/python/packages/agbench
Verify installation by checking the CLI:
agbench --help
Step-by-Step Setup Guide
Configure API Credentials
AutoGen Bench requires OpenAI or Azure credentials. Create an OAI_CONFIG_LIST file in your working directory or export it as an environment variable:
export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)
Alternatively, set a single key directly:
export OPENAI_API_KEY=sk-...
For additional APIs like Bing, create an ENV.json file alongside your scenario folder:
{
"BING_API_KEY": "your-bing-key"
}
The get_scenario_env helper in run_cmd.py loads these configurations at runtime.
Initialize Benchmark Tasks
Most benchmarks require task initialization to download datasets and create scenario files. For example, to set up HumanEval:
cd autogen/python/packages/agbench/benchmarks/HumanEval
python Scripts/init_tasks.py
This populates the Tasks/ folder with JSONL files defining individual problem instances.
Running Your First Benchmark
Execute a benchmark using the JSONL scenario definition. The following command runs HumanEval tasks once in Docker:
agbench run Tasks/human_eval_MagenticOne.jsonl
For statistically significant results, use the --repeat flag to run each scenario multiple times:
agbench run --repeat 10 Tasks/human_eval_MagenticOne.jsonl
Docker vs Native Execution
Docker (default): Builds and uses the agbench:default image automatically. Supply a custom image with --docker-image myrepo/agbench:custom.
Native mode: Add --native to run directly on the host OS. This bypasses container isolation but may leave stray state on your machine and reduce reproducibility.
Subsampling Options
Test on a subset of tasks using --subsample:
agbench run --subsample 0.5 Tasks/scenario.jsonlruns 50% of tasks randomlyagbench run --subsample 5 Tasks/scenario.jsonlruns exactly 5 random tasks
Analyzing Results with Tabulation
After execution, aggregate results using the tabulate command:
agbench tabulate Results/human_eval_MagenticOne
This command reads the per-run JSON logs from the results hierarchy and prints a concise table showing success rates, average runtime, and token usage. The output is markdown-compatible for easy inclusion in reports.
Advanced Configuration Options
Customize execution behavior through these CLI flags:
--docker-image: Specify a pre-built image instead of buildingagbench:default--timeout: Override the default execution timeout for long-running scenarios--remove_missing: Clean up result entries for scenarios that no longer exist in the task definition
For custom environment variables, modify the ENV.json file or extend the get_scenario_env logic in run_cmd.py to handle additional secret management systems.
Summary
- Install AutoGen Bench using
pip install -e autogen/python/packages/agbenchto access theagbenchCLI. - Configure credentials via
OAI_CONFIG_LISTorOPENAI_API_KEY, with additional keys inENV.json. - Initialize tasks using provided scripts like
Scripts/init_tasks.pybefore running benchmarks. - Execute scenarios with
agbench run, using--repeatfor multiple trials and--subsamplefor partial datasets. - Analyze outcomes using
agbench tabulate Results/<scenario>to generate performance summaries. - Prefer Docker execution (default) over
--nativeto ensure reproducible, isolated environments.
Frequently Asked Questions
What file format defines benchmark scenarios?
AutoGen Bench uses JSONL (JSON Lines) files where each line describes a scenario instance with template paths and substitution dictionaries. The run_cmd.py module expands these definitions into concrete execution folders before running.
How does AutoGen Bench handle sensitive API keys?
The tool loads keys from OAI_CONFIG_LIST (OpenAI/Azure) and ENV.json (additional services) at runtime. The get_scenario_env helper in run_cmd.py injects these as environment variables into Docker containers or native processes, ensuring credentials never appear in execution logs.
Can I run benchmarks without Docker?
Yes, using the --native flag, but this is discouraged for production benchmarking. Native execution runs directly on your host OS and may leave residual files or conflicting state between runs, compromising the reproducibility that Docker provides through container isolation.
Where are benchmark results stored?
Results follow a deterministic path: Results/<scenario>/<task_id>/<instance_id>/<repeat>/. Each directory contains stdout, stderr, and AutoGen trace files generated during execution, as implemented in lines 35-55 of run_cmd.py.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →