# Performance Tuning Options for GraphRAG Agent: Complete Configuration Guide

> Discover GraphRAG Agent performance tuning options. Optimize throughput, latency, and resources via env vars for concurrency, batching, GDS settings, and caching.

- Repository: [GLK/graph-rag-agent](https://github.com/1517005260/graph-rag-agent)
- Tags: performance
- Published: 2026-02-23

---

**You can optimize GraphRAG Agent throughput, latency, and resource consumption by adjusting environment variables for concurrency, batch processing, Neo4j GDS settings, and caching without modifying any source code.**

The GraphRAG Agent repository (`1517005260/graph-rag-agent`) exposes all performance-critical parameters as environment variables, allowing you to balance speed, memory usage, and result quality for different deployment scenarios. This guide covers every tunable knob available in the `.env.example` file and explains how each setting affects the core pipelines for ingestion, embedding, graph construction, and search.

## Concurrency and Parallelism Settings

The agent uses thread pools and worker processes to parallelize CPU-bound and I/O-bound tasks. Adjust these variables based on your available CPU cores and memory.

### API-Level Concurrency

- **`FASTAPI_WORKERS`**: Controls the number of FastAPI worker processes. Increasing this value raises request-per-second capacity but consumes more CPU and RAM. Typical values range from 2 to 8 workers depending on core count.

- **`MAX_WORKERS`**: Sets the global thread-pool size used by batch-processing classes such as `EmbeddingManager` and `GraphStructureBuilder`. A larger pool enables more parallel batches during ingestion but increases CPU pressure. This value is injected into `ThreadPoolExecutor` instances throughout the codebase, such as in [`graphrag_agent/integrations/build/incremental_update.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/integrations/build/incremental_update.py) at line 51.

### Multi-Agent Orchestration

- **`MA_WORKER_MAX_CONCURRENCY`**: Limits concurrent tasks when the multi-agent planner runs in parallel mode. Raising this improves throughput for complex queries but may spike memory consumption during map-reduce operations.

## Batch Size Configuration for Throughput

Batch sizes directly impact the trade-off between throughput and memory usage. The repository provides granular control over different pipeline stages.

### Ingestion and Embedding Batches

- **`BATCH_SIZE`**: General catch-all batch size for internal processing loops. Recommended range: 50-200.

- **`CHUNK_BATCH_SIZE`**: Specific to text-chunk ingestion and embedding generation. Recommended range: 80-150.

- **`ENTITY_BATCH_SIZE`**: Controls entity-level CRUD operations in the graph writer. Recommended range: 30-80 to prevent transaction timeouts.

- **`EMBEDDING_BATCH_SIZE`**: Number of vectors sent per HTTP call to the embedding service. Must respect provider API limits; typical range: 32-128.

### Search and LLM Batches

- **`LLM_BATCH_SIZE`**: Number of prompts sent together to the LLM (used in `fusion_agent`). Range: 4-10 to balance latency and token usage.

- **`COMMUNITY_BATCH_SIZE`**: Batch size for graph algorithm community detection. Range: 30-70.

- **`GLOBAL_SEARCH_BATCH_SIZE`**: Processes multiple communities during global search. Range: 5-10.

- **`HYBRID_SEARCH_BATCH_SIZE`**: Controls batching for hybrid (vector + graph) search requests. Range: 8-20. This value is read by [`graphrag_agent/search/tool/hybrid_tool.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/search/tool/hybrid_tool.py) at line 41 via `HYBRID_SEARCH_SETTINGS["batch_size"]`.

## Neo4j Graph Data Science (GDS) Optimization

When running graph algorithms on large corpora, tune the Neo4j GDS parameters to prevent out-of-memory errors.

- **`GDS_MEMORY_LIMIT`**: Sets the upper bound on GDS heap memory in gigabytes. Increase this when processing more than 50,000 nodes.

- **`GDS_CONCURRENCY`**: Number of parallel GDS threads. Recommended: 2-8 depending on available CPU cores.

- **`GDS_NODE_COUNT_LIMIT`**: Safety cap for node count in a single GDS run. Raise for large corpora but monitor for OOM errors.

- **`GDS_TIMEOUT_SECONDS`**: Maximum wall-time for a GDS algorithm. Increase for deep community detection on dense graphs.

These values are consumed by the Neo4j service layer (e.g., [`server/services/kg_service.py`](https://github.com/1517005260/graph-rag-agent/blob/main/server/services/kg_service.py)) and passed to the GDS driver during algorithm invocation.

## Cache and Memory Management

The agent implements a multi-tier caching system to reduce redundant embedding calls and database queries.

- **`CACHE_MAX_MEMORY_SIZE`**: Upper bound of the in-memory vector cache in megabytes. Larger values reduce repeated embedding calls but increase RAM usage.

- **`CACHE_MAX_DISK_SIZE`**: Disk-cache capacity for embeddings and query results. Set according to available SSD space to prevent disk thrashing.

- **`CACHE_ENABLE_VECTOR_SIMILARITY`**: Toggle for vector-similarity lookup caching. Disabling saves memory at the cost of extra compute.

- **`CACHE_SIMILARITY_THRESHOLD`**: Similarity cutoff for cache hits. Higher thresholds result in fewer cache hits and more recomputation.

- **`CACHE_MAX_VECTORS`**: Hard cap on stored vectors to control the memory footprint of the vector index.

All cache configuration is defined in `.env.example` lines 92-115 and consumed by [`graphrag_agent/cache_manager/__init__.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/cache_manager/__init__.py).

## Search and Agent Execution Limits

Control result quality and computational bounds for search operations and agent recursion.

- **`SEARCH_VECTOR_LIMIT`**: Maximum nearest-vector hits per query (typical: 5-20). Higher values improve recall.

- **`SEARCH_TEXT_LIMIT`**: Maximum textual matches returned by BM25-style search (typical: 5-15).

- **`SEARCH_SEMANTIC_TOP_K`**: Semantic search top-K (typical: 5-10).

- **`LOCAL_SEARCH_TOP_COMMUNITIES`**: Number of communities returned for local graph search (typical: 3-8).

- **`LOCAL_SEARCH_TOP_ENTITIES`**: Number of entity nodes returned per community (typical: 10-30).

- **`AGENT_RECURSION_LIMIT`**: Prevents runaway LangGraph recursion (default: 5).

- **`AGENT_CHUNK_SIZE`**: Number of message fragments fed to LangGraph per turn.

- **`AGENT_STREAM_FLUSH_THRESHOLD`**: Character threshold for flushing streaming responses.

- **`MA_PLANNER_MAX_TASKS`**: Upper bound on tasks a planner can emit in one step.

- **`MA_MAX_TOKENS_PER_REDUCE`**: Token budget for each Reduce phase in Map-Reduce writing mode.

These limits are enforced in [`graphrag_agent/search/utils.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/search/utils.py) and the multi-agent dispatcher in `graphrag_agent/agents/multi_agent`.

## Monitoring Performance with Built-in Tools

The repository includes a lightweight decorator for tracking endpoint latency without external dependencies.

```python
from server.utils.performance import measure_performance

@router.post("/search")
@measure_performance("search_endpoint")
async def search(request: SearchRequest):
    # search logic

    return result

```

The `measure_performance` decorator (located in [`server/utils/performance.py`](https://github.com/1517005260/graph-rag-agent/blob/main/server/utils/performance.py) lines 5-26) prints timestamped performance metrics to stdout, such as `API性能 - search_endpoint: 0.2371s`. This enables quick identification of slow paths in production logs without adding overhead to the hot path.

## Summary

- **Concurrency controls** (`FASTAPI_WORKERS`, `MAX_WORKERS`, `MA_WORKER_MAX_CONCURRENCY`) scale API throughput and parallel processing but increase CPU and memory pressure.
- **Batch sizes** (`BATCH_SIZE`, `EMBEDDING_BATCH_SIZE`, `CHUNK_BATCH_SIZE`, etc.) tune the trade-off between ingestion speed and memory consumption across embedding, graph construction, and search pipelines.
- **Neo4j GDS parameters** (`GDS_MEMORY_LIMIT`, `GDS_CONCURRENCY`) prevent out-of-memory errors during graph algorithms on large corpora.
- **Cache settings** (`CACHE_MAX_MEMORY_SIZE`, `CACHE_MAX_DISK_SIZE`) reduce redundant embedding calls and database queries through multi-tier caching.
- **Execution limits** (`AGENT_RECURSION_LIMIT`, `SEARCH_VECTOR_LIMIT`) bound computational cost and prevent runaway agent behavior.
- **Performance monitoring** via the `measure_performance` decorator provides lightweight latency tracking without external dependencies.

## Frequently Asked Questions

### How do I increase API request throughput for the GraphRAG Agent?

Increase `FASTAPI_WORKERS` to match your CPU core count (typically 2-8 workers) and raise `MAX_WORKERS` to allow more parallel batch processing in the ingestion pipeline. Monitor CPU utilization and memory usage, as higher concurrency increases resource pressure on the host machine.

### What batch size should I use for large document corpora?

For datasets exceeding 10,000 documents, increase `BATCH_SIZE` to 200 and `CHUNK_BATCH_SIZE` to 150 to maximize throughput. Adjust `EMBEDDING_BATCH_SIZE` based on your provider's API limits (typically 32-128), and ensure `ENTITY_BATCH_SIZE` remains between 30-80 to prevent Neo4j transaction timeouts during graph construction.

### How can I prevent out-of-memory errors during graph construction?

Raise `GDS_MEMORY_LIMIT` to allocate more heap memory for Neo4j Graph Data Science operations (increase when processing >50,000 nodes). Lower `GDS_CONCURRENCY` to reduce parallel thread pressure, and set `GDS_NODE_COUNT_LIMIT` as a safety cap. Additionally, reduce `MAX_WORKERS` and `BATCH_SIZE` to decrease memory pressure during the ingestion phase.

### How do I monitor slow endpoints in production?

Apply the `@measure_performance` decorator from [`server/utils/performance.py`](https://github.com/1517005260/graph-rag-agent/blob/main/server/utils/performance.py) to any FastAPI endpoint. This decorator prints timestamped latency metrics to stdout (e.g., `API性能 - search_endpoint: 0.2371s`), enabling you to identify bottlenecks in production logs without adding overhead to the hot path or requiring external monitoring tools.