Performance Tuning Options for GraphRAG Agent: Complete Configuration Guide

You can optimize GraphRAG Agent throughput, latency, and resource consumption by adjusting environment variables for concurrency, batch processing, Neo4j GDS settings, and caching without modifying any source code.

The GraphRAG Agent repository (1517005260/graph-rag-agent) exposes all performance-critical parameters as environment variables, allowing you to balance speed, memory usage, and result quality for different deployment scenarios. This guide covers every tunable knob available in the .env.example file and explains how each setting affects the core pipelines for ingestion, embedding, graph construction, and search.

Concurrency and Parallelism Settings

The agent uses thread pools and worker processes to parallelize CPU-bound and I/O-bound tasks. Adjust these variables based on your available CPU cores and memory.

API-Level Concurrency

  • FASTAPI_WORKERS: Controls the number of FastAPI worker processes. Increasing this value raises request-per-second capacity but consumes more CPU and RAM. Typical values range from 2 to 8 workers depending on core count.

  • MAX_WORKERS: Sets the global thread-pool size used by batch-processing classes such as EmbeddingManager and GraphStructureBuilder. A larger pool enables more parallel batches during ingestion but increases CPU pressure. This value is injected into ThreadPoolExecutor instances throughout the codebase, such as in graphrag_agent/integrations/build/incremental_update.py at line 51.

Multi-Agent Orchestration

  • MA_WORKER_MAX_CONCURRENCY: Limits concurrent tasks when the multi-agent planner runs in parallel mode. Raising this improves throughput for complex queries but may spike memory consumption during map-reduce operations.

Batch Size Configuration for Throughput

Batch sizes directly impact the trade-off between throughput and memory usage. The repository provides granular control over different pipeline stages.

Ingestion and Embedding Batches

  • BATCH_SIZE: General catch-all batch size for internal processing loops. Recommended range: 50-200.

  • CHUNK_BATCH_SIZE: Specific to text-chunk ingestion and embedding generation. Recommended range: 80-150.

  • ENTITY_BATCH_SIZE: Controls entity-level CRUD operations in the graph writer. Recommended range: 30-80 to prevent transaction timeouts.

  • EMBEDDING_BATCH_SIZE: Number of vectors sent per HTTP call to the embedding service. Must respect provider API limits; typical range: 32-128.

Search and LLM Batches

  • LLM_BATCH_SIZE: Number of prompts sent together to the LLM (used in fusion_agent). Range: 4-10 to balance latency and token usage.

  • COMMUNITY_BATCH_SIZE: Batch size for graph algorithm community detection. Range: 30-70.

  • GLOBAL_SEARCH_BATCH_SIZE: Processes multiple communities during global search. Range: 5-10.

  • HYBRID_SEARCH_BATCH_SIZE: Controls batching for hybrid (vector + graph) search requests. Range: 8-20. This value is read by graphrag_agent/search/tool/hybrid_tool.py at line 41 via HYBRID_SEARCH_SETTINGS["batch_size"].

Neo4j Graph Data Science (GDS) Optimization

When running graph algorithms on large corpora, tune the Neo4j GDS parameters to prevent out-of-memory errors.

  • GDS_MEMORY_LIMIT: Sets the upper bound on GDS heap memory in gigabytes. Increase this when processing more than 50,000 nodes.

  • GDS_CONCURRENCY: Number of parallel GDS threads. Recommended: 2-8 depending on available CPU cores.

  • GDS_NODE_COUNT_LIMIT: Safety cap for node count in a single GDS run. Raise for large corpora but monitor for OOM errors.

  • GDS_TIMEOUT_SECONDS: Maximum wall-time for a GDS algorithm. Increase for deep community detection on dense graphs.

These values are consumed by the Neo4j service layer (e.g., server/services/kg_service.py) and passed to the GDS driver during algorithm invocation.

Cache and Memory Management

The agent implements a multi-tier caching system to reduce redundant embedding calls and database queries.

  • CACHE_MAX_MEMORY_SIZE: Upper bound of the in-memory vector cache in megabytes. Larger values reduce repeated embedding calls but increase RAM usage.

  • CACHE_MAX_DISK_SIZE: Disk-cache capacity for embeddings and query results. Set according to available SSD space to prevent disk thrashing.

  • CACHE_ENABLE_VECTOR_SIMILARITY: Toggle for vector-similarity lookup caching. Disabling saves memory at the cost of extra compute.

  • CACHE_SIMILARITY_THRESHOLD: Similarity cutoff for cache hits. Higher thresholds result in fewer cache hits and more recomputation.

  • CACHE_MAX_VECTORS: Hard cap on stored vectors to control the memory footprint of the vector index.

All cache configuration is defined in .env.example lines 92-115 and consumed by graphrag_agent/cache_manager/__init__.py.

Search and Agent Execution Limits

Control result quality and computational bounds for search operations and agent recursion.

  • SEARCH_VECTOR_LIMIT: Maximum nearest-vector hits per query (typical: 5-20). Higher values improve recall.

  • SEARCH_TEXT_LIMIT: Maximum textual matches returned by BM25-style search (typical: 5-15).

  • SEARCH_SEMANTIC_TOP_K: Semantic search top-K (typical: 5-10).

  • LOCAL_SEARCH_TOP_COMMUNITIES: Number of communities returned for local graph search (typical: 3-8).

  • LOCAL_SEARCH_TOP_ENTITIES: Number of entity nodes returned per community (typical: 10-30).

  • AGENT_RECURSION_LIMIT: Prevents runaway LangGraph recursion (default: 5).

  • AGENT_CHUNK_SIZE: Number of message fragments fed to LangGraph per turn.

  • AGENT_STREAM_FLUSH_THRESHOLD: Character threshold for flushing streaming responses.

  • MA_PLANNER_MAX_TASKS: Upper bound on tasks a planner can emit in one step.

  • MA_MAX_TOKENS_PER_REDUCE: Token budget for each Reduce phase in Map-Reduce writing mode.

These limits are enforced in graphrag_agent/search/utils.py and the multi-agent dispatcher in graphrag_agent/agents/multi_agent.

Monitoring Performance with Built-in Tools

The repository includes a lightweight decorator for tracking endpoint latency without external dependencies.

from server.utils.performance import measure_performance

@router.post("/search")
@measure_performance("search_endpoint")
async def search(request: SearchRequest):
    # search logic

    return result

The measure_performance decorator (located in server/utils/performance.py lines 5-26) prints timestamped performance metrics to stdout, such as API性能 - search_endpoint: 0.2371s. This enables quick identification of slow paths in production logs without adding overhead to the hot path.

Summary

  • Concurrency controls (FASTAPI_WORKERS, MAX_WORKERS, MA_WORKER_MAX_CONCURRENCY) scale API throughput and parallel processing but increase CPU and memory pressure.
  • Batch sizes (BATCH_SIZE, EMBEDDING_BATCH_SIZE, CHUNK_BATCH_SIZE, etc.) tune the trade-off between ingestion speed and memory consumption across embedding, graph construction, and search pipelines.
  • Neo4j GDS parameters (GDS_MEMORY_LIMIT, GDS_CONCURRENCY) prevent out-of-memory errors during graph algorithms on large corpora.
  • Cache settings (CACHE_MAX_MEMORY_SIZE, CACHE_MAX_DISK_SIZE) reduce redundant embedding calls and database queries through multi-tier caching.
  • Execution limits (AGENT_RECURSION_LIMIT, SEARCH_VECTOR_LIMIT) bound computational cost and prevent runaway agent behavior.
  • Performance monitoring via the measure_performance decorator provides lightweight latency tracking without external dependencies.

Frequently Asked Questions

How do I increase API request throughput for the GraphRAG Agent?

Increase FASTAPI_WORKERS to match your CPU core count (typically 2-8 workers) and raise MAX_WORKERS to allow more parallel batch processing in the ingestion pipeline. Monitor CPU utilization and memory usage, as higher concurrency increases resource pressure on the host machine.

What batch size should I use for large document corpora?

For datasets exceeding 10,000 documents, increase BATCH_SIZE to 200 and CHUNK_BATCH_SIZE to 150 to maximize throughput. Adjust EMBEDDING_BATCH_SIZE based on your provider's API limits (typically 32-128), and ensure ENTITY_BATCH_SIZE remains between 30-80 to prevent Neo4j transaction timeouts during graph construction.

How can I prevent out-of-memory errors during graph construction?

Raise GDS_MEMORY_LIMIT to allocate more heap memory for Neo4j Graph Data Science operations (increase when processing >50,000 nodes). Lower GDS_CONCURRENCY to reduce parallel thread pressure, and set GDS_NODE_COUNT_LIMIT as a safety cap. Additionally, reduce MAX_WORKERS and BATCH_SIZE to decrease memory pressure during the ingestion phase.

How do I monitor slow endpoints in production?

Apply the @measure_performance decorator from server/utils/performance.py to any FastAPI endpoint. This decorator prints timestamped latency metrics to stdout (e.g., API性能 - search_endpoint: 0.2371s), enabling you to identify bottlenecks in production logs without adding overhead to the hot path or requiring external monitoring tools.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →