architecture

Scaling RAGFlow for Large Volumes of Documents and Queries: Architecture and Best Practices

February 23, 2026 infiniflow/ragflow ↗

To scale RAGFlow for enterprise-grade workloads involving tens of millions of documents and thousands of queries per second, horizontally scale the ingestion pipeline, embedding workers, vector store shards, and deploy stateless API replicas behind a load balancer while implementing Redis caching and read-only replicas.

RAGFlow by Infiniflow is architected as a modular micro-service platform designed for horizontal scalability across storage, embedding, retrieval, and API serving layers. When scaling RAGFlow for large volumes of documents and queries, administrators must tune worker pools, vector index configurations, and caching strategies across multiple backend services. This guide covers the specific configuration files, code paths, and deployment patterns required to achieve sub-second latency under massive concurrent load.

Horizontal Scaling Architecture Overview

RAGFlow decouples document ingestion, embedding generation, vector storage, and query execution into independent services that can be scaled separately. The platform supports three interchangeable vector backends—Infinity, Elasticsearch, and OceanBase—all implementing the common DocStoreBase interface in common/doc_store/doc_store_base.py.

Each scaling layer offers distinct configuration knobs:

Document Ingestion: Async worker pools in conf/service_conf.yaml
Embedding Generation: Batch processing via EMBEDDING_BATCH_SIZE in common/settings.py
Vector Retrieval: Sharding and connection pooling in common/doc_store/infinity_conn_pool.py and common/doc_store/es_conn_pool.py
Query Execution: Parallel workers in common/query_base.py
API Serving: Stateless Flask/Quart instances in api/apps

Scaling Document Ingestion and Embedding Generation

High-volume ingestion requires parallelizing both file processing and neural embedding generation.

Ingestion Worker Configuration

The ingestion API (/api/v1/document) streams files to a background worker pool defined in conf/service_conf.yaml. Increase worker_num and enable RabbitMQ or Kafka for distributed task queuing to handle bulk uploads.

Embedding Throughput Optimization

Embedding models are registered in conf/llm_factories.json. For CPU-only deployments, use lightweight transformers like sentence-transformers/all-MiniLM-L6-v2. For GPU throughput, configure multi-GPU servers and adjust EMBEDDING_BATCH_SIZE in common/settings.py.


# Optimize embedding throughput

from common import settings

settings.EMBEDDING_WORKER_NUM = 12      # Match CPU cores or GPU streams

settings.EMBEDDING_BATCH_SIZE = 256     # Chunks per inference batch

For GPU acceleration, enable mixed-precision inference (torch.float16) to double throughput without modifying the core rag/llm package logic.

Vector Store Sharding and Read Replicas

RAGFlow's vector storage layer supports horizontal sharding and read replica scaling to distribute search load.

Backend-Specific Sharding

In Infinity, configure num_shards in infinity_mapping.json. For Elasticsearch, use index routing keys (_routing) to ensure each tenant writes to dedicated shards, minimizing cross-tenant query overhead.

Connection Pool Management

The DocStoreBase abstraction allows switching between production and read-only instances. When read_only=True is passed in the client configuration, RAGFlow routes queries to replica nodes while the primary handles write operations.


# Configure Infinity with sharding and replicas

infinity_cfg = {
    "hosts": ["infinity-01:443", "infinity-02:443"],
    "num_shards": 4,               # Horizontal data distribution

    "replicas": 2,                 # Read-only replicas for query scaling

    "use_ssl": True,
}

# Instantiate via DocStoreFactory.get_store(...)

Connection pooling logic resides in common/doc_store/infinity_conn_pool.py and common/doc_store/es_conn_pool.py, managing failover and load distribution across the shard cluster.

Query Execution and Caching Strategies

The QueryExecutor class in common/query_base.py orchestrates vector search and LLM reranking, supporting parallel execution through configurable worker_num parameters.

Parallel Query Workers

Launch multiple query workers to handle concurrent ANN searches and reranking operations. The executor automatically distributes retrieval tasks across available workers.

Redis Query Caching

Enable Redis caching in conf/system_settings.json to store embedding vectors of recent queries and avoid redundant approximate nearest neighbor (ANN) searches.


# Enable Redis caching for query results

import redis
from common import cache_utils

redis_client = redis.StrictRedis(
    host="redis", 
    port=6379, 
    db=0
)
cache_utils.set_client(redis_client, ttl=300)   # 5-minute cache

Define fallback retrieval paths (e.g., BM25 on Elasticsearch) in retrieval.py using the fallback flag for graceful degradation when the ANN index experiences high load.

Stateless API Layer and Load Balancing

The Flask/Quart API layer in api/apps is deliberately stateless, requiring no session affinity. Deploy multiple instances behind an L4/L7 load balancer (NGINX, Envoy, or cloud-native load balancers) to distribute incoming requests.

Because all application state resides in the vector store and Redis cache, API instances can be scaled horizontally without coordination. Configure resource limits and replica counts in docker/docker-compose.yml:


# docker/docker-compose.yml scaling configuration

services:
  api:
    deploy:
      replicas: 4
    resources:
      limits:
        cpus: "2"
        memory: "4G"

Monitoring and Autoscaling Configuration

RAGFlow emits Prometheus-compatible metrics via common/log_utils.py, enabling integration with Grafana dashboards and Kubernetes Horizontal Pod Autoscaling (HPA).

Key Metrics to Monitor

Export and alert on:

request_latency_seconds (target 95th percentile < 500ms)
embedding_qps and search_qps throughput rates
Worker pool saturation from service_conf.yaml statistics

Configure HPA rules based on CPU, memory, or QPS thresholds to automatically scale API and embedding worker instances during traffic spikes.

Summary

RAGFlow scales horizontally across ingestion, embedding, vector storage, and API layers using micro-service architecture patterns implemented in infiniflow/ragflow.
Configure worker pools in conf/service_conf.yaml and common/settings.py to parallelize document processing and embedding generation.
Shard vector stores using Infinity or Elasticsearch backends with read-only replicas to separate write and query traffic.
Deploy stateless API instances behind load balancers with Redis caching enabled to handle thousands of concurrent queries.
Monitor Prometheus metrics from common/log_utils.py to trigger autoscaling before latency degrades.

Frequently Asked Questions

How do I configure RAGFlow to handle millions of document chunks?

Configure worker_num in conf/service_conf.yaml to increase background ingestion workers, set EMBEDDING_BATCH_SIZE to 256 or higher in common/settings.py, and shard your vector store across multiple nodes using the num_shards parameter in infinity_mapping.json or Elasticsearch routing keys.

What is the best vector database backend for high-throughput RAGFlow deployments?

RAGFlow supports Infinity, Elasticsearch, and OceanBase through the DocStoreBase interface. Infinity offers native sharding and replica support via common/doc_store/infinity_conn_pool.py, while Elasticsearch provides mature routing and index management for multi-tenant scenarios.

How can I reduce query latency during traffic spikes?

Enable Redis caching by setting REDIS_HOST and REDIS_TTL in conf/system_settings.json to cache embedding vectors and query results. Deploy read-only vector store replicas and scale the QueryExecutor workers in common/query_base.py to process searches in parallel.

Does RAGFlow support Kubernetes autoscaling?

Yes. The API layer in api/apps is stateless and emits Prometheus metrics from common/log_utils.py, allowing you to configure Horizontal Pod Autoscaling based on CPU, memory, or custom QPS metrics. All state is externalized to Redis and the vector store, making pods safe to terminate and reschedule.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how infiniflow/ragflow works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →