# Scaling RAGFlow for Large Volumes of Documents and Queries: Architecture and Best Practices

> Discover how to scale RAGFlow for massive document volumes and high query loads. Explore architecture, best practices, and key considerations for enterprise-grade performance.

- Repository: [InfiniFlow/ragflow](https://github.com/infiniflow/ragflow)
- Tags: architecture
- Published: 2026-02-23

---

**To scale RAGFlow for enterprise-grade workloads involving tens of millions of documents and thousands of queries per second, horizontally scale the ingestion pipeline, embedding workers, vector store shards, and deploy stateless API replicas behind a load balancer while implementing Redis caching and read-only replicas.**

RAGFlow by Infiniflow is architected as a modular micro-service platform designed for horizontal scalability across storage, embedding, retrieval, and API serving layers. When scaling RAGFlow for large volumes of documents and queries, administrators must tune worker pools, vector index configurations, and caching strategies across multiple backend services. This guide covers the specific configuration files, code paths, and deployment patterns required to achieve sub-second latency under massive concurrent load.

## Horizontal Scaling Architecture Overview

RAGFlow decouples document ingestion, embedding generation, vector storage, and query execution into independent services that can be scaled separately. The platform supports three interchangeable vector backends—**Infinity**, **Elasticsearch**, and **OceanBase**—all implementing the common `DocStoreBase` interface in [`common/doc_store/doc_store_base.py`](https://github.com/infiniflow/ragflow/blob/main/common/doc_store/doc_store_base.py).

Each scaling layer offers distinct configuration knobs:

- **Document Ingestion**: Async worker pools in [`conf/service_conf.yaml`](https://github.com/infiniflow/ragflow/blob/main/conf/service_conf.yaml)
- **Embedding Generation**: Batch processing via `EMBEDDING_BATCH_SIZE` in [`common/settings.py`](https://github.com/infiniflow/ragflow/blob/main/common/settings.py)
- **Vector Retrieval**: Sharding and connection pooling in [`common/doc_store/infinity_conn_pool.py`](https://github.com/infiniflow/ragflow/blob/main/common/doc_store/infinity_conn_pool.py) and [`common/doc_store/es_conn_pool.py`](https://github.com/infiniflow/ragflow/blob/main/common/doc_store/es_conn_pool.py)
- **Query Execution**: Parallel workers in [`common/query_base.py`](https://github.com/infiniflow/ragflow/blob/main/common/query_base.py)
- **API Serving**: Stateless Flask/Quart instances in `api/apps`

## Scaling Document Ingestion and Embedding Generation

High-volume ingestion requires parallelizing both file processing and neural embedding generation.

### Ingestion Worker Configuration

The ingestion API (`/api/v1/document`) streams files to a background worker pool defined in [`conf/service_conf.yaml`](https://github.com/infiniflow/ragflow/blob/main/conf/service_conf.yaml). Increase `worker_num` and enable **RabbitMQ** or **Kafka** for distributed task queuing to handle bulk uploads.

### Embedding Throughput Optimization

Embedding models are registered in [`conf/llm_factories.json`](https://github.com/infiniflow/ragflow/blob/main/conf/llm_factories.json). For CPU-only deployments, use lightweight transformers like `sentence-transformers/all-MiniLM-L6-v2`. For GPU throughput, configure multi-GPU servers and adjust `EMBEDDING_BATCH_SIZE` in [`common/settings.py`](https://github.com/infiniflow/ragflow/blob/main/common/settings.py).

```python

# Optimize embedding throughput

from common import settings

settings.EMBEDDING_WORKER_NUM = 12      # Match CPU cores or GPU streams

settings.EMBEDDING_BATCH_SIZE = 256     # Chunks per inference batch

```

For GPU acceleration, enable mixed-precision inference (`torch.float16`) to double throughput without modifying the core `rag/llm` package logic.

## Vector Store Sharding and Read Replicas

RAGFlow's vector storage layer supports horizontal sharding and read replica scaling to distribute search load.

### Backend-Specific Sharding

In **Infinity**, configure `num_shards` in [`infinity_mapping.json`](https://github.com/infiniflow/ragflow/blob/main/infinity_mapping.json). For **Elasticsearch**, use index routing keys (`_routing`) to ensure each tenant writes to dedicated shards, minimizing cross-tenant query overhead.

### Connection Pool Management

The `DocStoreBase` abstraction allows switching between production and read-only instances. When `read_only=True` is passed in the client configuration, RAGFlow routes queries to replica nodes while the primary handles write operations.

```python

# Configure Infinity with sharding and replicas

infinity_cfg = {
    "hosts": ["infinity-01:443", "infinity-02:443"],
    "num_shards": 4,               # Horizontal data distribution

    "replicas": 2,                 # Read-only replicas for query scaling

    "use_ssl": True,
}

# Instantiate via DocStoreFactory.get_store(...)

```

Connection pooling logic resides in [`common/doc_store/infinity_conn_pool.py`](https://github.com/infiniflow/ragflow/blob/main/common/doc_store/infinity_conn_pool.py) and [`common/doc_store/es_conn_pool.py`](https://github.com/infiniflow/ragflow/blob/main/common/doc_store/es_conn_pool.py), managing failover and load distribution across the shard cluster.

## Query Execution and Caching Strategies

The `QueryExecutor` class in [`common/query_base.py`](https://github.com/infiniflow/ragflow/blob/main/common/query_base.py) orchestrates vector search and LLM reranking, supporting parallel execution through configurable `worker_num` parameters.

### Parallel Query Workers

Launch multiple query workers to handle concurrent ANN searches and reranking operations. The executor automatically distributes retrieval tasks across available workers.

### Redis Query Caching

Enable Redis caching in [`conf/system_settings.json`](https://github.com/infiniflow/ragflow/blob/main/conf/system_settings.json) to store embedding vectors of recent queries and avoid redundant approximate nearest neighbor (ANN) searches.

```python

# Enable Redis caching for query results

import redis
from common import cache_utils

redis_client = redis.StrictRedis(
    host="redis", 
    port=6379, 
    db=0
)
cache_utils.set_client(redis_client, ttl=300)   # 5-minute cache

```

Define fallback retrieval paths (e.g., BM25 on Elasticsearch) in [`retrieval.py`](https://github.com/infiniflow/ragflow/blob/main/retrieval.py) using the `fallback` flag for graceful degradation when the ANN index experiences high load.

## Stateless API Layer and Load Balancing

The Flask/Quart API layer in `api/apps` is deliberately stateless, requiring no session affinity. Deploy multiple instances behind an L4/L7 load balancer (NGINX, Envoy, or cloud-native load balancers) to distribute incoming requests.

Because all application state resides in the vector store and Redis cache, API instances can be scaled horizontally without coordination. Configure resource limits and replica counts in [`docker/docker-compose.yml`](https://github.com/infiniflow/ragflow/blob/main/docker/docker-compose.yml):

```yaml

# docker/docker-compose.yml scaling configuration

services:
  api:
    deploy:
      replicas: 4
    resources:
      limits:
        cpus: "2"
        memory: "4G"

```

## Monitoring and Autoscaling Configuration

RAGFlow emits Prometheus-compatible metrics via [`common/log_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/log_utils.py), enabling integration with Grafana dashboards and Kubernetes Horizontal Pod Autoscaling (HPA).

### Key Metrics to Monitor

Export and alert on:

- `request_latency_seconds` (target 95th percentile < 500ms)
- `embedding_qps` and `search_qps` throughput rates
- Worker pool saturation from [`service_conf.yaml`](https://github.com/infiniflow/ragflow/blob/main/service_conf.yaml) statistics

Configure HPA rules based on CPU, memory, or QPS thresholds to automatically scale API and embedding worker instances during traffic spikes.

## Summary

- **RAGFlow scales horizontally** across ingestion, embedding, vector storage, and API layers using micro-service architecture patterns implemented in `infiniflow/ragflow`.
- **Configure worker pools** in [`conf/service_conf.yaml`](https://github.com/infiniflow/ragflow/blob/main/conf/service_conf.yaml) and [`common/settings.py`](https://github.com/infiniflow/ragflow/blob/main/common/settings.py) to parallelize document processing and embedding generation.
- **Shard vector stores** using Infinity or Elasticsearch backends with read-only replicas to separate write and query traffic.
- **Deploy stateless API instances** behind load balancers with Redis caching enabled to handle thousands of concurrent queries.
- **Monitor Prometheus metrics** from [`common/log_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/log_utils.py) to trigger autoscaling before latency degrades.

## Frequently Asked Questions

### How do I configure RAGFlow to handle millions of document chunks?

Configure `worker_num` in [`conf/service_conf.yaml`](https://github.com/infiniflow/ragflow/blob/main/conf/service_conf.yaml) to increase background ingestion workers, set `EMBEDDING_BATCH_SIZE` to 256 or higher in [`common/settings.py`](https://github.com/infiniflow/ragflow/blob/main/common/settings.py), and shard your vector store across multiple nodes using the `num_shards` parameter in [`infinity_mapping.json`](https://github.com/infiniflow/ragflow/blob/main/infinity_mapping.json) or Elasticsearch routing keys.

### What is the best vector database backend for high-throughput RAGFlow deployments?

RAGFlow supports **Infinity**, **Elasticsearch**, and **OceanBase** through the `DocStoreBase` interface. Infinity offers native sharding and replica support via [`common/doc_store/infinity_conn_pool.py`](https://github.com/infiniflow/ragflow/blob/main/common/doc_store/infinity_conn_pool.py), while Elasticsearch provides mature routing and index management for multi-tenant scenarios.

### How can I reduce query latency during traffic spikes?

Enable Redis caching by setting `REDIS_HOST` and `REDIS_TTL` in [`conf/system_settings.json`](https://github.com/infiniflow/ragflow/blob/main/conf/system_settings.json) to cache embedding vectors and query results. Deploy read-only vector store replicas and scale the `QueryExecutor` workers in [`common/query_base.py`](https://github.com/infiniflow/ragflow/blob/main/common/query_base.py) to process searches in parallel.

### Does RAGFlow support Kubernetes autoscaling?

Yes. The API layer in `api/apps` is stateless and emits Prometheus metrics from [`common/log_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/log_utils.py), allowing you to configure Horizontal Pod Autoscaling based on CPU, memory, or custom QPS metrics. All state is externalized to Redis and the vector store, making pods safe to terminate and reschedule.