Scaling RAGFlow for Large Volumes of Documents and Queries: Architecture and Best Practices
To scale RAGFlow for enterprise-grade workloads involving tens of millions of documents and thousands of queries per second, horizontally scale the ingestion pipeline, embedding workers, vector store shards, and deploy stateless API replicas behind a load balancer while implementing Redis caching and read-only replicas.
RAGFlow by Infiniflow is architected as a modular micro-service platform designed for horizontal scalability across storage, embedding, retrieval, and API serving layers. When scaling RAGFlow for large volumes of documents and queries, administrators must tune worker pools, vector index configurations, and caching strategies across multiple backend services. This guide covers the specific configuration files, code paths, and deployment patterns required to achieve sub-second latency under massive concurrent load.
Horizontal Scaling Architecture Overview
RAGFlow decouples document ingestion, embedding generation, vector storage, and query execution into independent services that can be scaled separately. The platform supports three interchangeable vector backends—Infinity, Elasticsearch, and OceanBase—all implementing the common DocStoreBase interface in common/doc_store/doc_store_base.py.
Each scaling layer offers distinct configuration knobs:
- Document Ingestion: Async worker pools in
conf/service_conf.yaml - Embedding Generation: Batch processing via
EMBEDDING_BATCH_SIZEincommon/settings.py - Vector Retrieval: Sharding and connection pooling in
common/doc_store/infinity_conn_pool.pyandcommon/doc_store/es_conn_pool.py - Query Execution: Parallel workers in
common/query_base.py - API Serving: Stateless Flask/Quart instances in
api/apps
Scaling Document Ingestion and Embedding Generation
High-volume ingestion requires parallelizing both file processing and neural embedding generation.
Ingestion Worker Configuration
The ingestion API (/api/v1/document) streams files to a background worker pool defined in conf/service_conf.yaml. Increase worker_num and enable RabbitMQ or Kafka for distributed task queuing to handle bulk uploads.
Embedding Throughput Optimization
Embedding models are registered in conf/llm_factories.json. For CPU-only deployments, use lightweight transformers like sentence-transformers/all-MiniLM-L6-v2. For GPU throughput, configure multi-GPU servers and adjust EMBEDDING_BATCH_SIZE in common/settings.py.
# Optimize embedding throughput
from common import settings
settings.EMBEDDING_WORKER_NUM = 12 # Match CPU cores or GPU streams
settings.EMBEDDING_BATCH_SIZE = 256 # Chunks per inference batch
For GPU acceleration, enable mixed-precision inference (torch.float16) to double throughput without modifying the core rag/llm package logic.
Vector Store Sharding and Read Replicas
RAGFlow's vector storage layer supports horizontal sharding and read replica scaling to distribute search load.
Backend-Specific Sharding
In Infinity, configure num_shards in infinity_mapping.json. For Elasticsearch, use index routing keys (_routing) to ensure each tenant writes to dedicated shards, minimizing cross-tenant query overhead.
Connection Pool Management
The DocStoreBase abstraction allows switching between production and read-only instances. When read_only=True is passed in the client configuration, RAGFlow routes queries to replica nodes while the primary handles write operations.
# Configure Infinity with sharding and replicas
infinity_cfg = {
"hosts": ["infinity-01:443", "infinity-02:443"],
"num_shards": 4, # Horizontal data distribution
"replicas": 2, # Read-only replicas for query scaling
"use_ssl": True,
}
# Instantiate via DocStoreFactory.get_store(...)
Connection pooling logic resides in common/doc_store/infinity_conn_pool.py and common/doc_store/es_conn_pool.py, managing failover and load distribution across the shard cluster.
Query Execution and Caching Strategies
The QueryExecutor class in common/query_base.py orchestrates vector search and LLM reranking, supporting parallel execution through configurable worker_num parameters.
Parallel Query Workers
Launch multiple query workers to handle concurrent ANN searches and reranking operations. The executor automatically distributes retrieval tasks across available workers.
Redis Query Caching
Enable Redis caching in conf/system_settings.json to store embedding vectors of recent queries and avoid redundant approximate nearest neighbor (ANN) searches.
# Enable Redis caching for query results
import redis
from common import cache_utils
redis_client = redis.StrictRedis(
host="redis",
port=6379,
db=0
)
cache_utils.set_client(redis_client, ttl=300) # 5-minute cache
Define fallback retrieval paths (e.g., BM25 on Elasticsearch) in retrieval.py using the fallback flag for graceful degradation when the ANN index experiences high load.
Stateless API Layer and Load Balancing
The Flask/Quart API layer in api/apps is deliberately stateless, requiring no session affinity. Deploy multiple instances behind an L4/L7 load balancer (NGINX, Envoy, or cloud-native load balancers) to distribute incoming requests.
Because all application state resides in the vector store and Redis cache, API instances can be scaled horizontally without coordination. Configure resource limits and replica counts in docker/docker-compose.yml:
# docker/docker-compose.yml scaling configuration
services:
api:
deploy:
replicas: 4
resources:
limits:
cpus: "2"
memory: "4G"
Monitoring and Autoscaling Configuration
RAGFlow emits Prometheus-compatible metrics via common/log_utils.py, enabling integration with Grafana dashboards and Kubernetes Horizontal Pod Autoscaling (HPA).
Key Metrics to Monitor
Export and alert on:
request_latency_seconds(target 95th percentile < 500ms)embedding_qpsandsearch_qpsthroughput rates- Worker pool saturation from
service_conf.yamlstatistics
Configure HPA rules based on CPU, memory, or QPS thresholds to automatically scale API and embedding worker instances during traffic spikes.
Summary
- RAGFlow scales horizontally across ingestion, embedding, vector storage, and API layers using micro-service architecture patterns implemented in
infiniflow/ragflow. - Configure worker pools in
conf/service_conf.yamlandcommon/settings.pyto parallelize document processing and embedding generation. - Shard vector stores using Infinity or Elasticsearch backends with read-only replicas to separate write and query traffic.
- Deploy stateless API instances behind load balancers with Redis caching enabled to handle thousands of concurrent queries.
- Monitor Prometheus metrics from
common/log_utils.pyto trigger autoscaling before latency degrades.
Frequently Asked Questions
How do I configure RAGFlow to handle millions of document chunks?
Configure worker_num in conf/service_conf.yaml to increase background ingestion workers, set EMBEDDING_BATCH_SIZE to 256 or higher in common/settings.py, and shard your vector store across multiple nodes using the num_shards parameter in infinity_mapping.json or Elasticsearch routing keys.
What is the best vector database backend for high-throughput RAGFlow deployments?
RAGFlow supports Infinity, Elasticsearch, and OceanBase through the DocStoreBase interface. Infinity offers native sharding and replica support via common/doc_store/infinity_conn_pool.py, while Elasticsearch provides mature routing and index management for multi-tenant scenarios.
How can I reduce query latency during traffic spikes?
Enable Redis caching by setting REDIS_HOST and REDIS_TTL in conf/system_settings.json to cache embedding vectors and query results. Deploy read-only vector store replicas and scale the QueryExecutor workers in common/query_base.py to process searches in parallel.
Does RAGFlow support Kubernetes autoscaling?
Yes. The API layer in api/apps is stateless and emits Prometheus metrics from common/log_utils.py, allowing you to configure Horizontal Pod Autoscaling based on CPU, memory, or custom QPS metrics. All state is externalized to Redis and the vector store, making pods safe to terminate and reschedule.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →