# How RAGFlow Manages Document Metadata for Efficient Retrieval and Indexing

> Discover how RAGFlow manages document metadata for efficient retrieval and indexing. Leverage per-tenant search indices for millisecond-level filtering alongside vector search.

- Repository: [InfiniFlow/ragflow](https://github.com/infiniflow/ragflow)
- Tags: deep-dive
- Published: 2026-02-23

---

**RAGFlow stores document metadata in dedicated per-tenant search indices with automatic value normalization and partial update support, enabling millisecond-level metadata filtering alongside vector search.**

RAGFlow, the open-source RAG engine maintained by Infiniflow, treats document-level metadata as a first-class searchable entity rather than simple relational columns. This architecture enables hybrid retrieval that combines vector similarity with structured metadata filtering across Elasticsearch or Infinity backends.

## Per-Tenant Index Architecture

### Dynamic Index Creation

RAGFlow creates metadata indices lazily using `DocMetadataService._get_doc_meta_index_name()` in [`api/db/services/doc_metadata_service.py`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/doc_metadata_service.py) (lines 40-55). When a tenant uploads their first document, the system generates an index name following the pattern `ragflow_doc_meta_<tenant_id>` and invokes `settings.docStoreConn.create_doc_meta_idx()` to initialize the schema on the fly.

### Tenant Isolation

Each tenant receives a dedicated metadata index, preventing cross-tenant data leakage and enabling independent sharding strategies. This isolation is enforced throughout [`api/db/services/doc_metadata_service.py`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/doc_metadata_service.py), where every CRUD operation first resolves the tenant-specific index name before executing queries against the underlying document store.

## Metadata CRUD Operations

### Inserting Document Metadata

The `DocMetadataService.insert_document_metadata()` method (lines 45-84) handles metadata ingestion by constructing a JSON document containing `id`, `kb_id`, and `meta_fields`. The service automatically splits combined string values—such as Chinese enumeration separators (e.g., "张三、李四")—into deduplicated arrays, ensuring consistent term-level indexing. For Elasticsearch backends, the operation includes an immediate index refresh to guarantee searchability.

### Updating with Partial ES Updates

For Elasticsearch backends, RAGFlow uses the partial-update API via `DocMetadataService.update_document_metadata()` (lines 131-166) to modify only the changed `meta_fields` without reindexing the entire document. If the partial update fails or when using Infinity, the service falls back to a delete-then-insert strategy. All incoming values undergo the same splitting and deduplication logic applied during insertion.

### Deletion and Index Cleanup

The `DocMetadataService.delete_document_metadata()` method (lines 197-215) removes individual document metadata entries and automatically drops the entire tenant index when it becomes empty, preventing storage bloat in multi-tenant deployments.

## Value Normalization and Deduplication

RAGFlow normalizes metadata values in [`common/metadata_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/metadata_utils.py) to handle real-world data inconsistencies. The system splits concatenated strings using locale-aware separators and deduplicates entries before indexing. This preprocessing ensures that filters like `author contains "Alice"` match documents regardless of whether the source listed authors as "Alice,Bob" or "Alice、Bob".

## Metadata Filtering and Search

### Building Value-to-Document Maps

The `DocMetadataService._search_metadata()` method (lines 112-139) executes generic searches against the tenant's metadata index, returning results normalized via `_iter_search_results()` to handle variations between Infinity DataFrames, Elasticsearch responses, and plain lists. The service constructs an in-memory map of metadata values to document IDs (`metas`), enabling O(1) lookups during filter evaluation.

### Operator-Based Filtering

The `meta_filter()` function in [`common/metadata_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/metadata_utils.py) (lines 42-60) implements operator-based matching supporting equality (`=`), inequality (`≠`), containment (`contains`), prefix matching (`start with`), and more. This operates against the pre-built value-to-document map rather than scanning the entire index for each query.

### LLM-Driven Filter Generation

For advanced use cases, `apply_meta_data_filter()` in [`common/metadata_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/metadata_utils.py) (lines 62-84) supports three modes: automatic LLM-generated filters, semi-automatic key selection, and manual conditions. The automatic mode leverages [`rag/prompts/generator.py`](https://github.com/infiniflow/ragflow/blob/main/rag/prompts/generator.py) (`gen_meta_filter`) to translate natural language questions into structured filter objects, enabling conversational metadata retrieval without manual field specification.

## Schema Management

RAGFlow provides schema utilities in [`common/metadata_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/metadata_utils.py) to ensure consistent metadata handling across the UI and API. The `metadata_schema()` and `turn2jsonschema()` functions (lines 74-98 and 128-144) generate JSON Schema representations from user-provided field lists or existing schemas, facilitating form generation and validation in frontend applications.

## Summary

- RAGFlow stores document metadata in **dedicated per-tenant search indices** (Elasticsearch or Infinity) rather than relational columns, enabling scalable metadata retrieval.
- The system uses **lazy index creation** via `DocMetadataService._get_doc_meta_index_name()` and automatic cleanup when indices empty.
- **Partial update APIs** optimize metadata modifications in Elasticsearch, falling back to delete-insert for Infinity or failed updates.
- **Automatic value normalization** splits concatenated strings and deduplicates entries before indexing, ensuring consistent term-level matching.
- **In-memory value-to-document maps** and operator-based filtering in `meta_filter()` provide O(1) metadata lookups without full index scans.
- **LLM-driven filter generation** via `apply_meta_data_filter()` enables natural language metadata queries through [`rag/prompts/generator.py`](https://github.com/infiniflow/ragflow/blob/main/rag/prompts/generator.py).

## Frequently Asked Questions

### How does RAGFlow isolate metadata between different tenants?

RAGFlow creates separate search indices for each tenant using the naming convention `ragflow_doc_meta_<tenant_id>`. This physical isolation prevents cross-tenant data leakage and allows independent scaling of metadata storage in [`api/db/services/doc_metadata_service.py`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/doc_metadata_service.py).

### What search backends does RAGFlow support for metadata storage?

RAGFlow abstracts metadata storage through `settings.docStoreConn`, supporting both Elasticsearch and Infinity backends. The system automatically handles backend-specific behaviors such as index refreshing in Elasticsearch and DataFrame result handling in Infinity.

### How does RAGFlow handle metadata updates without reindexing entire documents?

For Elasticsearch backends, RAGFlow uses the partial-update API via `DocMetadataService.update_document_metadata()` to modify only the changed `meta_fields`. If the partial update fails or when using Infinity, the service falls back to a delete-then-insert strategy to ensure consistency.

### Can RAGFlow automatically generate metadata filters from natural language queries?

Yes, through the `apply_meta_data_filter()` function in [`common/metadata_utils.py`](https://github.com/infiniflow/ragflow/blob/main/common/metadata_utils.py), RAGFlow supports an automatic mode that leverages [`rag/prompts/generator.py`](https://github.com/infiniflow/ragflow/blob/main/rag/prompts/generator.py) to translate natural language questions into structured filter conditions, enabling conversational metadata retrieval without manual field specification.