How to Configure Custom Chunking Strategies Using RAGFlow's Template-Based Chunker
Configure custom chunking strategies in RAGFlow by setting the parser_config dictionary with chunk_token_num, chunk_overlap, and delimiter parameters, which the system passes to the FirecrawlProcessor.chunk_content method in tools/firecrawl/firecrawl_processor.py.
RAGFlow provides a flexible template-based chunking system that lets you define exactly how documents are split into retrieval units without modifying core source code. By adjusting the parser_config object associated with a dataset, you control token limits, overlap windows, and logical delimiters that govern the chunking behavior implemented in the Firecrawl integration.
Understanding Template-Based Chunking Architecture
The template-based chunker is implemented in tools/firecrawl/firecrawl_processor.py within the FirecrawlProcessor class. Its chunk_content method accepts raw document text alongside configuration parameters and emits a list of JSON-serializable chunk objects containing id, content, and metadata fields.
The system relies on the parser_config dictionary stored at the dataset level. When you create or update a knowledge base, RAGFlow persists this configuration in the datasets table and retrieves it during task execution via api/db/services/document_service.py → get_chunking_config.
Key Configuration Parameters
RAGFlow recognizes the following fields inside parser_config to drive custom chunking strategies:
| Field | Purpose | Example Values |
|---|---|---|
chunk_token_num |
Maximum tokens per chunk (size limit). | 128, 512, 1024 |
chunk_overlap |
Tokens shared between consecutive chunks to preserve context. | 0, 50, 200 |
delimiter |
String or regex marking logical boundaries (paragraphs, headers, custom markers). | "\n\n", " ", "`##`" |
layout_recognize |
Layout parser for structured documents (optional). | "DeepDOC", "Plain Text" |
children_delimiter |
Hierarchical separator for nested sections (optional). | "`---`", "`##`" |
Default values for missing fields are supplied by api/utils/api_utils.py, ensuring backward compatibility when you omit optional parameters.
How Configuration Flows Through the System
Understanding the data flow helps debug custom chunking strategies and verify that your settings reach the processor:
- Dataset Creation – You supply
parser_configvia the Python SDK, REST API, or web UI. - Persistence – The backend stores the configuration in the
datasetstable. - Task Retrieval – When a document is uploaded,
api/db/services/document_service.py→get_chunking_configextracts the stored parameters. - Execution –
rag/svr/task_executor.pyinstantiatesFirecrawlProcessorand invokeschunk_contentwith the retrievedchunk_token_num,chunk_overlap, anddelimitervalues. - Low-level Processing – The processor utilizes token-aware merge utilities from
rag/nlp/__init__.py(naive_merge,naive_merge_with_images) to respect token limits while honoring delimiter boundaries.
Because the chunker operates on plain text, you can implement domain-specific strategies—such as splitting on markdown headers, legal clause markers, or XML tags—purely through configuration changes.
Implementing Custom Chunking Strategies: Code Examples
Method 1: Configure via Python SDK
Use the ragflow_sdk package to create a dataset with a tailored parser_config:
from ragflow_sdk import RagflowClient
client = RagflowClient(
base_url="http://localhost:8080",
api_key="YOUR_API_KEY"
)
# Define custom chunking template
parser_cfg = {
"chunk_token_num": 1024,
"chunk_overlap": 200,
"delimiter": "`##`", # Split on custom ## markers
"layout_recognize": "Plain Text"
}
# Create dataset with template-based chunking
kb = client.create_dataset(
name="CustomChunkKB",
description="Template-based chunking with custom delimiters",
parser_config=parser_cfg,
chunk_method="naive" # Uses FirecrawlProcessor under the hood
)
print("Dataset ID:", kb.id)
Source:
sdk/python/ragflow/client.py– https://github.com/infiniflow/ragflow/blob/main/sdk/python/ragflow/client.py
Method 2: Update via REST API
Modify an existing dataset’s chunking strategy by patching the parser_config:
PATCH /api/v1/datasets/<dataset_id>
Content-Type: application/json
Authorization: Bearer <token>
{
"parser_config": {
"chunk_token_num": 512,
"delimiter": "\n\n",
"chunk_overlap": 50,
"layout_recognize": "DeepDOC"
}
}
Source:
api/apps/sdk/dataset.py– https://github.com/infiniflow/ragflow/blob/main/api/apps/sdk/dataset.py
Method 3: Direct Processor Invocation
For advanced use cases—such as preprocessing documents outside the standard pipeline—instantiate the FirecrawlProcessor directly:
from rag.tools.firecrawl.firecrawl_processor import FirecrawlProcessor
processor = FirecrawlProcessor()
doc = {
"id": "doc_001",
"content": open("large_text.txt", "r", encoding="utf-8").read()
}
# Override parser_config values at runtime
chunks = processor.chunk_content(
document=doc,
chunk_size=1500,
chunk_overlap=300
)
for chunk in chunks:
print(f"Chunk {chunk['id']}: {len(chunk['content'])} chars")
Source:
tools/firecrawl/firecrawl_processor.py– https://github.com/infiniflow/ragflow/blob/main/tools/firecrawl/firecrawl_processor.py
Method 4: Verify Runtime Configuration
Inspect the effective configuration applied to a running task to ensure your custom settings were propagated correctly:
task = client.get_task(task_id)
print("Effective chunk config:", task["parser_config"])
This output reflects the merged values after system defaults from api/utils/api_utils.py are applied to any missing fields.
Source:
api/utils/api_utils.py(lines 363–408) – https://github.com/infiniflow/ragflow/blob/main/api/utils/api_utils.py
Summary
- Template-based chunking in RAGFlow is controlled by the
parser_configdictionary stored at the dataset level, enabling custom strategies without code modifications. - The
FirecrawlProcessor.chunk_contentmethod intools/firecrawl/firecrawl_processor.pyconsumeschunk_token_num,chunk_overlap, anddelimiterto generate chunks. - Configuration flows from dataset creation →
datasetstable →document_service.py→task_executor.py→ processor, with defaults supplied byapi/utils/api_utils.py. - You can implement domain-specific splitting (markdown headers, legal clauses, XML tags) by customizing the
delimiterfield while relying on token-aware merging fromrag/nlp/__init__.pyto enforce size limits.
Frequently Asked Questions
What is the maximum chunk size supported by RAGFlow?
RAGFlow does not enforce a hardcoded maximum chunk size in the template-based chunker; the limit is determined by the chunk_token_num parameter you provide in parser_config. In practice, values between 128 and 2048 tokens are common, but you should align this with your embedding model's context window to avoid truncation during vectorization.
Can I use regular expressions as delimiters in template-based chunking?
Yes, the delimiter field in parser_config accepts regular expressions or literal strings that define logical split points. For example, you can use "`##`" to split on markdown headers, "\n\n" for paragraph breaks, or custom markers like "[SECTION]" to isolate specific clauses. The FirecrawlProcessor applies these delimiters before enforcing the chunk_token_num limit.
How does chunk overlap affect retrieval accuracy?
The chunk_overlap parameter specifies how many tokens consecutive chunks share, which preserves context across boundaries. Setting this to 50–200 tokens ensures that sentences or concepts split across chunks remain semantically connected during retrieval, reducing the risk of losing critical context. However, excessive overlap increases storage costs and can introduce redundancy in search results.
Where are custom chunking configurations stored in RAGFlow?
Custom configurations are persisted in the datasets table of the RAGFlow database as the parser_config JSON column. When a document is processed, api/db/services/document_service.py retrieves these settings via get_chunking_config, and rag/svr/task_executor.py injects them into the FirecrawlProcessor at runtime. This architecture ensures that each dataset maintains its own independent chunking strategy.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →