how-to-guide

How to Configure Custom Chunking Strategies Using RAGFlow's Template-Based Chunker

February 23, 2026 infiniflow/ragflow ↗

Configure custom chunking strategies in RAGFlow by setting the parser_config dictionary with chunk_token_num, chunk_overlap, and delimiter parameters, which the system passes to the FirecrawlProcessor.chunk_content method in tools/firecrawl/firecrawl_processor.py.

RAGFlow provides a flexible template-based chunking system that lets you define exactly how documents are split into retrieval units without modifying core source code. By adjusting the parser_config object associated with a dataset, you control token limits, overlap windows, and logical delimiters that govern the chunking behavior implemented in the Firecrawl integration.

Understanding Template-Based Chunking Architecture

The template-based chunker is implemented in tools/firecrawl/firecrawl_processor.py within the FirecrawlProcessor class. Its chunk_content method accepts raw document text alongside configuration parameters and emits a list of JSON-serializable chunk objects containing id, content, and metadata fields.

The system relies on the parser_config dictionary stored at the dataset level. When you create or update a knowledge base, RAGFlow persists this configuration in the datasets table and retrieves it during task execution via api/db/services/document_service.py → get_chunking_config.

Key Configuration Parameters

RAGFlow recognizes the following fields inside parser_config to drive custom chunking strategies:

Field	Purpose	Example Values
`chunk_token_num`	Maximum tokens per chunk (size limit).	`128`, `512`, `1024`
`chunk_overlap`	Tokens shared between consecutive chunks to preserve context.	`0`, `50`, `200`
`delimiter`	String or regex marking logical boundaries (paragraphs, headers, custom markers).	`"\n\n"`, `" "`, "`##`"
`layout_recognize`	Layout parser for structured documents (optional).	`"DeepDOC"`, `"Plain Text"`
`children_delimiter`	Hierarchical separator for nested sections (optional).	"`---`", "`##`"

Default values for missing fields are supplied by api/utils/api_utils.py, ensuring backward compatibility when you omit optional parameters.

How Configuration Flows Through the System

Understanding the data flow helps debug custom chunking strategies and verify that your settings reach the processor:

Dataset Creation – You supply parser_config via the Python SDK, REST API, or web UI.
Persistence – The backend stores the configuration in the datasets table.
Task Retrieval – When a document is uploaded, api/db/services/document_service.py → get_chunking_config extracts the stored parameters.
Execution – rag/svr/task_executor.py instantiates FirecrawlProcessor and invokes chunk_content with the retrieved chunk_token_num, chunk_overlap, and delimiter values.
Low-level Processing – The processor utilizes token-aware merge utilities from rag/nlp/__init__.py (naive_merge, naive_merge_with_images) to respect token limits while honoring delimiter boundaries.

Because the chunker operates on plain text, you can implement domain-specific strategies—such as splitting on markdown headers, legal clause markers, or XML tags—purely through configuration changes.

Implementing Custom Chunking Strategies: Code Examples

Method 1: Configure via Python SDK

Use the ragflow_sdk package to create a dataset with a tailored parser_config:

from ragflow_sdk import RagflowClient

client = RagflowClient(
    base_url="http://localhost:8080",
    api_key="YOUR_API_KEY"
)

# Define custom chunking template

parser_cfg = {
    "chunk_token_num": 1024,
    "chunk_overlap": 200,
    "delimiter": "`##`",          # Split on custom ## markers

    "layout_recognize": "Plain Text"
}

# Create dataset with template-based chunking

kb = client.create_dataset(
    name="CustomChunkKB",
    description="Template-based chunking with custom delimiters",
    parser_config=parser_cfg,
    chunk_method="naive"          # Uses FirecrawlProcessor under the hood

)

print("Dataset ID:", kb.id)

Source: sdk/python/ragflow/client.py – https://github.com/infiniflow/ragflow/blob/main/sdk/python/ragflow/client.py

Method 2: Update via REST API

Modify an existing dataset’s chunking strategy by patching the parser_config:

PATCH /api/v1/datasets/<dataset_id>
Content-Type: application/json
Authorization: Bearer <token>

{
  "parser_config": {
    "chunk_token_num": 512,
    "delimiter": "\n\n",
    "chunk_overlap": 50,
    "layout_recognize": "DeepDOC"
  }
}

Source: api/apps/sdk/dataset.py – https://github.com/infiniflow/ragflow/blob/main/api/apps/sdk/dataset.py

Method 3: Direct Processor Invocation

For advanced use cases—such as preprocessing documents outside the standard pipeline—instantiate the FirecrawlProcessor directly:

from rag.tools.firecrawl.firecrawl_processor import FirecrawlProcessor

processor = FirecrawlProcessor()

doc = {
    "id": "doc_001",
    "content": open("large_text.txt", "r", encoding="utf-8").read()
}

# Override parser_config values at runtime

chunks = processor.chunk_content(
    document=doc,
    chunk_size=1500,
    chunk_overlap=300
)

for chunk in chunks:
    print(f"Chunk {chunk['id']}: {len(chunk['content'])} chars")

Source: tools/firecrawl/firecrawl_processor.py – https://github.com/infiniflow/ragflow/blob/main/tools/firecrawl/firecrawl_processor.py

Method 4: Verify Runtime Configuration

Inspect the effective configuration applied to a running task to ensure your custom settings were propagated correctly:

task = client.get_task(task_id)
print("Effective chunk config:", task["parser_config"])

This output reflects the merged values after system defaults from api/utils/api_utils.py are applied to any missing fields.

Source: api/utils/api_utils.py (lines 363–408) – https://github.com/infiniflow/ragflow/blob/main/api/utils/api_utils.py

Summary

Template-based chunking in RAGFlow is controlled by the parser_config dictionary stored at the dataset level, enabling custom strategies without code modifications.
The FirecrawlProcessor.chunk_content method in tools/firecrawl/firecrawl_processor.py consumes chunk_token_num, chunk_overlap, and delimiter to generate chunks.
Configuration flows from dataset creation → datasets table → document_service.py → task_executor.py → processor, with defaults supplied by api/utils/api_utils.py.
You can implement domain-specific splitting (markdown headers, legal clauses, XML tags) by customizing the delimiter field while relying on token-aware merging from rag/nlp/__init__.py to enforce size limits.

Frequently Asked Questions

What is the maximum chunk size supported by RAGFlow?

RAGFlow does not enforce a hardcoded maximum chunk size in the template-based chunker; the limit is determined by the chunk_token_num parameter you provide in parser_config. In practice, values between 128 and 2048 tokens are common, but you should align this with your embedding model's context window to avoid truncation during vectorization.

Can I use regular expressions as delimiters in template-based chunking?

Yes, the delimiter field in parser_config accepts regular expressions or literal strings that define logical split points. For example, you can use "`##`" to split on markdown headers, "\n\n" for paragraph breaks, or custom markers like "[SECTION]" to isolate specific clauses. The FirecrawlProcessor applies these delimiters before enforcing the chunk_token_num limit.

How does chunk overlap affect retrieval accuracy?

The chunk_overlap parameter specifies how many tokens consecutive chunks share, which preserves context across boundaries. Setting this to 50–200 tokens ensures that sentences or concepts split across chunks remain semantically connected during retrieval, reducing the risk of losing critical context. However, excessive overlap increases storage costs and can introduce redundancy in search results.

Where are custom chunking configurations stored in RAGFlow?

Custom configurations are persisted in the datasets table of the RAGFlow database as the parser_config JSON column. When a document is processed, api/db/services/document_service.py retrieves these settings via get_chunking_config, and rag/svr/task_executor.py injects them into the FirecrawlProcessor at runtime. This architecture ensures that each dataset maintains its own independent chunking strategy.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how infiniflow/ragflow works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →