# How to Configure Custom Chunking Strategies Using RAGFlow's Template-Based Chunker

> Learn to configure custom chunking strategies in RAGFlow using template-based chunking. Control chunk size, overlap, and delimiters via parser_config for optimized RAG pipelines.

- Repository: [InfiniFlow/ragflow](https://github.com/infiniflow/ragflow)
- Tags: how-to-guide
- Published: 2026-02-23

---

**Configure custom chunking strategies in RAGFlow by setting the `parser_config` dictionary with `chunk_token_num`, `chunk_overlap`, and `delimiter` parameters, which the system passes to the `FirecrawlProcessor.chunk_content` method in [`tools/firecrawl/firecrawl_processor.py`](https://github.com/infiniflow/ragflow/blob/main/tools/firecrawl/firecrawl_processor.py).**

RAGFlow provides a flexible **template-based chunking** system that lets you define exactly how documents are split into retrieval units without modifying core source code. By adjusting the `parser_config` object associated with a dataset, you control token limits, overlap windows, and logical delimiters that govern the chunking behavior implemented in the Firecrawl integration.

## Understanding Template-Based Chunking Architecture

The template-based chunker is implemented in [`tools/firecrawl/firecrawl_processor.py`](https://github.com/infiniflow/ragflow/blob/main/tools/firecrawl/firecrawl_processor.py) within the `FirecrawlProcessor` class. Its `chunk_content` method accepts raw document text alongside configuration parameters and emits a list of JSON-serializable chunk objects containing `id`, `content`, and `metadata` fields.

The system relies on the **`parser_config`** dictionary stored at the dataset level. When you create or update a knowledge base, RAGFlow persists this configuration in the `datasets` table and retrieves it during task execution via [`api/db/services/document_service.py`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/document_service.py) → `get_chunking_config`.

### Key Configuration Parameters

RAGFlow recognizes the following fields inside `parser_config` to drive custom chunking strategies:

| Field | Purpose | Example Values |
|-------|---------|----------------|
| `chunk_token_num` | Maximum tokens per chunk (size limit). | `128`, `512`, `1024` |
| `chunk_overlap` | Tokens shared between consecutive chunks to preserve context. | `0`, `50`, `200` |
| `delimiter` | String or regex marking logical boundaries (paragraphs, headers, custom markers). | `"\n\n"`, `" "`, ``"`##`"`` |
| `layout_recognize` | Layout parser for structured documents (optional). | `"DeepDOC"`, `"Plain Text"` |
| `children_delimiter` | Hierarchical separator for nested sections (optional). | ``"`---`"``, ``"`##`"`` |

Default values for missing fields are supplied by [`api/utils/api_utils.py`](https://github.com/infiniflow/ragflow/blob/main/api/utils/api_utils.py), ensuring backward compatibility when you omit optional parameters.

## How Configuration Flows Through the System

Understanding the data flow helps debug custom chunking strategies and verify that your settings reach the processor:

1. **Dataset Creation** – You supply `parser_config` via the Python SDK, REST API, or web UI.
2. **Persistence** – The backend stores the configuration in the `datasets` table.
3. **Task Retrieval** – When a document is uploaded, [`api/db/services/document_service.py`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/document_service.py) → `get_chunking_config` extracts the stored parameters.
4. **Execution** – [`rag/svr/task_executor.py`](https://github.com/infiniflow/ragflow/blob/main/rag/svr/task_executor.py) instantiates `FirecrawlProcessor` and invokes `chunk_content` with the retrieved `chunk_token_num`, `chunk_overlap`, and `delimiter` values.
5. **Low-level Processing** – The processor utilizes token-aware merge utilities from [`rag/nlp/__init__.py`](https://github.com/infiniflow/ragflow/blob/main/rag/nlp/__init__.py) (`naive_merge`, `naive_merge_with_images`) to respect token limits while honoring delimiter boundaries.

Because the chunker operates on plain text, you can implement domain-specific strategies—such as splitting on markdown headers, legal clause markers, or XML tags—purely through configuration changes.

## Implementing Custom Chunking Strategies: Code Examples

### Method 1: Configure via Python SDK

Use the `ragflow_sdk` package to create a dataset with a tailored `parser_config`:

```python
from ragflow_sdk import RagflowClient

client = RagflowClient(
    base_url="http://localhost:8080",
    api_key="YOUR_API_KEY"
)

# Define custom chunking template

parser_cfg = {
    "chunk_token_num": 1024,
    "chunk_overlap": 200,
    "delimiter": "`##`",          # Split on custom ## markers

    "layout_recognize": "Plain Text"
}

# Create dataset with template-based chunking

kb = client.create_dataset(
    name="CustomChunkKB",
    description="Template-based chunking with custom delimiters",
    parser_config=parser_cfg,
    chunk_method="naive"          # Uses FirecrawlProcessor under the hood

)

print("Dataset ID:", kb.id)

```

> **Source:** [`sdk/python/ragflow/client.py`](https://github.com/infiniflow/ragflow/blob/main/sdk/python/ragflow/client.py) – <https://github.com/infiniflow/ragflow/blob/main/sdk/python/ragflow/client.py>

### Method 2: Update via REST API

Modify an existing dataset’s chunking strategy by patching the `parser_config`:

```http
PATCH /api/v1/datasets/<dataset_id>
Content-Type: application/json
Authorization: Bearer <token>

{
  "parser_config": {
    "chunk_token_num": 512,
    "delimiter": "\n\n",
    "chunk_overlap": 50,
    "layout_recognize": "DeepDOC"
  }
}

```

> **Source:** [`api/apps/sdk/dataset.py`](https://github.com/infiniflow/ragflow/blob/main/api/apps/sdk/dataset.py) – <https://github.com/infiniflow/ragflow/blob/main/api/apps/sdk/dataset.py>

### Method 3: Direct Processor Invocation

For advanced use cases—such as preprocessing documents outside the standard pipeline—instantiate the `FirecrawlProcessor` directly:

```python
from rag.tools.firecrawl.firecrawl_processor import FirecrawlProcessor

processor = FirecrawlProcessor()

doc = {
    "id": "doc_001",
    "content": open("large_text.txt", "r", encoding="utf-8").read()
}

# Override parser_config values at runtime

chunks = processor.chunk_content(
    document=doc,
    chunk_size=1500,
    chunk_overlap=300
)

for chunk in chunks:
    print(f"Chunk {chunk['id']}: {len(chunk['content'])} chars")

```

> **Source:** [`tools/firecrawl/firecrawl_processor.py`](https://github.com/infiniflow/ragflow/blob/main/tools/firecrawl/firecrawl_processor.py) – <https://github.com/infiniflow/ragflow/blob/main/tools/firecrawl/firecrawl_processor.py>

### Method 4: Verify Runtime Configuration

Inspect the effective configuration applied to a running task to ensure your custom settings were propagated correctly:

```python
task = client.get_task(task_id)
print("Effective chunk config:", task["parser_config"])

```

This output reflects the merged values after system defaults from [`api/utils/api_utils.py`](https://github.com/infiniflow/ragflow/blob/main/api/utils/api_utils.py) are applied to any missing fields.

> **Source:** [`api/utils/api_utils.py`](https://github.com/infiniflow/ragflow/blob/main/api/utils/api_utils.py) (lines 363–408) – <https://github.com/infiniflow/ragflow/blob/main/api/utils/api_utils.py>

## Summary

- **Template-based chunking** in RAGFlow is controlled by the `parser_config` dictionary stored at the dataset level, enabling custom strategies without code modifications.
- The `FirecrawlProcessor.chunk_content` method in [`tools/firecrawl/firecrawl_processor.py`](https://github.com/infiniflow/ragflow/blob/main/tools/firecrawl/firecrawl_processor.py) consumes `chunk_token_num`, `chunk_overlap`, and `delimiter` to generate chunks.
- Configuration flows from dataset creation → `datasets` table → [`document_service.py`](https://github.com/infiniflow/ragflow/blob/main/document_service.py) → [`task_executor.py`](https://github.com/infiniflow/ragflow/blob/main/task_executor.py) → processor, with defaults supplied by [`api/utils/api_utils.py`](https://github.com/infiniflow/ragflow/blob/main/api/utils/api_utils.py).
- You can implement domain-specific splitting (markdown headers, legal clauses, XML tags) by customizing the `delimiter` field while relying on token-aware merging from [`rag/nlp/__init__.py`](https://github.com/infiniflow/ragflow/blob/main/rag/nlp/__init__.py) to enforce size limits.

## Frequently Asked Questions

### What is the maximum chunk size supported by RAGFlow?

RAGFlow does not enforce a hardcoded maximum chunk size in the template-based chunker; the limit is determined by the `chunk_token_num` parameter you provide in `parser_config`. In practice, values between `128` and `2048` tokens are common, but you should align this with your embedding model's context window to avoid truncation during vectorization.

### Can I use regular expressions as delimiters in template-based chunking?

Yes, the `delimiter` field in `parser_config` accepts regular expressions or literal strings that define logical split points. For example, you can use ``"`##`"`` to split on markdown headers, `"\n\n"` for paragraph breaks, or custom markers like `"[SECTION]"` to isolate specific clauses. The `FirecrawlProcessor` applies these delimiters before enforcing the `chunk_token_num` limit.

### How does chunk overlap affect retrieval accuracy?

The `chunk_overlap` parameter specifies how many tokens consecutive chunks share, which preserves context across boundaries. Setting this to `50`–`200` tokens ensures that sentences or concepts split across chunks remain semantically connected during retrieval, reducing the risk of losing critical context. However, excessive overlap increases storage costs and can introduce redundancy in search results.

### Where are custom chunking configurations stored in RAGFlow?

Custom configurations are persisted in the `datasets` table of the RAGFlow database as the `parser_config` JSON column. When a document is processed, [`api/db/services/document_service.py`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/document_service.py) retrieves these settings via `get_chunking_config`, and [`rag/svr/task_executor.py`](https://github.com/infiniflow/ragflow/blob/main/rag/svr/task_executor.py) injects them into the `FirecrawlProcessor` at runtime. This architecture ensures that each dataset maintains its own independent chunking strategy.