# How the Text Chunking Mechanism Works in GraphRAG Agent: A Deep Dive into ChineseTextChunker

> Explore the text chunking mechanism in GraphRAG Agent. Learn how ChineseTextChunker splits documents into sentence-aware token chunks using a sliding-window algorithm and HanLP.

- Repository: [GLK/graph-rag-agent](https://github.com/1517005260/graph-rag-agent)
- Tags: deep-dive
- Published: 2026-02-22

---

**GraphRAG Agent uses the `ChineseTextChunker` class to split long Chinese documents into overlapping, sentence-aware token chunks using a sliding-window algorithm with HanLP tokenization.**

GraphRAG Agent is an open-source retrieval-augmented generation system optimized for Chinese knowledge bases. Its **text chunking mechanism** transforms raw documents into semantically coherent segments that preserve sentence boundaries while maintaining configurable overlap for context retention. The implementation centers on the `ChineseTextChunker` class located in [`graphrag_agent/pipelines/ingestion/text_chunker.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/text_chunker.py), which orchestrates tokenization, preprocessing, and intelligent boundary detection to prepare text for graph-based indexing.

## Core Architecture and Configuration

The chunking pipeline operates through three distinct logical stages governed by the `ChineseTextChunker` class. Each stage handles specific constraints related to tokenization limits, document length, and semantic boundary preservation.

### Tokenizer Initialization and Hyperparameters

The constructor (`__init__`) initializes the **HanLP** tokenizer using the `COARSE_ELECTRA_SMALL_ZH` pretrained model. It loads three critical configuration parameters from [`graphrag_agent/config/settings.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/config/settings.py):

- **`CHUNK_SIZE`**: The target number of tokens per chunk
- **`OVERLAP`**: The number of tokens shared between consecutive chunks to maintain context continuity
- **`MAX_TEXT_LENGTH`**: The maximum character length the tokenizer can safely process without memory or performance degradation

These parameters drive all subsequent chunking decisions and ensure the pipeline respects hardware constraints while optimizing for retrieval quality.

### Preprocessing Documents That Exceed Maximum Length

When a document exceeds `MAX_TEXT_LENGTH`, the `_preprocess_large_text` method (lines 43-71) segments the raw string into smaller units before tokenization. The method implements a hierarchical splitting strategy:

1. **Paragraph Boundary Detection**: It first attempts to split on double-newline characters (`\n\n`) to preserve logical document structure
2. **Fallback to Line Breaks**: If the resulting paragraphs are too few or still too large, it falls back to single newline (`\n`) delimiters
3. **Character-Level Splitting**: For individual paragraphs that remain oversized, the `_split_long_paragraph` method (lines 104-138) applies character-level segmentation to guarantee no segment exceeds the tokenizer's safe operating threshold

This preprocessing guarantees that the HanLP tokenizer never receives input that would cause out-of-memory errors or exponential processing delays.

## The Chunk Generation Pipeline

Once preprocessed, segments flow through a tokenization and sliding-window assembly process that prioritizes sentence integrity over rigid token counts.

### Safe Tokenization with Fallback Strategies

The `_safe_tokenize` method (lines 65-82) sends each preprocessed segment to the HanLP tokenizer. If a segment somehow still exceeds `MAX_TEXT_LENGTH`—which can occur with pathological input lacking whitespace—the method falls back to treating the text as a character list. This fallback ensures deterministic output even when the neural tokenizer fails, returning a list of tokens (or characters) that the downstream chunker can reliably process.

### Sliding-Window Algorithm with Sentence-Aware Boundaries

The `chunk_text` method iterates over preprocessed segments and delegates chunking logic to `_chunk_single_segment`. This method implements a sophisticated while-loop that respects Chinese punctuation:

1. **Initial Window Proposal**: The algorithm proposes a chunk ending at `start_pos + CHUNK_SIZE`
2. **Forward Boundary Adjustment**: If the proposed end is not the final token, `_find_next_sentence_end` searches forward for sentence terminators (「。」「！」「？」). The search allows extending the chunk up to `CHUNK_SIZE + 100` tokens to prevent splitting mid-sentence
3. **Chunk Extraction**: The selected token slice becomes a discrete chunk
4. **Overlap Calculation**: The next `start_pos` is calculated by stepping back `OVERLAP` tokens from the current end. The `_find_previous_sentence_end` method optionally snaps this position to the nearest preceding sentence boundary to ensure overlapping windows remain semantically coherent
5. **Iteration**: The loop continues until the token list is exhausted

Helper methods `_is_sentence_end`, `_find_next_sentence_end`, and `_find_previous_sentence_end` provide the punctuation-aware logic that distinguishes this implementation from naive character-counting chunkers.

## Monitoring Chunking Statistics

Before processing large document collections, you can call `get_text_stats` (lines 86-103) to inspect how a specific document will be treated. This method returns metadata including whether preprocessing is required, the estimated number of chunks based on current settings, and the paragraph count. Use this to validate configuration parameters against your corpus characteristics before triggering expensive indexing operations.

## Practical Implementation Example

The following example demonstrates initializing the chunker and processing a Chinese document:

```python
from graphrag_agent.pipelines.ingestion.text_chunker import ChineseTextChunker
from pathlib import Path

# Load a sample document

doc_path = Path("/path/to/华东理工大学学生管理手册.pdf")
raw_text = doc_path.read_text(encoding="utf-8")

# Initialise the chunker (uses env defaults: CHUNK_SIZE=500, OVERLAP=100)

chunker = ChineseTextChunker()

# Produce overlapping token chunks

chunks = chunker.chunk_text(raw_text)

print(f"Total chunks: {len(chunks)}")
print("First chunk (tokens → string):")
print("".join(chunks[0]))

```

The output produces a list of token lists that can be joined into strings for vector indexing:

```text
Total chunks: 23
First chunk (tokens → string):
“根据《学生管理条例》第三条，学生应……”

```

Downstream components in [`graphrag_agent/integrations/build/build_chunk_index.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/integrations/build/build_chunk_index.py) consume these chunks to construct vector indices for similarity search, while [`graphrag_agent/pipelines/ingestion/document_processor.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/document_processor.py) demonstrates batch processing entire document directories with metadata attachment.

## Summary

- **ChineseTextChunker** in [`graphrag_agent/pipelines/ingestion/text_chunker.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/text_chunker.py) implements the complete text chunking mechanism using the HanLP tokenizer (`COARSE_ELECTRA_SMALL_ZH`)
- **Three-stage pipeline**: Configuration loading, large-document preprocessing via `_preprocess_large_text`, and sentence-aware sliding-window chunking via `_chunk_single_segment`
- **Sentence boundary preservation**: The algorithm recognizes Chinese terminators (「。」「！」「？」) and adjusts chunk boundaries up to 100 tokens beyond `CHUNK_SIZE` to avoid splitting sentences
- **Configurable overlap**: The `OVERLAP` parameter ensures context continuity between chunks, with optional snapping to previous sentence boundaries
- **Safety guarantees**: `_safe_tokenize` provides character-level fallback when neural tokenization fails, and preprocessing ensures `MAX_TEXT_LENGTH` constraints are never violated

## Frequently Asked Questions

### What tokenizer does GraphRAG Agent use for Chinese text chunking?

GraphRAG Agent uses the **HanLP** library with the `COARSE_ELECTRA_SMALL_ZH` pretrained model. This tokenizer is loaded in the `ChineseTextChunker.__init__` method and provides subword tokenization optimized for Chinese text. If input exceeds safety limits, the `_safe_tokenize` method automatically falls back to character-level tokenization to ensure deterministic output.

### How does the chunker handle documents longer than the maximum token length?

The `_preprocess_large_text` method implements a three-tier splitting strategy. It first attempts to split on double-newline paragraph boundaries, falls back to single newlines if necessary, and finally uses `_split_long_paragraph` for character-level segmentation of stubbornly long sections. This preprocessing occurs before tokenization to prevent memory issues with the HanLP model.

### What sentence delimiters does the text chunking mechanism recognize?

The chunking mechanism specifically recognizes Chinese sentence terminators: **「。」** (period), **「！」** (exclamation), and **「？」** (question mark). The `_is_sentence_end` helper method checks for these characters, while `_find_next_sentence_end` and `_find_previous_sentence_end` navigate to them when adjusting chunk boundaries to preserve semantic coherence.

### How is the overlap between chunks calculated?

The overlap is calculated using the `OVERLAP` parameter (default typically 100 tokens). After extracting a chunk, the algorithm sets the next starting position by stepping back `OVERLAP` tokens from the current chunk's end. Optionally, `_find_previous_sentence_end` snaps this position to the nearest preceding sentence boundary to ensure overlapping regions remain semantically meaningful rather than cutting arbitrarily in the middle of phrases.