How the Text Chunking Mechanism Works in GraphRAG Agent: A Deep Dive into ChineseTextChunker

GraphRAG Agent uses the ChineseTextChunker class to split long Chinese documents into overlapping, sentence-aware token chunks using a sliding-window algorithm with HanLP tokenization.

GraphRAG Agent is an open-source retrieval-augmented generation system optimized for Chinese knowledge bases. Its text chunking mechanism transforms raw documents into semantically coherent segments that preserve sentence boundaries while maintaining configurable overlap for context retention. The implementation centers on the ChineseTextChunker class located in graphrag_agent/pipelines/ingestion/text_chunker.py, which orchestrates tokenization, preprocessing, and intelligent boundary detection to prepare text for graph-based indexing.

Core Architecture and Configuration

The chunking pipeline operates through three distinct logical stages governed by the ChineseTextChunker class. Each stage handles specific constraints related to tokenization limits, document length, and semantic boundary preservation.

Tokenizer Initialization and Hyperparameters

The constructor (__init__) initializes the HanLP tokenizer using the COARSE_ELECTRA_SMALL_ZH pretrained model. It loads three critical configuration parameters from graphrag_agent/config/settings.py:

  • CHUNK_SIZE: The target number of tokens per chunk
  • OVERLAP: The number of tokens shared between consecutive chunks to maintain context continuity
  • MAX_TEXT_LENGTH: The maximum character length the tokenizer can safely process without memory or performance degradation

These parameters drive all subsequent chunking decisions and ensure the pipeline respects hardware constraints while optimizing for retrieval quality.

Preprocessing Documents That Exceed Maximum Length

When a document exceeds MAX_TEXT_LENGTH, the _preprocess_large_text method (lines 43-71) segments the raw string into smaller units before tokenization. The method implements a hierarchical splitting strategy:

  1. Paragraph Boundary Detection: It first attempts to split on double-newline characters (\n\n) to preserve logical document structure
  2. Fallback to Line Breaks: If the resulting paragraphs are too few or still too large, it falls back to single newline (\n) delimiters
  3. Character-Level Splitting: For individual paragraphs that remain oversized, the _split_long_paragraph method (lines 104-138) applies character-level segmentation to guarantee no segment exceeds the tokenizer's safe operating threshold

This preprocessing guarantees that the HanLP tokenizer never receives input that would cause out-of-memory errors or exponential processing delays.

The Chunk Generation Pipeline

Once preprocessed, segments flow through a tokenization and sliding-window assembly process that prioritizes sentence integrity over rigid token counts.

Safe Tokenization with Fallback Strategies

The _safe_tokenize method (lines 65-82) sends each preprocessed segment to the HanLP tokenizer. If a segment somehow still exceeds MAX_TEXT_LENGTH—which can occur with pathological input lacking whitespace—the method falls back to treating the text as a character list. This fallback ensures deterministic output even when the neural tokenizer fails, returning a list of tokens (or characters) that the downstream chunker can reliably process.

Sliding-Window Algorithm with Sentence-Aware Boundaries

The chunk_text method iterates over preprocessed segments and delegates chunking logic to _chunk_single_segment. This method implements a sophisticated while-loop that respects Chinese punctuation:

  1. Initial Window Proposal: The algorithm proposes a chunk ending at start_pos + CHUNK_SIZE
  2. Forward Boundary Adjustment: If the proposed end is not the final token, _find_next_sentence_end searches forward for sentence terminators (「。」「!」「?」). The search allows extending the chunk up to CHUNK_SIZE + 100 tokens to prevent splitting mid-sentence
  3. Chunk Extraction: The selected token slice becomes a discrete chunk
  4. Overlap Calculation: The next start_pos is calculated by stepping back OVERLAP tokens from the current end. The _find_previous_sentence_end method optionally snaps this position to the nearest preceding sentence boundary to ensure overlapping windows remain semantically coherent
  5. Iteration: The loop continues until the token list is exhausted

Helper methods _is_sentence_end, _find_next_sentence_end, and _find_previous_sentence_end provide the punctuation-aware logic that distinguishes this implementation from naive character-counting chunkers.

Monitoring Chunking Statistics

Before processing large document collections, you can call get_text_stats (lines 86-103) to inspect how a specific document will be treated. This method returns metadata including whether preprocessing is required, the estimated number of chunks based on current settings, and the paragraph count. Use this to validate configuration parameters against your corpus characteristics before triggering expensive indexing operations.

Practical Implementation Example

The following example demonstrates initializing the chunker and processing a Chinese document:

from graphrag_agent.pipelines.ingestion.text_chunker import ChineseTextChunker
from pathlib import Path

# Load a sample document

doc_path = Path("/path/to/华东理工大学学生管理手册.pdf")
raw_text = doc_path.read_text(encoding="utf-8")

# Initialise the chunker (uses env defaults: CHUNK_SIZE=500, OVERLAP=100)

chunker = ChineseTextChunker()

# Produce overlapping token chunks

chunks = chunker.chunk_text(raw_text)

print(f"Total chunks: {len(chunks)}")
print("First chunk (tokens → string):")
print("".join(chunks[0]))

The output produces a list of token lists that can be joined into strings for vector indexing:

Total chunks: 23
First chunk (tokens → string):
“根据《学生管理条例》第三条,学生应……”

Downstream components in graphrag_agent/integrations/build/build_chunk_index.py consume these chunks to construct vector indices for similarity search, while graphrag_agent/pipelines/ingestion/document_processor.py demonstrates batch processing entire document directories with metadata attachment.

Summary

  • ChineseTextChunker in graphrag_agent/pipelines/ingestion/text_chunker.py implements the complete text chunking mechanism using the HanLP tokenizer (COARSE_ELECTRA_SMALL_ZH)
  • Three-stage pipeline: Configuration loading, large-document preprocessing via _preprocess_large_text, and sentence-aware sliding-window chunking via _chunk_single_segment
  • Sentence boundary preservation: The algorithm recognizes Chinese terminators (「。」「!」「?」) and adjusts chunk boundaries up to 100 tokens beyond CHUNK_SIZE to avoid splitting sentences
  • Configurable overlap: The OVERLAP parameter ensures context continuity between chunks, with optional snapping to previous sentence boundaries
  • Safety guarantees: _safe_tokenize provides character-level fallback when neural tokenization fails, and preprocessing ensures MAX_TEXT_LENGTH constraints are never violated

Frequently Asked Questions

What tokenizer does GraphRAG Agent use for Chinese text chunking?

GraphRAG Agent uses the HanLP library with the COARSE_ELECTRA_SMALL_ZH pretrained model. This tokenizer is loaded in the ChineseTextChunker.__init__ method and provides subword tokenization optimized for Chinese text. If input exceeds safety limits, the _safe_tokenize method automatically falls back to character-level tokenization to ensure deterministic output.

How does the chunker handle documents longer than the maximum token length?

The _preprocess_large_text method implements a three-tier splitting strategy. It first attempts to split on double-newline paragraph boundaries, falls back to single newlines if necessary, and finally uses _split_long_paragraph for character-level segmentation of stubbornly long sections. This preprocessing occurs before tokenization to prevent memory issues with the HanLP model.

What sentence delimiters does the text chunking mechanism recognize?

The chunking mechanism specifically recognizes Chinese sentence terminators: 「。」 (period), 「!」 (exclamation), and 「?」 (question mark). The _is_sentence_end helper method checks for these characters, while _find_next_sentence_end and _find_previous_sentence_end navigate to them when adjusting chunk boundaries to preserve semantic coherence.

How is the overlap between chunks calculated?

The overlap is calculated using the OVERLAP parameter (default typically 100 tokens). After extracting a chunk, the algorithm sets the next starting position by stepping back OVERLAP tokens from the current chunk's end. Optionally, _find_previous_sentence_end snaps this position to the nearest preceding sentence boundary to ensure overlapping regions remain semantically meaningful rather than cutting arbitrarily in the middle of phrases.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →