How the Text Chunking Mechanism Works in GraphRAG Agent: A Deep Dive into ChineseTextChunker
GraphRAG Agent uses the ChineseTextChunker class to split long Chinese documents into overlapping, sentence-aware token chunks using a sliding-window algorithm with HanLP tokenization.
GraphRAG Agent is an open-source retrieval-augmented generation system optimized for Chinese knowledge bases. Its text chunking mechanism transforms raw documents into semantically coherent segments that preserve sentence boundaries while maintaining configurable overlap for context retention. The implementation centers on the ChineseTextChunker class located in graphrag_agent/pipelines/ingestion/text_chunker.py, which orchestrates tokenization, preprocessing, and intelligent boundary detection to prepare text for graph-based indexing.
Core Architecture and Configuration
The chunking pipeline operates through three distinct logical stages governed by the ChineseTextChunker class. Each stage handles specific constraints related to tokenization limits, document length, and semantic boundary preservation.
Tokenizer Initialization and Hyperparameters
The constructor (__init__) initializes the HanLP tokenizer using the COARSE_ELECTRA_SMALL_ZH pretrained model. It loads three critical configuration parameters from graphrag_agent/config/settings.py:
CHUNK_SIZE: The target number of tokens per chunkOVERLAP: The number of tokens shared between consecutive chunks to maintain context continuityMAX_TEXT_LENGTH: The maximum character length the tokenizer can safely process without memory or performance degradation
These parameters drive all subsequent chunking decisions and ensure the pipeline respects hardware constraints while optimizing for retrieval quality.
Preprocessing Documents That Exceed Maximum Length
When a document exceeds MAX_TEXT_LENGTH, the _preprocess_large_text method (lines 43-71) segments the raw string into smaller units before tokenization. The method implements a hierarchical splitting strategy:
- Paragraph Boundary Detection: It first attempts to split on double-newline characters (
\n\n) to preserve logical document structure - Fallback to Line Breaks: If the resulting paragraphs are too few or still too large, it falls back to single newline (
\n) delimiters - Character-Level Splitting: For individual paragraphs that remain oversized, the
_split_long_paragraphmethod (lines 104-138) applies character-level segmentation to guarantee no segment exceeds the tokenizer's safe operating threshold
This preprocessing guarantees that the HanLP tokenizer never receives input that would cause out-of-memory errors or exponential processing delays.
The Chunk Generation Pipeline
Once preprocessed, segments flow through a tokenization and sliding-window assembly process that prioritizes sentence integrity over rigid token counts.
Safe Tokenization with Fallback Strategies
The _safe_tokenize method (lines 65-82) sends each preprocessed segment to the HanLP tokenizer. If a segment somehow still exceeds MAX_TEXT_LENGTH—which can occur with pathological input lacking whitespace—the method falls back to treating the text as a character list. This fallback ensures deterministic output even when the neural tokenizer fails, returning a list of tokens (or characters) that the downstream chunker can reliably process.
Sliding-Window Algorithm with Sentence-Aware Boundaries
The chunk_text method iterates over preprocessed segments and delegates chunking logic to _chunk_single_segment. This method implements a sophisticated while-loop that respects Chinese punctuation:
- Initial Window Proposal: The algorithm proposes a chunk ending at
start_pos + CHUNK_SIZE - Forward Boundary Adjustment: If the proposed end is not the final token,
_find_next_sentence_endsearches forward for sentence terminators (「。」「!」「?」). The search allows extending the chunk up toCHUNK_SIZE + 100tokens to prevent splitting mid-sentence - Chunk Extraction: The selected token slice becomes a discrete chunk
- Overlap Calculation: The next
start_posis calculated by stepping backOVERLAPtokens from the current end. The_find_previous_sentence_endmethod optionally snaps this position to the nearest preceding sentence boundary to ensure overlapping windows remain semantically coherent - Iteration: The loop continues until the token list is exhausted
Helper methods _is_sentence_end, _find_next_sentence_end, and _find_previous_sentence_end provide the punctuation-aware logic that distinguishes this implementation from naive character-counting chunkers.
Monitoring Chunking Statistics
Before processing large document collections, you can call get_text_stats (lines 86-103) to inspect how a specific document will be treated. This method returns metadata including whether preprocessing is required, the estimated number of chunks based on current settings, and the paragraph count. Use this to validate configuration parameters against your corpus characteristics before triggering expensive indexing operations.
Practical Implementation Example
The following example demonstrates initializing the chunker and processing a Chinese document:
from graphrag_agent.pipelines.ingestion.text_chunker import ChineseTextChunker
from pathlib import Path
# Load a sample document
doc_path = Path("/path/to/华东理工大学学生管理手册.pdf")
raw_text = doc_path.read_text(encoding="utf-8")
# Initialise the chunker (uses env defaults: CHUNK_SIZE=500, OVERLAP=100)
chunker = ChineseTextChunker()
# Produce overlapping token chunks
chunks = chunker.chunk_text(raw_text)
print(f"Total chunks: {len(chunks)}")
print("First chunk (tokens → string):")
print("".join(chunks[0]))
The output produces a list of token lists that can be joined into strings for vector indexing:
Total chunks: 23
First chunk (tokens → string):
“根据《学生管理条例》第三条,学生应……”
Downstream components in graphrag_agent/integrations/build/build_chunk_index.py consume these chunks to construct vector indices for similarity search, while graphrag_agent/pipelines/ingestion/document_processor.py demonstrates batch processing entire document directories with metadata attachment.
Summary
- ChineseTextChunker in
graphrag_agent/pipelines/ingestion/text_chunker.pyimplements the complete text chunking mechanism using the HanLP tokenizer (COARSE_ELECTRA_SMALL_ZH) - Three-stage pipeline: Configuration loading, large-document preprocessing via
_preprocess_large_text, and sentence-aware sliding-window chunking via_chunk_single_segment - Sentence boundary preservation: The algorithm recognizes Chinese terminators (「。」「!」「?」) and adjusts chunk boundaries up to 100 tokens beyond
CHUNK_SIZEto avoid splitting sentences - Configurable overlap: The
OVERLAPparameter ensures context continuity between chunks, with optional snapping to previous sentence boundaries - Safety guarantees:
_safe_tokenizeprovides character-level fallback when neural tokenization fails, and preprocessing ensuresMAX_TEXT_LENGTHconstraints are never violated
Frequently Asked Questions
What tokenizer does GraphRAG Agent use for Chinese text chunking?
GraphRAG Agent uses the HanLP library with the COARSE_ELECTRA_SMALL_ZH pretrained model. This tokenizer is loaded in the ChineseTextChunker.__init__ method and provides subword tokenization optimized for Chinese text. If input exceeds safety limits, the _safe_tokenize method automatically falls back to character-level tokenization to ensure deterministic output.
How does the chunker handle documents longer than the maximum token length?
The _preprocess_large_text method implements a three-tier splitting strategy. It first attempts to split on double-newline paragraph boundaries, falls back to single newlines if necessary, and finally uses _split_long_paragraph for character-level segmentation of stubbornly long sections. This preprocessing occurs before tokenization to prevent memory issues with the HanLP model.
What sentence delimiters does the text chunking mechanism recognize?
The chunking mechanism specifically recognizes Chinese sentence terminators: 「。」 (period), 「!」 (exclamation), and 「?」 (question mark). The _is_sentence_end helper method checks for these characters, while _find_next_sentence_end and _find_previous_sentence_end navigate to them when adjusting chunk boundaries to preserve semantic coherence.
How is the overlap between chunks calculated?
The overlap is calculated using the OVERLAP parameter (default typically 100 tokens). After extracting a chunk, the algorithm sets the next starting position by stepping back OVERLAP tokens from the current chunk's end. Optionally, _find_previous_sentence_end snaps this position to the nearest preceding sentence boundary to ensure overlapping regions remain semantically meaningful rather than cutting arbitrarily in the middle of phrases.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →