Open Notebook Chunking Strategy: How `chunking.py` Splits Text Before Embedding
Open Notebook uses a multi-step, content-type-aware pipeline in open_notebook/utils/chunking.py that detects whether input is HTML, Markdown, or plain text, applies the appropriate LangChain splitter, recursively breaks oversized sections, and filters out tiny fragments below a configurable token threshold.
The lfnovo/open-notebook repository implements a robust embedding preparation pipeline. Its core chunking strategy defined in open_notebook/utils/chunking.py transforms raw documents into token-bounded pieces that vector-embedding models can safely consume.
Environment-Driven Configuration
Open Notebook does not hard-code chunk dimensions. Instead, chunking.py reads three environment variables:
OPEN_NOTEBOOK_CHUNK_SIZE— maximum tokens per chunk (default: 400).OPEN_NOTEBOOK_CHUNK_OVERLAP— overlap between consecutive chunks (default: 15%).OPEN_NOTEBOOK_MIN_CHUNK_SIZE— minimum token count for a fragment to be retained (default: 5).
These settings flow into the splitters and govern every downstream decision.
Content-Type Detection
Before splitting, the pipeline must decide whether the raw input is HTML, Markdown, or plain text. chunking.py resolves this through a two-tier detection system.
Extension-Based Detection
The code maps common file extensions to a ContentType enum (HTML, MARKDOWN, PLAIN). If the caller provides a file_path, the extension is checked first.
Heuristic-Based Fallback
When the extension is missing or generic, the system inspects the first 5,000 characters and scores HTML-specific patterns (e.g., <!DOCTYPE>, <html>, header tags) and Markdown-specific patterns (e.g., headers, links, code fences, lists). The heuristic returns a confidence-weighted type. The final decision prefers the extension guess, but a very high-confidence heuristic can override a plain-text assumption.
Splitter Selection
Once the content type is known, chunking.py instantiates a LangChain splitter tuned to that format.
- HTML —
HTMLHeaderTextSplitterconfigured for<h1>,<h2>, and<h3>elements. - Markdown —
MarkdownHeaderTextSplitterconfigured for#,##, and###headers. - Plain text —
RecursiveCharacterTextSplitterusing the configuredCHUNK_SIZEandCHUNK_OVERLAP, a token-count length function (provided byopen_notebook/utils/token_utils.py), and a separator hierarchy of"\n\n","\n",". ",", "," ", and"".
Primary and Secondary Chunking
The chunk_text function is the entry point used by the embedding pipeline. It first checks whether the input is already small enough (≤ CHUNK_SIZE tokens); if so, it returns the text unchanged. Otherwise, it detects the content type (unless overridden by the caller) and runs the selected splitter to produce raw chunks.
Because semantic splitters such as HTMLHeaderTextSplitter and MarkdownHeaderTextSplitter can emit large sections—for example, a massive <div>—the pipeline applies a secondary chunking step. Any chunk that exceeds CHUNK_SIZE tokens is fed into the plain-text RecursiveCharacterTextSplitter to guarantee token-safe boundaries.
Filtering Tiny Fragments
After splitting, chunking.py filters out noise. Any chunk whose token count falls below MIN_CHUNK_SIZE is dropped, unless removing it would eliminate the entire result set. This guard prevents meaningless punctuation-only fragments from reaching the embedding model. Such fragments are dangerous because some providers, notably llama.cpp-based services, can produce null vectors that crash downstream processing.
Code Examples
The following examples demonstrate how to call the chunking pipeline in practice.
from open_notebook.utils.chunking import chunk_text, ContentType
# Example: Raw HTML document
with open("article.html", "r") as f:
html = f.read()
# Automatic detection + chunking
chunks = chunk_text(html, file_path="article.html")
print(f"Generated {len(chunks)} chunks") # → log shows CHUNK_SIZE, overlap, etc.
# For pure Markdown (override detection if you already know the type)
md = "# Title\n\nSome text\n\n## Section\nMore content"
chunks = chunk_text(md, content_type=ContentType.MARKDOWN)
# Plain‑text with custom environment variables
import os
os.environ["OPEN_NOTEBOOK_CHUNK_SIZE"] = "500"
os.environ["OPEN_NOTEBOOK_CHUNK_OVERLAP"] = "75"
text = "Long paragraph ..." * 200
chunks = chunk_text(text) # respects the 500‑token limit and 15 % overlap
Summary
- Open Notebook centralizes its chunking strategy in
open_notebook/utils/chunking.py. - Behavior is controlled by environment variables for chunk size, overlap, and minimum fragment size.
- Content type is determined by extension first, then by a 5,000-character heuristic scan.
- Splitters are matched to format: HTML headers, Markdown headers, or recursive plain-text splitting.
- Oversized semantic chunks undergo secondary recursive splitting to enforce token limits.
- Fragments below the minimum size are discarded to avoid null-vector crashes.
Frequently Asked Questions
How does Open Notebook decide whether a file is HTML or Markdown?
It first checks the file extension against a known mapping. If the extension is ambiguous or missing, it scores the first 5,000 characters for HTML and Markdown signatures and selects the type with the highest confidence.
What happens if a single HTML section is larger than the token limit?
The HTMLHeaderTextSplitter may produce large sections. When a chunk exceeds CHUNK_SIZE, the pipeline automatically applies the RecursiveCharacterTextSplitter to break it into smaller, token-safe pieces.
Why does the pipeline drop very small chunks?
Chunks below MIN_CHUNK_SIZE tokens (default 5) are removed because punctuation-only fragments generate low-quality embeddings. Some embedding providers treat these as null vectors, which can cause runtime failures.
Can I force a specific splitter without relying on auto-detection?
Yes. The chunk_text function accepts an optional content_type argument. Pass ContentType.HTML, ContentType.MARKDOWN, or ContentType.PLAIN to bypass extension and heuristic detection.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →