Content Chunking Strategy for Vector Embeddings in Open Notebook

Open Notebook implements a configurable, multi-stage content chunking strategy for vector embeddings that detects content type via file extensions and heuristics, applies format-specific splitters (HTML headers, Markdown headings, or recursive text), and enforces token limits through secondary chunking and minimum-size filtering.

The lfnovo/open-notebook repository transforms arbitrary documents into embedding-ready segments through a sophisticated pipeline defined in open_notebook/utils/chunking.py. This system balances semantic preservation with strict token boundaries, ensuring optimal input for vector generation across diverse content formats.

Environment-Based Configuration

The chunking pipeline initializes critical parameters from environment variables, falling back to sensible defaults when values are absent or malformed. In open_notebook/utils/chunking.py (lines 33-58), the system reads:

  • OPEN_NOTEBOOK_CHUNK_SIZE: Maximum tokens per chunk (default: 400)
  • OPEN_NOTEBOOK_CHUNK_OVERLAP: Percentage of overlap between adjacent chunks (default: 15%)
  • OPEN_NOTEBOOK_MIN_CHUNK_SIZE: Minimum viable token count to retain a chunk (default: 5 tokens)

These variables govern every splitting decision without requiring code modification, allowing runtime adaptation to different embedding model context windows.

Content-Type Detection Pipeline

Before splitting, the detect_content_type() function (lines 221-263) determines whether content is HTML, Markdown, or plain text using a dual-strategy approach:

Extension-Based Detection

The system maps file extensions to a ContentType enum (lines 36-70). Common patterns like .html, .md, and .txt trigger immediate classification into their respective categories.

Heuristic Scoring Fallback

When extensions are missing or ambiguous, a heuristic scorer evaluates raw text for structural markers—including HTML tags, Markdown headings, links, and code fences—returning a confidence score (lines 95-128). The final type selection prefers the file extension unless heuristics report high confidence (≥ 0.8) suggesting a different format.

Format-Specific Splitter Selection

Based on detected content type, the pipeline instantiates specialized splitters that respect document semantics while preparing content for vector embeddings:

HTML Content Processing

HTML documents are processed using HTMLHeaderTextSplitter, configured to break at <h1>, <h2>, and <h3> tags (lines 65-73). This preserves logical document sections while creating embedding-friendly segments.

Markdown Content Processing

Markdown files utilize MarkdownHeaderTextSplitter, splitting at #, ##, and ### headings (lines 75-84). Header-based chunking maintains contextual boundaries between document sections.

Plain Text Processing

For unstructured text, the system falls back to RecursiveCharacterTextSplitter (lines 88-95). This splitter recursively divides text on a hierarchy of separators—\n\n, \n, . , etc.—respecting the configured CHUNK_SIZE and CHUNK_OVERLAP values.

Two-Stage Chunking Pipeline

The chunking process operates in two distinct phases to ensure strict token compliance:

Primary Chunking

The selected splitter processes the entire document, producing initial chunks or Document objects that are stripped to raw strings (lines 48-69). Header-based splitters execute during this phase to maintain semantic structure.

Secondary Chunking for Oversized Segments

Semantic splitters like HTML and Markdown headers can produce chunks exceeding CHUNK_SIZE. The _apply_secondary_chunking() function (lines 98-115) runs the plain-text splitter on any oversized chunks, ensuring every final segment respects the token limit while maintaining the configured overlap.

Quality Control and Filtering

After splitting, the pipeline applies final validation steps. Chunks shorter than MIN_CHUNK_SIZE tokens are dropped unless they represent the entire document (lines 77-84), because fragments below this threshold typically yield degraded or null embeddings from many AI providers.

All size calculations rely on token_count() defined in open_notebook/utils/token_utils.py (lines 15-31), which uses the tiktoken library with a fallback to crude word-count estimation when unavailable.

Implementation Examples

from open_notebook.utils.chunking import chunk_text, ContentType

# Example 1 – Plain text (default)

plain = "Lorem ipsum dolor sit amet, " * 200  # ~4000 tokens

chunks = chunk_text(plain)                     # Uses RecursiveCharacterTextSplitter

print(len(chunks), chunks[0][:60])

# Example 2 – Markdown with headings

md = """

# Title

Intro paragraph...

## Section A

Content for section A...

## Section B

Content for section B...
"""
chunks = chunk_text(md, file_path="notes.md")  # Detects MARKDOWN via extension

print([c[:30] for c in chunks])

# Example 3 – HTML content

html = """
<!DOCTYPE html>
<html><head><title>Demo</title></head>
<body><h1>Header</h1><p>Some paragraph text.</p></body>
</html>
"""
chunks = chunk_text(html, file_path="page.html")
print(chunks)

Summary

  • Configurable parameters via environment variables control maximum size, overlap percentage, and minimum thresholds without requiring code modification.
  • Intelligent detection combines file extension mapping with heuristic text analysis to accurately classify HTML, Markdown, and plain text content.
  • Semantic preservation uses header-aware splitters for structured formats before applying recursive character splitting for oversized segments.
  • Token-accurate validation ensures every chunk meets size requirements through secondary chunking and minimum-size filtering, using tiktoken for precise counting.

Frequently Asked Questions

What is content chunking for vector embeddings?

Content chunking for vector embeddings is the process of dividing large documents into smaller, token-limited text segments that can be converted into numerical vectors by AI embedding models. This strategy prevents context window overflows and improves retrieval accuracy by ensuring each chunk contains focused, semantically coherent information.

How does Open Notebook detect content types for chunking?

Open Notebook detects content types through a two-phase process in open_notebook/utils/chunking.py: first checking file extensions against a known mapping, then applying heuristic scoring to raw text when extensions are ambiguous. The system evaluates structural markers like HTML tags and Markdown syntax, preferring extension data unless heuristic confidence exceeds 0.8.

What happens if a chunk exceeds the maximum token size?

When primary splitters like HTMLHeaderTextSplitter or MarkdownHeaderTextSplitter produce oversized chunks, Open Notebook's _apply_secondary_chunking() function automatically applies RecursiveCharacterTextSplitter to those segments. This secondary pass breaks large chunks into token-compliant pieces while maintaining the configured overlap percentage.

Why filter out small chunks during the chunking process?

The pipeline filters chunks below MIN_CHUNK_SIZE (default 5 tokens) because extremely short text fragments often generate low-quality or null embeddings from many AI providers. This filtering occurs in open_notebook/utils/chunking.py unless the chunk represents the entire document content, ensuring the embedding space remains populated with meaningful, retrievable vectors.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →