`, `

# Open Notebook Chunking Strategy: How `chunking.py` Splits Text Before Embedding

> Discover the Open Notebook chunking strategy for text splitting before embedding. Learn how chunking.py intelligently handles HTML, Markdown, and text with LangChain and token filtering.

- Repository: [Luis Novo/open-notebook](https://github.com/lfnovo/open-notebook)
- Tags: how-to-guide
- Published: 2026-06-06

---

**Open Notebook uses a multi-step, content-type-aware pipeline in [`open_notebook/utils/chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/chunking.py) that detects whether input is HTML, Markdown, or plain text, applies the appropriate LangChain splitter, recursively breaks oversized sections, and filters out tiny fragments below a configurable token threshold.**

The `lfnovo/open-notebook` repository implements a robust embedding preparation pipeline. Its core **chunking strategy** defined in [`open_notebook/utils/chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/chunking.py) transforms raw documents into token-bounded pieces that vector-embedding models can safely consume.

## Environment-Driven Configuration

Open Notebook does not hard-code chunk dimensions. Instead, [`chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/chunking.py) reads three environment variables:

- `OPEN_NOTEBOOK_CHUNK_SIZE` — maximum tokens per chunk (default: 400).
- `OPEN_NOTEBOOK_CHUNK_OVERLAP` — overlap between consecutive chunks (default: 15%).
- `OPEN_NOTEBOOK_MIN_CHUNK_SIZE` — minimum token count for a fragment to be retained (default: 5).

These settings flow into the splitters and govern every downstream decision.

## Content-Type Detection

Before splitting, the pipeline must decide whether the raw input is HTML, Markdown, or plain text. [`chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/chunking.py) resolves this through a two-tier detection system.

### Extension-Based Detection

The code maps common file extensions to a `ContentType` enum (`HTML`, `MARKDOWN`, `PLAIN`). If the caller provides a `file_path`, the extension is checked first.

### Heuristic-Based Fallback

When the extension is missing or generic, the system inspects the first 5,000 characters and scores HTML-specific patterns (e.g., `<!DOCTYPE>`, `<html>`, header tags) and Markdown-specific patterns (e.g., headers, links, code fences, lists). The heuristic returns a confidence-weighted type. The final decision prefers the extension guess, but a very high-confidence heuristic can override a plain-text assumption.

## Splitter Selection

Once the content type is known, [`chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/chunking.py) instantiates a LangChain splitter tuned to that format.

- **HTML** — `HTMLHeaderTextSplitter` configured for `<h1>`, `<h2>`, and `<h3>` elements.
- **Markdown** — `MarkdownHeaderTextSplitter` configured for `#`, `##`, and `###` headers.
- **Plain text** — `RecursiveCharacterTextSplitter` using the configured `CHUNK_SIZE` and `CHUNK_OVERLAP`, a token-count length function (provided by [`open_notebook/utils/token_utils.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/token_utils.py)), and a separator hierarchy of `"\n\n"`, `"\n"`, `". "`, `", "`, `" "`, and `""`.

## Primary and Secondary Chunking

The `chunk_text` function is the entry point used by the embedding pipeline. It first checks whether the input is already small enough (≤ `CHUNK_SIZE` tokens); if so, it returns the text unchanged. Otherwise, it detects the content type (unless overridden by the caller) and runs the selected splitter to produce raw chunks.

Because semantic splitters such as `HTMLHeaderTextSplitter` and `MarkdownHeaderTextSplitter` can emit large sections—for example, a massive `<div>`—the pipeline applies a **secondary chunking** step. Any chunk that exceeds `CHUNK_SIZE` tokens is fed into the plain-text `RecursiveCharacterTextSplitter` to guarantee token-safe boundaries.

## Filtering Tiny Fragments

After splitting, [`chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/chunking.py) filters out noise. Any chunk whose token count falls below `MIN_CHUNK_SIZE` is dropped, unless removing it would eliminate the entire result set. This guard prevents meaningless punctuation-only fragments from reaching the embedding model. Such fragments are dangerous because some providers, notably [`llama.cpp`](https://github.com/lfnovo/open-notebook/blob/main/llama.cpp)-based services, can produce null vectors that crash downstream processing.

## Code Examples

The following examples demonstrate how to call the chunking pipeline in practice.

```python
from open_notebook.utils.chunking import chunk_text, ContentType

# Example: Raw HTML document

with open("article.html", "r") as f:
    html = f.read()

# Automatic detection + chunking

chunks = chunk_text(html, file_path="article.html")
print(f"Generated {len(chunks)} chunks")          # → log shows CHUNK_SIZE, overlap, etc.

```

```python

# For pure Markdown (override detection if you already know the type)

md = "# Title\n\nSome text\n\n## Section\nMore content"

chunks = chunk_text(md, content_type=ContentType.MARKDOWN)

```

```python

# Plain‑text with custom environment variables

import os
os.environ["OPEN_NOTEBOOK_CHUNK_SIZE"] = "500"
os.environ["OPEN_NOTEBOOK_CHUNK_OVERLAP"] = "75"
text = "Long paragraph ..." * 200
chunks = chunk_text(text)   # respects the 500‑token limit and 15 % overlap

```

## Summary

- Open Notebook centralizes its **chunking strategy** in [`open_notebook/utils/chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/chunking.py).
- Behavior is controlled by environment variables for chunk size, overlap, and minimum fragment size.
- Content type is determined by extension first, then by a 5,000-character heuristic scan.
- Splitters are matched to format: HTML headers, Markdown headers, or recursive plain-text splitting.
- Oversized semantic chunks undergo secondary recursive splitting to enforce token limits.
- Fragments below the minimum size are discarded to avoid null-vector crashes.

## Frequently Asked Questions

### How does Open Notebook decide whether a file is HTML or Markdown?

It first checks the file extension against a known mapping. If the extension is ambiguous or missing, it scores the first 5,000 characters for HTML and Markdown signatures and selects the type with the highest confidence.

### What happens if a single HTML section is larger than the token limit?

The `HTMLHeaderTextSplitter` may produce large sections. When a chunk exceeds `CHUNK_SIZE`, the pipeline automatically applies the `RecursiveCharacterTextSplitter` to break it into smaller, token-safe pieces.

### Why does the pipeline drop very small chunks?

Chunks below `MIN_CHUNK_SIZE` tokens (default 5) are removed because punctuation-only fragments generate low-quality embeddings. Some embedding providers treat these as null vectors, which can cause runtime failures.

### Can I force a specific splitter without relying on auto-detection?

Yes. The `chunk_text` function accepts an optional `content_type` argument. Pass `ContentType.HTML`, `ContentType.MARKDOWN`, or `ContentType.PLAIN` to bypass extension and heuristic detection.