How the `token_count()` Function Estimates Token Usage and Detects Large Contexts in Open Notebook
The token_count() function in Open Notebook estimates token consumption by first attempting exact tokenization via tiktoken's "o200k_base" encoder, falling back to a word-count heuristic multiplied by 1.3 when unavailable, and automatically triggers large-context model selection when content exceeds 105,000 tokens.
Open Notebook relies on precise token estimation to manage language model costs and prevent context window overflow. The token_count() function serves as the central utility for this measurement across the entire codebase, providing deterministic token counts that drive everything from model selection to text chunking strategies.
Implementation Details of token_count()
The core implementation resides in open_notebook/utils/token_utils.py. This utility provides a robust two-tier approach to token estimation.
Exact Tokenization with tiktoken
When available, the function imports tiktoken and instantiates the "o200k_base" encoding. This provides an exact token count for the input string, matching the tokenization behavior of modern OpenAI models. The deterministic output ensures consistent measurements across repeated calls.
Fallback Heuristic for Offline Environments
If tiktoken cannot be imported—due to missing dependencies or offline environments—the function degrades gracefully to a word-count estimate multiplied by 1.3. This heuristic statistically approximates English token density, providing reasonable estimates when exact tokenizers are unavailable.
Cost Estimation Integration
The same module exposes token_cost(), which converts the output of token_count() into estimated monetary costs. This allows the system to log and budget API expenses based on actual token consumption rather than rough character counts.
Large-Context Detection Logic
The primary production use of token_count() occurs in open_notebook/ai/provision.py. Here, the function drives intelligent model routing based on content size.
When provision_langchain_model() receives content for processing, it immediately invokes token_count(content) at line 19. If the returned value exceeds 105,000 tokens, the system logs a warning at line 26 and automatically selects the large-context model via model_manager.get_default_model("large_context"). This threshold prevents truncation errors by ensuring the system switches to models capable of handling extended prompts before processing begins.
Token Counting Across the Codebase
Beyond model provisioning, token_count() maintains consistent size measurements throughout the data pipeline.
Text Chunking
In open_notebook/utils/chunking.py, the chunker repeatedly queries token_count() to split documents into pieces that respect configured token budgets such as CHUNK_SIZE. This ensures no chunk exceeds the embedding model's input limits.
Embedding Pipeline
The embedding utility in open_notebook/utils/embedding.py records token sizes for each text chunk before vectorization. These measurements inform decisions about whether to embed content directly or require pre-splitting, optimizing both API usage and storage.
Context Building
open_notebook/utils/context_builder.py employs lazy evaluation, computing token counts only when not explicitly provided. This enables downstream components to perform size checks without redundant re-tokenization of previously processed documents.
Practical Code Examples
The following examples demonstrate common usage patterns for the token_count() function.
Direct token counting:
from open_notebook.utils.token_utils import token_count
text = "Hello, world! This is a short test."
print(token_count(text)) # → exact token count (e.g., 9)
Large-context model selection:
from open_notebook.ai.provision import provision_langchain_model
async def get_model_for_content(content):
# Automatically selects large-context model if content > 105k tokens
return await provision_langchain_model(
content,
model_id=None,
default_type="chat",
)
Chunking by token budget:
from open_notebook.utils.chunking import chunk_text
from open_notebook.utils.token_utils import token_count
MAX_TOKENS = 2_000
chunks = chunk_text(long_document, max_tokens=MAX_TOKENS)
# Verify all chunks respect the limit
assert all(token_count(c) <= MAX_TOKENS for c in chunks)
Summary
- The
token_count()function resides inopen_notebook/utils/token_utils.pyand serves as the single source of truth for token estimation across the Open Notebook codebase. - It prioritizes exact counts via tiktoken's "o200k_base" encoding, falling back to a word-count heuristic (×1.3) when the library is unavailable.
- Large-context detection occurs in
open_notebook/ai/provision.py, where content exceeding 105,000 tokens triggers automatic selection of a high-capacity model. - The utility supports downstream operations including text chunking, embedding size validation, and cost estimation through
token_cost().
Frequently Asked Questions
What encoding does the token_count() function use?
The function attempts to use tiktoken's "o200k_base" encoding first, which corresponds to modern OpenAI model tokenization schemes. If tiktoken is not installed, it falls back to a statistical estimate based on word count.
What happens if tiktoken is not installed?
When tiktoken is unavailable, the function calculates a word count and multiplies by 1.3 to approximate token density. This heuristic provides reasonable estimates for English text without requiring external dependencies.
Why is the large-context threshold set to 105,000 tokens?
The 105,000 token threshold in open_notebook/ai/provision.py acts as a safety buffer below common large-context model limits (such as 128k or 200k contexts). This ensures adequate headroom for system prompts and response generation while preventing truncation errors.
How accurate is the fallback word-count heuristic?
The 1.3× multiplier statistically approximates the average ratio of tokens to words in English prose. While less precise than tiktoken for code or non-English text, it provides sufficient accuracy for chunking decisions and rough cost estimation when exact tokenizers are unavailable.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →