How the Open Notebook sources_service Handles Content Extraction and Vectorization
The sources_service in lfnovo/open-notebook is an API-layer orchestrator that delegates all extraction and embedding work to a LangGraph workflow in open_notebook/graphs/source.py, where raw text is parsed by content_core and vector embeddings are generated via the Esperanto abstraction into SurrealDB.
The ingestion pipeline in the lfnovo/open-notebook repository centers on the sources_service, which handles the initial request when users submit files, URLs, or raw text. While the service manages the API boundary and returns structured results, the actual content extraction and vectorization occur inside a dedicated workflow graph. Understanding this boundary is essential for customizing ingestion behavior or debugging embedding failures.
The API Layer: What SourcesService Actually Does
SourcesService acts strictly as a wrapper around FastAPI client operations. When you call create_source, the service forwards the request through api_client.create_source and passes the embed parameter that signals whether the backend should generate vector embeddings after extraction.
If async_processing is disabled, the call returns a fully materialized Source object. If enabled, it returns a SourceProcessingResult containing a command ID for polling. The service itself never touches PDF parsers, embedding models, or vector databases; it only initiates the request and formats the response.
How the LangGraph Workflow Orchestrates Extraction and Embedding
Once the backend receives the request, it persists a Source record and immediately invokes the source_graph LangGraph workflow defined in open_notebook/graphs/source.py. This graph coordinates the heavy lifting across two primary nodes: content_process and save_source.
Extracting Raw Text with the content_process Node
The content_process node calls extract_content(state["content_state"]) from the content_core library. This utility supports diverse input types—including PDF, DOCX, audio files, and YouTube URLs—and returns a processed_state dictionary containing content, title, url, and file_path.
In open_notebook/graphs/source.py, the workflow passes the incoming source state into this node, which delegates format-specific parsing to content_core. The result is a standardized text payload that downstream nodes can handle uniformly regardless of the original file type.
Persisting Content and Triggering Vectorization in save_source
After extraction, the save_source node retrieves the existing Source record, updates its asset metadata, and writes the extracted full_text into the database. If the original request included embed=True, this node also executes await source.vectorize() before the workflow completes.
The save_source node is also responsible for optional title overrides: if the user provided a placeholder title, the node can replace it with inferred metadata from the extraction step. Once persistence and vectorization finish, the workflow may invoke additional transformation graphs—such as summarization—through trigger_transformations.
Inside Source.vectorize: Chunking, Embedding, and Storage
The actual embedding logic lives in open_notebook/domain/notebook.py inside the Source.vectorize method. This method performs three distinct operations:
- Chunking – It splits the stored
full_textinto discrete chunks suitable for the selected embedding model's context window. - Embedding generation – It calls the AI provider through the Esperanto abstraction layer. The provider selection and API routing logic is further supported by
open_notebook/utils/embedding.py. - Vector storage – It writes the resulting embedding vectors to SurrealDB, enabling semantic search across notebook sources.
This design cleanly separates the domain model (Source) from the ingestion transport (SourcesService) and the workflow engine (LangGraph).
Code Examples: Ingesting Sources with Open Notebook
You can request synchronous ingestion with immediate vectorization using create_source:
from api.sources_service import sources_service
result = sources_service.create_source(
notebooks=["notebook-123"],
source_type="upload",
file_path="/tmp/report.pdf",
title="Quarterly Report",
embed=True, # request vectorization
async_processing=False,
)
# `result` is a `Source` instance with `full_text` and `embedded_chunks` populated
print(result.id, result.full_text[:200])
For large files or slow URLs, use the asynchronous path to avoid blocking:
from api.sources_service import sources_service
async_result = sources_service.create_source_async(
notebooks=["notebook-123"],
source_type="url",
url="https://example.com/article.html",
embed=True,
)
print("Command ID:", async_result.command_id)
# Poll the status later:
status = sources_service.get_source_status(async_result.source.id)
print(status)
Summary
SourcesServiceis an API-layer orchestrator inapi/sources_service.py; it does not extract text or generate embeddings itself.- The
source_graphLangGraph workflow inopen_notebook/graphs/source.pyhandles the actual pipeline through thecontent_processandsave_sourcenodes. content_core.extract_contentparses files and URLs into standardizedfull_textregardless of input format.Source.vectorizeinopen_notebook/domain/notebook.pychunks text, retrieves embeddings via Esperanto, and stores vectors in SurrealDB.- The
embed=Trueparameter triggers vectorization inside thesave_sourcenode after text extraction completes.
Frequently Asked Questions
Does sources_service extract PDF text directly?
No. SourcesService only forwards the creation request to the backend. According to the lfnovo/open-notebook source code, the actual PDF parsing is performed inside the content_process node of open_notebook/graphs/source.py, which delegates to content_core.extract_content via extract_content(state["content_state"]).
Where does the embedding generation happen in the Open Notebook pipeline?
Embedding generation occurs in the Source.vectorize method defined in open_notebook/domain/notebook.py. The workflow node save_source calls this method after extraction if the request included embed=True. The method uses the Esperanto abstraction and open_notebook/utils/embedding.py to call the configured AI provider.
What database stores the vector embeddings?
The vectors are written to SurrealDB. After Source.vectorize chunks the full_text and obtains numerical embeddings through the selected provider, it persists those vectors into SurrealDB to power semantic search across sources and notebooks.
Can I process large files without blocking the API request?
Yes. Use sources_service.create_source_async instead of create_source. This returns a SourceProcessingResult with a command_id that you can poll through get_source_status, allowing the LangGraph workflow to run extraction and vectorization in the background.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →