how-to-guide

How the Open Notebook sources_service Handles Content Extraction and Vectorization

June 7, 2026 lfnovo/open-notebook ↗

The sources_service in lfnovo/open-notebook is an API-layer orchestrator that delegates all extraction and embedding work to a LangGraph workflow in open_notebook/graphs/source.py, where raw text is parsed by content_core and vector embeddings are generated via the Esperanto abstraction into SurrealDB.

The ingestion pipeline in the lfnovo/open-notebook repository centers on the sources_service, which handles the initial request when users submit files, URLs, or raw text. While the service manages the API boundary and returns structured results, the actual content extraction and vectorization occur inside a dedicated workflow graph. Understanding this boundary is essential for customizing ingestion behavior or debugging embedding failures.

The API Layer: What `SourcesService` Actually Does

SourcesService acts strictly as a wrapper around FastAPI client operations. When you call create_source, the service forwards the request through api_client.create_source and passes the embed parameter that signals whether the backend should generate vector embeddings after extraction.

If async_processing is disabled, the call returns a fully materialized Source object. If enabled, it returns a SourceProcessingResult containing a command ID for polling. The service itself never touches PDF parsers, embedding models, or vector databases; it only initiates the request and formats the response.

How the LangGraph Workflow Orchestrates Extraction and Embedding

Once the backend receives the request, it persists a Source record and immediately invokes the source_graph LangGraph workflow defined in open_notebook/graphs/source.py. This graph coordinates the heavy lifting across two primary nodes: content_process and save_source.

Extracting Raw Text with the `content_process` Node

The content_process node calls extract_content(state["content_state"]) from the content_core library. This utility supports diverse input types—including PDF, DOCX, audio files, and YouTube URLs—and returns a processed_state dictionary containing content, title, url, and file_path.

In open_notebook/graphs/source.py, the workflow passes the incoming source state into this node, which delegates format-specific parsing to content_core. The result is a standardized text payload that downstream nodes can handle uniformly regardless of the original file type.

Persisting Content and Triggering Vectorization in `save_source`

After extraction, the save_source node retrieves the existing Source record, updates its asset metadata, and writes the extracted full_text into the database. If the original request included embed=True, this node also executes await source.vectorize() before the workflow completes.

The save_source node is also responsible for optional title overrides: if the user provided a placeholder title, the node can replace it with inferred metadata from the extraction step. Once persistence and vectorization finish, the workflow may invoke additional transformation graphs—such as summarization—through trigger_transformations.

Inside `Source.vectorize`: Chunking, Embedding, and Storage

The actual embedding logic lives in open_notebook/domain/notebook.py inside the Source.vectorize method. This method performs three distinct operations:

Chunking – It splits the stored full_text into discrete chunks suitable for the selected embedding model's context window.
Embedding generation – It calls the AI provider through the Esperanto abstraction layer. The provider selection and API routing logic is further supported by open_notebook/utils/embedding.py.
Vector storage – It writes the resulting embedding vectors to SurrealDB, enabling semantic search across notebook sources.

This design cleanly separates the domain model (Source) from the ingestion transport (SourcesService) and the workflow engine (LangGraph).

Code Examples: Ingesting Sources with Open Notebook

You can request synchronous ingestion with immediate vectorization using create_source:

from api.sources_service import sources_service

result = sources_service.create_source(
    notebooks=["notebook-123"],
    source_type="upload",
    file_path="/tmp/report.pdf",
    title="Quarterly Report",
    embed=True,          # request vectorization

    async_processing=False,
)

# `result` is a `Source` instance with `full_text` and `embedded_chunks` populated

print(result.id, result.full_text[:200])

For large files or slow URLs, use the asynchronous path to avoid blocking:

from api.sources_service import sources_service

async_result = sources_service.create_source_async(
    notebooks=["notebook-123"],
    source_type="url",
    url="https://example.com/article.html",
    embed=True,
)

print("Command ID:", async_result.command_id)

# Poll the status later:

status = sources_service.get_source_status(async_result.source.id)
print(status)

Summary

SourcesService is an API-layer orchestrator in api/sources_service.py; it does not extract text or generate embeddings itself.
The source_graph LangGraph workflow in open_notebook/graphs/source.py handles the actual pipeline through the content_process and save_source nodes.
content_core.extract_content parses files and URLs into standardized full_text regardless of input format.
Source.vectorize in open_notebook/domain/notebook.py chunks text, retrieves embeddings via Esperanto, and stores vectors in SurrealDB.
The embed=True parameter triggers vectorization inside the save_source node after text extraction completes.

Frequently Asked Questions

Does `sources_service` extract PDF text directly?

No. SourcesService only forwards the creation request to the backend. According to the lfnovo/open-notebook source code, the actual PDF parsing is performed inside the content_process node of open_notebook/graphs/source.py, which delegates to content_core.extract_content via extract_content(state["content_state"]).

Where does the embedding generation happen in the Open Notebook pipeline?

Embedding generation occurs in the Source.vectorize method defined in open_notebook/domain/notebook.py. The workflow node save_source calls this method after extraction if the request included embed=True. The method uses the Esperanto abstraction and open_notebook/utils/embedding.py to call the configured AI provider.

What database stores the vector embeddings?

The vectors are written to SurrealDB. After Source.vectorize chunks the full_text and obtains numerical embeddings through the selected provider, it persists those vectors into SurrealDB to power semantic search across sources and notebooks.

Can I process large files without blocking the API request?

Yes. Use sources_service.create_source_async instead of create_source. This returns a SourceProcessingResult with a command_id that you can poll through get_source_status, allowing the LangGraph workflow to run extraction and vectorization in the background.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how lfnovo/open-notebook works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →