# How the Open Notebook sources_service Handles Content Extraction and Vectorization

> Discover how the sources_service in lfnovo/open-notebook extracts content and generates vector embeddings using LangGraph and Esperanto for efficient storage in SurrealDB.

- Repository: [Luis Novo/open-notebook](https://github.com/lfnovo/open-notebook)
- Tags: how-to-guide
- Published: 2026-06-07

---

**The `sources_service` in `lfnovo/open-notebook` is an API-layer orchestrator that delegates all extraction and embedding work to a LangGraph workflow in [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), where raw text is parsed by `content_core` and vector embeddings are generated via the Esperanto abstraction into SurrealDB.**

The ingestion pipeline in the `lfnovo/open-notebook` repository centers on the `sources_service`, which handles the initial request when users submit files, URLs, or raw text. While the service manages the API boundary and returns structured results, the actual content extraction and vectorization occur inside a dedicated workflow graph. Understanding this boundary is essential for customizing ingestion behavior or debugging embedding failures.

## The API Layer: What `SourcesService` Actually Does

`SourcesService` acts strictly as a wrapper around FastAPI client operations. When you call `create_source`, the service forwards the request through `api_client.create_source` and passes the `embed` parameter that signals whether the backend should generate vector embeddings after extraction.

If `async_processing` is disabled, the call returns a fully materialized `Source` object. If enabled, it returns a `SourceProcessingResult` containing a command ID for polling. The service itself never touches PDF parsers, embedding models, or vector databases; it only initiates the request and formats the response.

## How the LangGraph Workflow Orchestrates Extraction and Embedding

Once the backend receives the request, it persists a `Source` record and immediately invokes the `source_graph` LangGraph workflow defined in [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py). This graph coordinates the heavy lifting across two primary nodes: `content_process` and `save_source`.

### Extracting Raw Text with the `content_process` Node

The `content_process` node calls `extract_content(state["content_state"])` from the `content_core` library. This utility supports diverse input types—including PDF, DOCX, audio files, and YouTube URLs—and returns a `processed_state` dictionary containing `content`, `title`, `url`, and `file_path`.

In [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), the workflow passes the incoming source state into this node, which delegates format-specific parsing to `content_core`. The result is a standardized text payload that downstream nodes can handle uniformly regardless of the original file type.

### Persisting Content and Triggering Vectorization in `save_source`

After extraction, the `save_source` node retrieves the existing `Source` record, updates its asset metadata, and writes the extracted `full_text` into the database. If the original request included `embed=True`, this node also executes `await source.vectorize()` before the workflow completes.

The `save_source` node is also responsible for optional title overrides: if the user provided a placeholder title, the node can replace it with inferred metadata from the extraction step. Once persistence and vectorization finish, the workflow may invoke additional transformation graphs—such as summarization—through `trigger_transformations`.

## Inside `Source.vectorize`: Chunking, Embedding, and Storage

The actual embedding logic lives in [`open_notebook/domain/notebook.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/domain/notebook.py) inside the `Source.vectorize` method. This method performs three distinct operations:

1. **Chunking** – It splits the stored `full_text` into discrete chunks suitable for the selected embedding model's context window.
2. **Embedding generation** – It calls the AI provider through the **Esperanto** abstraction layer. The provider selection and API routing logic is further supported by [`open_notebook/utils/embedding.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/embedding.py).
3. **Vector storage** – It writes the resulting embedding vectors to SurrealDB, enabling semantic search across notebook sources.

This design cleanly separates the domain model (`Source`) from the ingestion transport (`SourcesService`) and the workflow engine (LangGraph).

## Code Examples: Ingesting Sources with Open Notebook

You can request synchronous ingestion with immediate vectorization using `create_source`:

```python
from api.sources_service import sources_service

result = sources_service.create_source(
    notebooks=["notebook-123"],
    source_type="upload",
    file_path="/tmp/report.pdf",
    title="Quarterly Report",
    embed=True,          # request vectorization

    async_processing=False,
)

# `result` is a `Source` instance with `full_text` and `embedded_chunks` populated

print(result.id, result.full_text[:200])

```

For large files or slow URLs, use the asynchronous path to avoid blocking:

```python
from api.sources_service import sources_service

async_result = sources_service.create_source_async(
    notebooks=["notebook-123"],
    source_type="url",
    url="https://example.com/article.html",
    embed=True,
)

print("Command ID:", async_result.command_id)

# Poll the status later:

status = sources_service.get_source_status(async_result.source.id)
print(status)

```

## Summary

- **`SourcesService`** is an API-layer orchestrator in [`api/sources_service.py`](https://github.com/lfnovo/open-notebook/blob/main/api/sources_service.py); it does not extract text or generate embeddings itself.
- The **`source_graph`** LangGraph workflow in [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py) handles the actual pipeline through the `content_process` and `save_source` nodes.
- **`content_core.extract_content`** parses files and URLs into standardized `full_text` regardless of input format.
- **`Source.vectorize`** in [`open_notebook/domain/notebook.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/domain/notebook.py) chunks text, retrieves embeddings via Esperanto, and stores vectors in SurrealDB.
- The **`embed=True` parameter** triggers vectorization inside the `save_source` node after text extraction completes.

## Frequently Asked Questions

### Does `sources_service` extract PDF text directly?

No. `SourcesService` only forwards the creation request to the backend. According to the `lfnovo/open-notebook` source code, the actual PDF parsing is performed inside the `content_process` node of [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), which delegates to `content_core.extract_content` via `extract_content(state["content_state"])`.

### Where does the embedding generation happen in the Open Notebook pipeline?

Embedding generation occurs in the `Source.vectorize` method defined in [`open_notebook/domain/notebook.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/domain/notebook.py). The workflow node `save_source` calls this method after extraction if the request included `embed=True`. The method uses the Esperanto abstraction and [`open_notebook/utils/embedding.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/embedding.py) to call the configured AI provider.

### What database stores the vector embeddings?

The vectors are written to **SurrealDB**. After `Source.vectorize` chunks the `full_text` and obtains numerical embeddings through the selected provider, it persists those vectors into SurrealDB to power semantic search across sources and notebooks.

### Can I process large files without blocking the API request?

Yes. Use `sources_service.create_source_async` instead of `create_source`. This returns a `SourceProcessingResult` with a `command_id` that you can poll through `get_source_status`, allowing the LangGraph workflow to run extraction and vectorization in the background.