# Open Notebook Content Processing Pipeline for PDFs, Videos, Audio, and URLs with `content_core`

> Discover how Open Notebook uses content_core to process PDFs, videos, audio, and URLs into searchable markdown. Explore the source graph for detailed pipeline orchestration.

- Repository: [Luis Novo/open-notebook](https://github.com/lfnovo/open-notebook)
- Tags: how-to-guide
- Published: 2026-06-06

---

**Open Notebook relies on the third-party `content_core` library, orchestrated through the source graph in [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), to transform raw PDFs, videos, audio files, and web URLs into clean, searchable markdown text.**

The `lfnovo/open-notebook` repository implements a robust content processing pipeline that leverages `content_core` to unify ingestion across documents and multimedia. By orchestrating extraction through a dedicated state graph, the system normalizes diverse inputs into a consistent markdown format ready for indexing and vector search.

## Pipeline Architecture in [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py)

The extraction workflow is implemented as a state machine inside [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py). It prepares the input state, invokes `content_core`, validates the result, and persists the output.

### Step 1: Engine Configuration via `ContentSettings`

Before extraction begins, the graph loads default engine settings from **`ContentSettings`**. Two key defaults are injected into the `content_state` dictionary:

- **`default_content_processing_engine_doc`** for local files such as PDFs, DOCX, and PPTX.
- **`default_content_processing_engine_url`** for web resources and HTML pages.

As seen in [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py) (lines 34–60), these values populate the `content_state` dict that is passed to the extractor. Users can override the defaults through the Settings UI—selecting specific loaders like `pdfminer`, `pandoc`, or `yt-dlp`—or leave the engine set to `"auto"` so that `content_core` selects the best available handler.

### Step 2: Optional Speech-to-Text Selection

For video and audio sources, the graph queries the Open Notebook model manager for the default speech-to-text configuration. When available, it adds **`audio_provider`** and **`audio_model`** to the extraction request ([`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), lines 62–71). This enables automatic transcript generation for YouTube videos, MP4 files, and raw audio streams when subtitles are not present.

### Step 3: Core Extraction via `extract_content`

The central entry point is the async call `await extract_content(content_state)`, where `extract_content` is imported from `content_core` on lines 4–5 of [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py). Internally, `content_core` performs the following operations:

- **Detects the source type** by inspecting the submitted `url` or `file_path`.
- **Selects the appropriate extractor** based on `url_engine` and `document_engine`.
- **Executes extraction** for the detected media:
  - **PDF / DOCX / PPTX** – parses the binary file and converts extracted text into markdown.
  - **HTML / generic URLs** – fetches the page, strips boilerplate, and returns the main article body.
  - **YouTube / other videos** – downloads subtitles if available; otherwise streams audio to the configured STT model to produce a transcript.
  - **Audio files (MP3, WAV, etc.)** – streams directly to the configured STT model.
- **Normalizes the result** into a **`ProcessSourceState`** object containing:
  - `content` – the extracted markdown text.
  - `title` – a guessed or extracted title.
  - `url` or `file_path` – the original source location.
  - `metadata` – MIME type, language hints, and other details.

### Step 4: Error Handling and Validation

After `extract_content` returns, the graph checks whether the `content` field is populated. If the result is empty, it raises a clear **`ValueError`** explaining the failure—such as missing subtitles, an unsupported format, or a broken URL ([`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), lines 80–92). This ensures that downstream steps never process invalid or empty extractions silently.

### Step 5: Persisting Results and Embedding

Once validation passes, the graph calls `save_source` to store the normalized text in a **`Source`** record, update the associated **`Asset`**, and optionally trigger the embedding step for vector search ([`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), lines 97–122). At this point, the raw resource is fully converted into a searchable markdown document inside Open Notebook.

## Practical Code Examples

### Ingesting Sources Through the Source Graph

The recommended path is to invoke `source_graph` with a prepared `content_state`. The graph fills in engines, audio configuration, and persistence logic automatically.

```python
from open_notebook.graphs.source import source_graph
from open_notebook.domain.notebook import Source, Asset

async def ingest(source_id: str, notebook_ids: list[str], url: str | None = None,
                 file_path: str | None = None, embed: bool = True):
    # Build the initial ProcessSourceState expected by content_core

    content_state = {
        "url": url,
        "file_path": file_path,
        # The following keys are filled in by the graph (engines, audio config, etc.)

    }

    # Kick off the workflow

    result = await source_graph.ainvoke(
        {
            "content_state": content_state,
            "apply_transformations": [],          # No extra transformations

            "source_id": source_id,
            "notebook_ids": notebook_ids,
            "source": await Source.get(source_id),  # pre-created empty Source record

            "embed": embed,
        }
    )
    return result["source"]          # Persisted Source with full_text populated

```

### Direct `content_core` Extraction

For custom scripts or testing, you can call `content_core` directly without the graph overhead.

```python
from content_core import extract_content
from content_core.common import ProcessSourceState

async def raw_extract(url: str) -> ProcessSourceState:
    # Basic state; the library will auto-detect the best engine

    state = {
        "url": url,
        "url_engine": "auto",
        "document_engine": "auto",
        "output_format": "markdown",
    }
    return await extract_content(state)   # Returns ProcessSourceState with .content etc.

```

## Key Files in the Extraction Workflow

- **[`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py)** – Orchestrates the complete extraction workflow; configures defaults, invokes `extract_content`, validates results, and stores output.
- **[`open_notebook/domain/content_settings.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/domain/content_settings.py)** – Defines the configurable defaults for document and URL extraction engines.
- **[`api/sources_service.py`](https://github.com/lfnovo/open-notebook/blob/main/api/sources_service.py)** – Receives uploads and URLs from the API, builds the initial `content_state`, and triggers the source graph.
- **`content_core`** (external package) – Performs the heavy lifting of file-type detection, text extraction, subtitle download, and STT transcription.
- **[`open_notebook/graphs/CLAUDE.md`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/CLAUDE.md)** – Documents the graph architecture and the role of `content_core` within the ingestion pipeline.

## Summary

- **Unified ingestion** – [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py) orchestrates conversion of PDFs, Office documents, web pages, videos, and audio into markdown through `content_core`.
- **Configurable engines** – Defaults from `ContentSettings` can be overridden per source, or left on `"auto"` for automatic loader selection.
- **Speech-to-text support** – The pipeline optionally integrates STT models to generate transcripts for video and audio content.
- **Strict validation** – Empty extractions raise informative `ValueError` exceptions before any data is persisted.
- **End-to-end persistence** – Valid results are saved to `Source` records and optionally embedded for vector search.

## Frequently Asked Questions

### What file and media types does Open Notebook support?

According to the `lfnovo/open-notebook` source code, the pipeline supports PDF, DOCX, PPTX, HTML pages, generic URLs, YouTube videos, MP4 files, and raw audio such as MP3 and WAV. The `content_core` library detects the type automatically and routes it to the appropriate extractor.

### How does the pipeline handle videos without subtitles?

If subtitles are unavailable, `content_core` streams the video’s audio track to the speech-to-text model specified by `audio_provider` and `audio_model` in the `content_state`. This behavior is configured inside [`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py) (lines 62–71) and produces a full transcript when no text track exists.

### Can I override the default extraction engine for a specific source?

Yes. While `ContentSettings` supplies global defaults for `default_content_processing_engine_doc` and `default_content_processing_engine_url`, users can override these per-source through the Settings UI or by injecting explicit `url_engine` and `document_engine` values into the `content_state` before extraction.

### What happens if `content_core` fails to return usable text?

The source graph validates the `content` field immediately after extraction. If it is empty, the graph raises a descriptive `ValueError` ([`open_notebook/graphs/source.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/graphs/source.py), lines 80–92) that explains why the operation failed, preventing broken or blank records from being saved to the database.