Open Notebook Content Processing Pipeline for PDFs, Videos, Audio, and URLs with `content_core`

Open Notebook relies on the third-party content_core library, orchestrated through the source graph in open_notebook/graphs/source.py, to transform raw PDFs, videos, audio files, and web URLs into clean, searchable markdown text.

The lfnovo/open-notebook repository implements a robust content processing pipeline that leverages content_core to unify ingestion across documents and multimedia. By orchestrating extraction through a dedicated state graph, the system normalizes diverse inputs into a consistent markdown format ready for indexing and vector search.

Pipeline Architecture in open_notebook/graphs/source.py

The extraction workflow is implemented as a state machine inside open_notebook/graphs/source.py. It prepares the input state, invokes content_core, validates the result, and persists the output.

Step 1: Engine Configuration via ContentSettings

Before extraction begins, the graph loads default engine settings from ContentSettings. Two key defaults are injected into the content_state dictionary:

  • default_content_processing_engine_doc for local files such as PDFs, DOCX, and PPTX.
  • default_content_processing_engine_url for web resources and HTML pages.

As seen in open_notebook/graphs/source.py (lines 34–60), these values populate the content_state dict that is passed to the extractor. Users can override the defaults through the Settings UI—selecting specific loaders like pdfminer, pandoc, or yt-dlp—or leave the engine set to "auto" so that content_core selects the best available handler.

Step 2: Optional Speech-to-Text Selection

For video and audio sources, the graph queries the Open Notebook model manager for the default speech-to-text configuration. When available, it adds audio_provider and audio_model to the extraction request (open_notebook/graphs/source.py, lines 62–71). This enables automatic transcript generation for YouTube videos, MP4 files, and raw audio streams when subtitles are not present.

Step 3: Core Extraction via extract_content

The central entry point is the async call await extract_content(content_state), where extract_content is imported from content_core on lines 4–5 of open_notebook/graphs/source.py. Internally, content_core performs the following operations:

  • Detects the source type by inspecting the submitted url or file_path.
  • Selects the appropriate extractor based on url_engine and document_engine.
  • Executes extraction for the detected media:
    • PDF / DOCX / PPTX – parses the binary file and converts extracted text into markdown.
    • HTML / generic URLs – fetches the page, strips boilerplate, and returns the main article body.
    • YouTube / other videos – downloads subtitles if available; otherwise streams audio to the configured STT model to produce a transcript.
    • Audio files (MP3, WAV, etc.) – streams directly to the configured STT model.
  • Normalizes the result into a ProcessSourceState object containing:
    • content – the extracted markdown text.
    • title – a guessed or extracted title.
    • url or file_path – the original source location.
    • metadata – MIME type, language hints, and other details.

Step 4: Error Handling and Validation

After extract_content returns, the graph checks whether the content field is populated. If the result is empty, it raises a clear ValueError explaining the failure—such as missing subtitles, an unsupported format, or a broken URL (open_notebook/graphs/source.py, lines 80–92). This ensures that downstream steps never process invalid or empty extractions silently.

Step 5: Persisting Results and Embedding

Once validation passes, the graph calls save_source to store the normalized text in a Source record, update the associated Asset, and optionally trigger the embedding step for vector search (open_notebook/graphs/source.py, lines 97–122). At this point, the raw resource is fully converted into a searchable markdown document inside Open Notebook.

Practical Code Examples

Ingesting Sources Through the Source Graph

The recommended path is to invoke source_graph with a prepared content_state. The graph fills in engines, audio configuration, and persistence logic automatically.

from open_notebook.graphs.source import source_graph
from open_notebook.domain.notebook import Source, Asset

async def ingest(source_id: str, notebook_ids: list[str], url: str | None = None,
                 file_path: str | None = None, embed: bool = True):
    # Build the initial ProcessSourceState expected by content_core

    content_state = {
        "url": url,
        "file_path": file_path,
        # The following keys are filled in by the graph (engines, audio config, etc.)

    }

    # Kick off the workflow

    result = await source_graph.ainvoke(
        {
            "content_state": content_state,
            "apply_transformations": [],          # No extra transformations

            "source_id": source_id,
            "notebook_ids": notebook_ids,
            "source": await Source.get(source_id),  # pre-created empty Source record

            "embed": embed,
        }
    )
    return result["source"]          # Persisted Source with full_text populated

Direct content_core Extraction

For custom scripts or testing, you can call content_core directly without the graph overhead.

from content_core import extract_content
from content_core.common import ProcessSourceState

async def raw_extract(url: str) -> ProcessSourceState:
    # Basic state; the library will auto-detect the best engine

    state = {
        "url": url,
        "url_engine": "auto",
        "document_engine": "auto",
        "output_format": "markdown",
    }
    return await extract_content(state)   # Returns ProcessSourceState with .content etc.

Key Files in the Extraction Workflow

  • open_notebook/graphs/source.py – Orchestrates the complete extraction workflow; configures defaults, invokes extract_content, validates results, and stores output.
  • open_notebook/domain/content_settings.py – Defines the configurable defaults for document and URL extraction engines.
  • api/sources_service.py – Receives uploads and URLs from the API, builds the initial content_state, and triggers the source graph.
  • content_core (external package) – Performs the heavy lifting of file-type detection, text extraction, subtitle download, and STT transcription.
  • open_notebook/graphs/CLAUDE.md – Documents the graph architecture and the role of content_core within the ingestion pipeline.

Summary

  • Unified ingestionopen_notebook/graphs/source.py orchestrates conversion of PDFs, Office documents, web pages, videos, and audio into markdown through content_core.
  • Configurable engines – Defaults from ContentSettings can be overridden per source, or left on "auto" for automatic loader selection.
  • Speech-to-text support – The pipeline optionally integrates STT models to generate transcripts for video and audio content.
  • Strict validation – Empty extractions raise informative ValueError exceptions before any data is persisted.
  • End-to-end persistence – Valid results are saved to Source records and optionally embedded for vector search.

Frequently Asked Questions

What file and media types does Open Notebook support?

According to the lfnovo/open-notebook source code, the pipeline supports PDF, DOCX, PPTX, HTML pages, generic URLs, YouTube videos, MP4 files, and raw audio such as MP3 and WAV. The content_core library detects the type automatically and routes it to the appropriate extractor.

How does the pipeline handle videos without subtitles?

If subtitles are unavailable, content_core streams the video’s audio track to the speech-to-text model specified by audio_provider and audio_model in the content_state. This behavior is configured inside open_notebook/graphs/source.py (lines 62–71) and produces a full transcript when no text track exists.

Can I override the default extraction engine for a specific source?

Yes. While ContentSettings supplies global defaults for default_content_processing_engine_doc and default_content_processing_engine_url, users can override these per-source through the Settings UI or by injecting explicit url_engine and document_engine values into the content_state before extraction.

What happens if content_core fails to return usable text?

The source graph validates the content field immediately after extraction. If it is empty, the graph raises a descriptive ValueError (open_notebook/graphs/source.py, lines 80–92) that explains why the operation failed, preventing broken or blank records from being saved to the database.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →