How to Regenerate Vector Embeddings for Existing Sources in Open Notebook

To regenerate vector embeddings for existing sources, trigger the rebuild_embeddings command via POST /api/embeddings/rebuild with mode="existing", which deletes old vectors from the source_embedding table, re-chunks the text, and inserts fresh embeddings asynchronously.

Open Notebook persists vector embeddings in the source_embedding table to enable semantic search across uploaded documents. When you change embedding models or adjust chunking strategies in open_notebook/utils/chunking.py, you must regenerate these vectors to keep search results accurate. The repository provides an asynchronous pipeline—exposed through the REST API and Python client—that safely rebuilds embeddings for existing sources while leaving untouched sources unchanged.

Understanding the Rebuild Architecture

The regeneration workflow coordinates five distinct components to process existing sources idempotently:

  1. API Routerapi/routers/embedding_rebuild.py accepts the rebuild request, validates parameters, and queues the background job.
  2. Command Controllercommands/embedding_commands.py contains rebuild_embeddings_command, which calls collect_items_for_rebuild to gather source IDs and submits individual embed_source jobs.
  3. Source Processor – The embed_source_command loads each source via Source.get, deletes existing embeddings using repo_query("DELETE source_embedding …"), detects content type, and chunks text.
  4. Embedding Utilitiesopen_notebook/utils/embedding.py provides generate_embeddings to vectorize chunks via the configured model_manager, storing results via repo_insert.
  5. Status EndpointGET /api/embeddings/rebuild/{command_id}/status exposes real-time progress metrics.

Triggering a Rebuild via REST API

Send a POST request to /api/embeddings/rebuild with mode="existing" to limit processing to sources that already have embeddings. This parameter ensures the operation is idempotent—old records are removed before new ones are inserted.

curl -X POST "http://localhost:5055/api/embeddings/rebuild" \
  -H "Content-Type: application/json" \
  -d '{"mode":"existing","include_sources":true,"include_notes":false,"include_insights":false}'

The response returns a command_id and the estimated number of items:

{
  "command_id": "rebuild_01hv...",
  "total_items": 42
}

Using the Python Client

The OpenNotebookClient class in api/client.py provides a typed interface for the same operation:

from open_notebook.api.client import OpenNotebookClient

client = OpenNotebookClient(base_url="http://localhost:5055")

rebuild_resp = client.rebuild_embeddings(
    mode="existing",      # Only sources with existing embeddings

    include_sources=True,
    include_notes=False,
    include_insights=False,
)

print(f"Command ID: {rebuild_resp.command_id}")
print(f"Estimated items: {rebuild_resp.total_items}")

Monitoring Rebuild Progress

Because the rebuild runs asynchronously, poll the status endpoint using the command_id returned at startup:

curl "http://localhost:5055/api/embeddings/rebuild/<command_id>/status"

Programmatic Status Polling

Use the Python client to track completion state:

import time

while True:
    status = client.get_rebuild_status(rebuild_resp.command_id)
    total_done = (status.stats.sources_submitted + 
                  status.stats.notes_submitted + 
                  status.stats.insights_submitted)
    print(f"Status: {status.status} | Progress: {total_done}/{status.progress.total}")
    
    if status.status in ("completed", "failed"):
        break
    time.sleep(5)

print("Final stats:", status.stats)

Summary

  • Idempotent regeneration – The mode="existing" parameter targets only sources with prior embeddings, automatically deleting old vectors via repo_query before inserting new ones.
  • Asynchronous execution – The rebuild_embeddings_command returns immediately with a command_id while background workers in commands/embedding_commands.py handle the heavy lifting.
  • End-to-end visibility – Track submitted, completed, and failed jobs via GET /api/embeddings/rebuild/{command_id}/status.
  • Configurable chunking – The pipeline respects current chunking settings defined in open_notebook/utils/chunking.py, ensuring new embeddings reflect your latest configuration.

Frequently Asked Questions

What does mode="existing" filter for in Open Notebook?

When you set mode="existing", the collect_items_for_rebuild function in commands/embedding_commands.py queries only sources that already have rows in the source_embedding table. This leaves untouched any sources that have never been processed, preventing unnecessary API calls and database operations on newly added files that lack prior embeddings.

How does the system prevent duplicate embeddings during regeneration?

The embed_source_command explicitly executes a deletion query—repo_query("DELETE source_embedding …")—targeting the specific source ID before generating new vectors. This guarantees that re-running the rebuild job multiple times produces exactly one set of embeddings per source chunk, regardless of intermittent failures or retries.

Can I regenerate embeddings for a single source instead of all existing ones?

Yes. While the standard endpoint processes batches, you can use the underlying embed_source_command directly or modify the payload to include specific source IDs. The pipeline supports granular filtering through the include_sources boolean and can be extended to accept explicit source ID lists passed to the rebuild_embeddings_command function.

Which file handles the actual call to the embedding model provider?

The low-level embedding generation is implemented in open_notebook/utils/embedding.py, which exports generate_embedding for single texts and generate_embeddings for batch processing. These functions interface with the model_manager to route requests to your configured provider (OpenAI, Ollama, etc.) and handle tokenization before the vectors are persisted via repo_insert in open_notebook/database/repository.py.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →