How to Regenerate Vector Embeddings for Existing Sources in Open Notebook
To regenerate vector embeddings for existing sources, trigger the rebuild_embeddings command via POST /api/embeddings/rebuild with mode="existing", which deletes old vectors from the source_embedding table, re-chunks the text, and inserts fresh embeddings asynchronously.
Open Notebook persists vector embeddings in the source_embedding table to enable semantic search across uploaded documents. When you change embedding models or adjust chunking strategies in open_notebook/utils/chunking.py, you must regenerate these vectors to keep search results accurate. The repository provides an asynchronous pipeline—exposed through the REST API and Python client—that safely rebuilds embeddings for existing sources while leaving untouched sources unchanged.
Understanding the Rebuild Architecture
The regeneration workflow coordinates five distinct components to process existing sources idempotently:
- API Router –
api/routers/embedding_rebuild.pyaccepts the rebuild request, validates parameters, and queues the background job. - Command Controller –
commands/embedding_commands.pycontainsrebuild_embeddings_command, which callscollect_items_for_rebuildto gather source IDs and submits individualembed_sourcejobs. - Source Processor – The
embed_source_commandloads each source viaSource.get, deletes existing embeddings usingrepo_query("DELETE source_embedding …"), detects content type, and chunks text. - Embedding Utilities –
open_notebook/utils/embedding.pyprovidesgenerate_embeddingsto vectorize chunks via the configuredmodel_manager, storing results viarepo_insert. - Status Endpoint –
GET /api/embeddings/rebuild/{command_id}/statusexposes real-time progress metrics.
Triggering a Rebuild via REST API
Send a POST request to /api/embeddings/rebuild with mode="existing" to limit processing to sources that already have embeddings. This parameter ensures the operation is idempotent—old records are removed before new ones are inserted.
curl -X POST "http://localhost:5055/api/embeddings/rebuild" \
-H "Content-Type: application/json" \
-d '{"mode":"existing","include_sources":true,"include_notes":false,"include_insights":false}'
The response returns a command_id and the estimated number of items:
{
"command_id": "rebuild_01hv...",
"total_items": 42
}
Using the Python Client
The OpenNotebookClient class in api/client.py provides a typed interface for the same operation:
from open_notebook.api.client import OpenNotebookClient
client = OpenNotebookClient(base_url="http://localhost:5055")
rebuild_resp = client.rebuild_embeddings(
mode="existing", # Only sources with existing embeddings
include_sources=True,
include_notes=False,
include_insights=False,
)
print(f"Command ID: {rebuild_resp.command_id}")
print(f"Estimated items: {rebuild_resp.total_items}")
Monitoring Rebuild Progress
Because the rebuild runs asynchronously, poll the status endpoint using the command_id returned at startup:
curl "http://localhost:5055/api/embeddings/rebuild/<command_id>/status"
Programmatic Status Polling
Use the Python client to track completion state:
import time
while True:
status = client.get_rebuild_status(rebuild_resp.command_id)
total_done = (status.stats.sources_submitted +
status.stats.notes_submitted +
status.stats.insights_submitted)
print(f"Status: {status.status} | Progress: {total_done}/{status.progress.total}")
if status.status in ("completed", "failed"):
break
time.sleep(5)
print("Final stats:", status.stats)
Summary
- Idempotent regeneration – The
mode="existing"parameter targets only sources with prior embeddings, automatically deleting old vectors viarepo_querybefore inserting new ones. - Asynchronous execution – The
rebuild_embeddings_commandreturns immediately with acommand_idwhile background workers incommands/embedding_commands.pyhandle the heavy lifting. - End-to-end visibility – Track submitted, completed, and failed jobs via
GET /api/embeddings/rebuild/{command_id}/status. - Configurable chunking – The pipeline respects current chunking settings defined in
open_notebook/utils/chunking.py, ensuring new embeddings reflect your latest configuration.
Frequently Asked Questions
What does mode="existing" filter for in Open Notebook?
When you set mode="existing", the collect_items_for_rebuild function in commands/embedding_commands.py queries only sources that already have rows in the source_embedding table. This leaves untouched any sources that have never been processed, preventing unnecessary API calls and database operations on newly added files that lack prior embeddings.
How does the system prevent duplicate embeddings during regeneration?
The embed_source_command explicitly executes a deletion query—repo_query("DELETE source_embedding …")—targeting the specific source ID before generating new vectors. This guarantees that re-running the rebuild job multiple times produces exactly one set of embeddings per source chunk, regardless of intermittent failures or retries.
Can I regenerate embeddings for a single source instead of all existing ones?
Yes. While the standard endpoint processes batches, you can use the underlying embed_source_command directly or modify the payload to include specific source IDs. The pipeline supports granular filtering through the include_sources boolean and can be extended to accept explicit source ID lists passed to the rebuild_embeddings_command function.
Which file handles the actual call to the embedding model provider?
The low-level embedding generation is implemented in open_notebook/utils/embedding.py, which exports generate_embedding for single texts and generate_embeddings for batch processing. These functions interface with the model_manager to route requests to your configured provider (OpenAI, Ollama, etc.) and handle tokenization before the vectors are persisted via repo_insert in open_notebook/database/repository.py.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →