# How to Regenerate Vector Embeddings for Existing Sources in Open Notebook

> Learn how to regenerate vector embeddings for existing sources in Open Notebook. Trigger the rebuild_embeddings API and refresh your data efficiently.

- Repository: [Luis Novo/open-notebook](https://github.com/lfnovo/open-notebook)
- Tags: how-to-guide
- Published: 2026-06-07

---

**To regenerate vector embeddings for existing sources, trigger the `rebuild_embeddings` command via `POST /api/embeddings/rebuild` with `mode="existing"`, which deletes old vectors from the `source_embedding` table, re-chunks the text, and inserts fresh embeddings asynchronously.**

Open Notebook persists vector embeddings in the `source_embedding` table to enable semantic search across uploaded documents. When you change embedding models or adjust chunking strategies in [`open_notebook/utils/chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/chunking.py), you must regenerate these vectors to keep search results accurate. The repository provides an asynchronous pipeline—exposed through the REST API and Python client—that safely rebuilds embeddings for existing sources while leaving untouched sources unchanged.

## Understanding the Rebuild Architecture

The regeneration workflow coordinates five distinct components to process existing sources idempotently:

1. **API Router** – [`api/routers/embedding_rebuild.py`](https://github.com/lfnovo/open-notebook/blob/main/api/routers/embedding_rebuild.py) accepts the rebuild request, validates parameters, and queues the background job.
2. **Command Controller** – [`commands/embedding_commands.py`](https://github.com/lfnovo/open-notebook/blob/main/commands/embedding_commands.py) contains `rebuild_embeddings_command`, which calls `collect_items_for_rebuild` to gather source IDs and submits individual `embed_source` jobs.
3. **Source Processor** – The `embed_source_command` loads each source via `Source.get`, deletes existing embeddings using `repo_query("DELETE source_embedding …")`, detects content type, and chunks text.
4. **Embedding Utilities** – [`open_notebook/utils/embedding.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/embedding.py) provides `generate_embeddings` to vectorize chunks via the configured `model_manager`, storing results via `repo_insert`.
5. **Status Endpoint** – `GET /api/embeddings/rebuild/{command_id}/status` exposes real-time progress metrics.

## Triggering a Rebuild via REST API

Send a `POST` request to `/api/embeddings/rebuild` with `mode="existing"` to limit processing to sources that already have embeddings. This parameter ensures the operation is idempotent—old records are removed before new ones are inserted.

```bash
curl -X POST "http://localhost:5055/api/embeddings/rebuild" \
  -H "Content-Type: application/json" \
  -d '{"mode":"existing","include_sources":true,"include_notes":false,"include_insights":false}'

```

The response returns a `command_id` and the estimated number of items:

```json
{
  "command_id": "rebuild_01hv...",
  "total_items": 42
}

```

### Using the Python Client

The `OpenNotebookClient` class in [`api/client.py`](https://github.com/lfnovo/open-notebook/blob/main/api/client.py) provides a typed interface for the same operation:

```python
from open_notebook.api.client import OpenNotebookClient

client = OpenNotebookClient(base_url="http://localhost:5055")

rebuild_resp = client.rebuild_embeddings(
    mode="existing",      # Only sources with existing embeddings

    include_sources=True,
    include_notes=False,
    include_insights=False,
)

print(f"Command ID: {rebuild_resp.command_id}")
print(f"Estimated items: {rebuild_resp.total_items}")

```

## Monitoring Rebuild Progress

Because the rebuild runs asynchronously, poll the status endpoint using the `command_id` returned at startup:

```bash
curl "http://localhost:5055/api/embeddings/rebuild/<command_id>/status"

```

### Programmatic Status Polling

Use the Python client to track completion state:

```python
import time

while True:
    status = client.get_rebuild_status(rebuild_resp.command_id)
    total_done = (status.stats.sources_submitted + 
                  status.stats.notes_submitted + 
                  status.stats.insights_submitted)
    print(f"Status: {status.status} | Progress: {total_done}/{status.progress.total}")
    
    if status.status in ("completed", "failed"):
        break
    time.sleep(5)

print("Final stats:", status.stats)

```

## Summary

- **Idempotent regeneration** – The `mode="existing"` parameter targets only sources with prior embeddings, automatically deleting old vectors via `repo_query` before inserting new ones.
- **Asynchronous execution** – The `rebuild_embeddings_command` returns immediately with a `command_id` while background workers in [`commands/embedding_commands.py`](https://github.com/lfnovo/open-notebook/blob/main/commands/embedding_commands.py) handle the heavy lifting.
- **End-to-end visibility** – Track submitted, completed, and failed jobs via `GET /api/embeddings/rebuild/{command_id}/status`.
- **Configurable chunking** – The pipeline respects current chunking settings defined in [`open_notebook/utils/chunking.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/chunking.py), ensuring new embeddings reflect your latest configuration.

## Frequently Asked Questions

### What does `mode="existing"` filter for in Open Notebook?

When you set `mode="existing"`, the `collect_items_for_rebuild` function in [`commands/embedding_commands.py`](https://github.com/lfnovo/open-notebook/blob/main/commands/embedding_commands.py) queries only sources that already have rows in the `source_embedding` table. This leaves untouched any sources that have never been processed, preventing unnecessary API calls and database operations on newly added files that lack prior embeddings.

### How does the system prevent duplicate embeddings during regeneration?

The `embed_source_command` explicitly executes a deletion query—`repo_query("DELETE source_embedding …")`—targeting the specific source ID before generating new vectors. This guarantees that re-running the rebuild job multiple times produces exactly one set of embeddings per source chunk, regardless of intermittent failures or retries.

### Can I regenerate embeddings for a single source instead of all existing ones?

Yes. While the standard endpoint processes batches, you can use the underlying `embed_source_command` directly or modify the payload to include specific source IDs. The pipeline supports granular filtering through the `include_sources` boolean and can be extended to accept explicit source ID lists passed to the `rebuild_embeddings_command` function.

### Which file handles the actual call to the embedding model provider?

The low-level embedding generation is implemented in [`open_notebook/utils/embedding.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/utils/embedding.py), which exports `generate_embedding` for single texts and `generate_embeddings` for batch processing. These functions interface with the `model_manager` to route requests to your configured provider (OpenAI, Ollama, etc.) and handle tokenization before the vectors are persisted via `repo_insert` in [`open_notebook/database/repository.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/database/repository.py).