# How `model_discovery.py` Discovers and Fetches Available AI Models for Each Supported Provider

> Learn how model_discovery.py automates AI model discovery and fetching from various providers. This module queries APIs and static definitions, classifying models for SurrealDB.

- Repository: [Luis Novo/open-notebook](https://github.com/lfnovo/open-notebook)
- Tags: deep-dive
- Published: 2026-06-06

---

**The [`open_notebook/ai/model_discovery.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/ai/model_discovery.py) module automates model discovery by querying vendor APIs or static definitions, classifying each model into a canonical type, and persisting the results to SurrealDB.**

Keeping track of every available AI model across multiple providers is a manual nightmare. In the `lfnovo/open-notebook` repository, [`open_notebook/ai/model_discovery.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/ai/model_discovery.py) centralizes the workflow to discover and fetch available AI models for each supported provider. It unifies dynamic API queries, static vendor fallbacks, and SurrealDB persistence behind a single async interface.

## Classifying Models with `classify_model_type()`

Inside [`open_notebook/ai/model_discovery.py`](https://github.com/lfnovo/open-notebook/blob/main/open_notebook/ai/model_discovery.py), the `classify_model_type()` function (lines 56–90) maps raw model identifiers to one of four canonical types: **language**, **embedding**, **speech_to_text**, or **text_to_speech**. It matches identifiers against provider-specific pattern tables such as `OPENAI_MODEL_TYPES` and `GOOGLE_MODEL_TYPES`. This ensures every model returned by a discovery coroutine is tagged consistently regardless of its source.

## Per-Provider Discovery Coroutines

For every supported backend, [`model_discovery.py`](https://github.com/lfnovo/open-notebook/blob/main/model_discovery.py) exposes an async coroutine named `discover_<provider>_models()`. Each function handles authentication, issues an HTTP request, parses the response, and delegates type classification before returning a list of `DiscoveredModel` objects.

### Dynamic API-Based Discovery

Providers that expose a public listing endpoint are queried live with **httpx**.

- `discover_openai_models()` (lines 98–130) reads the OpenAI API key from the environment or a credential record, sends a GET request to the provider’s models endpoint, and iterates the JSON payload to extract identifiers.
- `discover_google_models()` (lines 151–188) follows the same pattern for Google’s API.
- `discover_ollama_models()` (lines 194–226) fetches the local Ollama model registry dynamically.

After parsing the JSON response, each function calls `classify_model_type()` to determine the model’s role.

### Static Hard-Coded Lists for Private Catalogues

Some vendors do not expose a public listing API. For **Anthropic**, **ElevenLabs**, and **Voyage**, the module falls back to static hard-coded lists:

- Anthropic (lines 132–149)
- ElevenLabs (lines 250–258)
- Voyage (lines 310–322)

These functions still return standardized `DiscoveredModel` records, preserving a uniform interface across all providers.

## The `PROVIDER_DISCOVERY_FUNCTIONS` Registry

To avoid scattered conditional logic, the module declares a dictionary named `PROVIDER_DISCOVERY_FUNCTIONS` (lines 180–206) that maps each provider name to its dedicated coroutine. Providers requiring credential-based discovery such as **Azure** and **Vertex** are currently mapped to `None`, acting as placeholders for future implementation.

## Orchestrating Discovery and Database Sync

The remaining functions coordinate when and how discovery runs, plus how results land in SurrealDB.

### On-Demand Discovery with `discover_provider_models()`

The `discover_provider_models(provider)` function (lines 240–262) looks up the target provider in `PROVIDER_DISCOVERY_FUNCTIONS` and awaits the corresponding coroutine. It returns a list of `DiscoveredModel` objects that can be consumed immediately.

```python

# Example 1 – Discover models for a single provider (e.g., OpenAI)

from open_notebook.ai.model_discovery import discover_provider_models

models = await discover_provider_models("openai")
for m in models:
    print(f"{m.provider}/{m.name} → {m.model_type}")

```

### Synchronizing Results to SurrealDB

`sync_provider_models(provider, auto_register=True)` (lines 264–311) pairs discovery with persistence. It first calls `discover_provider_models()`, then queries existing records in SurrealDB via `repo_query` (lines 288–298) to avoid N+1 lookups. New models are inserted as `open_notebook.ai.models.Model` records (lines 312–322). The function returns a tuple of **(discovered, new, existing)** counts, giving callers a clear delta of what changed.

### Parallel Bulk Sync with `sync_all_providers()`

For administrative tasks or background jobs, `sync_all_providers()` (lines 322–357) launches `sync_provider_models()` concurrently for every entry in `PROVIDER_DISCOVERY_FUNCTIONS`. Results are gathered into a single dictionary keyed by provider name.

```python

# Example 2 – Sync all providers and see the per‑provider statistics

from open_notebook.ai.model_discovery import sync_all_providers

results = await sync_all_providers()
for provider, (disc, new, exist) in results.items():
    print(f"{provider}: discovered={disc}, new={new}, existing={exist}")

```

## Dashboard Aggregation with `get_provider_model_count()`

The `get_provider_model_count(provider)` utility (lines 359–389) aggregates registered models by their canonical type. This powers UI dashboards that need at-a-glance counts of language, embedding, speech-to-text, and text-to-speech availability.

```python

# Example 3 – Get a summary of registered model counts for a provider

from open_notebook.ai.model_discovery import get_provider_model_count

counts = await get_provider_model_count("google")
print(counts)   # → {'language': 3, 'embedding': 1, 'speech_to_text': 0, 'text_to_speech': 0}

```

## Summary

- `classify_model_type()` normalizes every model identifier into one of four canonical types using provider-specific pattern tables.
- Each backend exposes a `discover_<provider>_models()` coroutine that either calls a live API via **httpx** or returns a curated static list.
- `PROVIDER_DISCOVERY_FUNCTIONS` (lines 180–206) wires every provider to its discovery routine.
- `sync_provider_models()` avoids duplicate database writes by querying SurrealDB with `repo_query` before inserting new `Model` records.
- `sync_all_providers()` runs parallel discovery across every supported provider.
- `get_provider_model_count()` provides type-aggregated statistics for dashboard consumption.

## Frequently Asked Questions

### How does [`model_discovery.py`](https://github.com/lfnovo/open-notebook/blob/main/model_discovery.py) classify a discovered model?

The `classify_model_type()` function inspects the model identifier against provider-specific pattern tables such as `OPENAI_MODEL_TYPES` and `GOOGLE_MODEL_TYPES` (lines 56–90). It returns one of four canonical strings: `language`, `embedding`, `speech_to_text`, or `text_to_speech`.

### What happens when a provider has no public model listing API?

For providers like Anthropic, ElevenLabs, and Voyage that lack a public endpoint, the module uses static hard-coded lists inside the respective `discover_<provider>_models()` functions (lines 132–149, 250–258, and 310–322). These lists still produce fully typed `DiscoveredModel` objects.

### Where are discovered models stored in Open Notebook?

Discovered models are persisted as `open_notebook.ai.models.Model` records in SurrealDB. The `sync_provider_models()` routine (lines 264–311) first queries the existing catalogue via `repo_query` to avoid duplicates, then inserts only new entries.

### Can I trigger model discovery for every provider at once?

Yes. The `sync_all_providers()` coroutine (lines 322–357) concurrently executes `sync_provider_models()` for every provider registered in `PROVIDER_DISCOVERY_FUNCTIONS`. It returns a dictionary mapping each provider to its `(discovered, new, existing)` tuple.