deep-dive

How `model_discovery.py` Discovers and Fetches Available AI Models for Each Supported Provider

June 6, 2026 lfnovo/open-notebook ↗

The open_notebook/ai/model_discovery.py module automates model discovery by querying vendor APIs or static definitions, classifying each model into a canonical type, and persisting the results to SurrealDB.

Keeping track of every available AI model across multiple providers is a manual nightmare. In the lfnovo/open-notebook repository, open_notebook/ai/model_discovery.py centralizes the workflow to discover and fetch available AI models for each supported provider. It unifies dynamic API queries, static vendor fallbacks, and SurrealDB persistence behind a single async interface.

Classifying Models with `classify_model_type()`

Inside open_notebook/ai/model_discovery.py, the classify_model_type() function (lines 56–90) maps raw model identifiers to one of four canonical types: language, embedding, speech_to_text, or text_to_speech. It matches identifiers against provider-specific pattern tables such as OPENAI_MODEL_TYPES and GOOGLE_MODEL_TYPES. This ensures every model returned by a discovery coroutine is tagged consistently regardless of its source.

Per-Provider Discovery Coroutines

For every supported backend, model_discovery.py exposes an async coroutine named discover_<provider>_models(). Each function handles authentication, issues an HTTP request, parses the response, and delegates type classification before returning a list of DiscoveredModel objects.

Dynamic API-Based Discovery

Providers that expose a public listing endpoint are queried live with httpx.

discover_openai_models() (lines 98–130) reads the OpenAI API key from the environment or a credential record, sends a GET request to the provider’s models endpoint, and iterates the JSON payload to extract identifiers.
discover_google_models() (lines 151–188) follows the same pattern for Google’s API.
discover_ollama_models() (lines 194–226) fetches the local Ollama model registry dynamically.

After parsing the JSON response, each function calls classify_model_type() to determine the model’s role.

Static Hard-Coded Lists for Private Catalogues

Some vendors do not expose a public listing API. For Anthropic, ElevenLabs, and Voyage, the module falls back to static hard-coded lists:

Anthropic (lines 132–149)
ElevenLabs (lines 250–258)
Voyage (lines 310–322)

These functions still return standardized DiscoveredModel records, preserving a uniform interface across all providers.

The `PROVIDER_DISCOVERY_FUNCTIONS` Registry

To avoid scattered conditional logic, the module declares a dictionary named PROVIDER_DISCOVERY_FUNCTIONS (lines 180–206) that maps each provider name to its dedicated coroutine. Providers requiring credential-based discovery such as Azure and Vertex are currently mapped to None, acting as placeholders for future implementation.

Orchestrating Discovery and Database Sync

The remaining functions coordinate when and how discovery runs, plus how results land in SurrealDB.

On-Demand Discovery with `discover_provider_models()`

The discover_provider_models(provider) function (lines 240–262) looks up the target provider in PROVIDER_DISCOVERY_FUNCTIONS and awaits the corresponding coroutine. It returns a list of DiscoveredModel objects that can be consumed immediately.


# Example 1 – Discover models for a single provider (e.g., OpenAI)

from open_notebook.ai.model_discovery import discover_provider_models

models = await discover_provider_models("openai")
for m in models:
    print(f"{m.provider}/{m.name} → {m.model_type}")

Synchronizing Results to SurrealDB

sync_provider_models(provider, auto_register=True) (lines 264–311) pairs discovery with persistence. It first calls discover_provider_models(), then queries existing records in SurrealDB via repo_query (lines 288–298) to avoid N+1 lookups. New models are inserted as open_notebook.ai.models.Model records (lines 312–322). The function returns a tuple of (discovered, new, existing) counts, giving callers a clear delta of what changed.

Parallel Bulk Sync with `sync_all_providers()`

For administrative tasks or background jobs, sync_all_providers() (lines 322–357) launches sync_provider_models() concurrently for every entry in PROVIDER_DISCOVERY_FUNCTIONS. Results are gathered into a single dictionary keyed by provider name.


# Example 2 – Sync all providers and see the per‑provider statistics

from open_notebook.ai.model_discovery import sync_all_providers

results = await sync_all_providers()
for provider, (disc, new, exist) in results.items():
    print(f"{provider}: discovered={disc}, new={new}, existing={exist}")

Dashboard Aggregation with `get_provider_model_count()`

The get_provider_model_count(provider) utility (lines 359–389) aggregates registered models by their canonical type. This powers UI dashboards that need at-a-glance counts of language, embedding, speech-to-text, and text-to-speech availability.


# Example 3 – Get a summary of registered model counts for a provider

from open_notebook.ai.model_discovery import get_provider_model_count

counts = await get_provider_model_count("google")
print(counts)   # → {'language': 3, 'embedding': 1, 'speech_to_text': 0, 'text_to_speech': 0}

Summary

classify_model_type() normalizes every model identifier into one of four canonical types using provider-specific pattern tables.
Each backend exposes a discover_<provider>_models() coroutine that either calls a live API via httpx or returns a curated static list.
PROVIDER_DISCOVERY_FUNCTIONS (lines 180–206) wires every provider to its discovery routine.
sync_provider_models() avoids duplicate database writes by querying SurrealDB with repo_query before inserting new Model records.
sync_all_providers() runs parallel discovery across every supported provider.
get_provider_model_count() provides type-aggregated statistics for dashboard consumption.

Frequently Asked Questions

How does `model_discovery.py` classify a discovered model?

The classify_model_type() function inspects the model identifier against provider-specific pattern tables such as OPENAI_MODEL_TYPES and GOOGLE_MODEL_TYPES (lines 56–90). It returns one of four canonical strings: language, embedding, speech_to_text, or text_to_speech.

What happens when a provider has no public model listing API?

For providers like Anthropic, ElevenLabs, and Voyage that lack a public endpoint, the module uses static hard-coded lists inside the respective discover_<provider>_models() functions (lines 132–149, 250–258, and 310–322). These lists still produce fully typed DiscoveredModel objects.

Where are discovered models stored in Open Notebook?

Discovered models are persisted as open_notebook.ai.models.Model records in SurrealDB. The sync_provider_models() routine (lines 264–311) first queries the existing catalogue via repo_query to avoid duplicates, then inserts only new entries.

Can I trigger model discovery for every provider at once?

Yes. The sync_all_providers() coroutine (lines 322–357) concurrently executes sync_provider_models() for every provider registered in PROVIDER_DISCOVERY_FUNCTIONS. It returns a dictionary mapping each provider to its (discovered, new, existing) tuple.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how lfnovo/open-notebook works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →