How an ONNX-Based Approach Powers Headline Memory RAG Semantic Search in IndexedDB

World Monitor implements a retrieval-augmented generation (RAG) "headline memory" using ONNX Runtime Web and the MiniLM embedding model to perform semantic search directly in the browser's IndexedDB.

The koala73/worldmonitor repository demonstrates how to build an offline-first semantic search engine for news headlines. By leveraging an ONNX-based approach, the application converts text into high-dimensional vectors using the lightweight MiniLM model, stores them in IndexedDB, and executes cosine similarity searches to retrieve relevant headlines for RAG contexts.

ONNX Runtime Web and MiniLM: The Embedding Engine

The semantic search pipeline relies on ONNX Runtime Web via the @xenova/transformers library to execute the Xenova/all-MiniLM-L6-v2 model entirely within the browser. This architecture eliminates server-side inference latency while maintaining the accuracy of a 384-dimensional dense embedding space.

Model Configuration in ml-config.ts

The embedding model is declared centrally in src/config/ml-config.ts to ensure consistent vector dimensions across the application:

// src/config/ml-config.ts
export const ML_CONFIG = {
  embeddings: {
    model: 'Xenova/all-MiniLM-L6-v2',
    dimensions: 384,
    dtype: 'fp32',
  },
  // Additional configuration...
};

This configuration specifies the All-MiniLM-L6-v2 ONNX model, which generates 384-dimensional float vectors optimized for semantic similarity tasks.

Web Worker Inference Pipeline

To prevent blocking the main thread during model loading and inference, the application delegates embedding generation to a dedicated ML Web Worker defined in src/workers/ml.worker.ts. The worker initializes the pipeline using the feature-extraction task:

// src/workers/ml.worker.ts
import { pipeline } from '@xenova/transformers';

let embedder;

async function initializeEmbedder() {
  embedder = await pipeline(
    'feature-extraction',
    'Xenova/all-MiniLM-L6-v2',
    { dtype: 'fp32' }
  );
}

When processing headlines, the worker generates normalized embeddings:

// Inside ml.worker.ts message handler
const output = await embedder(text, { pooling: 'mean', normalize: true });
const vector = output.data; // Float32Array of length 384

These vectors are then passed to the vector database layer for persistent storage.

IndexedDB Vector Storage Architecture

The ONNX-based approach requires a client-side storage solution capable of handling high-dimensional float arrays. The implementation uses IndexedDB via src/workers/vector-db.ts to create a durable, offline-first vector store named worldmonitor_vector_store.

Storing Headline Embeddings

The storeVectors function in src/workers/vector-db.ts persists both the raw headline metadata and the 384-dimensional MiniLM vectors:

// src/workers/vector-db.ts
export async function storeVectors(items: VectorItem[]): Promise<void> {
  const db = await openDB('worldmonitor_vector_store', 1);
  const tx = db.transaction('vectors', 'readwrite');
  const store = tx.objectStore('vectors');
  
  for (const item of items) {
    await store.put({
      id: item.id,
      text: item.text,
      vector: item.vector, // Float32Array(384) from MiniLM
      metadata: item.metadata,
      timestamp: Date.now(),
    });
  }
  
  await tx.done;
}

This schema enables efficient retrieval of historical headlines alongside their semantic representations.

Cosine Similarity Search Implementation

Semantic retrieval is implemented through the searchVectors function, which computes cosine similarity between the query vector and all stored vectors:

// src/workers/vector-db.ts
export async function searchVectors(
  queryVector: Float32Array,
  topK: number = 5,
  minScore: number = 0.6
): Promise<SearchResult[]> {
  const db = await openDB('worldmonitor_vector_store', 1);
  const tx = db.transaction('vectors', 'readonly');
  const store = tx.objectStore('vectors');
  const allVectors = await store.getAll();
  
  // Calculate cosine similarity
  const results = allVectors.map(item => {
    const similarity = cosineSimilarity(queryVector, item.vector);
    return { ...item, score: similarity };
  });
  
  // Filter and sort
  return results
    .filter(r => r.score >= minScore)
    .sort((a, b) => b.score - a.score)
    .slice(0, Math.min(topK, 20)); // Clamp to 1-20
}

The cosine similarity calculation ensures that semantically related headlines are ranked higher regardless of exact keyword matches, enabling true RAG semantic search.

Integrating the ONNX-based approach into the application layer requires coordinating the ML worker with the vector database through typed message passing.

Ingesting Headlines into Memory

To populate the headline memory, the application sends a vector-store-ingest message to the worker:

// src/services/ml-worker.ts (service wrapper)
async function ingestHeadlines(headlines: Headline[]) {
  const items = headlines.map(h => ({
    text: h.title,
    pubDate: h.pubDate,
    source: h.source,
    url: h.link,
    tags: h.tags,
  }));
  
  mlWorker.postMessage({
    type: 'vector-store-ingest',
    id: crypto.randomUUID(),
    items,
  });
}

The worker sanitizes each headline, embeds them with MiniLM via the ONNX pipeline, and persists the resulting Float32Array vectors in IndexedDB.

Querying the Headline Memory

For RAG retrieval, the application posts a vector-store-search message:

// Querying for relevant headlines
mlWorker.postMessage({
  type: 'vector-store-search',
  id: crypto.randomUUID(),
  queries: ['US-China trade tensions', 'inflation outlook'],
  topK: 5,
  minScore: 0.6, // Cosine similarity threshold
});

The worker embeds the queries using the same ONNX MiniLM model, executes cosine similarity against the stored vectors, and returns the top-K results.

Handling Search Results

The main thread receives results through the worker's message handler:

mlWorker.onmessage = (e) => {
  const data = e.data;
  if (data.type === 'vector-store-search-result') {
    console.log('Retrieved headlines for RAG context:', data.results);
    // Inject into LLM prompt...
  }
};

This architecture ensures that the ONNX-based semantic search runs entirely client-side, providing offline-capable RAG functionality without external API dependencies.

Summary

The ONNX-based approach powering World Monitor's headline memory RAG semantic search combines several key technologies:

  • ONNX Runtime Web via @xenova/transformers enables in-browser execution of the Xenova/all-MiniLM-L6-v2 model without server infrastructure.
  • 384-dimensional embeddings generated by the MiniLM pipeline capture semantic meaning for accurate similarity matching.
  • IndexedDB vector storage in worldmonitor_vector_store persists headlines and vectors for offline-first retrieval.
  • Cosine similarity search implemented in src/workers/vector-db.ts ranks results by semantic relevance rather than keyword matching.
  • Web Worker architecture isolates ML inference from the main thread, maintaining UI responsiveness during embedding generation.

Frequently Asked Questions

What ONNX model does World Monitor use for headline embeddings?

World Monitor uses the Xenova/all-MiniLM-L6-v2 model, a quantized ONNX version of the sentence-transformers MiniLM model. According to the source code in src/config/ml-config.ts, this model generates 384-dimensional dense vectors optimized for semantic similarity tasks while maintaining a small footprint suitable for browser-based inference via ONNX Runtime Web.

How does the application prevent UI freezing during embedding generation?

The application delegates all ONNX inference to a dedicated ML Web Worker defined in src/workers/ml.worker.ts. By running the @xenova/transformers pipeline inside a worker thread, the main UI thread remains responsive while the MiniLM model loads and generates embeddings. The worker communicates results back to the main thread via asynchronous message passing.

Cosine similarity measures the angle between two vectors in the 384-dimensional embedding space, effectively capturing semantic relatedness regardless of vector magnitude. This approach ensures that headlines discussing similar topics receive high similarity scores even when they share no exact keywords. The implementation in src/workers/vector-db.ts filters results using a configurable minScore threshold (default 0.6) before returning the top-K matches.

Can the headline memory function completely offline?

Yes, the ONNX-based approach is entirely client-side and offline-capable. The MiniLM model runs locally via ONNX Runtime Web, and the resulting embeddings are stored in IndexedDB (worldmonitor_vector_store). Once headlines are ingested into the vector database, semantic search queries execute against the local IndexedDB store without requiring network connectivity or external API calls.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →