# How to Use Embeddings for Building Efficient Search Applications

> Learn how to build efficient search applications using embeddings. Convert text to vectors, store them in an index, and find relevant results with cosine similarity.

- Repository: [Microsoft/generative-ai-for-beginners](https://github.com/microsoft/generative-ai-for-beginners)
- Tags: tutorial
- Published: 2026-02-26

---

**Build a semantic search application by converting text into high-dimensional vectors using OpenAI's embedding models, storing them in a vector index, and retrieving relevant results through cosine similarity calculations.**

The `microsoft/generative-ai-for-beginners` repository provides a complete, production-ready blueprint for using embeddings to search video transcripts. This implementation demonstrates how to transform raw text data into a searchable semantic index that understands meaning rather than just matching keywords.

## Three-Stage Architecture for Embedding-Based Search

The reference implementation in `08-building-search-applications` follows a clear three-stage pipeline that separates data preparation from query-time operations. This architecture ensures that computationally expensive embedding generation happens once during indexing, while queries remain fast and lightweight.

### Stage 1: Data Preparation and Embedding Generation

The first stage processes raw content—in this case, YouTube transcripts—and converts them into vector representations. The script [`transcript_enrich_embeddings.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/transcript_enrich_embeddings.py) handles the entire pipeline:

- Downloads transcripts and chunks them into ~3-minute overlapping segments
- Generates summaries of approximately 60 words for each segment
- Creates 1536-dimensional vectors using the `text-embedding-ada-002` model
- Stores results in [`embedding_index_3m.json`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/embedding_index_3m.json)

The core embedding function uses exponential backoff retry logic to handle API rate limits:

```python
@retry(
    wait=wait_random_exponential(min=6, max=30),
    stop=stop_after_attempt(20),
    retry=retry_if_not_exception_type(openai.InvalidRequestError),
)
def get_text_embedding(text: str):
    """Call OpenAI Embedding API and return the vector."""
    embedding = get_embedding(text, engine="text-embedding-ada-002", timeout=60)
    return embedding

```

Each segment's vector is stored under the key `"ada_v2"` in the output JSON, creating a persistent index that can be loaded for querying without regenerating embeddings.

### Stage 2: Index Storage and Vector Databases

While the tutorial uses a local Pandas DataFrame loaded from [`embedding_index_3m.json`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/embedding_index_3m.json), production deployments should migrate to a dedicated vector database. The [`README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/README.md) in the lesson directory explains that Azure Cognitive Search, Redis, Pinecone, or Weaviate provide the necessary infrastructure for scaling to millions of vectors with low-latency approximate nearest neighbor (ANN) lookups.

### Stage 3: Query Processing and Similarity Search

At query time, the application embeds the user's natural language question using the same `text-embedding-ada-002` model, then computes cosine similarity against all stored segment vectors. The notebook `aoai-solution.ipynb` implements this in the `get_videos()` function:

```python
def cosine_similarity(a, b):
    # Pad the shorter vector and compute the cosine similarity

    if len(a) > len(b):
        b = np.pad(b, (0, len(a) - len(b)), 'constant')
    elif len(b) > len(a):
        a = np.pad(a, (0, len(b) - len(a)), 'constant')
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_videos(query: str, dataset: pd.DataFrame, rows: int) -> pd.DataFrame:
    video_vectors = dataset.copy()
    # 1️⃣ embed the query

    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding
    # 2️⃣ compute similarity for each stored segment

    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )
    # 3️⃣ filter & rank

    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].sort_values(by="similarity", ascending=False).head(rows)
    return video_vectors

```

The function filters results by a similarity threshold (typically 0.75), sorts by relevance, and returns the top-k matches. The helper `display_results()` then formats the output with direct YouTube links that jump to the exact timestamp:

```python
def display_results(videos: pd.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        return f"https://youtu.be/{video_id}?t={seconds}"
    print(f"\nVideos similar to '{query}':")
    for _, row in videos.iterrows():
        yt = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {yt}")
        print(f"   Similarity: {row['similarity']}")
        print(f"   Speakers: {row['speaker']}")

```

## Why Embeddings Make Search Efficient

Using embeddings for search applications provides three fundamental efficiency advantages over traditional keyword-based approaches:

**Semantic matching** – The 1536-dimensional vectors capture conceptual meaning rather than lexical overlap. A query for "Jupyter notebooks" will match content describing "interactive Python notebooks" even without keyword overlap, improving recall and relevance.

**Constant-time similarity** – Cosine similarity reduces to a simple dot product between normalized vectors. Computing this across thousands of vectors is fast on a single CPU, and vector databases use approximate nearest neighbor (ANN) algorithms to scale this to millions of vectors with sub-millisecond latency.

**Low latency architecture** – The only network round-trip required at query time is the single embedding API call for the user's question. All vector comparison operations happen locally in memory or within the vector database, eliminating the need for repeated full-text scans or expensive database joins.

## Production Considerations

When moving from the tutorial implementation to production workloads, consider these architectural improvements:

**Vector database migration** – Replace the Pandas DataFrame and JSON file ([`embedding_index_3m.json`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/embedding_index_3m.json)) with Azure Cognitive Search, Redis, Pinecone, or Weaviate. These services provide persistent storage, metadata filtering, and horizontal scaling for millions of vectors.

**Batch embedding generation** – The [`transcript_enrich_embeddings.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/transcript_enrich_embeddings.py) script uses 6 parallel threads to speed up initial vector creation. For millions of documents, implement queue-based processing with retry logic and checkpointing to handle API rate limits gracefully.

**Cache query embeddings** – Store frequently requested query vectors in a cache (Redis or in-memory) to avoid redundant API calls for repeated searches, reducing latency and API costs.

**Security hardening** – Keep `AZURE_OPENAI_API_KEY` and `AZURE_OPENAI_ENDPOINT` in environment variables or Azure Key Vault. Never commit credentials to version control; the repository uses `.env` files for local development.

## Summary

- **Embeddings convert text into high-dimensional vectors** that capture semantic meaning, enabling search systems to understand concepts rather than just match keywords.
- **The three-stage pipeline** involves: (1) chunking and embedding content with [`transcript_enrich_embeddings.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/transcript_enrich_embeddings.py), (2) storing vectors in [`embedding_index_3m.json`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/embedding_index_3m.json) or a vector database, and (3) querying via cosine similarity in `aoai-solution.ipynb`.
- **Cosine similarity provides efficient ranking** by computing dot products between the query vector and stored segment vectors, filtering by a threshold (typically 0.75) to ensure relevance.
- **Production scaling requires vector databases** like Azure Cognitive Search or Pinecone to handle millions of vectors with low-latency approximate nearest neighbor lookups.

## Frequently Asked Questions

### What is the difference between keyword search and embedding-based semantic search?

Keyword search relies on lexical matching and inverted indexes to find documents containing specific terms. Embedding-based semantic search converts both queries and documents into high-dimensional vectors that capture conceptual meaning, allowing the system to return relevant results even when they share no common keywords. For example, a query for "cloud cost optimization" will match content discussing "reducing AWS bills" through semantic similarity rather than keyword overlap.

### Why does the tutorial use cosine similarity specifically for comparing embeddings?

Cosine similarity measures the cosine of the angle between two vectors, effectively comparing their orientation rather than magnitude. This is ideal for text embeddings because it normalizes for document length—short queries and long documents can still achieve high similarity if they point in the same semantic direction. The implementation in `aoai-solution.ipynb` pads shorter vectors to match dimensions before computing the dot product normalized by vector magnitudes, ensuring consistent comparisons across the dataset.

### How do I scale this embedding search application to handle millions of documents?

For production workloads with millions of vectors, migrate from the Pandas DataFrame approach to a dedicated vector database such as Azure Cognitive Search, Pinecone, Weaviate, or Redis. These systems implement approximate nearest neighbor (ANN) algorithms like HNSW or IVF that reduce search complexity from O(n) to O(log n), enabling sub-50ms query times across billions of vectors. Additionally, implement batch processing with parallel threads (as shown in [`transcript_enrich_embeddings.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/transcript_enrich_embeddings.py)) for initial indexing, and cache frequent query embeddings to reduce API costs.

### What security measures should I implement when deploying this search application?

Never hardcode API credentials in your source code. Store `AZURE_OPENAI_API_KEY` and `AZURE_OPENAI_ENDPOINT` in environment variables using a `.env` file for local development (excluded from version control via `.gitignore`), and use Azure Key Vault or similar secret management services in production. Implement retry logic with exponential backoff (as demonstrated in the `get_text_embedding()` function) to handle transient API failures without exposing sensitive error details to end users. Additionally, validate and sanitize all user inputs before sending them to the embedding API to prevent injection attacks and unnecessary API costs.