How to Use Embeddings for Building Efficient Search Applications
Build a semantic search application by converting text into high-dimensional vectors using OpenAI's embedding models, storing them in a vector index, and retrieving relevant results through cosine similarity calculations.
The microsoft/generative-ai-for-beginners repository provides a complete, production-ready blueprint for using embeddings to search video transcripts. This implementation demonstrates how to transform raw text data into a searchable semantic index that understands meaning rather than just matching keywords.
Three-Stage Architecture for Embedding-Based Search
The reference implementation in 08-building-search-applications follows a clear three-stage pipeline that separates data preparation from query-time operations. This architecture ensures that computationally expensive embedding generation happens once during indexing, while queries remain fast and lightweight.
Stage 1: Data Preparation and Embedding Generation
The first stage processes raw content—in this case, YouTube transcripts—and converts them into vector representations. The script transcript_enrich_embeddings.py handles the entire pipeline:
- Downloads transcripts and chunks them into ~3-minute overlapping segments
- Generates summaries of approximately 60 words for each segment
- Creates 1536-dimensional vectors using the
text-embedding-ada-002model - Stores results in
embedding_index_3m.json
The core embedding function uses exponential backoff retry logic to handle API rate limits:
@retry(
wait=wait_random_exponential(min=6, max=30),
stop=stop_after_attempt(20),
retry=retry_if_not_exception_type(openai.InvalidRequestError),
)
def get_text_embedding(text: str):
"""Call OpenAI Embedding API and return the vector."""
embedding = get_embedding(text, engine="text-embedding-ada-002", timeout=60)
return embedding
Each segment's vector is stored under the key "ada_v2" in the output JSON, creating a persistent index that can be loaded for querying without regenerating embeddings.
Stage 2: Index Storage and Vector Databases
While the tutorial uses a local Pandas DataFrame loaded from embedding_index_3m.json, production deployments should migrate to a dedicated vector database. The README.md in the lesson directory explains that Azure Cognitive Search, Redis, Pinecone, or Weaviate provide the necessary infrastructure for scaling to millions of vectors with low-latency approximate nearest neighbor (ANN) lookups.
Stage 3: Query Processing and Similarity Search
At query time, the application embeds the user's natural language question using the same text-embedding-ada-002 model, then computes cosine similarity against all stored segment vectors. The notebook aoai-solution.ipynb implements this in the get_videos() function:
def cosine_similarity(a, b):
# Pad the shorter vector and compute the cosine similarity
if len(a) > len(b):
b = np.pad(b, (0, len(a) - len(b)), 'constant')
elif len(b) > len(a):
a = np.pad(a, (0, len(b) - len(a)), 'constant')
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def get_videos(query: str, dataset: pd.DataFrame, rows: int) -> pd.DataFrame:
video_vectors = dataset.copy()
# 1️⃣ embed the query
query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding
# 2️⃣ compute similarity for each stored segment
video_vectors["similarity"] = video_vectors["ada_v2"].apply(
lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
)
# 3️⃣ filter & rank
mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
video_vectors = video_vectors[mask].sort_values(by="similarity", ascending=False).head(rows)
return video_vectors
The function filters results by a similarity threshold (typically 0.75), sorts by relevance, and returns the top-k matches. The helper display_results() then formats the output with direct YouTube links that jump to the exact timestamp:
def display_results(videos: pd.DataFrame, query: str):
def _gen_yt_url(video_id: str, seconds: int) -> str:
return f"https://youtu.be/{video_id}?t={seconds}"
print(f"\nVideos similar to '{query}':")
for _, row in videos.iterrows():
yt = _gen_yt_url(row["videoId"], row["seconds"])
print(f" - {row['title']}")
print(f" Summary: {' '.join(row['summary'].split()[:15])}...")
print(f" YouTube: {yt}")
print(f" Similarity: {row['similarity']}")
print(f" Speakers: {row['speaker']}")
Why Embeddings Make Search Efficient
Using embeddings for search applications provides three fundamental efficiency advantages over traditional keyword-based approaches:
Semantic matching – The 1536-dimensional vectors capture conceptual meaning rather than lexical overlap. A query for "Jupyter notebooks" will match content describing "interactive Python notebooks" even without keyword overlap, improving recall and relevance.
Constant-time similarity – Cosine similarity reduces to a simple dot product between normalized vectors. Computing this across thousands of vectors is fast on a single CPU, and vector databases use approximate nearest neighbor (ANN) algorithms to scale this to millions of vectors with sub-millisecond latency.
Low latency architecture – The only network round-trip required at query time is the single embedding API call for the user's question. All vector comparison operations happen locally in memory or within the vector database, eliminating the need for repeated full-text scans or expensive database joins.
Production Considerations
When moving from the tutorial implementation to production workloads, consider these architectural improvements:
Vector database migration – Replace the Pandas DataFrame and JSON file (embedding_index_3m.json) with Azure Cognitive Search, Redis, Pinecone, or Weaviate. These services provide persistent storage, metadata filtering, and horizontal scaling for millions of vectors.
Batch embedding generation – The transcript_enrich_embeddings.py script uses 6 parallel threads to speed up initial vector creation. For millions of documents, implement queue-based processing with retry logic and checkpointing to handle API rate limits gracefully.
Cache query embeddings – Store frequently requested query vectors in a cache (Redis or in-memory) to avoid redundant API calls for repeated searches, reducing latency and API costs.
Security hardening – Keep AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT in environment variables or Azure Key Vault. Never commit credentials to version control; the repository uses .env files for local development.
Summary
- Embeddings convert text into high-dimensional vectors that capture semantic meaning, enabling search systems to understand concepts rather than just match keywords.
- The three-stage pipeline involves: (1) chunking and embedding content with
transcript_enrich_embeddings.py, (2) storing vectors inembedding_index_3m.jsonor a vector database, and (3) querying via cosine similarity inaoai-solution.ipynb. - Cosine similarity provides efficient ranking by computing dot products between the query vector and stored segment vectors, filtering by a threshold (typically 0.75) to ensure relevance.
- Production scaling requires vector databases like Azure Cognitive Search or Pinecone to handle millions of vectors with low-latency approximate nearest neighbor lookups.
Frequently Asked Questions
What is the difference between keyword search and embedding-based semantic search?
Keyword search relies on lexical matching and inverted indexes to find documents containing specific terms. Embedding-based semantic search converts both queries and documents into high-dimensional vectors that capture conceptual meaning, allowing the system to return relevant results even when they share no common keywords. For example, a query for "cloud cost optimization" will match content discussing "reducing AWS bills" through semantic similarity rather than keyword overlap.
Why does the tutorial use cosine similarity specifically for comparing embeddings?
Cosine similarity measures the cosine of the angle between two vectors, effectively comparing their orientation rather than magnitude. This is ideal for text embeddings because it normalizes for document length—short queries and long documents can still achieve high similarity if they point in the same semantic direction. The implementation in aoai-solution.ipynb pads shorter vectors to match dimensions before computing the dot product normalized by vector magnitudes, ensuring consistent comparisons across the dataset.
How do I scale this embedding search application to handle millions of documents?
For production workloads with millions of vectors, migrate from the Pandas DataFrame approach to a dedicated vector database such as Azure Cognitive Search, Pinecone, Weaviate, or Redis. These systems implement approximate nearest neighbor (ANN) algorithms like HNSW or IVF that reduce search complexity from O(n) to O(log n), enabling sub-50ms query times across billions of vectors. Additionally, implement batch processing with parallel threads (as shown in transcript_enrich_embeddings.py) for initial indexing, and cache frequent query embeddings to reduce API costs.
What security measures should I implement when deploying this search application?
Never hardcode API credentials in your source code. Store AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT in environment variables using a .env file for local development (excluded from version control via .gitignore), and use Azure Key Vault or similar secret management services in production. Implement retry logic with exponential backoff (as demonstrated in the get_text_embedding() function) to handle transient API failures without exposing sensitive error details to end users. Additionally, validate and sanitize all user inputs before sending them to the embedding API to prevent injection attacks and unnecessary API costs.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →