RAG Workflow in Open Notebook's ask.py: Search and Synthesis Explained
Open Notebook's ask.py implements a search-then-synthesize RAG pipeline where a language model first plans multi-step searches, vector retrieval fetches relevant chunks, and sequential synthesis stages produce a unified final answer.
The lfnovo/open-notebook repository uses this RAG workflow inside a LangGraph state machine to answer user questions with grounded, retrieved context. The entire pipeline is defined in open_notebook/graphs/ask.py and orchestrates model-driven strategy generation, parallel vector search, and focused answer aggregation.
How the RAG Pipeline Works in ask.py
The ask.py module compiles a linear LangGraph that transforms a user question into a polished answer through five distinct stages. Each stage is implemented as a node or conditional edge in the state graph.
Step 1: Strategy Generation
The pipeline begins with the agent node calling call_model_with_messages. This function builds a system prompt from the ask/entry template and invokes provision_langchain_model to generate a JSON-encoded strategy. The strategy contains up to five search objects, each specifying a term and instructions for downstream synthesis. The parsed Strategy object is stored in state["strategy"] before the graph proceeds.
In open_notebook/graphs/ask.py, this logic occupies the entry-point node that converts natural language intent into a structured retrieval plan.
Step 2: Parallel Retrieval with Dynamic Queries
Once the strategy is materialized, LangGraph evaluates a conditional edge from the agent node. For every search item in the strategy, the graph spawns a Send node that forwards the original question, the search term, and its instructions to the provide_answer sub-graph. This design executes multiple retrieval branches in parallel according to the model's own search plan.
The dynamic expansion is defined in the state-graph wiring inside open_notebook/graphs/ask.py.
Step 3: Per-Chunk Synthesis
Each provide_answer node performs a vector similarity search by calling vector_search (defined in open_notebook/domain/notebook.py) using the term supplied by the strategy. The function returns the top-k relevant documents as results. These raw excerpts are combined with the search-specific instructions and fed to a second language model configured with the ask/query_process prompt template. The model returns a concise, focused answer that is cleaned and stored as a partial result.
This step realizes the first synthesis layer: turning retrieved chunks into digestible evidence.
Step 4: Final Answer Aggregation
After all parallel provide_answer nodes complete, the write_final_answer node collects the full thread state. It receives the original question, the chosen strategy, and the list of partial answers. A third language model call—using the ask/final_answer template—stitches these pieces into a single, coherent, polished response. The final output is written to state["final_answer"], concluding the pipeline.
Step 5: LangGraph State Machine Wiring
The compiled graph follows a linear chain: START → agent → (conditional provide_answer) → write_final_answer → END. The conditional edge after agent dynamically expands into as many provide_answer instances as the generated strategy specifies. This wiring is visible in the graph construction at the bottom of open_notebook/graphs/ask.py.
Key Source Files and Functions
Several modules collaborate to realize the RAG workflow in ask.py:
open_notebook/graphs/ask.py— Defines the core state graph, including theagent,provide_answer, andwrite_final_answernodes.open_notebook/domain/notebook.py— Implementsvector_search, which executes similarity search against SurrealDB embeddings.open_notebook/ai/provision.py— Exportsprovision_langchain_model, the factory that instantiates the configured LLM across providers.templates/ask/entry.jinja— Prompt template that instructs the model to emit a JSON search strategy.templates/ask/query_process.jinja— Prompt template for synthesizing a focused answer from each retrieved chunk.templates/ask/final_answer.jinja— Prompt template that merges partial answers into the final unified response.api/routers/ask.py— Thin FastAPI wrapper that exposes the compiled graph via an HTTP endpoint.
Running the RAG Workflow
You can invoke the pipeline directly from Python using the compiled graph object, or through the project's FastAPI server.
Direct Graph Invocation
import asyncio
from open_notebook.graphs.ask import graph
async def ask_question(q: str):
# Invoke the compiled LangGraph with the user question
result = await graph.ainvoke({"question": q})
print("🧠 Final answer:", result["final_answer"])
asyncio.run(ask_question("How does quantum entanglement work?"))
FastAPI Endpoint
import httpx
async def ask_via_api(q: str):
async with httpx.AsyncClient(base_url="http://localhost:5055") as client:
resp = await client.post("/ask", json={"question": q})
resp.raise_for_status()
print(resp.json()["final_answer"])
# Example
import asyncio
asyncio.run(ask_via_api("Explain the benefits of RAG in AI assistants"))
The /ask route simply forwards the payload to the compiled graph. See api/routers/ask.py for the thin wrapper implementation.
Summary
ask.pyimplements a classic retrieval-augmented generation pipeline inside a LangGraph state machine.- Strategy generation uses an entry model to plan up to five searches, each with a term and synthesis instructions.
- Parallel retrieval executes vector searches via
vector_searchinopen_notebook/domain/notebook.py. - Two-stage synthesis first produces focused answers per chunk (
ask/query_process), then aggregates them into a final response (ask/final_answer). - The graph runs as a linear chain with a dynamic conditional edge that scales
provide_answernodes to match the strategy.
Frequently Asked Questions
What makes the RAG workflow in ask.py different from a single-shot retrieval system?
Unlike a single-shot approach that performs one search and answers immediately, ask.py uses a model-generated strategy to spawn targeted, parallel searches. Each branch retrieves and synthesizes its own evidence before a dedicated aggregation step writes the final answer, improving coverage and coherence for complex questions.
Which prompt templates drive the three model calls in the pipeline?
The pipeline relies on three Jinja templates: templates/ask/entry.jinja for the initial JSON strategy, templates/ask/query_process.jinja for per-chunk synthesis, and templates/ask/final_answer.jinja for global answer aggregation. Each template is rendered at a distinct stage of the graph to control model behavior. Together they separate planning, evidence synthesis, and final reasoning into discrete prompting layers.
How does ask.py handle parallel execution of multiple searches?
The LangGraph state machine defines a conditional edge after the agent node that emits a Send for every search defined in the strategy. LangGraph schedules these provide_answer nodes concurrently, and the graph pauses at write_final_answer until all partial results are collected.
Where is the vector search executed in the open-notebook codebase?
The similarity search is performed by vector_search, implemented in open_notebook/domain/notebook.py. The provide_answer node in open_notebook/graphs/ask.py calls this function using the term supplied by the strategy-generated search plan.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →