Implementing Multi-Tool Orchestration with RAG Using the OpenAI Responses API

Implementing multi-tool orchestration with RAG enables large language models to dynamically route queries to vector stores, web search, or internal file indexes, executing sequential retrieval steps before synthesizing a final answer.

The openai/openai-cookbook repository demonstrates a complete implementation of this pattern, combining the Responses API with multiple retrieval tools to build agents that autonomously decide which knowledge sources to query. This approach eliminates brittle conditional logic by letting the model itself determine when to retrieve context, which tool to use, and how to chain multiple retrieval operations.

Architecture of Dynamic Multi-Tool RAG

The cookbook implements a model-driven routing architecture where the LLM acts as the orchestrator. Instead of hard-coding retrieval paths, you provide a catalog of tools and instructions, allowing the model to select the optimal retrieval strategy for each query.

The Responses API as the Central Router

At the core of the implementation is the openai.responses.create() method, invoked in examples/responses_api/responses_api_tool_orchestration.ipynb. This endpoint accepts a tools list containing JSON schema definitions for each available function and an instructions parameter that guides the model's routing decisions. The API handles the conversation state, detects when tool execution is required, and returns structured tool_call payloads that your application executes.

Built-in vs. Custom Retrieval Tools

The implementation combines three distinct retrieval mechanisms:

  • file_search: A built-in tool that performs vector search over an internal document index (backed by Pinecone in the cookbook examples) without requiring custom code.
  • search_vector_store: A custom Python function that queries an external vector database (e.g., Pinecone) for domain-specific documents, registered via a JSON schema in the tools list.
  • web_search: A custom function implementing Google Custom Search or similar, enabling the model to fetch up-to-date public information when internal knowledge bases are insufficient.

Sequential Tool Execution Flow

The orchestration loop supports multiple sequential tool calls. The model can request a web search, examine the results, then decide to query a vector store for additional context, or vice versa. Your application inspects the tool_calls array in the response, executes each function, appends the results as tool messages to the conversation history, and resubmits to the API until the model returns a final assistant message.

Python Implementation of the Orchestration Loop

The following implementation mirrors the pattern found in examples/responses_api/responses_api_tool_orchestration.ipynb, demonstrating how to wire custom retrieval functions into the Responses API.

import openai
import os

# ----------------------------------------------------

# 1️⃣  Tool definitions (JSON-schema style)

# ----------------------------------------------------

def search_vector_store(query: str) -> str:
    """
    Retrieve relevant chunks from an external Pinecone index.
    Returns a JSON string with a list of `documents`.
    """
    # Insert your Pinecone client code here

    return f'{{"documents": ["Result 1 for {query}", "Result 2 for {query}"]}}'

def web_search(query: str) -> str:
    """Perform a Google Custom Search and return the top result."""
    # Insert your Google-CSE request here

    return f'{{"title": "Example", "snippet": "Result for {query}"}}'

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_vector_store",
            "description": "Search a domain-specific vector DB (e.g., Pinecone).",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Run a live web search for up-to-date information.",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        },
    },
]

# ----------------------------------------------------

# 2️⃣  Orchestration loop

# ----------------------------------------------------

def run_multi_tool_rag(user_prompt: str):
    messages = [{"role": "user", "content": user_prompt}]
    while True:
        response = openai.responses.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=tools,
            instructions=(
                "Use `file_search` for internal docs, "
                "`search_vector_store` for domain knowledge, "
                "and `web_search` for fresh public info. "
                "If multiple tools are needed, call them sequentially."
            ),
        )

        # Final answer from the model → break the loop

        if response.choices[0].message.role == "assistant":
            return response.choices[0].message.content

        # Otherwise we got a tool call

        tool_call = response.choices[0].message.tool_calls[0]
        func_name = tool_call.function.name
        args = eval(tool_call.function.arguments)  # Schema is controlled

        # Dispatch to the real Python implementation

        if func_name == "search_vector_store":
            result = search_vector_store(**args)
        elif func_name == "web_search":
            result = web_search(**args)
        else:  # Built-in file_search is handled by the API itself

            result = None

        # Append tool result so the model can see it

        messages.append(
            {
                "role": "tool",
                "content": result,
                "tool_call_id": tool_call.id,
                "name": func_name,
            }
        )

# ----------------------------------------------------

# 3️⃣  Example usage

# ----------------------------------------------------

if __name__ == "__main__":
    os.environ["OPENAI_API_KEY"] = "sk-..."
    query = "What are the latest pricing changes for Azure OpenAI?"
    answer = run_multi_tool_rag(query)
    print("\nFinal Answer:\n", answer)

Defining Tool Schemas for RAG

Each retrieval tool requires a JSON schema definition in the tools list. The schema for search_vector_store specifies a single query string parameter, enabling the model to extract appropriate search terms from the user's intent. This declarative approach allows the LLM to understand tool capabilities without exposure to implementation details.

Handling Tool Call Sequences

The while True loop implements the orchestration logic. When the API returns a tool_calls payload, the code extracts the function name and arguments, dispatches to the corresponding Python implementation, and appends the result to the messages array with role tool. This pattern supports chained retrieval: the model can issue a second tool call based on evidence gathered from the first, creating complex multi-source RAG workflows.

Source Files and Production Patterns

The cookbook provides reference implementations and deployment patterns beyond the basic notebook.

Primary Notebook Reference

The file examples/responses_api/responses_api_tool_orchestration.ipynb contains the complete walkthrough, including Pinecone integration setup, exact JSON schemas for each tool, and interactive examples of the orchestration loop. The notebook is registered in registry.yaml (lines 680-686) under the entry "Multi-Tool Orchestration with RAG approach using OpenAI's Responses API."

MCP Micro-Service Deployment Pattern

For production scaling, the cookbook demonstrates exposing RAG tools as microservices using the Model Context Protocol (MCP). The file examples/partners/mcp_powered_voice_agents/search_server.py shows how to wrap search_vector_store and web_search as independent MCP servers, enabling horizontal scaling, independent versioning of retrieval components, and integration with voice agents. This pattern decouples the retrieval infrastructure from the main orchestration logic.

Supporting Documentation

  • examples/vector_databases/pinecone/README.md: Provides helper functions for Pinecone connectivity used by the search_vector_store implementation.
  • AGENTS.md: Contains general architectural guidance for building agentic systems that the multi-tool orchestration pattern extends.

Summary

  • Model-driven routing eliminates hard-coded if/else logic by letting the LLM select between file_search, vector stores, and web search based on query context.
  • Sequential execution enables complex RAG workflows where the model chains multiple retrieval operations before generating an answer.
  • Schema-based registration allows you to add new retrieval sources by defining JSON schemas and Python functions, without modifying the core orchestration loop.
  • Production scalability is achieved through MCP-based microservice patterns demonstrated in search_server.py, allowing independent scaling of vector search and web search components.

Frequently Asked Questions

What is multi-tool orchestration in RAG systems?

Multi-tool orchestration is an architectural pattern where the language model autonomously selects and sequences multiple retrieval tools—such as vector databases, web search, and file indexes—to gather evidence before generating a response. Unlike static RAG pipelines, this approach allows the model to decide which knowledge sources are relevant for each specific query.

How does the Responses API handle multiple sequential tool calls?

The Responses API supports automatic multi-turn tool execution through its conversation state management. When you submit a request with a tools list, the API can return a tool_calls payload instead of a final answer. Your application executes these functions, appends the results as tool messages, and sends the updated conversation back to the API. The model can then request additional tools based on the new context, repeating until it produces a final assistant response.

Can I combine the built-in file_search tool with custom vector databases?

Yes. The implementation in examples/responses_api/responses_api_tool_orchestration.ipynb demonstrates combining the built-in file_search tool (for internal document indexes) with custom functions like search_vector_store (for external Pinecone indexes) and web_search. You include all available tools in the tools array and use the instructions parameter to guide the model on when to use each source.

Where can I find the reference implementation for production deployment?

The primary reference is examples/responses_api/responses_api_tool_orchestration.ipynb in the openai-cookbook repository. For production deployment patterns using microservices, examine examples/partners/mcp_powered_voice_agents/search_server.py, which demonstrates how to expose these tools as MCP-compatible servers for scalable, decoupled architectures.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →