How to Optimize LLM Costs When Running the AI Hedge Fund

You can optimize LLM costs in the AI Hedge Fund by configuring cheaper default models, reducing retry attempts, batching multiple tickers into single prompts, implementing response caching, and monitoring usage via the built-in llm_calls_count tracker.

The virattt/ai-hedge-fund repository orchestrates multiple LLM-driven agents—such as Warren Buffett and Peter Lynch personas—that analyze financial data to generate trading signals. Because each agent may invoke premium models like GPT-4 multiple times per run, operational costs can escalate rapidly without proper optimization strategies.

Model Selection Strategy

The most immediate cost lever is the model itself. In src/utils/llm.py, the call_llm function (lines 33-40) and get_agent_model_config (lines 24-47) resolve the model name from request state or fall back to the system default—currently gpt-4.1 with the OPENAI provider.

To reduce costs, override the default with a cheaper alternative such as gpt-3.5-turbo or a local Ollama model. The repository already exposes a model catalog in src/llm/models.py (line 108) via the LLM_ORDER list, which can be surfaced in a CLI flag or UI dropdown.


# src/cli/input.py – example override

parser.add_argument(
    "--model",
    choices=[m.model_name for m in AVAILABLE_MODELS],
    default="gpt-3.5-turbo",
    help="Cheaper LLM model to use for all agents",
)

Retry Policy Tuning

The LLM wrapper implements automatic retries to handle transient failures. In src/utils/llm.py (lines 58-78), the retry loop attempts up to three times before returning a default response. While this improves reliability, each retry multiplies token-billable calls.

For non-critical agents—such as news sentiment analysis—reduce max_retries to 1. Reserve higher retry counts for deterministic agents like valuation models.


# src/agents/news_sentiment.py

response = call_llm(
    prompt,
    Sentiment,
    agent_name="news_sentiment",
    state=state,
    max_retries=1,  # Reduced from default 3

    default_factory=lambda: Sentiment(decision="NEUTRAL")
)

Prompt Engineering and Batching

Token consumption scales with prompt length. The call_llm function accepts any prompt object, and long contexts—such as analyst biographies or extensive ticker lists—directly increase costs.

Optimize by moving static context to a shared system prompt stored once per run. Additionally, batch multiple tickers into a single prompt rather than invoking the LLM per ticker. This reduces API calls from N to 1.

batched_prompt = f"""
For each ticker in {tickers}, provide a short buy/sell/hold decision.
Return JSON: {{ "ticker": "decision", ... }}
"""
result = call_llm(
    batched_prompt,
    DecisionsModel,
    agent_name="portfolio_manager",
    state=state
)

Response Caching Implementation

Unlike API data—which is cached in src/data/cache.py—LLM responses are not cached by default. Repeated analyses, such as sentiment scoring for the same news article, trigger redundant costly calls.

Implement a memo-cache keyed by <model_name>|<prompt_hash> to deduplicate identical requests. Use the existing cache pattern from src/data/cache.py as a model.


# src/utils/llm.py – add at top of file

import hashlib
import json
_llm_memo: dict[str, BaseModel] = {}

def memoized_call_llm(prompt, model, **kwargs):
    key = hashlib.sha256(
        f"{model.__name__}:{json.dumps(prompt, sort_keys=True)}".encode()
    ).hexdigest()
    if cached := _llm_memo.get(key):
        return cached
    result = call_llm(prompt, model, **kwargs)
    _llm_memo[key] = result
    return result

Usage Monitoring and Cost Tracking

The database schema tracks every LLM invocation. In app/backend/database/models.py (line 88), the llm_calls_count column increments with each call, while estimated_cost provides visibility into spend.

Query these fields after backtests to identify cost hotspots. Schedule alerts when a run exceeds predefined thresholds to prevent budget overruns.

from sqlalchemy.orm import Session
from app.backend.database.models import HedgeFundFlowRunCycle

def print_llm_costs(session: Session, run_id: int):
    cycle = session.query(HedgeFundFlowRunCycle).get(run_id)
    print(
        f"LLM calls: {cycle.llm_calls_count}, "
        f"API calls: {cycle.api_calls_count}, "
        f"Estimated cost: {cycle.estimated_cost}"
    )

Summary

  • Select cheaper models by overriding the default gpt-4.1 fallback in src/utils/llm.py with gpt-3.5-turbo or Ollama models via the LLM_ORDER catalog.
  • Reduce retry counts from the default 3 to 1 for non-critical agents to eliminate redundant token billing.
  • Batch ticker processing into single prompts rather than individual calls, cutting API volume from N to 1.
  • Implement response caching using a <model_name>|<prompt_hash> key to avoid recomputing identical analyses.
  • Monitor usage via the llm_calls_count and estimated_cost columns in app/backend/database/models.py to identify cost hotspots.

Frequently Asked Questions

What is the default LLM model used by the AI Hedge Fund?

The system defaults to gpt-4.1 provided by OpenAI. This fallback is defined in src/utils/llm.py within the call_llm function (lines 37-40) when no model is specified in the request state.

How can I switch to a cheaper model without modifying agent code?

You can override the model selection via CLI flags or request metadata. The repository exposes an LLM_ORDER list in src/llm/models.py (line 108) that catalogs available models. Pass a cheaper option like gpt-3.5-turbo or an Ollama model name through the request state, and get_agent_model_config in src/utils/llm.py (lines 24-47) will resolve it.

Why does the AI Hedge Fund retry LLM calls, and how does this affect costs?

The call_llm wrapper in src/utils/llm.py (lines 58-78) implements a retry loop that attempts up to three times before falling back to a default response. While this improves reliability against transient API failures, each retry generates a fresh billable token request. For non-critical agents like news sentiment analysis, reducing max_retries to 1 can significantly cut costs without materially impacting decision quality.

How do I track which agents are driving the highest LLM costs?

The database schema in app/backend/database/models.py (line 88) includes an llm_calls_count column that increments with every LLM invocation, alongside an estimated_cost field. After running a backtest, query the HedgeFundFlowRunCycle table to identify which workflow cycles generated the most calls and associated costs, then target those specific agents for optimization.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →