internals

How the Summarize CLI Handles Output Length and Token Limits

February 19, 2026 steipete/summarize ↗

The summarize CLI handles output length and token limits by translating --max-output-tokens flags or --length presets into a token budget, then capping that value against model-specific maximums before every LLM request.

The steipete/summarize repository provides a robust command-line interface for generating text summaries using large language models. Understanding how the CLI handles output length and token limits is essential for avoiding provider rejections and controlling API costs. The codebase implements a multi-stage pipeline that converts user preferences into precise token budgets while respecting both explicit flags and model-specific constraints.

Parsing User Input for Output Length and Token Limits

The process begins in src/run/run-settings.ts, where the parseMaxOutputTokensArg function processes the --max-output-tokens flag. If provided, this value becomes the explicit maxOutputTokensArg returned to the run context. When the user omits the flag, the CLI falls back to deriving a token budget from the --length argument (defaulting to the "xl" preset).

In src/run/run-output.ts, the resolveDesiredOutputTokens function performs this conversion by approximating characters ÷ 4 with a minimum of 16 tokens. For example, the "short" preset might allocate approximately 1,500 characters, translating to roughly 375 tokens. This ensures that even vague length preferences translate into concrete token budgets that the LLM can respect.

Enforcing Model-Specific Token Caps

Before any network request, the system enforces hard limits defined by the model provider. The src/run/run-metrics.ts file implements resolveMaxOutputTokensForCall, which loads the LiteLLM catalog via loadLiteLlmCatalog to retrieve the model's advertised maximum output token count. The function returns the smaller of the user-requested value and the model-specific ceiling, preventing API errors from oversized requests.

The same module provides resolveMaxInputTokensForCall to guard against prompts that exceed the model's input capacity. This dual validation ensures that both the prompt and the expected response remain within provider constraints.

Executing Constrained LLM Requests

The final enforcement occurs in src/run/summary-engine.ts within the runSummaryAttempt function. This function obtains the definitive maxOutputTokensForCall by awaiting deps.resolveMaxOutputTokensForCall with the resolved model ID. When streaming is disabled or falls back to non-streaming mode, this capped value is passed directly to summarizeWithModelId, which forwards it to the provider's SDK.

For input validation, runSummaryAttempt uses gpt-tokenizer to check the prompt length against resolveMaxInputTokensForCall. If the input exceeds the limit, the CLI throws a clear error before any network request is initiated, saving unnecessary API calls.

Practical Examples of Token Budget Control

Explicitly set a hard token limit for precise control:

summarize https://example.com/article --max-output-tokens 300

Allow the CLI to infer the budget from a length preset:

summarize https://example.com/article --length short

# Internally: chars ≈ 1,500 → tokens ≈ 375 (minimum 16)

Combine both approaches to request a short summary while enforcing a strict upper bound:

summarize https://example.com/article --length short --max-output-tokens 200

In a TypeScript context, you can replicate the budget resolution logic:

import { resolveDesiredOutputTokens } from './run/run-output';
import { parseLengthArg } from '../flags';

// Convert a length preset into a token budget
const lengthArg = parseLengthArg('short');
const desiredTokens = resolveDesiredOutputTokens({ 
  lengthArg, 
  maxOutputTokensArg: null 
});

Summary

The CLI parses --max-output-tokens explicitly or derives tokens from --length presets using a characters-to-tokens ratio in src/run/run-output.ts.
The resolveMaxOutputTokensForCall helper in src/run/run-metrics.ts caps user requests against model-specific limits from the LiteLLM catalog.
src/run/summary-engine.ts applies the final token budget to every LLM call via runSummaryAttempt, ensuring provider constraints are never violated.
Input prompts are pre-validated using gpt-tokenizer to prevent requests that would exceed the model's input token capacity.
The orchestration in src/run/runner.ts coordinates this pipeline to guarantee that user intent and provider constraints align for every summary generation.

Frequently Asked Questions

What happens if I don't specify `--max-output-tokens`?

The CLI automatically derives a token budget from the --length argument you provide (defaulting to "xl" if omitted). The resolveDesiredOutputTokens function in src/run/run-output.ts calculates this by dividing the character target by four and ensuring a minimum of 16 tokens.

How does the CLI prevent exceeding model token limits?

Before each API call, the resolveMaxOutputTokensForCall function in src/run/run-metrics.ts compares your requested limit against the model's maximum output tokens listed in the LiteLLM catalog. It automatically uses the lower of the two values, ensuring the request complies with provider constraints regardless of user input.

Can I combine `--length` and `--max-output-tokens` flags?

Yes. When both flags are present, --max-output-tokens acts as an explicit cap that overrides the token count derived from --length. This allows you to request a "short" summary style while strictly limiting the response to a specific token budget, such as 200 tokens.

What error occurs if the input text exceeds token limits?

The runSummaryAttempt function in src/run/summary-engine.ts uses gpt-tokenizer to count input tokens before sending the request. If the prompt exceeds the limit returned by resolveMaxInputTokensForCall, the CLI throws a clear error immediately, preventing the API call and saving unnecessary costs.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how steipete/summarize works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →