# How the Summarize CLI Handles Output Length and Token Limits

> Discover how the summarize CLI manages output length and token limits. Learn about token budgets, model maximums, and LLM request capping for efficient text summarization.

- Repository: [Peter Steinberger/summarize](https://github.com/steipete/summarize)
- Tags: internals
- Published: 2026-02-19

---

**The `summarize` CLI handles output length and token limits by translating `--max-output-tokens` flags or `--length` presets into a token budget, then capping that value against model-specific maximums before every LLM request.**

The `steipete/summarize` repository provides a robust command-line interface for generating text summaries using large language models. Understanding how the CLI handles output length and token limits is essential for avoiding provider rejections and controlling API costs. The codebase implements a multi-stage pipeline that converts user preferences into precise token budgets while respecting both explicit flags and model-specific constraints.

## Parsing User Input for Output Length and Token Limits

The process begins in [`src/run/run-settings.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-settings.ts), where the `parseMaxOutputTokensArg` function processes the `--max-output-tokens` flag. If provided, this value becomes the explicit `maxOutputTokensArg` returned to the run context. When the user omits the flag, the CLI falls back to deriving a token budget from the `--length` argument (defaulting to the "xl" preset).

In [`src/run/run-output.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-output.ts), the `resolveDesiredOutputTokens` function performs this conversion by approximating **characters ÷ 4** with a **minimum of 16 tokens**. For example, the "short" preset might allocate approximately 1,500 characters, translating to roughly 375 tokens. This ensures that even vague length preferences translate into concrete token budgets that the LLM can respect.

## Enforcing Model-Specific Token Caps

Before any network request, the system enforces hard limits defined by the model provider. The [`src/run/run-metrics.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-metrics.ts) file implements `resolveMaxOutputTokensForCall`, which loads the LiteLLM catalog via `loadLiteLlmCatalog` to retrieve the model's advertised maximum output token count. The function returns the **smaller** of the user-requested value and the model-specific ceiling, preventing API errors from oversized requests.

The same module provides `resolveMaxInputTokensForCall` to guard against prompts that exceed the model's input capacity. This dual validation ensures that both the prompt and the expected response remain within provider constraints.

## Executing Constrained LLM Requests

The final enforcement occurs in [`src/run/summary-engine.ts`](https://github.com/steipete/summarize/blob/main/src/run/summary-engine.ts) within the `runSummaryAttempt` function. This function obtains the definitive `maxOutputTokensForCall` by awaiting `deps.resolveMaxOutputTokensForCall` with the resolved model ID. When streaming is disabled or falls back to non-streaming mode, this capped value is passed directly to `summarizeWithModelId`, which forwards it to the provider's SDK.

For input validation, `runSummaryAttempt` uses `gpt-tokenizer` to check the prompt length against `resolveMaxInputTokensForCall`. If the input exceeds the limit, the CLI throws a clear error before any network request is initiated, saving unnecessary API calls.

## Practical Examples of Token Budget Control

Explicitly set a hard token limit for precise control:

```bash
summarize https://example.com/article --max-output-tokens 300

```

Allow the CLI to infer the budget from a length preset:

```bash
summarize https://example.com/article --length short

# Internally: chars ≈ 1,500 → tokens ≈ 375 (minimum 16)

```

Combine both approaches to request a short summary while enforcing a strict upper bound:

```bash
summarize https://example.com/article --length short --max-output-tokens 200

```

In a TypeScript context, you can replicate the budget resolution logic:

```typescript
import { resolveDesiredOutputTokens } from './run/run-output';
import { parseLengthArg } from '../flags';

// Convert a length preset into a token budget
const lengthArg = parseLengthArg('short');
const desiredTokens = resolveDesiredOutputTokens({ 
  lengthArg, 
  maxOutputTokensArg: null 
});

```

## Summary

- The CLI parses `--max-output-tokens` explicitly or derives tokens from `--length` presets using a characters-to-tokens ratio in [`src/run/run-output.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-output.ts).
- The `resolveMaxOutputTokensForCall` helper in [`src/run/run-metrics.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-metrics.ts) caps user requests against model-specific limits from the LiteLLM catalog.
- [`src/run/summary-engine.ts`](https://github.com/steipete/summarize/blob/main/src/run/summary-engine.ts) applies the final token budget to every LLM call via `runSummaryAttempt`, ensuring provider constraints are never violated.
- Input prompts are pre-validated using `gpt-tokenizer` to prevent requests that would exceed the model's input token capacity.
- The orchestration in [`src/run/runner.ts`](https://github.com/steipete/summarize/blob/main/src/run/runner.ts) coordinates this pipeline to guarantee that user intent and provider constraints align for every summary generation.

## Frequently Asked Questions

### What happens if I don't specify `--max-output-tokens`?

The CLI automatically derives a token budget from the `--length` argument you provide (defaulting to "xl" if omitted). The `resolveDesiredOutputTokens` function in [`src/run/run-output.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-output.ts) calculates this by dividing the character target by four and ensuring a minimum of 16 tokens.

### How does the CLI prevent exceeding model token limits?

Before each API call, the `resolveMaxOutputTokensForCall` function in [`src/run/run-metrics.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-metrics.ts) compares your requested limit against the model's maximum output tokens listed in the LiteLLM catalog. It automatically uses the lower of the two values, ensuring the request complies with provider constraints regardless of user input.

### Can I combine `--length` and `--max-output-tokens` flags?

Yes. When both flags are present, `--max-output-tokens` acts as an explicit cap that overrides the token count derived from `--length`. This allows you to request a "short" summary style while strictly limiting the response to a specific token budget, such as 200 tokens.

### What error occurs if the input text exceeds token limits?

The `runSummaryAttempt` function in [`src/run/summary-engine.ts`](https://github.com/steipete/summarize/blob/main/src/run/summary-engine.ts) uses `gpt-tokenizer` to count input tokens before sending the request. If the prompt exceeds the limit returned by `resolveMaxInputTokensForCall`, the CLI throws a clear error immediately, preventing the API call and saving unnecessary costs.