Message Compaction for Optimizing Conversation History in Long-Running Flue Agents
Flue's compaction layer automatically summarizes conversation history after each turn to prevent unbounded memory growth and reduce token costs.
Long-running Flue agents maintain stateful sessions that accumulate every interaction between the user and the LLM. Without intervention, these message histories grow indefinitely, leading to increased latency, higher API costs, and potential context window exhaustion. The withastro/flue repository solves this through an intelligent message compaction system implemented in packages/sdk/src/compaction.ts that triggers after each turn to keep payloads lightweight.
Why Message Compaction Matters for Long-Running Agents
The Unbounded Growth Problem
Flue agents execute as stateful LLM-driven services that preserve a running log of every interaction within a session. When an agent operates over extended periods—whether as a persistent chatbot handling dozens of turns or a background task making periodic LLM calls—the session's message history expands without bound. Each turn adds both the user prompt and assistant response to the context, causing the payload size to grow linearly with conversation length.
Performance and Cost Implications
Unchecked history growth creates three critical issues:
- Increased latency — Larger payloads require more processing time, especially for providers with per-request overhead.
- Higher token costs — Token-based pricing models charge for every token in the context window; unbounded histories directly inflate bills.
- Provider limits — Long-running agents risk hitting maximum context length restrictions imposed by LLM providers.
How Flue Implements Message Compaction
The Compaction Algorithm
The compaction logic in packages/sdk/src/compaction.ts executes a three-step strategy after each turn:
- Identify a stable point — The algorithm locates the first message containing a complete, self-contained answer, typically the assistant's most recent response.
- Merge earlier messages — All messages preceding the stable point fold into a single summary message that captures essential context. This summary assumes a user role in the message list.
- Replace the original sequence — The system substitutes the merged history with the summary plus the most recent few messages (defaulting to approximately 5), dramatically shrinking payloads from dozens of kilobytes to a few hundred bytes while preserving semantic continuity.
Integration with the Session Lifecycle
The compaction system hooks into Flue's Session → Run → Turn architecture. Every turn originates from session.prompt (or session.skill), which records the interaction via runStore.record in run-store.ts / node/run-store.ts. After the LLM response returns, session.prompt invokes compactIfNeeded() to evaluate whether the message list exceeds the configured maxTokens threshold. If compaction triggers, the shortened list immediately replaces the session's stored messages, ensuring subsequent runs operate on the optimized history.
Provider Abstraction for Summaries
Flue delegates summarization to the provider layer defined in packages/sdk/src/runtime/providers.ts. By default, the system uses the configured LLM provider (e.g., Anthropic or OpenAI) with a lightweight summarization prompt. Developers may alternatively supply a custom summarizer function to the compaction configuration for specialized summarization strategies or to use a cheaper model for the condensation step.
Configuring Message Compaction in Flue
Enable compaction through the init() function by providing a compaction configuration object. The following example demonstrates setting a custom token limit and preserving the last five messages unsummarized:
import { init } from '@flue/sdk';
// Enable compaction with a custom token limit (default ≈ 4k tokens)
const harness = init({
model: 'anthropic/claude-sonnet-4-6',
compaction: {
// Compact when the total token count exceeds 3000
maxTokens: 3000,
// Keep the last N messages un-summarized (default 5)
keepLast: 5,
// Optional custom summarizer using a cheaper model
summarizer: async (messages) => {
const summary = await harness.provider.run({
model: 'anthropic/claude-haiku-4-5',
prompt: `Summarize the following conversation in <200 tokens:\n${messages.map(m => `${m.role}: ${m.content}`).join('\n')}`,
});
return summary;
},
},
});
For a complete agent implementation that benefits from automatic compaction, refer to examples/hello-world/.flue/agents/with-compaction.ts:
import { init } from '@flue/sdk';
export default async function (ctx) {
const harness = init({
model: 'openai/gpt-4o-mini',
compaction: { maxTokens: 2500 },
});
// Simple echo loop – each user message triggers a response
while (true) {
const userMsg = await ctx.prompt('User says:');
const reply = await harness.prompt({
role: 'assistant',
content: `You said: ${userMsg}`,
});
ctx.output(reply);
}
}
After several dozen turns, the session's internal message list automatically compacts, maintaining responsive performance and minimizing token expenditure.
Summary
- Message compaction in Flue triggers automatically after each turn when the
maxTokensthreshold is exceeded, preventing unbounded context growth in long-running agents. - The compaction algorithm in
packages/sdk/src/compaction.tsidentifies stable points, merges historical messages into a single user-role summary, and retains the most recent messages (default 5) unsummarized. - Configuration occurs through the
init()function'scompactionoption, supporting custommaxTokenslimits,keepLastcounts, and optional customsummarizerfunctions. - The system integrates with Flue's session lifecycle via
compactIfNeeded()and persists compacted state throughrun-store.ts, reducing latency, token costs, and memory usage.
Frequently Asked Questions
When does compaction trigger in Flue?
Compaction triggers after each turn when session.prompt calls compactIfNeeded() in packages/sdk/src/compaction.ts. The system checks whether the total token count of the current message list exceeds the configurable maxTokens threshold (default approximately 4,000 tokens). If the history exceeds this limit, compaction executes immediately before the next LLM call.
How does Flue's compaction preserve conversation context?
The algorithm preserves context by identifying a stable point—typically the most recent complete assistant response—and collapsing only the messages preceding it into a summary. This summary captures the essential semantic content of the removed history, while the recent messages (controlled by the keepLast parameter, defaulting to 5) remain untouched in their original form, ensuring the LLM retains immediate conversational context.
Can I use a custom LLM for summarization?
Yes. The compaction configuration accepts an optional summarizer function that receives the message array requiring condensation. You can implement this function to call any provider or model available through packages/sdk/src/runtime/providers.ts, such as using a smaller, faster model like Claude Haiku for cost-effective summarization while reserving the primary model for the main agent logic.
What happens to the original messages after compaction?
After compaction completes, the original messages that were summarized are removed from the active session state and replaced by the single summary message. This updated message list immediately persists to the session store (run-store.ts / node/run-store.ts), meaning future turns retrieve only the compacted history. The compaction is destructive to the original message array but preserves the semantic content through the summary.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →