# How to Optimize Token Usage and Manage Context Window Limits in LLMs

> Master LLM token usage & context window limits. Learn to count tokens, budget responses, and chunk documents to prevent input truncation and output degradation. Optimize your LLM performance today.

- Repository: [Microsoft/generative-ai-for-beginners](https://github.com/microsoft/generative-ai-for-beginners)
- Tags: performance
- Published: 2026-02-26

---

**Large Language Models process text as discrete tokens and enforce a fixed context-window limit; exceeding this limit truncates input or degrades output, so you must count tokens beforehand, reserve budget for responses, and chunk large documents.**

The `microsoft/generative-ai-for-beginners` repository provides a comprehensive curriculum for mastering token economics and context-window management. This guide distills the essential strategies from lessons on prompt engineering, text generation, retrieval-augmented generation (RAG), and model selection to help you build cost-effective, reliable LLM applications.

## Understanding Tokens and Context Windows

LLMs do not read raw characters; they process **tokens**—common sequences of characters that might represent whole words, sub-words, or punctuation. According to the *Prompt Engineering Fundamentals* lesson in [`04-prompt-engineering-fundamentals/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/04-prompt-engineering-fundamentals/README.md), tokenization directly impacts both **cost** (APIs charge per token) and **quality** (splitting words awkwardly can confuse the model).

Every model enforces a **context-window limit**, the maximum number of tokens it can consider in a single forward pass. As documented in [`21-meta/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/21-meta/README.md), these limits vary by provider—ranging from 8,192 tokens in older models to 128,000 tokens in newer variants like GPT-4-Turbo. Exceeding this window results in either automatic truncation of your prompt or a hard error, causing the model to lose earlier instructions or context.

## Counting Tokens Before You Send

Pre-flight token counting prevents runtime failures and budget overruns. The repository demonstrates this in [`20-mistral/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/20-mistral/README.md) using the `tiktoken` library, which encodes text using the same tokenizer as OpenAI models.

```python
import tiktoken

def token_count(text: str, model: str = "gpt-3.5-turbo") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

prompt = "Explain the water cycle in three sentences."
print(f"Prompt uses {token_count(prompt)} tokens.")

```

Always sum the tokens from **all** message roles—system, user, and assistant—before calling the API.

## Setting Limits with max_tokens

Controlling the response length is critical for staying within budget. The *Text Generation Apps* lesson in [`06-text-generation-apps/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/06-text-generation-apps/README.md) shows how the `max_tokens` parameter in [`oai-app.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/oai-app.py) caps the model’s output, ensuring you reserve headroom for the input prompt.

```python
MAX_WINDOW = 8192          # GPT-3.5-turbo limit

MAX_RESPONSE = 500         # reserve room for the model's answer

def safe_chat(messages, model="gpt-3.5-turbo"):
    total = sum(token_count(m["content"], model) for m in messages)
    if total + MAX_RESPONSE > MAX_WINDOW:
        raise ValueError("Prompt exceeds context window")
    return openai.ChatCompletion.create(
        model=model,
        messages=messages,
        max_tokens=MAX_RESPONSE,
        temperature=0.7,
    )

```

Setting `max_tokens` conservatively prevents the model from consuming your entire remaining window with a verbose completion.

## Chunking Large Documents for RAG

When processing documents that exceed the context window, **chunking** splits text into manageable pieces. The *RAG and Vector Databases* lesson in [`15-rag-and-vector-databases/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/15-rag-and-vector-databases/README.md) recommends chunk sizes around **800 tokens** to balance granularity with semantic coherence.

```python
def chunk_text(text: str, chunk_size: int = 800):
    words = text.split()
    for i in range(0, len(words), chunk_size):
        yield " ".join(words[i:i + chunk_size])

long_doc = open("transcript.txt").read()
chunks = list(chunk_text(long_doc))

# Embed each chunk and store in a vector DB (pseudo-code)

embeddings = [embed(chunk) for chunk in chunks]
vector_db.upsert(ids=range(len(chunks)), vectors=embeddings)

```

By retrieving only the most relevant chunks at query time, you stay well within the token limit while still leveraging large source documents.

## Choosing the Right Model Context Window

Sometimes the only solution is more capacity. As cataloged in [`21-meta/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/21-meta/README.md), model selection directly dictates your available context:

- **8K models**: Standard for GPT-3.5-turbo and early GPT-4.
- **32K/128K models**: GPT-4-32k, GPT-4-Turbo, and Mistral-Large support up to 128,000 tokens.

```python

# Switch to a 128k-token model for large prompts

response = openai.ChatCompletion.create(
    model="gpt-4-32k",
    messages=[{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": very_long_prompt}],
    max_tokens=1000,
    temperature=0.6,
)

```

Evaluate the trade-off between token cost (larger windows are typically more expensive per token) and the complexity of chunking or summarization logic.

## Accounting for System Messages and Functions

Hidden token consumers can silently erode your context budget. The *Function Calling* lesson in [`11-integrating-with-function-calling/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/11-integrating-with-function-calling/README.md) warns that **function definitions** and **system messages** are tokenized and counted against the window limit just like user content.

When using tools or multi-turn conversations, always include these overhead tokens in your `safe_chat` calculations:

```python

# Approximate token count including system message and functions

system_msg = "You are a helpful assistant."
functions = [...]  # JSON schema definitions

total_tokens = (
    token_count(system_msg) +
    sum(token_count(f) for f in functions) +
    sum(token_count(m["content"]) for m in messages)
)

```

## Summary

- **Tokenization is fundamental**: LLMs process tokens, not characters, and APIs charge per token consumed.
- **Measure before sending**: Use `tiktoken` to count tokens in prompts, system messages, and function definitions to avoid overruns.
- **Cap the response**: Set `max_tokens` conservatively in [`oai-app.py`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/oai-app.py) to reserve budget for the model’s completion.
- **Chunk for RAG**: Split documents into ~800-token segments as shown in [`15-rag-and-vector-databases/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/15-rag-and-vector-databases/README.md) to fit retrieval contexts within limits.
- **Scale the window**: Select models with larger context windows (32k or 128k tokens) from the catalog in [`21-meta/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/21-meta/README.md) when chunking is insufficient.
- **Account for overhead**: Remember that system prompts and function schemas consume tokens, as noted in [`11-integrating-with-function-calling/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/11-integrating-with-function-calling/README.md).

## Frequently Asked Questions

### What happens if I exceed the context window limit?

If your prompt plus requested `max_tokens` exceeds the model’s context window, the API typically throws an error or automatically truncates the oldest parts of your input. This truncation causes the model to lose earlier instructions or conversation history, leading to incomplete or incoherent completions. Always pre-calculate token counts using `tiktoken` to prevent this scenario.

### How do I choose the right chunk size for RAG applications?

The `microsoft/generative-ai-for-beginners` repository recommends chunk sizes around **800 tokens** for retrieval-augmented generation, as documented in [`15-rag-and-vector-databases/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/15-rag-and-vector-databases/README.md). This size balances semantic coherence (keeping related sentences together) against the need to fit multiple retrieved chunks within the remaining context window alongside the user’s query and system instructions.

### Do system messages and tool definitions count toward the token limit?

Yes. As emphasized in [`11-integrating-with-function-calling/README.md`](https://github.com/microsoft/generative-ai-for-beginners/blob/main/11-integrating-with-function-calling/README.md), system prompts, function schemas, and previous conversation turns all consume tokens from the context budget. When calculating available space for new user input, you must sum tokens from the system message, any function definitions, the conversation history, and the requested `max_tokens` for the completion.