# How ModelOutput Classes Are Structured for Different Task Heads in Hugging Face Transformers

> Explore the structured ModelOutput classes in Hugging Face Transformers. Learn how this unified hierarchy ensures consistent access and pytree compatibility for all model heads.

- Repository: [Hugging Face/transformers](https://github.com/huggingface/transformers)
- Tags: internals
- Published: 2026-02-22

---

**Hugging Face Transformers implements a unified dataclass-based hierarchy where all task-specific outputs inherit from the base `ModelOutput` class, providing consistent dictionary-like access, attribute access, and PyTorch pytree compatibility across every model head.**

The huggingface/transformers library standardizes model return values through a sophisticated output system. Understanding how ModelOutput classes are structured for different task heads enables developers to write cleaner inference code and leverage built-in utilities like automatic tuple conversion and distributed training support.

## The Base ModelOutput Architecture

All output types in the library descend from the core **`ModelOutput`** class defined in [`src/transformers/utils/generic.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/generic.py). This base class is implemented as a Python dataclass that inherits from `OrderedDict`, giving it dual behavior as both an object and a dictionary.

```python
@dataclass
class ModelOutput(OrderedDict):
    ...
    def __getitem__(self, k):
        if isinstance(k, str):
            return dict(self.items())[k]
        else:
            return self.to_tuple()[k]
    ...

```

The `ModelOutput` base provides three critical capabilities:

- **Tuple-like indexing** – Access outputs by integer index (e.g., `output[0]` for the first field)
- **Safe attribute access** – Use dot notation (e.g., `output.logits`) with automatic handling of missing fields
- **PyTorch pytree registration** – Registration in `__init_subclass__` ensures compatibility with `torch.nn.parallel.DistributedDataParallel` and `static_graph=True`

When a concrete output dataclass is instantiated, `ModelOutput.__post_init__` automatically populates the underlying `OrderedDict` with any non-`None` attributes. This design allows the same object to behave like a dictionary (`output["logits"]`) while preserving attribute access (`output.logits`).

## Task-Specific Output Classes

Every task head declares its return type as a thin dataclass in [`src/transformers/modeling_outputs.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_outputs.py). These classes specify which fields the head computes, such as `loss`, `logits`, or `past_key_values`.

### Causal Language Modeling Outputs

Autoregressive models return variants of the causal LM output family. The base **`CausalLMOutput`** (line 28) contains `loss`, `logits`, `hidden_states`, and `attentions`.

When using `use_cache=True`, models return **`CausalLMOutputWithPast`** (line 58), which extends the base class with `past_key_values` for efficient generation:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = tokenizer("Hello world", return_tensors="pt")
output = model(**inputs)               # → CausalLMOutput

logits = output.logits                 # torch.FloatTensor (1, seq_len, vocab_size)

# With caching

output_with_cache = model(**inputs, use_cache=True)
past = output_with_cache.past_key_values  # Cache for next token generation

```

Decoder-only models that attend to encoder outputs use **`CausalLMOutputWithCrossAttentions`** (line 93), adding `cross_attentions` to the field list.

### Classification Outputs

Sequence classification heads return **`SequenceClassifierOutput`** (line 960) with standard fields plus task-specific `logits` representing class scores. For generation-capable classifiers, **`SequenceClassifierOutputWithPast`** (line 735) includes `past_key_values`.

Token-level tasks like NER use **`TokenClassifierOutput`** (line 78), which shares the same structure as the sequence variant but returns `logits` with shape `(batch_size, sequence_length, num_labels)`.

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I love this movie!", return_tensors="pt")
out = model(**inputs)                         # → SequenceClassifierOutput

loss, logits = out.to_tuple()                 # Convenient unpacking

pred = logits.argmax(dim=-1).item()

```

### Question Answering and Seq2Seq Outputs

Span-based QA models return **`QuestionAnsweringModelOutput`** (line 107), which splits predictions into `start_logits` and `end_logits` rather than a single logits tensor.

Encoder-decoder architectures use **`Seq2SeqLMOutput`** (line 800), the most comprehensive output type. It includes:
- `encoder_last_hidden_state`
- `decoder_hidden_states` and `decoder_attentions`
- `cross_attentions` between encoder and decoder
- `past_key_values` for the decoder cache

Similarly, **`Seq2SeqQuestionAnsweringModelOutput`** (line 138) adapts the QA structure for encoder-decoder models while maintaining cache fields.

### Specialized Outputs

Masked language modeling heads return **`MaskedLMOutput`** (line 69), following the same pattern as causal LM but without past-key-value support.

Mixture-of-Experts (MoE) models like Mixtral extend the causal LM output with **`MoECausalLMOutputWithPast`** (line 82), adding MoE-specific fields:
- `router_logits` – Routing decisions for each token
- `aux_loss` – Load balancing auxiliary loss
- `z_loss` – Router z-loss for training stability

## Common Behaviors and Utilities

All task-specific output classes inherit standardized behaviors from `ModelOutput`:

- **Dictionary-style indexing** – Access fields via strings: `output["logits"]` equivalent to `output.logits`
- **Tuple conversion** – The `to_tuple()` method returns a tuple of all non-`None` fields, enabling Pythonic unpacking: `loss, logits = output.to_tuple()`
- **Immutability guarantees** – Dataclass immutability options ensure consistent state during distributed training
- **Filtering** – The `__post_init__` method automatically excludes `None` values from the underlying dictionary representation

These behaviors ensure that pipelines in `src/transformers/pipelines` can process outputs uniformly regardless of the underlying task head.

## Practical Usage Examples

### Accessing Decoder Cache for Efficient Generation

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

input_ids = tokenizer("The answer is", return_tensors="pt").input_ids

# First forward pass returns past_key_values

output = model(input_ids, use_cache=True)
past = output.past_key_values

# Efficient next token generation

next_token_logits = model(input_ids[:, -1:], past_key_values=past).logits

```

### Handling Seq2Seq Model Outputs

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

inputs = tokenizer("translate English to French: Hello world", return_tensors="pt")
outputs = model(**inputs)  # → Seq2SeqLMOutput

# Access encoder states and decoder logits

encoder_states = outputs.encoder_last_hidden_state
decoder_logits = outputs.logits

```

## Summary

- **Base architecture**: All outputs inherit from `ModelOutput` in [`src/transformers/utils/generic.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/generic.py), a dataclass-based `OrderedDict` that supports both attribute and dictionary access.
- **Task-specific classes**: Concrete implementations live in [`src/transformers/modeling_outputs.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_outputs.py), with each head (CausalLM, SequenceClassification, QuestionAnswering, Seq2Seq) declaring its specific fields like `logits`, `past_key_values`, or `start_logits`.
- **Unified interface**: Every output supports tuple unpacking via `to_tuple()`, integer indexing, and PyTorch pytree registration for distributed training compatibility.
- **Extension pattern**: Specialized variants (e.g., `WithPast`, `WithCrossAttentions`, MoE outputs) extend base classes through inheritance, adding task-specific tensors while maintaining API consistency.

## Frequently Asked Questions

### What is the difference between CausalLMOutput and CausalLMOutputWithPast?

**CausalLMOutput** is the base class for autoregressive language models containing `loss`, `logits`, `hidden_states`, and `attentions`. **CausalLMOutputWithPast** extends this class to include `past_key_values`, a cache of previous key and value tensors that enables efficient token-by-token generation without recomputing attention for prior context. You receive the latter when passing `use_cache=True` to models like GPT-2 or Llama.

### How do ModelOutput classes support both attribute and dictionary-style access?

The `ModelOutput` base class in [`src/transformers/utils/generic.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/generic.py) inherits from `OrderedDict` while using the `@dataclass` decorator. The `__post_init__` method populates the dictionary with dataclass fields, and `__getitem__` handles both string keys (dictionary access) and integer indices (tuple access). This dual interface allows code like `output["logits"]` and `output.logits` to return identical tensors.

### Can ModelOutput objects be used with PyTorch's DistributedDataParallel?

Yes. The `ModelOutput` class includes pytree registration logic in `__init_subclass__` that makes instances compatible with PyTorch's tree utilities. This registration ensures that `DistributedDataParallel` with `static_graph=True` can properly handle gradient synchronization across devices when model outputs contain complex nested structures like `past_key_values` tuples.

### Where are new task-specific output classes defined in the Transformers library?

New output classes should be defined in [`src/transformers/modeling_outputs.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_outputs.py) following the existing pattern: subclass `ModelOutput`, use the `@dataclass` decorator, declare fields with type hints, and set default values to `None`. Model implementations in `src/transformers/models/*/modeling_*.py` then import and instantiate these classes in their `forward` methods to ensure type consistency across the library.