# Greedy Search vs Beam Search vs Temperature Sampling in Hugging Face Transformers

> Explore Greedy Search, Beam Search, and Temperature Sampling in Hugging Face Transformers. Understand how each text generation strategy works to produce diverse and optimal outputs.

- Repository: [Hugging Face/transformers](https://github.com/huggingface/transformers)
- Tags: deep-dive
- Published: 2026-02-21

---

**Greedy search deterministically selects the highest probability token at each step, beam search maintains multiple candidate sequences to find globally better outputs, and temperature sampling introduces controlled randomness by scaling logits before drawing from the probability distribution.**

The `generate()` method in the Hugging Face Transformers repository supports three fundamentally different text generation strategies. These approaches—greedy search, beam search, and temperature sampling—are configured through the `GenerationConfig` class and determine whether the model produces deterministic or stochastic outputs. Understanding how these algorithms differ allows you to optimize the trade-off between generation speed, output quality, and creative diversity.

## How Generation Strategies Are Selected

The routing logic in [`src/transformers/generation/utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py) selects a decoding strategy based on two boolean flags: `num_beams` and `do_sample`. When `do_sample=False`, the system enters deterministic mode, falling back to greedy search if `num_beams=1` or invoking beam search if `num_beams>1`. Setting `do_sample=True` activates stochastic sampling, where logits are processed through warpers like `TemperatureLogitsWarper` before multinomial sampling occurs.

## Greedy Search (Deterministic Single-Path Decoding)

### Implementation Details

Greedy search represents the simplest decoding strategy, implemented in the no-beam path of [`src/transformers/generation/utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py). When `generation_config.num_beams=1` and `generation_config.do_sample=False`, the `_generate_no_beam_search` method executes greedy decoding by taking the **argmax** of the logits at every generation step (see lines 3420–3450). This approach selects the token with the highest conditional probability without introducing randomness, resulting in deterministic outputs that remain identical across multiple runs with the same input.

### When to Use Greedy Search

Use greedy search when you require fast, reproducible results and can tolerate potentially repetitive or locally suboptimal completions. This method works well for factual extraction or constrained generation tasks where creativity is undesirable, though it risks missing globally optimal sequences because it cannot recover from early low-probability choices.

## Beam Search (Deterministic Multi-Hypothesis Decoding)

### Implementation Details

Beam search maintains multiple candidate sequences simultaneously to explore a broader solution space. Implemented in the `_beam_search` method of [`src/transformers/generation/utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py) (starting around line 3062), this algorithm keeps `num_beams` parallel hypotheses. At each timestep, every active beam expands with potential next tokens, the joint log-probabilities are computed for all extensions, and only the top-scoring beams survive to the next iteration. This lookahead capability allows the model to recover from locally optimal but globally poor choices, typically producing higher-quality, less repetitive text than greedy decoding.

### When to Use Beam Search

Configure beam search when output quality matters more than generation speed and you need deterministic results. Setting `num_beams=5` or higher significantly improves coherence for open-ended generation tasks, though computational cost scales linearly with the beam count. Enable `early_stopping=True` to terminate generation once all beams reach an end-of-sequence token.

## Temperature Sampling (Stochastic Decoding)

### Implementation Details

Temperature sampling introduces randomness by reshaping the probability distribution before drawing tokens. The `TemperatureLogitsWarper` class in [`src/transformers/generation/logits_process.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/logits_process.py) (lines 283–298) implements this by dividing the logits by the temperature parameter: `scores / temperature`. After warping, the `torch.multinomial` function samples from the resulting distribution. A temperature value greater than 1.0 flattens the distribution, increasing diversity and randomness, while values below 1.0 sharpen the distribution, making the model more conservative and deterministic.

### Configuring Temperature and Sampling

Activate temperature sampling by setting `do_sample=True` and specifying a `temperature` value in the `GenerationConfig`. This mode supports additional constraints like `top_k` and `top_p` (nucleus sampling) to prevent sampling from the long tail of unlikely tokens. Unlike deterministic methods, each generation produces unique output, making this approach ideal for creative writing or brainstorming applications.

## Practical Implementation Examples

The following example demonstrates how to configure each strategy using the `generate()` API:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Once upon a time"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# 1️⃣ Greedy Search (deterministic)

greedy_output = model.generate(
    input_ids,
    max_new_tokens=30,
    do_sample=False,
    num_beams=1,
)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

# 2️⃣ Beam Search (deterministic, 5 beams)

beam_output = model.generate(
    input_ids,
    max_new_tokens=30,
    do_sample=False,
    num_beams=5,
    early_stopping=True,
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

# 3️⃣ Temperature Sampling (stochastic)

temp_output = model.generate(
    input_ids,
    max_new_tokens=30,
    do_sample=True,
    temperature=0.7,
    top_k=50,
)
print(tokenizer.decode(temp_output[0], skip_special_tokens=True))

```

## Summary

- **Greedy search** selects the single highest-probability token at each step using argmax in `_generate_no_beam_search`, providing fast but potentially myopic results.
- **Beam search** explores multiple candidate sequences in parallel via the `_beam_search` method, offering higher quality deterministic outputs at increased computational cost.
- **Temperature sampling** applies randomness through the `TemperatureLogitsWarper` class, enabling diverse generation controlled by the temperature parameter.
- All three strategies are configured via `GenerationConfig` parameters (`num_beams`, `do_sample`, `temperature`) and executed through the unified `generate()` API in [`src/transformers/generation/utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py).

## Frequently Asked Questions

### Which generation method produces the most deterministic output?

Greedy search produces completely deterministic outputs when `do_sample=False` and `num_beams=1`, as implemented in the `_generate_no_beam_search` path in [`utils.py`](https://github.com/huggingface/transformers/blob/main/utils.py). Beam search is also deterministic when sampling is disabled, though it may produce different results than greedy search due to its multi-hypothesis nature.

### How does temperature affect text diversity in Transformers?

Temperature scales logits before the softmax operation in `TemperatureLogitsWarper` (lines 283–298 of [`logits_process.py`](https://github.com/huggingface/transformers/blob/main/logits_process.py)). Values above 1.0 increase entropy and diversity by flattening the probability distribution, while values approaching 0.0 make the distribution peaky, effectively approximating greedy decoding behavior.

### Can I combine beam search with temperature sampling?

Yes, setting `num_beams>1` with `do_sample=True` enables beam sampling, where the model maintains multiple beams but samples from the temperature-adjusted distribution within each beam rather than selecting the argmax. This hybrid approach balances diversity with structured search according to the implementation in [`src/transformers/generation/utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py).

### Where is the generation logic implemented in the Transformers source code?

The main generation loop resides in [`src/transformers/generation/utils.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py), containing both the `_beam_search` method and the greedy fallback path. Temperature processing and other logit warpers are defined in [`src/transformers/generation/logits_process.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/logits_process.py), while high-level strategy documentation lives in [`docs/source/en/generation_strategies.md`](https://github.com/huggingface/transformers/blob/main/docs/source/en/generation_strategies.md).