Greedy Search vs Beam Search vs Temperature Sampling in Hugging Face Transformers

Question

Explore Greedy Search, Beam Search, and Temperature Sampling in Hugging Face Transformers. Understand how each text generation strategy works to produce diverse and optimal outputs.

Accepted Answer

Greedy search deterministically selects the highest probability token at each step, beam search maintains multiple candidate sequences to find globally better outputs, and temperature sampling introduces controlled randomness by scaling logits before drawing from the probability distribution.

The generate() method in the Hugging Face Transformers repository supports three fundamentally different text generation strategies. These approaches—greedy search, beam search, and temperature sampling—are configured through the GenerationConfig class and determine whether the model produces deterministic or stochastic outputs. Understanding how these algorithms differ allows you to optimize the trade-off between generation speed, output quality, and creative diversity.

How Generation Strategies Are Selected

The routing logic in src/transformers/generation/utils.py selects a decoding strategy based on two boolean flags: num_beams and do_sample. When do_sample=False, the system enters deterministic mode, falling back to greedy search if num_beams=1 or invoking beam search if num_beams>1. Setting do_sample=True activates stochastic sampling, where logits are processed through warpers like TemperatureLogitsWarper before multinomial sampling occurs.

Greedy Search (Deterministic Single-Path Decoding)

Implementation Details

Greedy search represents the simplest decoding strategy, implemented in the no-beam path of src/transformers/generation/utils.py. When generation_config.num_beams=1 and generation_config.do_sample=False, the _generate_no_beam_search method executes greedy decoding by taking the argmax of the logits at every generation step (see lines 3420–3450). This approach selects the token with the highest conditional probability without introducing randomness, resulting in deterministic outputs that remain identical across multiple runs with the same input.

When to Use Greedy Search

Use greedy search when you require fast, reproducible results and can tolerate potentially repetitive or locally suboptimal completions. This method works well for factual extraction or constrained generation tasks where creativity is undesirable, though it risks missing globally optimal sequences because it cannot recover from early low-probability choices.

Beam Search (Deterministic Multi-Hypothesis Decoding)

Implementation Details

Beam search maintains multiple candidate sequences simultaneously to explore a broader solution space. Implemented in the _beam_search method of src/transformers/generation/utils.py (starting around line 3062), this algorithm keeps num_beams parallel hypotheses. At each timestep, every active beam expands with potential next tokens, the joint log-probabilities are computed for all extensions, and only the top-scoring beams survive to the next iteration. This lookahead capability allows the model to recover from locally optimal but globally poor choices, typically producing higher-quality, less repetitive text than greedy decoding.

When to Use Beam Search

Configure beam search when output quality matters more than generation speed and you need deterministic results. Setting num_beams=5 or higher significantly improves coherence for open-ended generation tasks, though computational cost scales linearly with the beam count. Enable early_stopping=True to terminate generation once all beams reach an end-of-sequence token.

Temperature Sampling (Stochastic Decoding)

Implementation Details

Temperature sampling introduces randomness by reshaping the probability distribution before drawing tokens. The TemperatureLogitsWarper class in src/transformers/generation/logits_process.py (lines 283–298) implements this by dividing the logits by the temperature parameter: scores / temperature. After warping, the torch.multinomial function samples from the resulting distribution. A temperature value greater than 1.0 flattens the distribution, increasing diversity and randomness, while values below 1.0 sharpen the distribution, making the model more conservative and deterministic.

Configuring Temperature and Sampling

Activate temperature sampling by setting do_sample=True and specifying a temperature value in the GenerationConfig. This mode supports additional constraints like top_k and top_p (nucleus sampling) to prevent sampling from the long tail of unlikely tokens. Unlike deterministic methods, each generation produces unique output, making this approach ideal for creative writing or brainstorming applications.

Practical Implementation Examples

The following example demonstrates how to configure each strategy using the generate() API:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Once upon a time"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# 1️⃣ Greedy Search (deterministic)

greedy_output = model.generate(
    input_ids,
    max_new_tokens=30,
    do_sample=False,
    num_beams=1,
)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

# 2️⃣ Beam Search (deterministic, 5 beams)

beam_output = model.generate(
    input_ids,
    max_new_tokens=30,
    do_sample=False,
    num_beams=5,
    early_stopping=True,
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

# 3️⃣ Temperature Sampling (stochastic)

temp_output = model.generate(
    input_ids,
    max_new_tokens=30,
    do_sample=True,
    temperature=0.7,
    top_k=50,
)
print(tokenizer.decode(temp_output[0], skip_special_tokens=True))

Summary

Greedy search selects the single highest-probability token at each step using argmax in _generate_no_beam_search, providing fast but potentially myopic results.
Beam search explores multiple candidate sequences in parallel via the _beam_search method, offering higher quality deterministic outputs at increased computational cost.
Temperature sampling applies randomness through the TemperatureLogitsWarper class, enabling diverse generation controlled by the temperature parameter.
All three strategies are configured via GenerationConfig parameters (num_beams, do_sample, temperature) and executed through the unified generate() API in src/transformers/generation/utils.py.

Frequently Asked Questions

Which generation method produces the most deterministic output?

Greedy search produces completely deterministic outputs when do_sample=False and num_beams=1, as implemented in the _generate_no_beam_search path in utils.py. Beam search is also deterministic when sampling is disabled, though it may produce different results than greedy search due to its multi-hypothesis nature.

How does temperature affect text diversity in Transformers?

Temperature scales logits before the softmax operation in TemperatureLogitsWarper (lines 283–298 of logits_process.py). Values above 1.0 increase entropy and diversity by flattening the probability distribution, while values approaching 0.0 make the distribution peaky, effectively approximating greedy decoding behavior.

Can I combine beam search with temperature sampling?

Yes, setting num_beams>1 with do_sample=True enables beam sampling, where the model maintains multiple beams but samples from the temperature-adjusted distribution within each beam rather than selecting the argmax. This hybrid approach balances diversity with structured search according to the implementation in src/transformers/generation/utils.py.

Where is the generation logic implemented in the Transformers source code?

The main generation loop resides in src/transformers/generation/utils.py, containing both the _beam_search method and the greedy fallback path. Temperature processing and other logit warpers are defined in src/transformers/generation/logits_process.py, while high-level strategy documentation lives in docs/source/en/generation_strategies.md.

Greedy Search vs Beam Search vs Temperature Sampling in Hugging Face Transformers

How Generation Strategies Are Selected

Greedy Search (Deterministic Single-Path Decoding)

Implementation Details

When to Use Greedy Search

Beam Search (Deterministic Multi-Hypothesis Decoding)

Implementation Details

When to Use Beam Search

Temperature Sampling (Stochastic Decoding)

Implementation Details

Configuring Temperature and Sampling

Practical Implementation Examples

Summary

Frequently Asked Questions

Which generation method produces the most deterministic output?

How does temperature affect text diversity in Transformers?

Can I combine beam search with temperature sampling?

Where is the generation logic implemented in the Transformers source code?

Have a question about this repo?