# How Tokenizer-Agnostic Bits-Per-Byte (BPB) Is Calculated in OpenAI's Parameter-Golf

> Learn how tokenizer-agnostic bits-per-byte is calculated in OpenAI parameter-golf. Discover the formula converting cross-entropy loss to bits using token-to-byte ratios.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: deep-dive
- Published: 2026-04-17

---

**The tokenizer-agnostic bits-per-byte (BPB) metric is calculated by converting the average cross-entropy loss from nats to bits, then scaling by the ratio of tokens to bytes derived from SentencePiece vocabulary lookup tables.**

The `openai/parameter-golf` repository evaluates language model compression using this BPB calculation. Unlike standard perplexity metrics that depend on specific tokenizer vocabularies, this approach measures the actual information density relative to raw bytes, enabling fair comparison across different tokenization strategies.

## Understanding the Tokenizer-Agnostic BPB Formula

The final BPB value represents the average number of bits required to encode each byte of text. The calculation follows this mathematical relationship:

```

BPB = (AvgLoss_nats / log(2)) × (TokenCount / ByteCount)

```

Where:
- **AvgLoss_nats** is the mean cross-entropy loss across all tokens (in natural log units)
- **TokenCount / ByteCount** is the inverse of the average bytes per token, derived from the tokenizer's vocabulary mapping

This formula is implemented in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) within the `_loss_bpb` helper function.

## Step 1: Building Byte-Length Lookup Tables from SentencePiece

Before training begins, the system constructs lookup tables that map every token ID to its corresponding byte length. This happens in the `build_sentencepiece_luts` function located at lines 80-99 of [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py).

### Normal Tokens vs. Byte Tokens

The function iterates through the SentencePiece vocabulary to determine byte lengths:

- **Normal tokens**: Length equals `len(piece.encode("utf-8"))` — the actual UTF-8 byte representation of the token string
- **Byte tokens**: Special tokens representing single bytes (like `<0x00>`) have a fixed length of `1`

### Handling Leading Spaces and Boundary Tokens

The lookup tables also track metadata critical for accurate byte counting:

- **Leading space flag**: Stored in `has_leading_space_lut` to indicate if a token begins with a space character
- **Boundary tokens**: Control, unknown, and unused tokens are marked in `is_boundary_token_lut` — these never contribute extra bytes for spacing

```python

# Simplified excerpt from build_sentencepiece_luts (train_gpt.py lines 80-99)

def build_sentencepiece_luts(sp, vocab_size, device):
    base_bytes = torch.zeros(vocab_size, dtype=torch.int16, device=device)
    has_leading_space = torch.zeros(vocab_size, dtype=torch.bool, device=device)
    is_boundary = torch.zeros(vocab_size, dtype=torch.bool, device=device)
    
    for token_id in range(vocab_size):
        piece = sp.id_to_piece(token_id)
        # Determine byte length based on token type

        if piece.startswith("<0x") and piece.endswith(">"):
            base_bytes[token_id] = 1  # Byte token

        else:
            base_bytes[token_id] = len(piece.encode("utf-8"))
            
    return base_bytes, has_leading_space, is_boundary

```

## Step 2: Accumulating Loss and Byte Counts During Validation

During the validation loop (lines 262-267 of [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py)), the system aggregates three key metrics: loss sum, token count, and byte count.

The byte accumulation logic handles the interaction between consecutive tokens:

1. **Base bytes**: Retrieved from `base_bytes_lut` for the current token
2. **Leading space adjustment**: If the current token has a leading space (`has_leading_space_lut`) and the previous token is **not** a boundary token (`~is_boundary_token_lut[prev_ids]`), add one extra byte

```python

# Excerpt from validation loop (train_gpt.py lines 262-267)

prev_ids = x.reshape(-1)
tgt_ids = y.reshape(-1)

# Look up base byte lengths for target tokens

token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16)

# Add byte for leading space if previous token isn't a boundary

token_bytes += (has_leading_space_lut[tgt_ids] & 
               ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16)

val_byte_count += token_bytes.to(torch.float64).sum()

```

This approach ensures that space characters are only counted when they represent actual byte content between tokens, not when they result from tokenizer formatting or control sequences.

## Step 3: Converting Average Loss to Bits-Per-Byte

After validation completes, the `_loss_bpb` function (lines 1240-1243 of [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py)) converts the accumulated statistics into the final BPB metric.

The conversion follows three mathematical operations:

1. **Average loss per token**: `val_loss = loss_sum / token_count` (in nats)
2. **Nats to bits**: Divide by `log(2)` to convert from natural logarithm to base-2 logarithm
3. **Scale by token-to-byte ratio**: Multiply by `token_count / byte_count` to get bits per byte rather than bits per token

```python

# _loss_bpb implementation (train_gpt.py lines 1240-1243)

def _loss_bpb(loss_sum, token_count, byte_count):
    val_loss = (loss_sum / token_count).item()
    val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item())
    return val_loss, val_bpb

```

This final value represents the theoretical minimum number of bits required to encode each byte of the validation text using the model's probability distribution — a true measure of compression performance independent of tokenization efficiency.

## Summary

- **Tokenizer-agnostic BPB** measures compression efficiency in bits per raw byte, eliminating bias from specific tokenizer vocabularies.
- **Lookup table construction** in `build_sentencepiece_luts` maps tokens to byte lengths while tracking leading spaces and boundary tokens.
- **Validation accumulation** counts bytes accurately by adding space bytes only when they represent actual content between non-boundary tokens.
- **Final conversion** via `_loss_bpb` applies the formula: `(avg_loss / log(2)) * (tokens / bytes)`.
- This implementation appears in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) throughout the `openai/parameter-golf` repository.

## Frequently Asked Questions

### What makes BPB "tokenizer-agnostic"?

Standard perplexity metrics measure bits per token, which favors tokenizers with larger vocabularies that produce fewer tokens per byte. The tokenizer-agnostic BPB metric converts this to bits per **byte** by accounting for how many bytes each token actually represents. This allows fair comparison between models using different tokenization strategies, as the metric reflects actual compression of the raw text rather than tokenizer efficiency.

### Why convert from nats to bits?

Neural network loss functions typically compute cross-entropy using the natural logarithm (nats). However, information theory standards and compression metrics conventionally use base-2 logarithms (bits). The division by `math.log(2.0)` converts the average loss from nats to bits, ensuring the final BPB value represents the theoretical minimum bits required per byte according to Shannon's source coding theorem.

### How does Parameter-Golf handle special tokens in byte counting?

Control tokens, unknown tokens, and unused vocabulary entries are marked as *boundary tokens* in the `is_boundary_token_lut` table. These tokens never contribute extra bytes for leading spaces, preventing artificial inflation of byte counts when the model encounters special sequences. This ensures that only actual text content (normal tokens and byte tokens) contributes to the compression metric.

### Can this metric be used with non-SentencePiece tokenizers?

The current implementation in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) specifically builds lookup tables from SentencePiece models using `build_sentencepiece_luts`. However, the underlying mathematical approach—mapping token IDs to byte lengths and calculating bits per byte—is tokenizer-agnostic by design. To use this with BPE or WordPiece tokenizers, you would need to implement equivalent logic to build the byte-length lookup tables for those vocabulary types.