deep-dive

How Tokenizer-Agnostic Bits-Per-Byte (BPB) Is Calculated in OpenAI's Parameter-Golf

April 17, 2026 openai/parameter-golf ↗

The tokenizer-agnostic bits-per-byte (BPB) metric is calculated by converting the average cross-entropy loss from nats to bits, then scaling by the ratio of tokens to bytes derived from SentencePiece vocabulary lookup tables.

The openai/parameter-golf repository evaluates language model compression using this BPB calculation. Unlike standard perplexity metrics that depend on specific tokenizer vocabularies, this approach measures the actual information density relative to raw bytes, enabling fair comparison across different tokenization strategies.

Understanding the Tokenizer-Agnostic BPB Formula

The final BPB value represents the average number of bits required to encode each byte of text. The calculation follows this mathematical relationship:


BPB = (AvgLoss_nats / log(2)) × (TokenCount / ByteCount)

Where:

AvgLoss_nats is the mean cross-entropy loss across all tokens (in natural log units)
TokenCount / ByteCount is the inverse of the average bytes per token, derived from the tokenizer's vocabulary mapping

This formula is implemented in train_gpt.py within the _loss_bpb helper function.

Step 1: Building Byte-Length Lookup Tables from SentencePiece

Before training begins, the system constructs lookup tables that map every token ID to its corresponding byte length. This happens in the build_sentencepiece_luts function located at lines 80-99 of train_gpt.py.

Normal Tokens vs. Byte Tokens

The function iterates through the SentencePiece vocabulary to determine byte lengths:

Normal tokens: Length equals len(piece.encode("utf-8")) — the actual UTF-8 byte representation of the token string
Byte tokens: Special tokens representing single bytes (like <0x00>) have a fixed length of 1

Handling Leading Spaces and Boundary Tokens

The lookup tables also track metadata critical for accurate byte counting:

Leading space flag: Stored in has_leading_space_lut to indicate if a token begins with a space character
Boundary tokens: Control, unknown, and unused tokens are marked in is_boundary_token_lut — these never contribute extra bytes for spacing


# Simplified excerpt from build_sentencepiece_luts (train_gpt.py lines 80-99)

def build_sentencepiece_luts(sp, vocab_size, device):
    base_bytes = torch.zeros(vocab_size, dtype=torch.int16, device=device)
    has_leading_space = torch.zeros(vocab_size, dtype=torch.bool, device=device)
    is_boundary = torch.zeros(vocab_size, dtype=torch.bool, device=device)
    
    for token_id in range(vocab_size):
        piece = sp.id_to_piece(token_id)
        # Determine byte length based on token type

        if piece.startswith("<0x") and piece.endswith(">"):
            base_bytes[token_id] = 1  # Byte token

        else:
            base_bytes[token_id] = len(piece.encode("utf-8"))
            
    return base_bytes, has_leading_space, is_boundary

Step 2: Accumulating Loss and Byte Counts During Validation

During the validation loop (lines 262-267 of train_gpt.py), the system aggregates three key metrics: loss sum, token count, and byte count.

The byte accumulation logic handles the interaction between consecutive tokens:

Base bytes: Retrieved from base_bytes_lut for the current token
Leading space adjustment: If the current token has a leading space (has_leading_space_lut) and the previous token is not a boundary token (~is_boundary_token_lut[prev_ids]), add one extra byte


# Excerpt from validation loop (train_gpt.py lines 262-267)

prev_ids = x.reshape(-1)
tgt_ids = y.reshape(-1)

# Look up base byte lengths for target tokens

token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16)

# Add byte for leading space if previous token isn't a boundary

token_bytes += (has_leading_space_lut[tgt_ids] & 
               ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16)

val_byte_count += token_bytes.to(torch.float64).sum()

This approach ensures that space characters are only counted when they represent actual byte content between tokens, not when they result from tokenizer formatting or control sequences.

Step 3: Converting Average Loss to Bits-Per-Byte

After validation completes, the _loss_bpb function (lines 1240-1243 of train_gpt.py) converts the accumulated statistics into the final BPB metric.

The conversion follows three mathematical operations:

Average loss per token: val_loss = loss_sum / token_count (in nats)
Nats to bits: Divide by log(2) to convert from natural logarithm to base-2 logarithm
Scale by token-to-byte ratio: Multiply by token_count / byte_count to get bits per byte rather than bits per token


# _loss_bpb implementation (train_gpt.py lines 1240-1243)

def _loss_bpb(loss_sum, token_count, byte_count):
    val_loss = (loss_sum / token_count).item()
    val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item())
    return val_loss, val_bpb

This final value represents the theoretical minimum number of bits required to encode each byte of the validation text using the model's probability distribution — a true measure of compression performance independent of tokenization efficiency.

Summary

Tokenizer-agnostic BPB measures compression efficiency in bits per raw byte, eliminating bias from specific tokenizer vocabularies.
Lookup table construction in build_sentencepiece_luts maps tokens to byte lengths while tracking leading spaces and boundary tokens.
Validation accumulation counts bytes accurately by adding space bytes only when they represent actual content between non-boundary tokens.
Final conversion via _loss_bpb applies the formula: (avg_loss / log(2)) * (tokens / bytes).
This implementation appears in train_gpt.py throughout the openai/parameter-golf repository.

Frequently Asked Questions

What makes BPB "tokenizer-agnostic"?

Standard perplexity metrics measure bits per token, which favors tokenizers with larger vocabularies that produce fewer tokens per byte. The tokenizer-agnostic BPB metric converts this to bits per byte by accounting for how many bytes each token actually represents. This allows fair comparison between models using different tokenization strategies, as the metric reflects actual compression of the raw text rather than tokenizer efficiency.

Why convert from nats to bits?

Neural network loss functions typically compute cross-entropy using the natural logarithm (nats). However, information theory standards and compression metrics conventionally use base-2 logarithms (bits). The division by math.log(2.0) converts the average loss from nats to bits, ensuring the final BPB value represents the theoretical minimum bits required per byte according to Shannon's source coding theorem.

How does Parameter-Golf handle special tokens in byte counting?

Control tokens, unknown tokens, and unused vocabulary entries are marked as boundary tokens in the is_boundary_token_lut table. These tokens never contribute extra bytes for leading spaces, preventing artificial inflation of byte counts when the model encounters special sequences. This ensures that only actual text content (normal tokens and byte tokens) contributes to the compression metric.

Can this metric be used with non-SentencePiece tokenizers?

The current implementation in train_gpt.py specifically builds lookup tables from SentencePiece models using build_sentencepiece_luts. However, the underlying mathematical approach—mapping token IDs to byte lengths and calculating bits per byte—is tokenizer-agnostic by design. To use this with BPE or WordPiece tokenizers, you would need to implement equivalent logic to build the byte-length lookup tables for those vocabulary types.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how openai/parameter-golf works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →