How Tokenizer-Agnostic Bits-Per-Byte (BPB) Is Calculated in OpenAI's Parameter-Golf
The tokenizer-agnostic bits-per-byte (BPB) metric is calculated by converting the average cross-entropy loss from nats to bits, then scaling by the ratio of tokens to bytes derived from SentencePiece vocabulary lookup tables.
The openai/parameter-golf repository evaluates language model compression using this BPB calculation. Unlike standard perplexity metrics that depend on specific tokenizer vocabularies, this approach measures the actual information density relative to raw bytes, enabling fair comparison across different tokenization strategies.
Understanding the Tokenizer-Agnostic BPB Formula
The final BPB value represents the average number of bits required to encode each byte of text. The calculation follows this mathematical relationship:
BPB = (AvgLoss_nats / log(2)) × (TokenCount / ByteCount)
Where:
- AvgLoss_nats is the mean cross-entropy loss across all tokens (in natural log units)
- TokenCount / ByteCount is the inverse of the average bytes per token, derived from the tokenizer's vocabulary mapping
This formula is implemented in train_gpt.py within the _loss_bpb helper function.
Step 1: Building Byte-Length Lookup Tables from SentencePiece
Before training begins, the system constructs lookup tables that map every token ID to its corresponding byte length. This happens in the build_sentencepiece_luts function located at lines 80-99 of train_gpt.py.
Normal Tokens vs. Byte Tokens
The function iterates through the SentencePiece vocabulary to determine byte lengths:
- Normal tokens: Length equals
len(piece.encode("utf-8"))— the actual UTF-8 byte representation of the token string - Byte tokens: Special tokens representing single bytes (like
<0x00>) have a fixed length of1
Handling Leading Spaces and Boundary Tokens
The lookup tables also track metadata critical for accurate byte counting:
- Leading space flag: Stored in
has_leading_space_lutto indicate if a token begins with a space character - Boundary tokens: Control, unknown, and unused tokens are marked in
is_boundary_token_lut— these never contribute extra bytes for spacing
# Simplified excerpt from build_sentencepiece_luts (train_gpt.py lines 80-99)
def build_sentencepiece_luts(sp, vocab_size, device):
base_bytes = torch.zeros(vocab_size, dtype=torch.int16, device=device)
has_leading_space = torch.zeros(vocab_size, dtype=torch.bool, device=device)
is_boundary = torch.zeros(vocab_size, dtype=torch.bool, device=device)
for token_id in range(vocab_size):
piece = sp.id_to_piece(token_id)
# Determine byte length based on token type
if piece.startswith("<0x") and piece.endswith(">"):
base_bytes[token_id] = 1 # Byte token
else:
base_bytes[token_id] = len(piece.encode("utf-8"))
return base_bytes, has_leading_space, is_boundary
Step 2: Accumulating Loss and Byte Counts During Validation
During the validation loop (lines 262-267 of train_gpt.py), the system aggregates three key metrics: loss sum, token count, and byte count.
The byte accumulation logic handles the interaction between consecutive tokens:
- Base bytes: Retrieved from
base_bytes_lutfor the current token - Leading space adjustment: If the current token has a leading space (
has_leading_space_lut) and the previous token is not a boundary token (~is_boundary_token_lut[prev_ids]), add one extra byte
# Excerpt from validation loop (train_gpt.py lines 262-267)
prev_ids = x.reshape(-1)
tgt_ids = y.reshape(-1)
# Look up base byte lengths for target tokens
token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16)
# Add byte for leading space if previous token isn't a boundary
token_bytes += (has_leading_space_lut[tgt_ids] &
~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16)
val_byte_count += token_bytes.to(torch.float64).sum()
This approach ensures that space characters are only counted when they represent actual byte content between tokens, not when they result from tokenizer formatting or control sequences.
Step 3: Converting Average Loss to Bits-Per-Byte
After validation completes, the _loss_bpb function (lines 1240-1243 of train_gpt.py) converts the accumulated statistics into the final BPB metric.
The conversion follows three mathematical operations:
- Average loss per token:
val_loss = loss_sum / token_count(in nats) - Nats to bits: Divide by
log(2)to convert from natural logarithm to base-2 logarithm - Scale by token-to-byte ratio: Multiply by
token_count / byte_countto get bits per byte rather than bits per token
# _loss_bpb implementation (train_gpt.py lines 1240-1243)
def _loss_bpb(loss_sum, token_count, byte_count):
val_loss = (loss_sum / token_count).item()
val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item())
return val_loss, val_bpb
This final value represents the theoretical minimum number of bits required to encode each byte of the validation text using the model's probability distribution — a true measure of compression performance independent of tokenization efficiency.
Summary
- Tokenizer-agnostic BPB measures compression efficiency in bits per raw byte, eliminating bias from specific tokenizer vocabularies.
- Lookup table construction in
build_sentencepiece_lutsmaps tokens to byte lengths while tracking leading spaces and boundary tokens. - Validation accumulation counts bytes accurately by adding space bytes only when they represent actual content between non-boundary tokens.
- Final conversion via
_loss_bpbapplies the formula:(avg_loss / log(2)) * (tokens / bytes). - This implementation appears in
train_gpt.pythroughout theopenai/parameter-golfrepository.
Frequently Asked Questions
What makes BPB "tokenizer-agnostic"?
Standard perplexity metrics measure bits per token, which favors tokenizers with larger vocabularies that produce fewer tokens per byte. The tokenizer-agnostic BPB metric converts this to bits per byte by accounting for how many bytes each token actually represents. This allows fair comparison between models using different tokenization strategies, as the metric reflects actual compression of the raw text rather than tokenizer efficiency.
Why convert from nats to bits?
Neural network loss functions typically compute cross-entropy using the natural logarithm (nats). However, information theory standards and compression metrics conventionally use base-2 logarithms (bits). The division by math.log(2.0) converts the average loss from nats to bits, ensuring the final BPB value represents the theoretical minimum bits required per byte according to Shannon's source coding theorem.
How does Parameter-Golf handle special tokens in byte counting?
Control tokens, unknown tokens, and unused vocabulary entries are marked as boundary tokens in the is_boundary_token_lut table. These tokens never contribute extra bytes for leading spaces, preventing artificial inflation of byte counts when the model encounters special sequences. This ensures that only actual text content (normal tokens and byte tokens) contributes to the compression metric.
Can this metric be used with non-SentencePiece tokenizers?
The current implementation in train_gpt.py specifically builds lookup tables from SentencePiece models using build_sentencepiece_luts. However, the underlying mathematical approach—mapping token IDs to byte lengths and calculating bits per byte—is tokenizer-agnostic by design. To use this with BPE or WordPiece tokenizers, you would need to implement equivalent logic to build the byte-length lookup tables for those vocabulary types.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →