How LZMA Code Compression Reduces Submission Artifact Size in Parameter Golf

LZMA code compression reduces submission artifacts by 39% compared to zlib by applying high-ratio entropy coding to tightly packed, quantized weight streams, enabling a ~16.6 KB self-extracting wrapper that fits the 16 MB competition limit.

The openai/parameter-golf repository demonstrates how extreme quantization combined with Lempel-Ziv-Markov chain Algorithm (LZMA) compression creates tiny model submissions. By first quantizing neural network weights to binary or ternary values, packing them into dense byte streams, and then applying LZMA's dictionary-based range coding, the pipeline achieves dramatic size reductions without sacrificing the ability to reconstruct full tensors at runtime.

The Multi-Stage Compression Pipeline

The repository implements a three-stage pipeline where LZMA serves as the final entropy coding layer. Each stage progressively reduces the information entropy of the weight tensors to maximize compressibility.

Quantization and Packing

Model weights are first quantized to ultra-low-precision formats to minimize information content. The parameter-golf codebase supports 1-bit binary (values in {-1, +1}), 3-value ternary (values in {-1, 0, +1}), and int6/int8 quantizations.

After quantization, the discrete values are tightly packed into byte arrays:

  • Bit-packing stores each binary weight in a single bit, yielding an 8× density improvement over int8 storage.
  • Base-3 packing maps ternary values to trits (0, 1, 2) and encodes 5 trits per byte (approximately 1.585 bits per trit), as implemented in the binary/ternary packing utilities.

These packed streams are already 30–40% smaller than a standard int8 + zlib baseline before LZMA is applied, according to the results documented in records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/RESULTS.md.

LZMA Entropy Coding

The packed byte streams are handed to Python's lzma module with preset 9 (the highest compression level). LZMA combines a large sliding-window LZ77 dictionary with range coding, which excels at exploiting the high redundancy and long runs of identical values that remain after quantization and packing.

For ternary models, this final compression stage achieves a ≈39% size reduction compared to the int8 + zlib baseline, as reported in the same results file. The dictionary compression effectively encodes repetitive weight patterns—common in quantized neural networks—using back-references rather than storing literal values.

Implementing the Self-Extracting Wrapper

The compressed payload is embedded into a minimal Python loader that reconstructs the original tensors at runtime. This wrapper is significantly smaller than the raw weight blob it replaces.

Offline Compression Script

The following pattern, derived from the repository's training scripts, demonstrates how weights are packed and compressed for submission:

import numpy as np
import lzma
import base64

# Example: ternary weights ∈ {-1, 0, +1}

weights = np.random.choice([-1, 0, 1], size=1_000_000).astype(np.int8)

# Base-3 packing: map -1→0, 0→1, +1→2

trits = (weights + 1).astype(np.uint8)

# Pack 5 trits per byte (simplified illustration)

# Actual implementation uses efficient bit manipulation

packed = np.packbits(np.unpackbits(trits)[:5_000_000])

# LZMA compress with preset 9 (highest ratio)

compressed = lzma.compress(packed.tobytes(), preset=9)

# Encode for embedding in the wrapper

payload = base64.b85encode(compressed).decode()
print(f"Payload length: {len(payload)} characters")

Runtime Decompression Loader

The self-extracting wrapper, as implemented in records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py, decompresses and reconstructs the tensors:

import lzma as L
import base64 as B
import torch
import numpy as np

# The compressed payload embedded as a base85 string

payload = "<payload>"

# Decompress to raw bytes

raw = L.decompress(B.b85decode(payload))

# Reconstruct as byte tensor then reshape

tensor = torch.from_numpy(
    np.frombuffer(raw, dtype=np.uint8)
).view(<shape>)  # Reshape to original dimensions

# Assign to model parameters

This wrapper approach saves approximately 43 KB per submission compared to storing raw weight files, as documented in records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md.

Alternative: LZMA2 Filters for Archive Bundling

For submissions requiring multiple files, the repository also demonstrates using LZMA2 filters directly with tarfile for maximum compression of the entire submission bundle:

import tarfile
import lzma

with tarfile.open("submission.tar.xz", "w:xz",
                  format=tarfile.PAX_FORMAT,
                  filters=[{"id": lzma.FILTER_LZMA2}]) as tar:
    tar.add("model_weights.bin")
    tar.add("inference_code.py")

This pattern appears in records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_WD090_AllInt6/train_gpt.py, showing how the competition entries leverage modern LZMA2 filtering for archive-level compression.

Summary

  • LZMA code compression in openai/parameter-golf serves as the final entropy coding stage after aggressive quantization and bit-packing.
  • The algorithm achieves ≈39% size reduction over int8+zlib baselines by exploiting redundancies in ternary and binary weight streams using dictionary compression and range coding.
  • A ~16.6 KB self-extracting Python wrapper decompresses the payload at runtime using lzma.decompress() and base64.b85decode(), saving approximately 43 KB per submission.
  • The pipeline automatically evaluates bit-mask + LZMA versus base-3 + LZMA to select the optimal compression strategy for each model architecture.

Frequently Asked Questions

How does LZMA achieve better compression than zlib for quantized neural networks?

LZMA utilizes a large sliding-window dictionary (up to 4 GB in theory, though typically smaller in practice) combined with range coding, whereas zlib uses DEFLATE with a smaller 32 KB window and Huffman coding. Quantized neural network weights exhibit long runs of identical values and repeated patterns after packing. LZMA's larger dictionary captures these redundancies across longer distances, while range coding achieves fractional bit-precision for frequent symbols, resulting in the ≈39% reduction observed over zlib in the parameter-golf benchmarks.

What is the size overhead of the self-extracting LZMA wrapper?

The self-extracting Python loader that embeds the compressed payload and decompresses it at runtime weighs approximately 16.6 KB, as documented in records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md. This is significantly smaller than storing raw weight files or using less efficient compression schemes, effectively saving roughly 43 KB per submission compared to alternative packaging methods that don't use this tight integration of LZMA and base85 encoding.

Can LZMA compression be used with other packing formats besides binary and ternary?

Yes, the parameter-golf repository demonstrates that LZMA is format-agnostic regarding the packed bitstream, as long as the data exhibits exploitable entropy patterns. The code evaluates both bit-mask packing (for binary weights) and base-3 packing (for ternary weights) before selecting the smaller LZMA-compressed artifact. The repository also shows LZMA2 filters being applied to tar archives containing arbitrary model components, indicating the compression works effectively across different serialization formats provided the underlying data has sufficient redundancy or low entropy from quantization.

How does the LZMA preset level affect submission size and decompression speed?

The parameter-golf implementation uses preset 9, the highest compression level available in Python's lzma module. Higher presets increase the dictionary size and compression depth, yielding smaller payloads at the cost of slower compression time. Since submissions are compressed once offline but decompressed at runtime during evaluation, the trade-off favors maximum compression (preset 9) to minimize the final artifact size. Decompression speed remains fast enough for inference because LZMA decompression is typically much faster than compression and the weight tensors are loaded once at startup.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →