# How to Implement LLM Quantization for Inference: A Complete Guide

> Learn LLM quantization for inference: Shrink models to INT4/INT8 with GPTQ/AWQ for efficient deployment on limited hardware while preserving accuracy. A complete guide for AI engineers on GitHub.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-10

---

**LLM quantization for inference reduces model size from FP16 to INT4 or INT8 using symmetric scaling, per-channel quantization, and advanced algorithms like GPTQ and AWQ, enabling deployment on limited hardware while maintaining accuracy.**

The rohitg00/ai-engineering-from-scratch repository provides a comprehensive implementation of LLM quantization techniques in Phase 10, Lesson 11. This guide walks through the complete pipeline from number-format fundamentals to production deployment, demonstrating how to shrink model footprints by 75% or more while keeping perplexity degradation within acceptable bounds.

## Why Quantize? Memory and Speed Trade-offs

Quantization converts weights from 16-bit floating point (FP16) to lower-precision formats (INT8, INT4, or FP8), dramatically reducing GPU memory requirements and increasing inference throughput. The following table illustrates the trade-offs for a Llama 3 70B model:

| Format | Size | Speed-up (A100) | Perplexity Δ |
|--------|------|-----------------|--------------|
| FP16 | 140 GB | 1.0× baseline | 0 |
| INT8 | 70 GB | 1.5× | < 0.5 |
| INT4 | 35 GB | 2.5× | 1–2 |

According to the lesson documentation in [`phases/10-llms-from-scratch/11-quantization/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/11-quantization/docs/en.md), aggressive quantization of weights to INT4 can reduce memory usage by fourfold with minimal quality loss when combined with per-channel scaling.

## The Quantization Sensitivity Hierarchy

Not all model components tolerate quantization equally. The repository defines a sensitivity hierarchy visualized in the lesson documentation, where errors propagate and amplify through the network:

- **Weights** tolerate aggressive quantization (INT4) with per-channel scales.
- **Activations** require INT8 or FP16; outlier channels must remain in higher precision.
- **KV-cache** benefits from FP8/INT8 to save memory during long contexts.
- **Attention logits** must stay in FP16 or BF16 due to softmax sensitivity.

This hierarchy determines which quantization strategy applies to each tensor type during the conversion process.

## Core Quantization Operations

The foundational math lives in [`phases/10-llms-from-scratch/11-quantization/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/11-quantization/code/main.py), implementing symmetric quantization as the baseline approach.

### Symmetric Per-Tensor Quantization

Symmetric quantization uses a single scale factor derived from the maximum absolute value:

```python
import numpy as np

def quantize_symmetric(tensor, num_bits=8):
    qmin = -(2 ** (num_bits - 1))
    qmax = 2 ** (num_bits - 1) - 1
    abs_max = np.max(np.abs(tensor))
    if abs_max == 0:
        return np.zeros_like(tensor, dtype=np.int32), 1.0
    scale = abs_max / qmax
    quantized = np.clip(np.round(tensor / scale), qmin, qmax).astype(np.int32)
    return quantized, float(scale)

def dequantize_symmetric(quantized, scale):
    return quantized.astype(np.float64) * scale

```

This method works well for uniformly distributed weights but can suffer from high reconstruction error when outliers exist.

## Per-Channel Quantization

Per-channel scaling dramatically reduces error by computing separate scales for each output channel. The implementation in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) handles both row-wise and column-wise scaling:

```python
def quantize_per_channel(tensor, num_bits=8, axis=0):
    qmin = -(2 ** (num_bits - 1))
    qmax = 2 ** (num_bits - 1) - 1
    if axis == 0:
        abs_max = np.max(np.abs(tensor), axis=1, keepdims=True)
    else:
        abs_max = np.max(np.abs(tensor), axis=0, keepdims=True)
    abs_max = np.where(abs_max == 0, 1.0, abs_max)
    scales = abs_max / qmax
    quantized = np.clip(np.round(tensor / scales), qmin, qmax).astype(np.int32)
    return quantized, scales.squeeze()

```

**Per-channel quantization** captures variance across dimensions better than per-tensor methods, enabling INT4 compression with lower reconstruction loss.

## Post-Training Quantization Algorithms

The repository implements two state-of-the-art PTQ methods that outperform naive rounding: **GPTQ** (Hessian-guided) and **AWQ** (activation-aware).

### GPTQ: Hessian-Based Column Quantization

GPTQ treats each column as an independent quantization problem, using an approximate Hessian computed from calibration data to determine which weights can tolerate more rounding. The `simulated_gptq` function (lines 775–825 in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py)) implements this column-wise optimization:

```python

# Assuming calibration_inputs is a list of activation tensors

calibration_inputs = [np.random.randn(8, 1024) * 0.1 for _ in range(32)]

q_int4, scales, gptq_info = simulated_gptq(
    weights_fp16, 
    calibration_inputs, 
    num_bits=4
)

```

This approach minimizes the L2 error between the original and quantized weight matrices by leveraging second-order gradient information.

### AWQ: Activation-Aware Weight Scaling

**AWQ** (Activation-aware Weight Quantization) identifies "salient weights"—those multiplying large activation magnitudes—and scales them up before quantization. The `simulated_awq` function (lines 845–865) demonstrates this three-step process:

1. Compute activation magnitudes from calibration data
2. Scale salient weights by a factor (typically 1.5–2.0) before quantization
3. Reverse the scaling during dequantization

```python
weights_awq, awq_info = simulated_awq(
    weights_fp16, 
    calibration_inputs, 
    num_bits=4
)

```

AWQ typically outperforms GPTQ on downstream tasks by preserving weights that have disproportionate impact on the output distribution.

## Deployment Formats and Serving

Once quantized, models must be packaged for target hardware. The repository covers two primary deployment paths.

### GGUF for CPU Inference

**GGUF** is the self-contained format used by llama.cpp, supporting mixed-precision layers and embedding tokenizer metadata. Convert HuggingFace checkpoints using the provided utility:

```bash
pip install llama-cpp-python

python convert_hf_to_gguf.py meta-llama/Llama-3.1-8B \
    --outtype q4_k_m \
    --outfile llama-8b-q4km.gguf

```

The `q4_k_m` format stores attention weights in 4-bit while keeping feed-forward layers in 6-bit, balancing quality and compression.

### vLLM for GPU Serving

For NVIDIA GPUs, serve quantized models directly through vLLM without manual dequantization:

```bash
pip install vllm

vllm serve path/to/llama-8b-gptq-int4 \
    --quantization gptq \
    --dtype half \
    --max-model-len 8192

```

vLLM automatically fuses the dequantization and matrix multiplication operations, minimizing latency overhead.

## End-to-End Implementation Pipeline

Follow this six-step workflow to quantize any LLM for inference:

1. **Select target hardware** – Determine if deploying to CPU (GGUF), consumer GPU (INT4), or datacenter GPU (FP8).
2. **Gather calibration data** – Run 200–500 representative prompts through the FP16 model to collect activation statistics.
3. **Apply sensitivity analysis** – Identify outlier channels in activations that require FP16 preservation.
4. **Execute PTQ** – Run `simulated_gptq` or `simulated_awq` using the calibration set to generate per-channel scales.
5. **Validate quality** – Compute perplexity on held-out data and run downstream benchmarks (MMLU, HumanEval).
6. **Export and serve** – Package as GGUF for llama.cpp or PyTorch checkpoint for vLLM with `--quantization gptq`.

The `__main__` block in [`phases/10-llms-from-scratch/11-quantization/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/11-quantization/code/main.py) provides a complete reference script executing this entire pipeline.

## Summary

- **LLM quantization for inference** converts FP16 weights to INT8 or INT4 using symmetric or per-channel scaling, reducing memory by 50–75%.
- **Per-channel quantization** outperforms per-tensor methods by capturing variance across dimensions, essential for INT4 deployment.
- **GPTQ** uses Hessian approximation to optimize column-wise quantization, while **AWQ** preserves salient weights through activation-aware scaling.
- **GGUF** enables CPU inference via llama.cpp, while **vLLM** serves quantized models on GPUs with automatic kernel fusion.
- The rohitg00/ai-engineering-from-scratch repository provides complete implementations in [`phases/10-llms-from-scratch/11-quantization/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/11-quantization/code/main.py).

## Frequently Asked Questions

### What is the difference between symmetric and asymmetric quantization?

Symmetric quantization uses a single scale factor centered at zero, assuming the weight distribution is balanced around zero. Asymmetric quantization uses separate scale and zero-point values to handle biased distributions. For transformers, symmetric quantization is preferred because weights are typically centered, and it avoids the computational overhead of zero-point subtraction during dequantization.

### How much accuracy do I lose with INT4 quantization?

INT4 quantization typically increases perplexity by 1–2 points compared to FP16, while INT8 adds less than 0.5 perplexity points. Using **GPTQ** or **AWQ** rather than naive rounding can reduce this degradation by half. The repository's benchmark suite in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) demonstrates that AWQ-INT4 often matches FP16 performance on downstream tasks like MMLU.

### Can I quantize the KV-cache separately from the weights?

Yes, the KV-cache can be quantized to FP8 or INT8 independently using per-token scaling factors, while keeping weights in INT4. This is crucial for long-context inference where the KV-cache dominates memory usage. The sensitivity hierarchy in the lesson documentation recommends treating KV-cache as more sensitive than weights but less sensitive than attention logits.

### What hardware supports INT4 inference natively?

Modern NVIDIA GPUs (Ampere and later) support INT4 via Tensor Cores for matrix multiplication, though the dequantization still occurs in FP16. CPUs use the GGUF format with SIMD-optimized kernels in llama.cpp. Apple Silicon (M1/M2/M3) efficiently runs INT4 models through the ANE (Apple Neural Engine) when using the GGUF format with specific quantization types like Q4_K_M.