# How to Fine-Tune LLMs Using LoRA and QLoRA: A Complete Implementation Guide

> Master LoRA and QLoRA to fine-tune LLMs efficiently. Learn how these techniques drastically cut VRAM needs while preserving model quality. Implement now.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-11

---

**Fine-tuning large language models with LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) reduces VRAM requirements by up to 75% while maintaining near-full-model quality by training small adapter matrices instead of the entire base model.**

The `ai-engineering-from-scratch` repository provides a production-ready framework for parameter-efficient fine-tuning, including decision matrices for hardware constraints and a complete implementation in [`phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py). This guide distills the best practices from the repository's [`prompt-lora-advisor.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/prompt-lora-advisor.md) file and source code to help you implement these techniques correctly.

## Selecting the Right Fine-Tuning Method Based on VRAM

Your available VRAM determines whether you should use full fine-tuning, LoRA, or QLoRA. According to the decision table in [`phases/11-llm-engineering/08-fine-tuning-lora/outputs/prompt-lora-advisor.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/08-fine-tuning-lora/outputs/prompt-lora-advisor.md), match your hardware to the method as follows:

- **Full fine-tuning**: Requires **≥ 2× model size** (fp16). Only viable with enterprise-grade GPUs or multi-GPU setups.
- **LoRA**: Requires **≥ model size** (fp16). Keeps the base model in fp16 and adds low-rank trainable adapters.
- **QLoRA**: Requires **≥ model size / 4**. Quantizes the frozen base model to 4-bit (NF4) while keeping adapters in fp16.
- **Below threshold**: Use a smaller base model or CPU off-loading strategies.

If your VRAM is tight but above the QLoRA threshold, quantize the base model using the `quantize_model()` function implemented in [`lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lora.py).

## Configuring LoRA Rank (r) and Alpha (α) Parameters

The **rank** (`r`) controls the expressiveness of your adapters, while **alpha** (`α`) scales the learning signal. The repository recommends specific values based on task complexity:

| Task type | Recommended rank (r) | Alpha (α) |
|-----------|---------------------|-----------|
| Binary classification, sentiment analysis | 4 | 8 |
| Single-domain Q&A, summarization, translation | 8 | 16 |
| Multi-domain instruction following, chat | 16 | 32 |
| Code generation, complex reasoning | 32 | 64 |
| Experimental/ablation only | 64 | 128 |

Follow the rule **α ≈ 2 × r**. Adjust downward (α = r) if training becomes unstable, or upward (α = 4 × r) if convergence is too slow. These guidelines are extracted from the rank selection tables in [`prompt-lora-advisor.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/prompt-lora-advisor.md).

## Selecting Target Modules for Adapter Injection

Not all layers require adaptation. Start with the minimal viable set and expand only if validation metrics indicate underfitting:

- **Minimum viable**: Target `q_proj` and `v_proj` (attention query and value projections).
- **Standard recommendation**: Include `q_proj`, `k_proj`, `v_proj`, and `o_proj` (all attention projections).
- **Maximum coverage**: Add MLP layers including `gate_proj`, `up_proj`, and `down_proj`.

In [`lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lora.py), the `inject_lora()` function accepts a `target_modules` list where you specify which layers to adapt. For most instruction-tuning tasks, the Standard set provides the best efficiency-to-quality ratio.

## Hyperparameter Guidelines for Training

LoRA and QLoRA require different learning rate ranges than full fine-tuning due to the reduced parameter count:

| Method | Learning rate range | Effective batch size |
|--------|--------------------|---------------------|
| Full fine-tuning | 1e‑5 – 5e‑5 | 16 – 64 |
| LoRA (fp16 base) | 5e‑5 – 2e‑4 | 16 – 64 |
| QLoRA (4‑bit base) | 1e‑4 – 3e‑4 | 16 – 64 |

When VRAM is constrained, set `per_device_batch_size=1` and increase `gradient_accumulation_steps` to 16 or 32 to maintain the effective batch size.

Regularization via dropout prevents overfitting on small datasets:

- **< 5K examples**: `lora_dropout=0.10`
- **5K – 100K examples**: `lora_dropout=0.05` (default)
- **> 100K examples**: `lora_dropout=0.00`

## Implementing QLoRA with NF4 Quantization

QLoRA quantizes the frozen base model to 4-bit Normal Float (NF4) while preserving adapter weights in fp16. The implementation in [`lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lora.py) follows this workflow:

1. Freeze all base parameters by setting `requires_grad=False`.
2. Apply `quantize_model()` to convert frozen tensors to NF4 using per-block scaling factors.
3. Inject LoRA adapters that remain trainable in fp16.

The quantization utilities `quantize_to_nf4` and `dequantize_from_nf4` handle the bit-packing and scaling automatically within the `LoRALayer` class.

## Training Workflow and Evaluation

Monitor three key signals during training:

- **Loss curves**: Should show steady decrease without spikes.
- **Gradient norms**: Watch for explosion (indicating instabilities) or vanishing.
- **Evaluation metrics**: Compare base model, LoRA-adapted model, and a fully fine-tuned reference on a held-out set of ~200 examples using accuracy, BLEU, or ROUGE scores as appropriate.

Run evaluation after each epoch or every 500 steps to detect overfitting early, especially when using higher ranks (r ≥ 32).

## Persisting and Serving LoRA Adapters

Adapters are significantly smaller than full models, enabling efficient storage and multi-task serving:

- **Saving**: `save_lora_adapter()` in [`lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lora.py) (lines 55‑71) stores only the LoRA matrices (`A` and `B`) along with rank and alpha metadata.
- **Loading**: `load_lora_adapter()` (lines 166‑176) restores adapters into a model that already has LoRA layers injected via `inject_lora()`.
- **Multi-adapter serving**: Train separate adapters on disjoint data splits, then switch adapters at inference time for task-specific routing without reloading the base model.

To merge adapters permanently for faster inference, call `merge_lora_weights()`, which adds the adapter product `BA` to the original weight matrix.

## Complete Implementation Example

The following workflow demonstrates end-to-end fine-tuning using the repository's implementation:

```python

# Install dependencies first

# pip install -r requirements.txt

from phases.11_llm_engineering.08_fine_tuning_lora.code.lora import (
    create_demo_model,
    inject_lora,
    train_lora,
    quantize_model,
    merge_lora_weights,
    save_lora_adapter,
    load_lora_adapter,
    create_demo_data,
)

# Initialize base model (or load from HuggingFace)

model = create_demo_model()

# Inject LoRA adapters: rank=8, alpha=16, targeting layers 0 and 2

lora_layers = inject_lora(
    model, 
    target_modules=["0", "2"], 
    rank=8, 
    alpha=16
)

# Prepare training data

data = create_demo_data()

# Train adapters only (base model remains frozen)

losses = train_lora(model, data, epochs=10, lr=1e-3, batch_size=4)

# For QLoRA: quantize base to NF4, keep adapters fp16

quant_state = quantize_model(model)

# Training proceeds identically with train_lora()

# Optional: merge weights for deployment speed

merge_lora_weights(model)

# Persist adapters (typically <10MB vs GBs for full model)

import tempfile, os
tmp_path = tempfile.NamedTemporaryFile(suffix=".pt", delete=False).name
n_saved = save_lora_adapter(model, tmp_path)
print(f"Saved {n_saved} LoRA tensors ({os.path.getsize(tmp_path)/1024:.1f} KB)")

# Load for later inference

load_lora_adapter(model, tmp_path)

```

## Summary

- **Choose QLoRA** when VRAM is limited to less than half the model size; use standard LoRA when you can fit the fp16 base model.
- **Set rank between 4 and 32** depending on task complexity, with alpha approximately twice the rank.
- **Target attention projections first** (`q_proj`, `v_proj`), expanding to MLP layers only if quality is insufficient.
- **Quantize to NF4** using `quantize_model()` before training to reduce memory by 75%.
- **Monitor loss curves and gradient norms** closely when using aggressive learning rates (1e‑4 – 3e‑4) typical for QLoRA.
- **Persist only adapter weights** via `save_lora_adapter()` for efficient storage and multi-task deployment.

## Frequently Asked Questions

### What is the difference between LoRA and QLoRA?

**LoRA** keeps the base model in 16-bit floating point and adds trainable low-rank matrices to specific layers. **QLoRA** first quantizes the frozen base model to 4-bit Normal Float (NF4), drastically reducing memory usage while keeping the LoRA adapters in fp16. According to the `ai-engineering-from-scratch` implementation, QLoRA allows fine-tuning models up to 4× larger than your VRAM would normally permit.

### How do I choose the correct LoRA rank for my task?

Start with **r = 8** for single-domain tasks like summarization or translation, and **r = 16** for multi-domain instruction following. Use **r = 4** for simple binary classification, and **r = 32** only for code generation or complex reasoning tasks. The repository advises against ranks above 64 unless performing specific ablation studies, as higher ranks increase compute without proportional quality gains.

### Can I merge LoRA weights back into the original model?

Yes. The `merge_lora_weights()` function in [`lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lora.py) computes the product of the LoRA matrices `B` and `A`, scales by `alpha/rank`, and adds the result to the frozen base weights. This produces a standard model with no inference overhead, though you lose the ability to switch between different adapters dynamically.

### Why does QLoRA use NF4 quantization instead of INT4?

NF4 (Normal Float 4-bit) is optimized for the zero-centered normal distributions typical of neural network weights, providing better accuracy than uniform INT4 quantization for the same bit width. The `quantize_to_nf4` implementation in [`lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lora.py) applies block-wise scaling to minimize quantization error on the frozen parameters while keeping trainable adapters in higher precision.