# Best Practices for LLM Fine-Tuning with LoRA and QLoRA: A Complete Guide

> Master LLM fine-tuning with LoRA and QLoRA. Learn best practices to reduce memory usage by 75% while preserving model quality. Explore this complete guide now.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: best-practices
- Published: 2026-05-21

---

**Fine-tuning large language models with LoRA and QLoRA reduces memory usage by up to 75% while maintaining full-model quality by training low-rank adapter matrices instead of updating base parameters.**

The `ai-engineering-from-scratch` repository by Rohit Ghumare provides a production-ready framework for parameter-efficient fine-tuning. This guide distills the decision tables and reference implementation from `phases/11-llm-engineering/08-fine-tuning-lora/` into actionable best practices for optimizing rank selection, quantization strategies, and adapter persistence.

## Choose the Right Adaptation Method for Your Hardware

Selecting between full fine-tuning, LoRA, and QLoRA depends entirely on available VRAM. According to the decision framework in [`phases/11-llm-engineering/08-fine-tuning-lora/outputs/prompt-lora-advisor.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/08-fine-tuning-lora/outputs/prompt-lora-advisor.md), match your hardware to the approach using these thresholds:

| VRAM (≈ relative to model size) | Recommended approach |
|----------------------------------|----------------------|
| ≥ 2 × model size (fp16) | Full fine-tuning (only if budget permits) |
| ≥ model size (fp16) | **LoRA** – keep the base model in fp16 and add low-rank adapters |
| ≥ model size / 4 | **QLoRA** – quantize the base to 4-bit (NF4) and keep adapters in fp16 |
| < model size / 4 | Use a smaller base model or CPU off-loading |

If VRAM is scarce, use QLoRA to quantize the frozen base model to NF4 while keeping trainable adapters in fp16.

## Configure Rank and Alpha for Your Use Case

The rank (**r**) and scaling factor (**α**) determine adapter capacity and learning dynamics. As documented in the advisor file, use this task-based mapping:

| Task type | Recommended LoRA rank | α (default) |
|-----------|----------------------|-------------|
| Binary classification, sentiment | r = 4 | α = 8 |
| Single-domain Q&A, summarization, translation | r = 8 | α = 16 |
| Multi-domain instruction following, chat | r = 16 | α = 32 |
| Code generation / complex reasoning | r = 32 | α = 64 |
| Rarely needed (ablate first) | r = 64 | α = 128 |

The rule of thumb is **α ≈ 2 × r**. Adjust if training proves unstable (set α = r) or converges too slowly (set α = 4 × r).

## Select Target Modules Strategically

Not all layers require adaptation. The repository recommends starting minimal and expanding only if quality gaps persist:

| Depth | Modules to fine-tune |
|-------|----------------------|
| Minimum viable | `q_proj`, `v_proj` (attention query & value) |
| Standard | `q_proj`, `k_proj`, `v_proj`, `o_proj` (all attention projections) |
| Maximum | All linear layers, including MLP parts (`gate_proj`, `up_proj`, `down_proj`) |

In [`phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py), the `inject_lora()` function accepts a `target_modules` list to specify which layers receive adapter matrices.

## Optimize Learning Rate and Batch Size

Parameter-efficient methods tolerate higher learning rates than full fine-tuning. Use these ranges from the advisor guidelines:

| Method | LR range | Typical effective batch size |
|--------|----------|------------------------------|
| Full fine-tuning | 1e‑5 – 5e‑5 | 16 – 64 |
| LoRA (fp16 base) | 5e‑5 – 2e‑4 | 16 – 64 |
| QLoRA (4-bit base) | 1e‑4 – 3e‑4 | 16 – 64 |

When VRAM is tight, set `per_device_batch_size=1` and increase `gradient_accumulation_steps` (e.g., to 16) to maintain the effective batch size.

## Apply Dropout Based on Dataset Size

Prevent overfitting by adjusting `lora_dropout` according to training example count:

| Dataset size | Suggested `lora_dropout` |
|--------------|--------------------------|
| < 5 K examples | 0.10 |
| 5 K – 100 K | 0.05 (default) |
| > 100 K | 0.00 |

## Implement NF4 Quantization for QLoRA

QLoRA relies on 4-bit Normal Float (NF4) quantization of the frozen base model. The implementation in [`phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py) follows this workflow:

1. Freeze all base-model parameters (`requires_grad=False`).
2. Quantize each frozen tensor to NF4 using `quantize_model()`, storing per-block scales.
3. Keep LoRA adapters in fp16 (trainable).

The `train_lora()` function works identically on both quantized and non-quantized models, simplifying the training pipeline.

## Persist and Serve Adapters

Adapter persistence enables efficient multi-task serving without duplicating base model weights.

- **Saving:** `save_lora_adapter()` stores only LoRA matrices (`A`, `B`, plus rank and α metadata).
- **Loading:** `load_lora_adapter()` restores adapters into a model that already has LoRA layers injected.
- **Merging:** `merge_lora_weights()` fuses adapters into base weights for inference-speed critical deployments.

The reference implementation demonstrates multi-adapter serving—training separate adapters on disjoint data splits and switching them at inference time for task-level routing.

## End-to-End Implementation Example

Below is a complete workflow using utilities from [`phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py):

```python
from phases.11_llm_engineering.08_fine_tuning_lora.code.lora import (
    create_demo_model,
    inject_lora,
    train_lora,
    quantize_model,
    merge_lora_weights,
    save_lora_adapter,
    load_lora_adapter,
    create_demo_data,
)

# Initialize model

model = create_demo_model()

# Inject LoRA (rank=8, α=16) targeting specific layers

lora_layers = inject_lora(
    model, 
    target_modules=["0", "2"], 
    rank=8, 
    alpha=16
)

# Prepare data

data = create_demo_data()

# Standard LoRA training

losses = train_lora(model, data, epochs=10, lr=1e-3, batch_size=4)

# QLoRA workflow: quantize frozen base, keep adapters fp16

quant_state = quantize_model(model)
losses = train_lora(model, data, epochs=10, lr=1e-4, batch_size=4)

# Merge for deployment

merge_lora_weights(model)

# Persist adapters only

import tempfile, os
tmp_path = tempfile.NamedTemporaryFile(suffix=".pt", delete=False).name
n_saved = save_lora_adapter(model, tmp_path)
print(f"Saved {n_saved} LoRA tensors → {os.path.getsize(tmp_path)/1024:.1f} KB")

# Load later into identical architecture

load_lora_adapter(model, tmp_path)

```

## Summary

- **Match method to VRAM:** Use LoRA when you have at least the model size in fp16 VRAM; use QLoRA when you have one-quarter that amount.
- **Set rank and alpha by task complexity:** Start with r=8/α=16 for general Q&A, r=32/α=64 for code generation.
- **Target attention first:** Begin with `q_proj` and `v_proj`, expanding to MLP layers only if validation quality lags.
- **Quantize correctly:** Freeze base parameters before applying NF4 quantization in QLoRA workflows.
- **Persist adapters:** Store only the low-rank matrices (typically 10–100 MB) rather than full model checkpoints.

## Frequently Asked Questions

### What is the difference between LoRA and QLoRA?

LoRA (Low-Rank Adaptation) keeps the base model in fp16 and trains small adapter matrices, while QLoRA (Quantized LoRA) first quantizes the frozen base model to 4-bit NF4 format, reducing VRAM requirements to roughly 25% of full fine-tuning needs. Both methods keep adapters in fp16 during training.

### How do I choose the right rank for LoRA fine-tuning?

Select rank based on task complexity: use r=4 for simple binary classification, r=8–16 for single-domain Q&A, r=16–32 for multi-domain instruction following, and r=32–64 for code generation. Always start lower and increase only if validation metrics plateau.

### Can I merge LoRA adapters back into the base model?

Yes. Use `merge_lora_weights()` from [`lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lora.py) to fuse the trained adapter matrices into the base weights. This eliminates inference overhead but loses the ability to dynamically swap adapters. Store the original base model separately if you need to revert or switch adapters later.

### How much VRAM is required for QLoRA fine-tuning a 7B parameter model?

QLoRA requires approximately one-quarter of the fp16 model size in VRAM—roughly 4–6 GB for a 7B parameter model using 4-bit quantization, compared to 14+ GB for standard LoRA. This enables fine-tuning on consumer GPUs.