best-practices

Best Practices for LLM Fine-Tuning with LoRA and QLoRA: A Complete Guide

May 21, 2026 rohitg00/ai-engineering-from-scratch ↗

Fine-tuning large language models with LoRA and QLoRA reduces memory usage by up to 75% while maintaining full-model quality by training low-rank adapter matrices instead of updating base parameters.

The ai-engineering-from-scratch repository by Rohit Ghumare provides a production-ready framework for parameter-efficient fine-tuning. This guide distills the decision tables and reference implementation from phases/11-llm-engineering/08-fine-tuning-lora/ into actionable best practices for optimizing rank selection, quantization strategies, and adapter persistence.

Choose the Right Adaptation Method for Your Hardware

Selecting between full fine-tuning, LoRA, and QLoRA depends entirely on available VRAM. According to the decision framework in phases/11-llm-engineering/08-fine-tuning-lora/outputs/prompt-lora-advisor.md, match your hardware to the approach using these thresholds:

VRAM (≈ relative to model size)	Recommended approach
≥ 2 × model size (fp16)	Full fine-tuning (only if budget permits)
≥ model size (fp16)	LoRA – keep the base model in fp16 and add low-rank adapters
≥ model size / 4	QLoRA – quantize the base to 4-bit (NF4) and keep adapters in fp16
< model size / 4	Use a smaller base model or CPU off-loading

If VRAM is scarce, use QLoRA to quantize the frozen base model to NF4 while keeping trainable adapters in fp16.

Configure Rank and Alpha for Your Use Case

The rank (r) and scaling factor (α) determine adapter capacity and learning dynamics. As documented in the advisor file, use this task-based mapping:

Task type	Recommended LoRA rank	α (default)
Binary classification, sentiment	r = 4	α = 8
Single-domain Q&A, summarization, translation	r = 8	α = 16
Multi-domain instruction following, chat	r = 16	α = 32
Code generation / complex reasoning	r = 32	α = 64
Rarely needed (ablate first)	r = 64	α = 128

The rule of thumb is α ≈ 2 × r. Adjust if training proves unstable (set α = r) or converges too slowly (set α = 4 × r).

Select Target Modules Strategically

Not all layers require adaptation. The repository recommends starting minimal and expanding only if quality gaps persist:

Depth	Modules to fine-tune
Minimum viable	`q_proj`, `v_proj` (attention query & value)
Standard	`q_proj`, `k_proj`, `v_proj`, `o_proj` (all attention projections)
Maximum	All linear layers, including MLP parts (`gate_proj`, `up_proj`, `down_proj`)

In phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py, the inject_lora() function accepts a target_modules list to specify which layers receive adapter matrices.

Optimize Learning Rate and Batch Size

Parameter-efficient methods tolerate higher learning rates than full fine-tuning. Use these ranges from the advisor guidelines:

Method	LR range	Typical effective batch size
Full fine-tuning	1e‑5 – 5e‑5	16 – 64
LoRA (fp16 base)	5e‑5 – 2e‑4	16 – 64
QLoRA (4-bit base)	1e‑4 – 3e‑4	16 – 64

When VRAM is tight, set per_device_batch_size=1 and increase gradient_accumulation_steps (e.g., to 16) to maintain the effective batch size.

Apply Dropout Based on Dataset Size

Prevent overfitting by adjusting lora_dropout according to training example count:

Dataset size	Suggested `lora_dropout`
< 5 K examples	0.10
5 K – 100 K	0.05 (default)
> 100 K	0.00

Implement NF4 Quantization for QLoRA

QLoRA relies on 4-bit Normal Float (NF4) quantization of the frozen base model. The implementation in phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py follows this workflow:

Freeze all base-model parameters (requires_grad=False).
Quantize each frozen tensor to NF4 using quantize_model(), storing per-block scales.
Keep LoRA adapters in fp16 (trainable).

The train_lora() function works identically on both quantized and non-quantized models, simplifying the training pipeline.

Persist and Serve Adapters

Adapter persistence enables efficient multi-task serving without duplicating base model weights.

Saving: save_lora_adapter() stores only LoRA matrices (A, B, plus rank and α metadata).
Loading: load_lora_adapter() restores adapters into a model that already has LoRA layers injected.
Merging: merge_lora_weights() fuses adapters into base weights for inference-speed critical deployments.

The reference implementation demonstrates multi-adapter serving—training separate adapters on disjoint data splits and switching them at inference time for task-level routing.

End-to-End Implementation Example

Below is a complete workflow using utilities from phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py:

from phases.11_llm_engineering.08_fine_tuning_lora.code.lora import (
    create_demo_model,
    inject_lora,
    train_lora,
    quantize_model,
    merge_lora_weights,
    save_lora_adapter,
    load_lora_adapter,
    create_demo_data,
)

# Initialize model

model = create_demo_model()

# Inject LoRA (rank=8, α=16) targeting specific layers

lora_layers = inject_lora(
    model, 
    target_modules=["0", "2"], 
    rank=8, 
    alpha=16
)

# Prepare data

data = create_demo_data()

# Standard LoRA training

losses = train_lora(model, data, epochs=10, lr=1e-3, batch_size=4)

# QLoRA workflow: quantize frozen base, keep adapters fp16

quant_state = quantize_model(model)
losses = train_lora(model, data, epochs=10, lr=1e-4, batch_size=4)

# Merge for deployment

merge_lora_weights(model)

# Persist adapters only

import tempfile, os
tmp_path = tempfile.NamedTemporaryFile(suffix=".pt", delete=False).name
n_saved = save_lora_adapter(model, tmp_path)
print(f"Saved {n_saved} LoRA tensors → {os.path.getsize(tmp_path)/1024:.1f} KB")

# Load later into identical architecture

load_lora_adapter(model, tmp_path)

Summary

Match method to VRAM: Use LoRA when you have at least the model size in fp16 VRAM; use QLoRA when you have one-quarter that amount.
Set rank and alpha by task complexity: Start with r=8/α=16 for general Q&A, r=32/α=64 for code generation.
Target attention first: Begin with q_proj and v_proj, expanding to MLP layers only if validation quality lags.
Quantize correctly: Freeze base parameters before applying NF4 quantization in QLoRA workflows.
Persist adapters: Store only the low-rank matrices (typically 10–100 MB) rather than full model checkpoints.

Frequently Asked Questions

What is the difference between LoRA and QLoRA?

LoRA (Low-Rank Adaptation) keeps the base model in fp16 and trains small adapter matrices, while QLoRA (Quantized LoRA) first quantizes the frozen base model to 4-bit NF4 format, reducing VRAM requirements to roughly 25% of full fine-tuning needs. Both methods keep adapters in fp16 during training.

How do I choose the right rank for LoRA fine-tuning?

Select rank based on task complexity: use r=4 for simple binary classification, r=8–16 for single-domain Q&A, r=16–32 for multi-domain instruction following, and r=32–64 for code generation. Always start lower and increase only if validation metrics plateau.

Can I merge LoRA adapters back into the base model?

Yes. Use merge_lora_weights() from lora.py to fuse the trained adapter matrices into the base weights. This eliminates inference overhead but loses the ability to dynamically swap adapters. Store the original base model separately if you need to revert or switch adapters later.

How much VRAM is required for QLoRA fine-tuning a 7B parameter model?

QLoRA requires approximately one-quarter of the fp16 model size in VRAM—roughly 4–6 GB for a 7B parameter model using 4-bit quantization, compared to 14+ GB for standard LoRA. This enables fine-tuning on consumer GPUs.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →