Best Practices for LLM Fine-Tuning with LoRA and QLoRA: A Complete Guide
Fine-tuning large language models with LoRA and QLoRA reduces memory usage by up to 75% while maintaining full-model quality by training low-rank adapter matrices instead of updating base parameters.
The ai-engineering-from-scratch repository by Rohit Ghumare provides a production-ready framework for parameter-efficient fine-tuning. This guide distills the decision tables and reference implementation from phases/11-llm-engineering/08-fine-tuning-lora/ into actionable best practices for optimizing rank selection, quantization strategies, and adapter persistence.
Choose the Right Adaptation Method for Your Hardware
Selecting between full fine-tuning, LoRA, and QLoRA depends entirely on available VRAM. According to the decision framework in phases/11-llm-engineering/08-fine-tuning-lora/outputs/prompt-lora-advisor.md, match your hardware to the approach using these thresholds:
| VRAM (≈ relative to model size) | Recommended approach |
|---|---|
| ≥ 2 × model size (fp16) | Full fine-tuning (only if budget permits) |
| ≥ model size (fp16) | LoRA – keep the base model in fp16 and add low-rank adapters |
| ≥ model size / 4 | QLoRA – quantize the base to 4-bit (NF4) and keep adapters in fp16 |
| < model size / 4 | Use a smaller base model or CPU off-loading |
If VRAM is scarce, use QLoRA to quantize the frozen base model to NF4 while keeping trainable adapters in fp16.
Configure Rank and Alpha for Your Use Case
The rank (r) and scaling factor (α) determine adapter capacity and learning dynamics. As documented in the advisor file, use this task-based mapping:
| Task type | Recommended LoRA rank | α (default) |
|---|---|---|
| Binary classification, sentiment | r = 4 | α = 8 |
| Single-domain Q&A, summarization, translation | r = 8 | α = 16 |
| Multi-domain instruction following, chat | r = 16 | α = 32 |
| Code generation / complex reasoning | r = 32 | α = 64 |
| Rarely needed (ablate first) | r = 64 | α = 128 |
The rule of thumb is α ≈ 2 × r. Adjust if training proves unstable (set α = r) or converges too slowly (set α = 4 × r).
Select Target Modules Strategically
Not all layers require adaptation. The repository recommends starting minimal and expanding only if quality gaps persist:
| Depth | Modules to fine-tune |
|---|---|
| Minimum viable | q_proj, v_proj (attention query & value) |
| Standard | q_proj, k_proj, v_proj, o_proj (all attention projections) |
| Maximum | All linear layers, including MLP parts (gate_proj, up_proj, down_proj) |
In phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py, the inject_lora() function accepts a target_modules list to specify which layers receive adapter matrices.
Optimize Learning Rate and Batch Size
Parameter-efficient methods tolerate higher learning rates than full fine-tuning. Use these ranges from the advisor guidelines:
| Method | LR range | Typical effective batch size |
|---|---|---|
| Full fine-tuning | 1e‑5 – 5e‑5 | 16 – 64 |
| LoRA (fp16 base) | 5e‑5 – 2e‑4 | 16 – 64 |
| QLoRA (4-bit base) | 1e‑4 – 3e‑4 | 16 – 64 |
When VRAM is tight, set per_device_batch_size=1 and increase gradient_accumulation_steps (e.g., to 16) to maintain the effective batch size.
Apply Dropout Based on Dataset Size
Prevent overfitting by adjusting lora_dropout according to training example count:
| Dataset size | Suggested lora_dropout |
|---|---|
| < 5 K examples | 0.10 |
| 5 K – 100 K | 0.05 (default) |
| > 100 K | 0.00 |
Implement NF4 Quantization for QLoRA
QLoRA relies on 4-bit Normal Float (NF4) quantization of the frozen base model. The implementation in phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py follows this workflow:
- Freeze all base-model parameters (
requires_grad=False). - Quantize each frozen tensor to NF4 using
quantize_model(), storing per-block scales. - Keep LoRA adapters in fp16 (trainable).
The train_lora() function works identically on both quantized and non-quantized models, simplifying the training pipeline.
Persist and Serve Adapters
Adapter persistence enables efficient multi-task serving without duplicating base model weights.
- Saving:
save_lora_adapter()stores only LoRA matrices (A,B, plus rank and α metadata). - Loading:
load_lora_adapter()restores adapters into a model that already has LoRA layers injected. - Merging:
merge_lora_weights()fuses adapters into base weights for inference-speed critical deployments.
The reference implementation demonstrates multi-adapter serving—training separate adapters on disjoint data splits and switching them at inference time for task-level routing.
End-to-End Implementation Example
Below is a complete workflow using utilities from phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py:
from phases.11_llm_engineering.08_fine_tuning_lora.code.lora import (
create_demo_model,
inject_lora,
train_lora,
quantize_model,
merge_lora_weights,
save_lora_adapter,
load_lora_adapter,
create_demo_data,
)
# Initialize model
model = create_demo_model()
# Inject LoRA (rank=8, α=16) targeting specific layers
lora_layers = inject_lora(
model,
target_modules=["0", "2"],
rank=8,
alpha=16
)
# Prepare data
data = create_demo_data()
# Standard LoRA training
losses = train_lora(model, data, epochs=10, lr=1e-3, batch_size=4)
# QLoRA workflow: quantize frozen base, keep adapters fp16
quant_state = quantize_model(model)
losses = train_lora(model, data, epochs=10, lr=1e-4, batch_size=4)
# Merge for deployment
merge_lora_weights(model)
# Persist adapters only
import tempfile, os
tmp_path = tempfile.NamedTemporaryFile(suffix=".pt", delete=False).name
n_saved = save_lora_adapter(model, tmp_path)
print(f"Saved {n_saved} LoRA tensors → {os.path.getsize(tmp_path)/1024:.1f} KB")
# Load later into identical architecture
load_lora_adapter(model, tmp_path)
Summary
- Match method to VRAM: Use LoRA when you have at least the model size in fp16 VRAM; use QLoRA when you have one-quarter that amount.
- Set rank and alpha by task complexity: Start with r=8/α=16 for general Q&A, r=32/α=64 for code generation.
- Target attention first: Begin with
q_projandv_proj, expanding to MLP layers only if validation quality lags. - Quantize correctly: Freeze base parameters before applying NF4 quantization in QLoRA workflows.
- Persist adapters: Store only the low-rank matrices (typically 10–100 MB) rather than full model checkpoints.
Frequently Asked Questions
What is the difference between LoRA and QLoRA?
LoRA (Low-Rank Adaptation) keeps the base model in fp16 and trains small adapter matrices, while QLoRA (Quantized LoRA) first quantizes the frozen base model to 4-bit NF4 format, reducing VRAM requirements to roughly 25% of full fine-tuning needs. Both methods keep adapters in fp16 during training.
How do I choose the right rank for LoRA fine-tuning?
Select rank based on task complexity: use r=4 for simple binary classification, r=8–16 for single-domain Q&A, r=16–32 for multi-domain instruction following, and r=32–64 for code generation. Always start lower and increase only if validation metrics plateau.
Can I merge LoRA adapters back into the base model?
Yes. Use merge_lora_weights() from lora.py to fuse the trained adapter matrices into the base weights. This eliminates inference overhead but loses the ability to dynamically swap adapters. Store the original base model separately if you need to revert or switch adapters later.
How much VRAM is required for QLoRA fine-tuning a 7B parameter model?
QLoRA requires approximately one-quarter of the fp16 model size in VRAM—roughly 4–6 GB for a 7B parameter model using 4-bit quantization, compared to 14+ GB for standard LoRA. This enables fine-tuning on consumer GPUs.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →