How to Fine-Tune LLMs Using LoRA and QLoRA: A Complete Implementation Guide
Fine-tuning large language models with LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) reduces VRAM requirements by up to 75% while maintaining near-full-model quality by training small adapter matrices instead of the entire base model.
The ai-engineering-from-scratch repository provides a production-ready framework for parameter-efficient fine-tuning, including decision matrices for hardware constraints and a complete implementation in phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py. This guide distills the best practices from the repository's prompt-lora-advisor.md file and source code to help you implement these techniques correctly.
Selecting the Right Fine-Tuning Method Based on VRAM
Your available VRAM determines whether you should use full fine-tuning, LoRA, or QLoRA. According to the decision table in phases/11-llm-engineering/08-fine-tuning-lora/outputs/prompt-lora-advisor.md, match your hardware to the method as follows:
- Full fine-tuning: Requires ≥ 2× model size (fp16). Only viable with enterprise-grade GPUs or multi-GPU setups.
- LoRA: Requires ≥ model size (fp16). Keeps the base model in fp16 and adds low-rank trainable adapters.
- QLoRA: Requires ≥ model size / 4. Quantizes the frozen base model to 4-bit (NF4) while keeping adapters in fp16.
- Below threshold: Use a smaller base model or CPU off-loading strategies.
If your VRAM is tight but above the QLoRA threshold, quantize the base model using the quantize_model() function implemented in lora.py.
Configuring LoRA Rank (r) and Alpha (α) Parameters
The rank (r) controls the expressiveness of your adapters, while alpha (α) scales the learning signal. The repository recommends specific values based on task complexity:
| Task type | Recommended rank (r) | Alpha (α) |
|---|---|---|
| Binary classification, sentiment analysis | 4 | 8 |
| Single-domain Q&A, summarization, translation | 8 | 16 |
| Multi-domain instruction following, chat | 16 | 32 |
| Code generation, complex reasoning | 32 | 64 |
| Experimental/ablation only | 64 | 128 |
Follow the rule α ≈ 2 × r. Adjust downward (α = r) if training becomes unstable, or upward (α = 4 × r) if convergence is too slow. These guidelines are extracted from the rank selection tables in prompt-lora-advisor.md.
Selecting Target Modules for Adapter Injection
Not all layers require adaptation. Start with the minimal viable set and expand only if validation metrics indicate underfitting:
- Minimum viable: Target
q_projandv_proj(attention query and value projections). - Standard recommendation: Include
q_proj,k_proj,v_proj, ando_proj(all attention projections). - Maximum coverage: Add MLP layers including
gate_proj,up_proj, anddown_proj.
In lora.py, the inject_lora() function accepts a target_modules list where you specify which layers to adapt. For most instruction-tuning tasks, the Standard set provides the best efficiency-to-quality ratio.
Hyperparameter Guidelines for Training
LoRA and QLoRA require different learning rate ranges than full fine-tuning due to the reduced parameter count:
| Method | Learning rate range | Effective batch size |
|---|---|---|
| Full fine-tuning | 1e‑5 – 5e‑5 | 16 – 64 |
| LoRA (fp16 base) | 5e‑5 – 2e‑4 | 16 – 64 |
| QLoRA (4‑bit base) | 1e‑4 – 3e‑4 | 16 – 64 |
When VRAM is constrained, set per_device_batch_size=1 and increase gradient_accumulation_steps to 16 or 32 to maintain the effective batch size.
Regularization via dropout prevents overfitting on small datasets:
- < 5K examples:
lora_dropout=0.10 - 5K – 100K examples:
lora_dropout=0.05(default) - > 100K examples:
lora_dropout=0.00
Implementing QLoRA with NF4 Quantization
QLoRA quantizes the frozen base model to 4-bit Normal Float (NF4) while preserving adapter weights in fp16. The implementation in lora.py follows this workflow:
- Freeze all base parameters by setting
requires_grad=False. - Apply
quantize_model()to convert frozen tensors to NF4 using per-block scaling factors. - Inject LoRA adapters that remain trainable in fp16.
The quantization utilities quantize_to_nf4 and dequantize_from_nf4 handle the bit-packing and scaling automatically within the LoRALayer class.
Training Workflow and Evaluation
Monitor three key signals during training:
- Loss curves: Should show steady decrease without spikes.
- Gradient norms: Watch for explosion (indicating instabilities) or vanishing.
- Evaluation metrics: Compare base model, LoRA-adapted model, and a fully fine-tuned reference on a held-out set of ~200 examples using accuracy, BLEU, or ROUGE scores as appropriate.
Run evaluation after each epoch or every 500 steps to detect overfitting early, especially when using higher ranks (r ≥ 32).
Persisting and Serving LoRA Adapters
Adapters are significantly smaller than full models, enabling efficient storage and multi-task serving:
- Saving:
save_lora_adapter()inlora.py(lines 55‑71) stores only the LoRA matrices (AandB) along with rank and alpha metadata. - Loading:
load_lora_adapter()(lines 166‑176) restores adapters into a model that already has LoRA layers injected viainject_lora(). - Multi-adapter serving: Train separate adapters on disjoint data splits, then switch adapters at inference time for task-specific routing without reloading the base model.
To merge adapters permanently for faster inference, call merge_lora_weights(), which adds the adapter product BA to the original weight matrix.
Complete Implementation Example
The following workflow demonstrates end-to-end fine-tuning using the repository's implementation:
# Install dependencies first
# pip install -r requirements.txt
from phases.11_llm_engineering.08_fine_tuning_lora.code.lora import (
create_demo_model,
inject_lora,
train_lora,
quantize_model,
merge_lora_weights,
save_lora_adapter,
load_lora_adapter,
create_demo_data,
)
# Initialize base model (or load from HuggingFace)
model = create_demo_model()
# Inject LoRA adapters: rank=8, alpha=16, targeting layers 0 and 2
lora_layers = inject_lora(
model,
target_modules=["0", "2"],
rank=8,
alpha=16
)
# Prepare training data
data = create_demo_data()
# Train adapters only (base model remains frozen)
losses = train_lora(model, data, epochs=10, lr=1e-3, batch_size=4)
# For QLoRA: quantize base to NF4, keep adapters fp16
quant_state = quantize_model(model)
# Training proceeds identically with train_lora()
# Optional: merge weights for deployment speed
merge_lora_weights(model)
# Persist adapters (typically <10MB vs GBs for full model)
import tempfile, os
tmp_path = tempfile.NamedTemporaryFile(suffix=".pt", delete=False).name
n_saved = save_lora_adapter(model, tmp_path)
print(f"Saved {n_saved} LoRA tensors ({os.path.getsize(tmp_path)/1024:.1f} KB)")
# Load for later inference
load_lora_adapter(model, tmp_path)
Summary
- Choose QLoRA when VRAM is limited to less than half the model size; use standard LoRA when you can fit the fp16 base model.
- Set rank between 4 and 32 depending on task complexity, with alpha approximately twice the rank.
- Target attention projections first (
q_proj,v_proj), expanding to MLP layers only if quality is insufficient. - Quantize to NF4 using
quantize_model()before training to reduce memory by 75%. - Monitor loss curves and gradient norms closely when using aggressive learning rates (1e‑4 – 3e‑4) typical for QLoRA.
- Persist only adapter weights via
save_lora_adapter()for efficient storage and multi-task deployment.
Frequently Asked Questions
What is the difference between LoRA and QLoRA?
LoRA keeps the base model in 16-bit floating point and adds trainable low-rank matrices to specific layers. QLoRA first quantizes the frozen base model to 4-bit Normal Float (NF4), drastically reducing memory usage while keeping the LoRA adapters in fp16. According to the ai-engineering-from-scratch implementation, QLoRA allows fine-tuning models up to 4× larger than your VRAM would normally permit.
How do I choose the correct LoRA rank for my task?
Start with r = 8 for single-domain tasks like summarization or translation, and r = 16 for multi-domain instruction following. Use r = 4 for simple binary classification, and r = 32 only for code generation or complex reasoning tasks. The repository advises against ranks above 64 unless performing specific ablation studies, as higher ranks increase compute without proportional quality gains.
Can I merge LoRA weights back into the original model?
Yes. The merge_lora_weights() function in lora.py computes the product of the LoRA matrices B and A, scales by alpha/rank, and adds the result to the frozen base weights. This produces a standard model with no inference overhead, though you lose the ability to switch between different adapters dynamically.
Why does QLoRA use NF4 quantization instead of INT4?
NF4 (Normal Float 4-bit) is optimized for the zero-centered normal distributions typical of neural network weights, providing better accuracy than uniform INT4 quantization for the same bit width. The quantize_to_nf4 implementation in lora.py applies block-wise scaling to minimize quantization error on the frozen parameters while keeping trainable adapters in higher precision.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →