How LoRA Adapters Are Merged into Base Weights and Dynamically Unloaded in Hugging Face Transformers

LoRA adapters are merged into base weights through a block-diagonal concatenation of lora_B matrices and linear projection of lora_A, while dynamic unloading is achieved by toggling enable flags in wrapper layers without modifying the underlying tensors.

The Hugging Face Transformers library integrates Parameter-Efficient Fine-Tuning (PEFT) through a dedicated mixin that handles LoRA adapter loading, merging, and runtime management. Understanding how LoRA adapters are merged into base weights or dynamically unloaded requires examining the internal weight conversion pipeline and the lightweight wrapper architecture that enables zero-cost adapter switching.

Understanding LoRA Adapter Architecture in Transformers

Separate Parameter Storage

When a LoRA adapter is loaded via load_adapter(), the integration does not immediately add the adapter's tensors to the original weight matrices. Instead, the library maintains the LoRA parameters in a separate adapter state-dict and injects lightweight wrapper layers (such as peft.tuners.lora.layer.Linear) around the base model's linear layers. This separation allows multiple adapters to coexist without duplicating the base model weights.

Wrapper Layer Injection

The wrapper layers intercept the forward pass to compute the low-rank update W'x = Wx + BAx, where W represents the frozen base weights and A/B are the trainable LoRA matrices. This architecture is implemented in src/transformers/integrations/peft.py through the PeftAdapterMixin class, which manages the lifecycle of these injected modules.

How LoRA Adapters Are Merged into Base Weights

When exporting a single checkpoint containing only base weights (for example, via save_pretrained with merge_adapter=True), the library fuses the LoRA tensors into the original W matrix through a structured conversion pipeline.

Building the Weight Mapping

The process begins with _build_peft_weight_mapping in src/transformers/integrations/peft.py (lines 95-107), which constructs a mapping that tells the loader how to map the adapter's lora_A and lora_B tensors onto the base model's weight keys. This mapping uses PeftConcatenate objects specifically for handling the B matrices.

Concatenating lora_A Tensors

The lora_A matrices (the down-projection layers) are concatenated along the rank dimension using standard tensor concatenation. The PeftConcatenate class inherits from Concatenate and handles this merge without special block-diagonal logic, as documented in the class docstring at lines 75-82 of peft.py.

Block-Diagonal Merge of lora_B

The lora_B matrices (the up-projection layers) require special handling to preserve the separate contributions of each adapter. The implementation merges these block-diagonally so that each adapter's contribution occupies a distinct block of the fused weight.

This logic lives in PeftConcatenate.convert (lines 107-130) and uses:

  • torch.block_diag for standard 2-D tensors
  • _block_diag_3d for 3-D tensors (used in Mixture-of-Experts architectures)

The block-diagonal structure ensures that when multiple adapters are merged, their transformations remain mathematically equivalent to sequential application while residing in a single weight matrix.

Final Weight Fusion

The conversion pipeline applies the formula W' = W + A @ B_merged, where:

  • W is the original frozen base weight
  • A represents the concatenated lora_A tensors
  • B_merged represents the block-diagonal merged lora_B tensors

The WeightConverter and WeightRenaming classes handle the final state-dict manipulation, writing the fused tensor back into the state-dict that model.save_pretrained ultimately serializes. This process is triggered when load_adapter calls _load_pretrained_model with the weight_mapping=peft_weight_conversions parameter.

from transformers import AutoModelForCausalLM

# Load base model and LoRA adapter

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
model.load_adapter("hf-internal-testing/adapter-bloom-lora", adapter_name="lora")

# Merge adapter into base weights and save unified checkpoint

model.save_pretrained("bloom-560m-merged-lora", merge_adapter=True)

# Result: saved folder contains only base model weights with LoRA fused in

Dynamic Unloading and Adapter Management

When dynamically unloading a LoRA adapter at runtime, the library does not delete the adapter's tensors or modify the base weights. Instead, it toggles internal flags within each tuner layer to bypass the LoRA computation.

Runtime Toggle Mechanism

The PeftAdapterMixin in src/transformers/integrations/peft.py exposes methods that manipulate the enable_adapters attribute of BaseTunerLayer and ModulesToSaveWrapper objects. These wrappers surround the original linear layers and conditionally add the low-rank contribution based on the flag state.

Disabling Adapters

The disable_adapters() method (lines 172-180) iterates over all tuner layers and calls module.enable_adapters(False). When disabled, the wrapper layers pass through the base model's output without computing BAx, effectively running inference with pure base weights while preserving the adapter parameters in memory.


# Dynamically disable LoRA at inference time

model.disable_adapters()
output = model.generate(token_ids)  # Uses only base model weights

Enabling and Switching Adapters

The enable_adapters() method (lines 236-244) reactivates the wrappers, restoring the LoRA forward pass. For models with multiple loaded adapters, set_adapter() (lines 76-84) selects specific adapter(s) by calling module.set_adapter(adapter_name) on each wrapper, allowing granular control over which low-rank updates apply during inference.


# Re-enable the adapter

model.enable_adapters()
output = model.generate(token_ids)  # LoRA contribution restored

# Switch to a different adapter

model.set_adapter("other_lora_adapter")

Hot-Swapping Adapters

For advanced use cases involving torch.compile or frequent adapter switching without reloading the base model, the mixin supports hot-swapping. The enable_peft_hotswap(target_rank=..., check_compiled=...) method prepares the model to accept adapters of different ranks by pre-allocating buffers.

When loading with load_adapter(..., hotswap=True), the library replaces the tensors of the already-loaded adapter in-place using the same wrapper objects, avoiding graph recompilation. This is particularly valuable for serving infrastructure where latency matters.


# Prepare for hot-swapping with higher rank adapters

model.enable_peft_hotswap(target_rank=256)

# Load first adapter

model.load_adapter("path/to/first_lora", adapter_name="lora")

# Hot-swap to second adapter without recompilation

model.load_adapter("path/to/second_lora", adapter_name="lora", hotswap=True)

Summary

  • Separate Storage: LoRA adapters remain distinct from base weights in separate state-dicts with wrapper layers handling the forward pass, as implemented in src/transformers/integrations/peft.py.
  • Block-Diagonal Merging: When merging, lora_B tensors are combined block-diagonally via PeftConcatenate.convert using torch.block_diag, while lora_A uses standard concatenation, producing the fused weight W' = W + A @ B_merged.
  • Dynamic Unloading: Runtime adapter disabling occurs through disable_adapters(), which sets internal flags in BaseTunerLayer wrappers to bypass LoRA computation without deleting parameters or modifying base weights.
  • Hot-Swapping: The enable_peft_hotswap and load_adapter(..., hotswap=True) APIs allow in-place tensor replacement for compiled models, enabling zero-latency adapter switching in serving environments.

Frequently Asked Questions

What happens to the original model weights when a LoRA adapter is merged?

The original base weights remain intact in memory during the merge process, but the exported checkpoint contains only the fused weights W' = W + A @ B_merged. The save_pretrained method with merge_adapter=True computes this fusion via the WeightConverter pipeline in src/transformers/integrations/peft.py and writes a standard state-dict without LoRA-specific keys, effectively baking the adaptations permanently into the linear layers.

Can I switch between multiple LoRA adapters without reloading the base model?

Yes, the PeftAdapterMixin supports dynamic adapter switching through the set_adapter() method, which calls module.set_adapter(adapter_name) on every BaseTunerLayer wrapper to select the active low-rank matrices. For scenarios involving torch.compile, you can use enable_peft_hotswap() followed by load_adapter(..., hotswap=True) to replace adapter tensors in-place without triggering graph recompilation, enabling sub-second switching in production environments.

Does merging LoRA adapters affect model performance or inference speed?

Merging adapters improves inference speed by eliminating the overhead of the separate LoRA forward pass (Wx + BAx) and reducing memory fragmentation, since the merged model contains only standard linear layers without wrapper objects. After merging, inference runs at the same speed as the original base model, whereas unmerged adapter inference incurs a small latency penalty from the additional matrix multiplications and memory accesses required to combine base and LoRA outputs.

Where is the logic for block-diagonal merging implemented in Transformers?

The block-diagonal merging logic resides in the PeftConcatenate class within src/transformers/integrations/peft.py, specifically in the convert method (lines 107-130). This method uses torch.block_diag for standard 2-D tensors and a custom _block_diag_3d helper for 3-D tensors used in Mixture-of-Experts architectures, ensuring that multiple adapters' lora_B matrices occupy distinct diagonal blocks in the fused weight matrix.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →