# How Stable Diffusion Generates Images: Architecture Explained

> Discover the Stable Diffusion architecture, from VAE to U-Net, and understand how it transforms text prompts into stunning images by denoising latent space.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: architecture
- Published: 2026-05-21

---

**Stable Diffusion is a latent diffusion model that uses a Variational Auto-Encoder (VAE), CLIP text encoder, U-Net denoiser, and noise scheduler to transform text prompts into high-fidelity images by iteratively refining random noise in a compressed 4×64×64 latent space.**

Stable Diffusion revolutionized open-source AI image generation by shifting the computationally expensive diffusion process from high-resolution pixel space to a compressed latent representation. According to the `rohitg00/ai-engineering-from-scratch` repository, this architectural choice reduces computational requirements by approximately 48× while maintaining visual quality, enabling text-to-image synthesis on consumer GPUs. The implementation documented in [`phases/04-computer-vision/11-stable-diffusion/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/11-stable-diffusion/docs/en.md) reveals how five core components collaborate to bridge the gap between human language and pixel-perfect imagery.

## The Five Core Components of Stable Diffusion

The architecture consists of specialized neural networks that handle compression, conditioning, denoising, and scheduling. As detailed in the source documentation, each component serves a distinct purpose in the generation pipeline.

### Variational Auto-Encoder (VAE)

The **VAE** compresses high-resolution images into a low-dimensional latent space and reconstructs them after denoising. In [`phases/04-computer-vision/11-stable-diffusion/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/11-stable-diffusion/docs/en.md) (lines 48‑49), the encoder maps a 3×512×512 RGB image to a 4×64×64 latent tensor, representing a 48× reduction in dimensionality. The decoder performs the inverse operation, converting the final latent representation back into a viewable image. This compression makes the diffusion process tractable on standard hardware without sacrificing fidelity.

### Text Encoder

The **text encoder** converts human prompts into numerical embeddings that guide the image generation process. The repository notes that Stable Diffusion 1.x and 2.x utilize CLIP-L, while SDXL employs a combination of CLIP-L and CLIP-G, and newer variants like SD 3 use T5-XXL (line 49 in [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md)). These embeddings serve as conditioning vectors that the U-Net uses to align visual features with semantic concepts described in the prompt.

### U-Net Denoiser

The **U-Net** (or Diffusion Transformer in newer variants) constitutes the core neural network that predicts and removes noise from latent representations. According to lines 50‑52 and 75‑81 of the documentation, the U-Net architecture contains cross-attention blocks that allow latent patches to attend to text embeddings at every resolution level. Additionally, a time-embedding MLP injects timestep information, enabling the network to handle different noise levels throughout the iterative denoising process.

### Scheduler

The **scheduler** algorithmically manages the step-by-step transition from pure Gaussian noise to a coherent image. The source code references multiple scheduler implementations including DDIM, Euler, DPM-Solver++, and LCM/Turbo (lines 98‑104 in [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md)). These algorithms determine how much noise to remove at each step and whether to introduce stochasticity, directly impacting generation speed and output quality.

### Safety Checker

An optional **safety checker** component filters generated content for NSFW or inappropriate material before final decoding. While not part of the core diffusion mathematics, this module runs after the U-Net completes its iterations to ensure output compliance with usage policies.

## The Latent Diffusion Process Step-by-Step

The generation pipeline follows a deterministic workflow clearly illustrated in the Mermaid diagram located at lines 29‑41 of the documentation:

1. **Encode**: The initial canvas or input image converts into latent space via the VAE encoder.
2. **Add Noise**: Gaussian noise perturbs the latent representation based on the current timestep *t*.
3. **Predict Noise**: The U-Net predicts the noise component, conditioned on text embeddings through cross-attention mechanisms and timestep embeddings through the MLP.
4. **Apply CFG**: The system executes **classifier-free guidance** by computing both conditional and unconditional noise predictions, then combining them as `ε = ε_uncond + w·(ε_cond − ε_uncond)` where *w* represents the guidance scale (lines 54‑62 in [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md)).
5. **Denoise**: The scheduler updates the latent vector toward a cleaner state using the predicted noise residual.
6. **Decode**: After *N* iterations (typically 20‑30 steps with DPM-Solver++), the VAE decoder converts the final latent tensor into a 512×512 RGB image.

Because diffusion occurs in the compressed latent space rather than pixel space, training and inference require significantly fewer floating-point operations while preserving high-resolution output capabilities (lines 21‑22 in [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md)).

## Fine-Tuning with LoRA

**Low-Rank Adaptation (LoRA)** enables efficient customization of Stable Diffusion without retraining the entire base model. As implemented in the repository (lines 85‑94), LoRA injects low-rank matrices into the U-Net's attention layers while keeping the massive pretrained weights frozen. This approach yields adapter files sized between 10‑50 MB that can be dynamically loaded at inference time (lines 91‑94). Fine-tuning with LoRA requires only a small dataset of example images and significantly less GPU memory than full model fine-tuning.

## Practical Implementation

The repository provides runnable examples in [`phases/04-computer-vision/11-stable-diffusion/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/11-stable-diffusion/code/main.py) demonstrating pipeline initialization, scheduler configuration, and LoRA integration.

### Basic Text-to-Image Generation

```python
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# Load the pipeline with FP16 optimization

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# Configure DPM-Solver++ for efficient deterministic sampling

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Generate image with CFG

image = pipe(
    prompt="a dog riding a skateboard in Tokyo, studio Ghibli style",
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("dog.png")

```

### Swapping Schedulers

Change the sampling algorithm without reloading the model:

```python
from diffusers import EulerAncestralDiscreteScheduler
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)

```

### Image-to-Image Translation

Modify existing images by adding controlled noise and redenoising:

```python
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

img2img = StableDiffusionImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

init_image = Image.open("dog.png").convert("RGB").resize((512, 512))

result = img2img(
    prompt="a dog riding a skateboard, oil painting",
    image=init_image,
    strength=0.6,  # Controls noise injection level

    guidance_scale=7.5,
).images[0]

result.save("dog_oil.png")

```

### Loading LoRA Adapters

Apply custom styles or concepts without permanent model modification:

```python

# Load pretrained LoRA weights

pipe.load_lora_weights("sayakpaul/sd-lora-ghibli")

# Fuse weights into base model for inference speed

pipe.fuse_lora(lora_scale=0.8)

```

### LoRA Training Structure

The repository includes pseudo-code illustrating the training loop (from [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py)):

```python
for step, batch in enumerate(dataloader):
    images, prompts = batch
    
    # Encode images to latent space with scaling factor

    latents = vae.encode(images).latent_dist.sample() * 0.18215
    
    # Random timestep selection

    t = torch.randint(0, num_train_timesteps, (batch_size,))
    
    # Forward diffusion: add noise to latents

    noise = torch.randn_like(latents)
    noisy_latents = scheduler.add_noise(latents, noise, t)
    
    # Text conditioning

    text_emb = text_encoder(tokenizer(prompts))
    
    # Predict noise with U-Net (LoRA weights active in attention layers)

    pred_noise = unet(noisy_latents, t, text_emb)
    
    # Optimization step

    loss = F.mse_loss(pred_noise, noise)
    loss.backward()
    optimizer.step()

```

## Summary

- **Latent Space Efficiency**: The VAE compresses 512×512 images to 4×64×64 tensors, reducing computational requirements by 48× compared to pixel-space diffusion.
- **Cross-Attention Conditioning**: The U-Net utilizes cross-attention layers to bind visual features with CLIP text embeddings, enabling precise prompt adherence.
- **Classifier-Free Guidance**: The CFG mechanism balances prompt fidelity and image diversity through the guidance scale parameter *w*.
- **Scheduler Flexibility**: Algorithms like DPM-Solver++ and Euler enable quality generation in 20‑30 steps, while LCM/Turbo variants achieve real-time synthesis.
- **Efficient Adaptation**: LoRA fine-tuning modifies only attention layer weights, producing 10‑50 MB adapters that customize generation without full model retraining.

## Frequently Asked Questions

### What is the difference between pixel-space and latent-space diffusion?

Pixel-space diffusion operates directly on RGB values at full resolution, requiring massive computational resources to process 512×512×3 dimensional tensors. Latent-space diffusion, as implemented in Stable Diffusion, uses a VAE to compress images into a 4×64×64 latent representation before applying the diffusion process. This architectural shift, documented in [`phases/04-computer-vision/11-stable-diffusion/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/11-stable-diffusion/docs/en.md) (lines 21‑22), reduces memory consumption and inference time by approximately 48× while maintaining comparable output quality.

### How does Classifier-Free Guidance (CFG) affect image generation?

Classifier-Free Guidance controls the trade-off between prompt adherence and image diversity by combining conditional and unconditional noise predictions. According to lines 54‑62 of the repository documentation, the formula `ε = ε_uncond + w·(ε_cond − ε_uncond)` allows users to scale the influence of the text prompt via the guidance scale *w*. Values between 7.5 and 12 typically produce prompt-faithful results, while values below 5 increase diversity and artistic interpretation at the cost of accuracy.

### Why is LoRA the preferred method for fine-tuning Stable Diffusion?

LoRA (Low-Rank Adaptation) freezes the base model's billions of parameters and instead trains small, low-rank matrices injected into the U-Net's attention layers. As detailed in [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) (lines 85‑94), this approach requires 10,000× less storage (typically 10‑50 MB versus 4‑7 GB) and significantly less GPU memory than full fine-tuning. Users can swap LoRA weights at inference time to switch between artistic styles, character representations, or conceptual modifications without loading separate complete model checkpoints.

### Which scheduler should I use for Stable Diffusion inference?

The optimal scheduler depends on your speed and quality requirements. The repository highlights **DPM-Solver++** for high-quality generation in 20‑30 steps, **Euler** and **Euler Ancestral** for artistic styles with moderate step counts, and **LCM (Latent Consistency Models)** or **Turbo** variants for real-time generation in 4‑8 steps. Configuration examples in [`phases/04-computer-vision/11-stable-diffusion/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/11-stable-diffusion/code/main.py) demonstrate that schedulers can be swapped with a single line of code without re-instantiating the pipeline.