How Stable Diffusion Generates Images: Architecture Explained

Stable Diffusion is a latent diffusion model that uses a Variational Auto-Encoder (VAE), CLIP text encoder, U-Net denoiser, and noise scheduler to transform text prompts into high-fidelity images by iteratively refining random noise in a compressed 4×64×64 latent space.

Stable Diffusion revolutionized open-source AI image generation by shifting the computationally expensive diffusion process from high-resolution pixel space to a compressed latent representation. According to the rohitg00/ai-engineering-from-scratch repository, this architectural choice reduces computational requirements by approximately 48× while maintaining visual quality, enabling text-to-image synthesis on consumer GPUs. The implementation documented in phases/04-computer-vision/11-stable-diffusion/docs/en.md reveals how five core components collaborate to bridge the gap between human language and pixel-perfect imagery.

The Five Core Components of Stable Diffusion

The architecture consists of specialized neural networks that handle compression, conditioning, denoising, and scheduling. As detailed in the source documentation, each component serves a distinct purpose in the generation pipeline.

Variational Auto-Encoder (VAE)

The VAE compresses high-resolution images into a low-dimensional latent space and reconstructs them after denoising. In phases/04-computer-vision/11-stable-diffusion/docs/en.md (lines 48‑49), the encoder maps a 3×512×512 RGB image to a 4×64×64 latent tensor, representing a 48× reduction in dimensionality. The decoder performs the inverse operation, converting the final latent representation back into a viewable image. This compression makes the diffusion process tractable on standard hardware without sacrificing fidelity.

Text Encoder

The text encoder converts human prompts into numerical embeddings that guide the image generation process. The repository notes that Stable Diffusion 1.x and 2.x utilize CLIP-L, while SDXL employs a combination of CLIP-L and CLIP-G, and newer variants like SD 3 use T5-XXL (line 49 in docs/en.md). These embeddings serve as conditioning vectors that the U-Net uses to align visual features with semantic concepts described in the prompt.

U-Net Denoiser

The U-Net (or Diffusion Transformer in newer variants) constitutes the core neural network that predicts and removes noise from latent representations. According to lines 50‑52 and 75‑81 of the documentation, the U-Net architecture contains cross-attention blocks that allow latent patches to attend to text embeddings at every resolution level. Additionally, a time-embedding MLP injects timestep information, enabling the network to handle different noise levels throughout the iterative denoising process.

Scheduler

The scheduler algorithmically manages the step-by-step transition from pure Gaussian noise to a coherent image. The source code references multiple scheduler implementations including DDIM, Euler, DPM-Solver++, and LCM/Turbo (lines 98‑104 in docs/en.md). These algorithms determine how much noise to remove at each step and whether to introduce stochasticity, directly impacting generation speed and output quality.

Safety Checker

An optional safety checker component filters generated content for NSFW or inappropriate material before final decoding. While not part of the core diffusion mathematics, this module runs after the U-Net completes its iterations to ensure output compliance with usage policies.

The Latent Diffusion Process Step-by-Step

The generation pipeline follows a deterministic workflow clearly illustrated in the Mermaid diagram located at lines 29‑41 of the documentation:

  1. Encode: The initial canvas or input image converts into latent space via the VAE encoder.
  2. Add Noise: Gaussian noise perturbs the latent representation based on the current timestep t.
  3. Predict Noise: The U-Net predicts the noise component, conditioned on text embeddings through cross-attention mechanisms and timestep embeddings through the MLP.
  4. Apply CFG: The system executes classifier-free guidance by computing both conditional and unconditional noise predictions, then combining them as ε = ε_uncond + w·(ε_cond − ε_uncond) where w represents the guidance scale (lines 54‑62 in docs/en.md).
  5. Denoise: The scheduler updates the latent vector toward a cleaner state using the predicted noise residual.
  6. Decode: After N iterations (typically 20‑30 steps with DPM-Solver++), the VAE decoder converts the final latent tensor into a 512×512 RGB image.

Because diffusion occurs in the compressed latent space rather than pixel space, training and inference require significantly fewer floating-point operations while preserving high-resolution output capabilities (lines 21‑22 in docs/en.md).

Fine-Tuning with LoRA

Low-Rank Adaptation (LoRA) enables efficient customization of Stable Diffusion without retraining the entire base model. As implemented in the repository (lines 85‑94), LoRA injects low-rank matrices into the U-Net's attention layers while keeping the massive pretrained weights frozen. This approach yields adapter files sized between 10‑50 MB that can be dynamically loaded at inference time (lines 91‑94). Fine-tuning with LoRA requires only a small dataset of example images and significantly less GPU memory than full model fine-tuning.

Practical Implementation

The repository provides runnable examples in phases/04-computer-vision/11-stable-diffusion/code/main.py demonstrating pipeline initialization, scheduler configuration, and LoRA integration.

Basic Text-to-Image Generation

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# Load the pipeline with FP16 optimization

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# Configure DPM-Solver++ for efficient deterministic sampling

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Generate image with CFG

image = pipe(
    prompt="a dog riding a skateboard in Tokyo, studio Ghibli style",
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("dog.png")

Swapping Schedulers

Change the sampling algorithm without reloading the model:

from diffusers import EulerAncestralDiscreteScheduler
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)

Image-to-Image Translation

Modify existing images by adding controlled noise and redenoising:

from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

img2img = StableDiffusionImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

init_image = Image.open("dog.png").convert("RGB").resize((512, 512))

result = img2img(
    prompt="a dog riding a skateboard, oil painting",
    image=init_image,
    strength=0.6,  # Controls noise injection level

    guidance_scale=7.5,
).images[0]

result.save("dog_oil.png")

Loading LoRA Adapters

Apply custom styles or concepts without permanent model modification:


# Load pretrained LoRA weights

pipe.load_lora_weights("sayakpaul/sd-lora-ghibli")

# Fuse weights into base model for inference speed

pipe.fuse_lora(lora_scale=0.8)

LoRA Training Structure

The repository includes pseudo-code illustrating the training loop (from main.py):

for step, batch in enumerate(dataloader):
    images, prompts = batch
    
    # Encode images to latent space with scaling factor

    latents = vae.encode(images).latent_dist.sample() * 0.18215
    
    # Random timestep selection

    t = torch.randint(0, num_train_timesteps, (batch_size,))
    
    # Forward diffusion: add noise to latents

    noise = torch.randn_like(latents)
    noisy_latents = scheduler.add_noise(latents, noise, t)
    
    # Text conditioning

    text_emb = text_encoder(tokenizer(prompts))
    
    # Predict noise with U-Net (LoRA weights active in attention layers)

    pred_noise = unet(noisy_latents, t, text_emb)
    
    # Optimization step

    loss = F.mse_loss(pred_noise, noise)
    loss.backward()
    optimizer.step()

Summary

  • Latent Space Efficiency: The VAE compresses 512×512 images to 4×64×64 tensors, reducing computational requirements by 48× compared to pixel-space diffusion.
  • Cross-Attention Conditioning: The U-Net utilizes cross-attention layers to bind visual features with CLIP text embeddings, enabling precise prompt adherence.
  • Classifier-Free Guidance: The CFG mechanism balances prompt fidelity and image diversity through the guidance scale parameter w.
  • Scheduler Flexibility: Algorithms like DPM-Solver++ and Euler enable quality generation in 20‑30 steps, while LCM/Turbo variants achieve real-time synthesis.
  • Efficient Adaptation: LoRA fine-tuning modifies only attention layer weights, producing 10‑50 MB adapters that customize generation without full model retraining.

Frequently Asked Questions

What is the difference between pixel-space and latent-space diffusion?

Pixel-space diffusion operates directly on RGB values at full resolution, requiring massive computational resources to process 512×512×3 dimensional tensors. Latent-space diffusion, as implemented in Stable Diffusion, uses a VAE to compress images into a 4×64×64 latent representation before applying the diffusion process. This architectural shift, documented in phases/04-computer-vision/11-stable-diffusion/docs/en.md (lines 21‑22), reduces memory consumption and inference time by approximately 48× while maintaining comparable output quality.

How does Classifier-Free Guidance (CFG) affect image generation?

Classifier-Free Guidance controls the trade-off between prompt adherence and image diversity by combining conditional and unconditional noise predictions. According to lines 54‑62 of the repository documentation, the formula ε = ε_uncond + w·(ε_cond − ε_uncond) allows users to scale the influence of the text prompt via the guidance scale w. Values between 7.5 and 12 typically produce prompt-faithful results, while values below 5 increase diversity and artistic interpretation at the cost of accuracy.

Why is LoRA the preferred method for fine-tuning Stable Diffusion?

LoRA (Low-Rank Adaptation) freezes the base model's billions of parameters and instead trains small, low-rank matrices injected into the U-Net's attention layers. As detailed in docs/en.md (lines 85‑94), this approach requires 10,000× less storage (typically 10‑50 MB versus 4‑7 GB) and significantly less GPU memory than full fine-tuning. Users can swap LoRA weights at inference time to switch between artistic styles, character representations, or conceptual modifications without loading separate complete model checkpoints.

Which scheduler should I use for Stable Diffusion inference?

The optimal scheduler depends on your speed and quality requirements. The repository highlights DPM-Solver++ for high-quality generation in 20‑30 steps, Euler and Euler Ancestral for artistic styles with moderate step counts, and LCM (Latent Consistency Models) or Turbo variants for real-time generation in 4‑8 steps. Configuration examples in phases/04-computer-vision/11-stable-diffusion/code/main.py demonstrate that schedulers can be swapped with a single line of code without re-instantiating the pipeline.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →