deep-dive

Multimodal AI Architectures Explained: CLIP and LLaVA Implementations from Scratch

May 21, 2026 rohitg00/ai-engineering-from-scratch ↗

CLIP and LLaVA represent two foundational approaches to multimodal AI—CLIP learns joint image-text embeddings via contrastive training, while LLaVA projects visual tokens into large language models using lightweight adapter networks.

The rohitg00/ai-engineering-from-scratch repository provides pure Python implementations of these multimodal AI architectures without PyTorch or NumPy dependencies. These implementations reveal the mathematical core of how vision and language models interact, from contrastive loss functions to token-level fusion strategies.

CLIP – Contrastive Language-Image Pre-training

CLIP (Contrastive Language-Image Pre-training) establishes a shared embedding space where matching image-text pairs align and mismatched pairs repel. The repository implements this framework-free in phases/12-multimodal-ai/02-clip-contrastive-pretraining/code/main.py.

Core Mathematical Components

Embedding normalization ensures that cosine similarity operates as a true inner product. The normalize function (line 16) applies L2 normalization to both image and text vectors before comparison.

Similarity matrix computation scales the dot product by a temperature parameter τ. The similarity_matrix function (line 25) computes pairwise cosine similarities across the batch, producing a logits matrix used for classification.

InfoNCE loss provides symmetric training signals. The infonce_loss function (line 43) penalizes both image-to-text and text-to-image directions simultaneously, maximizing the diagonal of the similarity matrix while minimizing off-diagonal elements.

SigLIP loss offers an alternative for distributed training scenarios. The sigmoid_loss function (line 64) applies binary cross-entropy per pair rather than softmax across the full batch, eliminating the need for expensive all-gather operations across GPUs.

Zero-Shot Classification Pipeline

After training, CLIP enables zero-shot inference without task-specific fine-tuning. The zero_shot_classify function (line 81) matches query images against text prompts via argmax cosine similarity, returning predicted labels based solely on the aligned embedding space.

Implementing CLIP from Scratch

The following example demonstrates the contrastive training pipeline using the repository's pure Python implementation:


# Demo from phases/12-multimodal-ai/02-clip-contrastive-pretraining/code/main.py

from phases_12_multimodal_ai_02_clip_contrastive_pretraining_code_main import (
    demo_infonce, demo_shuffled, demo_zero_shot
)

if __name__ == "__main__":
    demo_infonce()      # symmetric InfoNCE loss on aligned pairs

    demo_shuffled()     # demonstrates loss increase on mismatched pairs

    demo_zero_shot()    # zero-shot classification via cosine similarity

Key insight: InfoNCE requires the full batch similarity matrix for softmax normalization, while SigLIP treats each image-text pair independently—making SigLIP preferable for massive-scale training where batch sizes exceed memory constraints.

LLaVA – Large Language and Vision Assistant

LLaVA (Large Language and Vision Assistant) extends CLIP by adding a projection bridge between frozen vision encoders and frozen language models. Unlike CLIP's joint embedding approach, LLaVA converts visual patches into language-model-compatible tokens. The implementation resides in phases/12-multimodal-ai/05-llava-visual-instruction-tuning/code/main.py.

Visual Token Projection Architecture

Patch extraction simulates a Vision Transformer output. The fake_vit_output function (line 63) generates 16×16-dimensional patch embeddings representing a grid of image regions.

Two-layer MLP projector adapts visual features to LLM dimensions. The MLPProjector class (line 47) maps patches through PATCH_DIM → HIDDEN_DIM → LLM_DIM transformations, compressing or expanding visual information to match the language model's token dimensionality.

Prompt construction injects visual tokens into text sequences. The build_llava_prompt function (line 67) replaces <image> placeholders with projected visual tokens, allowing the LLM to process images as sequential data alongside text.

AnyRes and Higher Resolution Handling

Modern LLaVA variants support higher resolution inputs through tiling strategies. The demo_anyres function (line 28) calculates token budgets for various image resolutions, demonstrating that a standard 336×336 pixel image consumes approximately 576 visual tokens—roughly 30% of a 2,000-token context window.

Building a LLaVA Prompt

This example shows how to construct a multimodal prompt with projected visual tokens:


# Demo from phases/12-multimodal-ai/05-llava-visual-instruction-tuning/code/main.py

from phases_12_multimodal_ai_05_llava_visual_instruction_tuning_code_main import (
    demo_projector, demo_prompt, demo_anyres
)

if __name__ == "__main__":
    demo_projector()  # executes the 2-layer MLP on fake ViT patches

    demo_prompt()    # creates <image> placeholder and estimates token budget

    demo_anyres()    # displays token costs for multi-resolution inputs

Critical distinction: While CLIP trains end-to-end, LLaVA keeps the vision and language encoders frozen, training only the lightweight projection layers—dramatically reducing compute requirements for multimodal adaptation.

Extended Multimodal Architectures

Beyond CLIP and LLaVA, the repository catalogs alternative fusion strategies that modify how visual and textual information interact:

Flamingo-style gated cross-attention (phases/12-multimodal-ai/04-flamingo-gated-cross-attention/code/main.py): Inserts cross-attention layers between frozen language model layers, using gating mechanisms to control visual information flow.
Chameleon early-fusion (phases/12-multimodal-ai/11-chameleon-early-fusion-tokens/code/main.py): Interleaves image patches directly into the token stream without projection layers, training the model on mixed-modality sequences from scratch.
InternVL-3 multi-expert (phases/12-multimodal-ai/10-internvl3-native-multimodal/code/main.py): Combines multiple vision encoders for handling diverse visual tasks including video temporal grounding.
Multi-encoder hybrids (phases/12-multimodal-ai/07-open-weight-vlm-recipes/code/main.py): Concatenates features from CLIP, DINOv2, SigLIP, and ConvNeXt encoders, examining how additive visual representations affect downstream performance.

Summary

CLIP architectures learn joint embeddings through contrastive losses (InfoNCE or SigLIP), enabling zero-shot classification by matching images to text descriptions in normalized vector space.
LLaVA architectures repurpose frozen CLIP encoders by adding trainable MLP projectors that convert visual patches into LLM-compatible tokens, then insert these tokens into language model prompts.
Fusion strategies determine computational cost and capability: projectors (LLaVA) balance efficiency and performance, gated cross-attention (Flamingo) adds expressiveness at higher compute cost, and early fusion (Chameleon) removes architectural boundaries between modalities.
The rohitg00/ai-engineering-from-scratch repository implements all variants using pure Python, exposing the mathematical operations underlying modern multimodal systems.

Frequently Asked Questions

What is the primary difference between CLIP and LLaVA architectures?

CLIP learns a shared embedding space where images and text coexist as comparable vectors, suitable for retrieval and zero-shot classification. LLaVA instead treats images as token sequences that feed into autoregressive language models, enabling generative tasks like visual question answering. CLIP optimizes for similarity matching, while LLaVA optimizes for next-token prediction in multimodal contexts.

How does the LLaVA projector function mathematically?

The projector is a two-layer MLP implemented in phases/12-multimodal-ai/05-llava-visual-instruction-tuning/code/main.py as the MLPProjector class. It transforms visual patch embeddings from the CLIP encoder's output dimension (e.g., 16 dimensions) through a hidden layer into the LLM's input dimension (e.g., 24 dimensions). This linear transformation adapts visual features to the linguistic representation space without modifying the underlying encoders.

When should SigLIP loss be preferred over standard InfoNCE?

SigLIP, implemented via the sigmoid_loss function in phases/12-multimodal-ai/02-clip-contrastive-pretraining/code/main.py, uses binary cross-entropy per image-text pair rather than batch-wide softmax. This design eliminates the need for gathering all batch statistics across distributed training nodes, making SigLIP more efficient for training runs with massive batch sizes or limited inter-GPU bandwidth.

How do higher resolution images affect LLaVA's token budget?

Higher resolutions increase the number of visual patches extracted by the Vision Transformer, linearly increasing token consumption. The demo_anyres function demonstrates that tiling a high-resolution image or adding thumbnail representations can consume 30-50% of the LLM's context window. The repository's token cost tables in phases/12-multimodal-ai/05-llava-visual-instruction-tuning/code/main.py provide specific calculations for balancing visual fidelity against available context length.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →