How to Build Vision Models with Backbones and Heads in TensorFlow Models

TensorFlow Models provides a modular architecture that separates vision models into backbones for feature extraction, decoders for multiscale fusion, and task-specific heads, enabling you to assemble state-of-the-art detectors and segmenters by mixing and matching components.

The tensorflow/models repository offers a production-grade vision framework that decouples feature extraction from task-specific predictions. When you build vision models with backbones and heads, you follow a standardized three-stage pipeline that supports everything from image classification to instance segmentation. This modular design allows you to swap architectures—such as replacing ResNet with Vision Transformers—without rewriting downstream logic.

Architecture Overview: The Three-Component System

The TensorFlow Models vision stack organizes every model into three distinct layers:

Component Purpose Key Implementation Files
Backbone Extracts hierarchical feature maps from raw input images. official/vision/modeling/backbones/resnet.py, official/vision/modeling/backbones/efficientnet.py, official/vision/modeling/backbones/vit.py
Decoder (Neck) Fuses multiscale backbone outputs into a unified feature pyramid. official/vision/modeling/decoders/fpn.py, official/vision/modeling/decoders/nasfpn.py
Head Performs final task-specific predictions (classification, boxes, masks). official/vision/modeling/heads/dense_prediction_heads.py, official/vision/modeling/heads/segmentation_heads.py

The backbone returns a dictionary of feature tensors keyed by level (e.g., "2", "3", "4"), which the decoder processes into a pyramid compatible with the head's input requirements.

Step-by-Step Assembly Workflow

Constructing a vision model follows a strict composition pattern. You instantiate each component sequentially, then bind them inside a high-level model class that manages checkpointing and training loops.

1. Select a Backbone

Start by instantiating a feature extractor. In official/vision/modeling/backbones/resnet.py, the ResNet class accepts parameters like model_id (depth) and stem_type to control architecture variants.

from official.vision.modeling.backbones import resnet
import tf_keras

backbone = resnet.ResNet(
    model_id=50,
    input_specs=tf_keras.layers.InputSpec(shape=[None, None, None, 3]),
    stem_type='v0',
    use_sync_bn=False)

The output_specs property of every backbone provides the decoder with metadata about tensor shapes and levels.

2. Add a Decoder (Optional)

For detection and segmentation tasks, pass the backbone's output specs to a decoder. The FPN class in official/vision/modeling/decoders/fpn.py builds a top-down pathway with lateral connections.

from official.vision.modeling.decoders import fpn

decoder = fpn.FPN(
    input_specs=backbone.output_specs,
    min_level=3,
    max_level=7,
    num_filters=256,
    fusion_type='sum',
    use_separable_conv=False)

Image classification models typically skip this step and feed backbone outputs directly into a classification head.

3. Attach Task-Specific Heads

Heads consume the decoder's feature pyramid (or backbone outputs) and produce logits. For object detection, RetinaNetHead in official/vision/modeling/heads/dense_prediction_heads.py handles class scores and bounding-box regression.

from official.vision.modeling.heads import dense_prediction_heads as dp_heads

head = dp_heads.RetinaNetHead(
    min_level=3,
    max_level=7,
    num_classes=80,
    num_anchors_per_location=9,
    num_convs=4,
    num_filters=256,
    use_separable_conv=False)

4. Wrap in a Model Class

Combine the components inside a task-specific model. The RetinaNetModel class in official/vision/modeling/retinanet_model.py orchestrates forward passes and loss computation.

from official.vision.modeling import retinanet_model

model = retinanet_model.RetinaNetModel(
    backbone=backbone,
    decoder=decoder,
    head=head)

Complete Code Example: Building a RetinaNet Detector

Below is a runnable script that assembles a full RetinaNet object detector from a ResNet-50 backbone, an FPN decoder, and a RetinaNet head.

import tensorflow as tf, tf_keras
from official.vision.modeling.backbones import resnet
from official.vision.modeling.decoders import fpn
from official.vision.modeling.heads import dense_prediction_heads as dp_heads
from official.vision.modeling import retinanet_model

# 1️⃣ Build the backbone.

backbone = resnet.ResNet(
    model_id=50,
    input_specs=tf_keras.layers.InputSpec(shape=[None, None, None, 3]),
    stem_type='v0',
    use_sync_bn=False)

# 2️⃣ Build the FPN decoder using the backbone’s output specs.

decoder = fpn.FPN(
    input_specs=backbone.output_specs,
    min_level=3,
    max_level=7,
    num_filters=256,
    fusion_type='sum',
    use_separable_conv=False)

# 3️⃣ Build the RetinaNet head.

head = dp_heads.RetinaNetHead(
    min_level=3,
    max_level=7,
    num_classes=80,                # COCO has 80 classes.

    num_anchors_per_location=9,
    num_convs=4,
    num_filters=256,
    use_separable_conv=False)

# 4️⃣ Assemble the full model.

model = retinanet_model.RetinaNetModel(
    backbone=backbone,
    decoder=decoder,
    head=head)

# Inspect the model’s output specs.

print(model.output_specs)   # {'logits': ..., 'boxes': ...}

Adapting the Pattern for Other Vision Tasks

The same backbone-decoder-head pattern supports multiple vision paradigms by varying the final model wrapper.

Image Classification

Classification models omit the decoder entirely. The backbone feeds directly into a classification head. See official/vision/examples/starter/example_model.py for a minimal implementation that demonstrates custom assembly without multiscale fusion.

Semantic Segmentation

For pixel-level prediction, use an FPN decoder followed by a segmentation head. The SegmentationModel class in official/vision/modeling/segmentation_model.py combines these components and applies a per-pixel classification layer defined in official/vision/modeling/heads/segmentation_heads.py.

Instance Segmentation with Mask R-CNN

Complex tasks require multiple heads. In official/vision/modeling/maskrcnn_model.py, the MaskRCNNModel attaches both a detection head (RPN + class/box head) and a mask head to the FPN decoder output, enabling simultaneous bounding-box and mask prediction.

Configuration-Driven Model Building

Beyond manual assembly, TensorFlow Models provides factory modules that construct models from protobuf-style configs. The factory system reads definitions from official/vision/configs/retinanet.py or official/vision/configs/semantic_segmentation.py and dispatches to:

This approach allows you to switch architectures—such as swapping ResNet for EfficientNet in official/vision/modeling/backbones/efficientnet.py or using the Vision Transformer backbone in official/vision/modeling/backbones/vit.py—by editing configuration files rather than source code.

Summary

  • TensorFlow Models structures vision architectures into backbones, decoders, and heads for maximum modularity.
  • The standard workflow instantiates a backbone, optionally wraps it with an FPN decoder, attaches task-specific heads, and binds everything in a model class like RetinaNetModel or SegmentationModel.
  • Backbones expose output_specs that automatically configure downstream decoders and heads.
  • You can assemble models programmatically or use the factory system in official/vision/modeling/backbones/factory.py and related files to build from configuration files.

Frequently Asked Questions

What is the difference between a backbone and a head in TensorFlow Models?

The backbone is a feature extractor—typically a ResNet, EfficientNet, or Vision Transformer—that processes raw images into hierarchical feature maps. The head is a task-specific network that converts those features into predictions, such as class probabilities in official/vision/modeling/heads/dense_prediction_heads.py or segmentation masks in official/vision/modeling/heads/segmentation_heads.py.

Do I always need a decoder between the backbone and head?

No. Image classification models connect the backbone directly to a classification head, skipping the decoder. However, object detection and segmentation tasks require a decoder like the FPN in official/vision/modeling/decoders/fpn.py to fuse multiscale features into a pyramid that the head can process.

How do I switch from ResNet to EfficientNet in my vision model?

Change the backbone instantiation to import from official/vision/modeling/backbones/efficientnet.py instead of resnet.py, and ensure the input_specs match your image resolution. The factory system in official/vision/modeling/backbones/factory.py automates this swap when you update the backbone.type field in your configuration file.

Can I use custom backbones or heads with the TensorFlow Models factory system?

Yes. Register your custom class in the respective factory module—official/vision/modeling/backbones/factory.py for backbones or official/vision/modeling/heads/factory.py for heads—by adding a build function that parses your config parameters. Once registered, the central model factory can instantiate your custom components alongside standard ones like those in official/vision/modeling/backbones/vit.py.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →