How to Build Vision Models with Backbones and Heads in TensorFlow Models
TensorFlow Models provides a modular architecture that separates vision models into backbones for feature extraction, decoders for multiscale fusion, and task-specific heads, enabling you to assemble state-of-the-art detectors and segmenters by mixing and matching components.
The tensorflow/models repository offers a production-grade vision framework that decouples feature extraction from task-specific predictions. When you build vision models with backbones and heads, you follow a standardized three-stage pipeline that supports everything from image classification to instance segmentation. This modular design allows you to swap architectures—such as replacing ResNet with Vision Transformers—without rewriting downstream logic.
Architecture Overview: The Three-Component System
The TensorFlow Models vision stack organizes every model into three distinct layers:
| Component | Purpose | Key Implementation Files |
|---|---|---|
| Backbone | Extracts hierarchical feature maps from raw input images. | official/vision/modeling/backbones/resnet.py, official/vision/modeling/backbones/efficientnet.py, official/vision/modeling/backbones/vit.py |
| Decoder (Neck) | Fuses multiscale backbone outputs into a unified feature pyramid. | official/vision/modeling/decoders/fpn.py, official/vision/modeling/decoders/nasfpn.py |
| Head | Performs final task-specific predictions (classification, boxes, masks). | official/vision/modeling/heads/dense_prediction_heads.py, official/vision/modeling/heads/segmentation_heads.py |
The backbone returns a dictionary of feature tensors keyed by level (e.g., "2", "3", "4"), which the decoder processes into a pyramid compatible with the head's input requirements.
Step-by-Step Assembly Workflow
Constructing a vision model follows a strict composition pattern. You instantiate each component sequentially, then bind them inside a high-level model class that manages checkpointing and training loops.
1. Select a Backbone
Start by instantiating a feature extractor. In official/vision/modeling/backbones/resnet.py, the ResNet class accepts parameters like model_id (depth) and stem_type to control architecture variants.
from official.vision.modeling.backbones import resnet
import tf_keras
backbone = resnet.ResNet(
model_id=50,
input_specs=tf_keras.layers.InputSpec(shape=[None, None, None, 3]),
stem_type='v0',
use_sync_bn=False)
The output_specs property of every backbone provides the decoder with metadata about tensor shapes and levels.
2. Add a Decoder (Optional)
For detection and segmentation tasks, pass the backbone's output specs to a decoder. The FPN class in official/vision/modeling/decoders/fpn.py builds a top-down pathway with lateral connections.
from official.vision.modeling.decoders import fpn
decoder = fpn.FPN(
input_specs=backbone.output_specs,
min_level=3,
max_level=7,
num_filters=256,
fusion_type='sum',
use_separable_conv=False)
Image classification models typically skip this step and feed backbone outputs directly into a classification head.
3. Attach Task-Specific Heads
Heads consume the decoder's feature pyramid (or backbone outputs) and produce logits. For object detection, RetinaNetHead in official/vision/modeling/heads/dense_prediction_heads.py handles class scores and bounding-box regression.
from official.vision.modeling.heads import dense_prediction_heads as dp_heads
head = dp_heads.RetinaNetHead(
min_level=3,
max_level=7,
num_classes=80,
num_anchors_per_location=9,
num_convs=4,
num_filters=256,
use_separable_conv=False)
4. Wrap in a Model Class
Combine the components inside a task-specific model. The RetinaNetModel class in official/vision/modeling/retinanet_model.py orchestrates forward passes and loss computation.
from official.vision.modeling import retinanet_model
model = retinanet_model.RetinaNetModel(
backbone=backbone,
decoder=decoder,
head=head)
Complete Code Example: Building a RetinaNet Detector
Below is a runnable script that assembles a full RetinaNet object detector from a ResNet-50 backbone, an FPN decoder, and a RetinaNet head.
import tensorflow as tf, tf_keras
from official.vision.modeling.backbones import resnet
from official.vision.modeling.decoders import fpn
from official.vision.modeling.heads import dense_prediction_heads as dp_heads
from official.vision.modeling import retinanet_model
# 1️⃣ Build the backbone.
backbone = resnet.ResNet(
model_id=50,
input_specs=tf_keras.layers.InputSpec(shape=[None, None, None, 3]),
stem_type='v0',
use_sync_bn=False)
# 2️⃣ Build the FPN decoder using the backbone’s output specs.
decoder = fpn.FPN(
input_specs=backbone.output_specs,
min_level=3,
max_level=7,
num_filters=256,
fusion_type='sum',
use_separable_conv=False)
# 3️⃣ Build the RetinaNet head.
head = dp_heads.RetinaNetHead(
min_level=3,
max_level=7,
num_classes=80, # COCO has 80 classes.
num_anchors_per_location=9,
num_convs=4,
num_filters=256,
use_separable_conv=False)
# 4️⃣ Assemble the full model.
model = retinanet_model.RetinaNetModel(
backbone=backbone,
decoder=decoder,
head=head)
# Inspect the model’s output specs.
print(model.output_specs) # {'logits': ..., 'boxes': ...}
Adapting the Pattern for Other Vision Tasks
The same backbone-decoder-head pattern supports multiple vision paradigms by varying the final model wrapper.
Image Classification
Classification models omit the decoder entirely. The backbone feeds directly into a classification head. See official/vision/examples/starter/example_model.py for a minimal implementation that demonstrates custom assembly without multiscale fusion.
Semantic Segmentation
For pixel-level prediction, use an FPN decoder followed by a segmentation head. The SegmentationModel class in official/vision/modeling/segmentation_model.py combines these components and applies a per-pixel classification layer defined in official/vision/modeling/heads/segmentation_heads.py.
Instance Segmentation with Mask R-CNN
Complex tasks require multiple heads. In official/vision/modeling/maskrcnn_model.py, the MaskRCNNModel attaches both a detection head (RPN + class/box head) and a mask head to the FPN decoder output, enabling simultaneous bounding-box and mask prediction.
Configuration-Driven Model Building
Beyond manual assembly, TensorFlow Models provides factory modules that construct models from protobuf-style configs. The factory system reads definitions from official/vision/configs/retinanet.py or official/vision/configs/semantic_segmentation.py and dispatches to:
official/vision/modeling/backbones/factory.pyfor backbone instantiationofficial/vision/modeling/decoders/factory.pyfor decoder selectionofficial/vision/modeling/heads/factory.pyfor head creation
This approach allows you to switch architectures—such as swapping ResNet for EfficientNet in official/vision/modeling/backbones/efficientnet.py or using the Vision Transformer backbone in official/vision/modeling/backbones/vit.py—by editing configuration files rather than source code.
Summary
- TensorFlow Models structures vision architectures into backbones, decoders, and heads for maximum modularity.
- The standard workflow instantiates a backbone, optionally wraps it with an FPN decoder, attaches task-specific heads, and binds everything in a model class like
RetinaNetModelorSegmentationModel. - Backbones expose
output_specsthat automatically configure downstream decoders and heads. - You can assemble models programmatically or use the factory system in
official/vision/modeling/backbones/factory.pyand related files to build from configuration files.
Frequently Asked Questions
What is the difference between a backbone and a head in TensorFlow Models?
The backbone is a feature extractor—typically a ResNet, EfficientNet, or Vision Transformer—that processes raw images into hierarchical feature maps. The head is a task-specific network that converts those features into predictions, such as class probabilities in official/vision/modeling/heads/dense_prediction_heads.py or segmentation masks in official/vision/modeling/heads/segmentation_heads.py.
Do I always need a decoder between the backbone and head?
No. Image classification models connect the backbone directly to a classification head, skipping the decoder. However, object detection and segmentation tasks require a decoder like the FPN in official/vision/modeling/decoders/fpn.py to fuse multiscale features into a pyramid that the head can process.
How do I switch from ResNet to EfficientNet in my vision model?
Change the backbone instantiation to import from official/vision/modeling/backbones/efficientnet.py instead of resnet.py, and ensure the input_specs match your image resolution. The factory system in official/vision/modeling/backbones/factory.py automates this swap when you update the backbone.type field in your configuration file.
Can I use custom backbones or heads with the TensorFlow Models factory system?
Yes. Register your custom class in the respective factory module—official/vision/modeling/backbones/factory.py for backbones or official/vision/modeling/heads/factory.py for heads—by adding a build function that parses your config parameters. Once registered, the central model factory can instantiate your custom components alongside standard ones like those in official/vision/modeling/backbones/vit.py.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →