# How to Build Vision Models with Backbones and Heads in TensorFlow Models

> Learn to build vision models using TensorFlow Models. Understand backbones, decoders, and heads to create state-of-the-art detectors and segmenters with this modular architecture.

- Repository: [tensorflow/models](https://github.com/tensorflow/models)
- Tags: tutorial
- Published: 2026-02-28

---

**TensorFlow Models provides a modular architecture that separates vision models into backbones for feature extraction, decoders for multiscale fusion, and task-specific heads, enabling you to assemble state-of-the-art detectors and segmenters by mixing and matching components.**

The `tensorflow/models` repository offers a production-grade vision framework that decouples feature extraction from task-specific predictions. When you build vision models with backbones and heads, you follow a standardized three-stage pipeline that supports everything from image classification to instance segmentation. This modular design allows you to swap architectures—such as replacing ResNet with Vision Transformers—without rewriting downstream logic.

## Architecture Overview: The Three-Component System

The TensorFlow Models vision stack organizes every model into three distinct layers:

| Component | Purpose | Key Implementation Files |
|-----------|---------|--------------------------|
| **Backbone** | Extracts hierarchical feature maps from raw input images. | [`official/vision/modeling/backbones/resnet.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/resnet.py), [`official/vision/modeling/backbones/efficientnet.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/efficientnet.py), [`official/vision/modeling/backbones/vit.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/vit.py) |
| **Decoder** (Neck) | Fuses multiscale backbone outputs into a unified feature pyramid. | [`official/vision/modeling/decoders/fpn.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/decoders/fpn.py), [`official/vision/modeling/decoders/nasfpn.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/decoders/nasfpn.py) |
| **Head** | Performs final task-specific predictions (classification, boxes, masks). | [`official/vision/modeling/heads/dense_prediction_heads.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/dense_prediction_heads.py), [`official/vision/modeling/heads/segmentation_heads.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/segmentation_heads.py) |

The backbone returns a dictionary of feature tensors keyed by level (e.g., `"2"`, `"3"`, `"4"`), which the decoder processes into a pyramid compatible with the head's input requirements.

## Step-by-Step Assembly Workflow

Constructing a vision model follows a strict composition pattern. You instantiate each component sequentially, then bind them inside a high-level model class that manages checkpointing and training loops.

### 1. Select a Backbone

Start by instantiating a feature extractor. In [`official/vision/modeling/backbones/resnet.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/resnet.py), the `ResNet` class accepts parameters like `model_id` (depth) and `stem_type` to control architecture variants.

```python
from official.vision.modeling.backbones import resnet
import tf_keras

backbone = resnet.ResNet(
    model_id=50,
    input_specs=tf_keras.layers.InputSpec(shape=[None, None, None, 3]),
    stem_type='v0',
    use_sync_bn=False)

```

The `output_specs` property of every backbone provides the decoder with metadata about tensor shapes and levels.

### 2. Add a Decoder (Optional)

For detection and segmentation tasks, pass the backbone's output specs to a decoder. The `FPN` class in [`official/vision/modeling/decoders/fpn.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/decoders/fpn.py) builds a top-down pathway with lateral connections.

```python
from official.vision.modeling.decoders import fpn

decoder = fpn.FPN(
    input_specs=backbone.output_specs,
    min_level=3,
    max_level=7,
    num_filters=256,
    fusion_type='sum',
    use_separable_conv=False)

```

Image classification models typically skip this step and feed backbone outputs directly into a classification head.

### 3. Attach Task-Specific Heads

Heads consume the decoder's feature pyramid (or backbone outputs) and produce logits. For object detection, `RetinaNetHead` in [`official/vision/modeling/heads/dense_prediction_heads.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/dense_prediction_heads.py) handles class scores and bounding-box regression.

```python
from official.vision.modeling.heads import dense_prediction_heads as dp_heads

head = dp_heads.RetinaNetHead(
    min_level=3,
    max_level=7,
    num_classes=80,
    num_anchors_per_location=9,
    num_convs=4,
    num_filters=256,
    use_separable_conv=False)

```

### 4. Wrap in a Model Class

Combine the components inside a task-specific model. The `RetinaNetModel` class in [`official/vision/modeling/retinanet_model.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/retinanet_model.py) orchestrates forward passes and loss computation.

```python
from official.vision.modeling import retinanet_model

model = retinanet_model.RetinaNetModel(
    backbone=backbone,
    decoder=decoder,
    head=head)

```

## Complete Code Example: Building a RetinaNet Detector

Below is a runnable script that assembles a full RetinaNet object detector from a ResNet-50 backbone, an FPN decoder, and a RetinaNet head.

```python
import tensorflow as tf, tf_keras
from official.vision.modeling.backbones import resnet
from official.vision.modeling.decoders import fpn
from official.vision.modeling.heads import dense_prediction_heads as dp_heads
from official.vision.modeling import retinanet_model

# 1️⃣ Build the backbone.

backbone = resnet.ResNet(
    model_id=50,
    input_specs=tf_keras.layers.InputSpec(shape=[None, None, None, 3]),
    stem_type='v0',
    use_sync_bn=False)

# 2️⃣ Build the FPN decoder using the backbone’s output specs.

decoder = fpn.FPN(
    input_specs=backbone.output_specs,
    min_level=3,
    max_level=7,
    num_filters=256,
    fusion_type='sum',
    use_separable_conv=False)

# 3️⃣ Build the RetinaNet head.

head = dp_heads.RetinaNetHead(
    min_level=3,
    max_level=7,
    num_classes=80,                # COCO has 80 classes.

    num_anchors_per_location=9,
    num_convs=4,
    num_filters=256,
    use_separable_conv=False)

# 4️⃣ Assemble the full model.

model = retinanet_model.RetinaNetModel(
    backbone=backbone,
    decoder=decoder,
    head=head)

# Inspect the model’s output specs.

print(model.output_specs)   # {'logits': ..., 'boxes': ...}

```

## Adapting the Pattern for Other Vision Tasks

The same backbone-decoder-head pattern supports multiple vision paradigms by varying the final model wrapper.

### Image Classification

Classification models omit the decoder entirely. The backbone feeds directly into a classification head. See [`official/vision/examples/starter/example_model.py`](https://github.com/tensorflow/models/blob/main/official/vision/examples/starter/example_model.py) for a minimal implementation that demonstrates custom assembly without multiscale fusion.

### Semantic Segmentation

For pixel-level prediction, use an FPN decoder followed by a segmentation head. The `SegmentationModel` class in [`official/vision/modeling/segmentation_model.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/segmentation_model.py) combines these components and applies a per-pixel classification layer defined in [`official/vision/modeling/heads/segmentation_heads.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/segmentation_heads.py).

### Instance Segmentation with Mask R-CNN

Complex tasks require multiple heads. In [`official/vision/modeling/maskrcnn_model.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/maskrcnn_model.py), the `MaskRCNNModel` attaches both a detection head (RPN + class/box head) and a mask head to the FPN decoder output, enabling simultaneous bounding-box and mask prediction.

## Configuration-Driven Model Building

Beyond manual assembly, TensorFlow Models provides factory modules that construct models from protobuf-style configs. The factory system reads definitions from [`official/vision/configs/retinanet.py`](https://github.com/tensorflow/models/blob/main/official/vision/configs/retinanet.py) or [`official/vision/configs/semantic_segmentation.py`](https://github.com/tensorflow/models/blob/main/official/vision/configs/semantic_segmentation.py) and dispatches to:

- [`official/vision/modeling/backbones/factory.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/factory.py) for backbone instantiation
- [`official/vision/modeling/decoders/factory.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/decoders/factory.py) for decoder selection
- [`official/vision/modeling/heads/factory.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/factory.py) for head creation

This approach allows you to switch architectures—such as swapping ResNet for EfficientNet in [`official/vision/modeling/backbones/efficientnet.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/efficientnet.py) or using the Vision Transformer backbone in [`official/vision/modeling/backbones/vit.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/vit.py)—by editing configuration files rather than source code.

## Summary

- **TensorFlow Models** structures vision architectures into **backbones**, **decoders**, and **heads** for maximum modularity.
- The standard workflow instantiates a backbone, optionally wraps it with an FPN decoder, attaches task-specific heads, and binds everything in a model class like `RetinaNetModel` or `SegmentationModel`.
- Backbones expose `output_specs` that automatically configure downstream decoders and heads.
- You can assemble models programmatically or use the factory system in [`official/vision/modeling/backbones/factory.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/factory.py) and related files to build from configuration files.

## Frequently Asked Questions

### What is the difference between a backbone and a head in TensorFlow Models?

The **backbone** is a feature extractor—typically a ResNet, EfficientNet, or Vision Transformer—that processes raw images into hierarchical feature maps. The **head** is a task-specific network that converts those features into predictions, such as class probabilities in [`official/vision/modeling/heads/dense_prediction_heads.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/dense_prediction_heads.py) or segmentation masks in [`official/vision/modeling/heads/segmentation_heads.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/segmentation_heads.py).

### Do I always need a decoder between the backbone and head?

No. **Image classification** models connect the backbone directly to a classification head, skipping the decoder. However, **object detection** and **segmentation** tasks require a decoder like the FPN in [`official/vision/modeling/decoders/fpn.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/decoders/fpn.py) to fuse multiscale features into a pyramid that the head can process.

### How do I switch from ResNet to EfficientNet in my vision model?

Change the backbone instantiation to import from [`official/vision/modeling/backbones/efficientnet.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/efficientnet.py) instead of [`resnet.py`](https://github.com/tensorflow/models/blob/main/resnet.py), and ensure the `input_specs` match your image resolution. The factory system in [`official/vision/modeling/backbones/factory.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/factory.py) automates this swap when you update the `backbone.type` field in your configuration file.

### Can I use custom backbones or heads with the TensorFlow Models factory system?

Yes. Register your custom class in the respective factory module—[`official/vision/modeling/backbones/factory.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/factory.py) for backbones or [`official/vision/modeling/heads/factory.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/heads/factory.py) for heads—by adding a build function that parses your config parameters. Once registered, the central model factory can instantiate your custom components alongside standard ones like those in [`official/vision/modeling/backbones/vit.py`](https://github.com/tensorflow/models/blob/main/official/vision/modeling/backbones/vit.py).