How to Implement Image Segmentation Models with TensorFlow Models
To implement image segmentation models using the TensorFlow Models repository, assemble a backbone network, optional decoder, and segmentation head via the SegmentationModel class, then configure specialized losses and metrics from the official Vision library for end-to-end training.
The tensorflow/models repository provides a production-ready, modular framework for building both semantic and instance segmentation pipelines. By composing reusable components—from ResNet backbones to DeepLab-style fusion heads—you can implement image segmentation models without writing low-level TensorFlow operations from scratch.
Core Architecture Components
The segmentation stack consists of four interconnected components defined in official/vision/modeling/. The backbone extracts hierarchical features, the decoder optionally upsamples and enriches these features, the segmentation head produces per-pixel logits, and an optional mask-scoring head refines instance mask quality.
SegmentationModel Class
Located in official/vision/modeling/segmentation_model.py, the SegmentationModel class (lines 64-76) acts as the orchestration layer. Its constructor accepts:
backbone: Anytf_keras.Modelreturning a dictionary of feature maps (e.g.,{2: feat2, 3: feat3, ...})decoder: Optional upsampling module (omit to connect backbone directly to head)head: ASegmentationHeadinstance producing logitsmask_scoring_head: OptionalMaskScoringlayer for instance segmentation
The call method executes these stages sequentially, returning a dictionary containing logits and optionally mask_scores.
Segmentation Head Implementation
The SegmentationHead class in official/vision/modeling/heads/segmentation_heads.py (lines 91-100) generates pixel-wise classifications. Key configuration parameters include:
- feature_fusion: Controls low-level feature integration. Options include
deeplabv3plus,pyramid_fusion,panoptic_fpn_fusion, orNone(lines 334-376). - upsample_factor: Nearest-neighbor upsampling ratio applied after convolutions (lines 60-62).
- num_convs and num_filters: Defines the convolutional stack depth and width before the final 1×1 classifier (lines 63-88).
The final output layer is a Conv2D with num_classes filters, optionally followed by logit_activation (softmax or sigmoid).
Mask Scoring for Instance Segmentation
For Mask R-CNN-style architectures, the MaskScoring class (same file, lines 24-45) refines mask quality predictions. It applies depth-wise convolutions, resizes features to fc_input_size, and passes them through fully-connected layers (lines 97-104) to produce per-class mask confidence scores.
Loss Functions and Evaluation Metrics
The framework provides specialized utilities for segmentation training and evaluation.
SegmentationLosses (in official/vision/losses/segmentation_losses.py) combines pixel-wise cross-entropy, focal loss, and Dice loss terms. It accepts ignore_label parameters to handle unannotated pixels common in datasets like Cityscapes.
SegmentationMetrics (in official/vision/evaluation/segmentation_metrics.py) computes Mean Intersection-over-Union (mIoU), per-class IoU, and boundary F-scores. Both utilities automatically handle the model's output dictionary, including optional mask_scores.
Data Pipeline Configuration
The SegmentationInput class in official/vision/dataloaders/segmentation_input.py parses TFRecord datasets containing paired images and uint8 masks. It yields a dictionary:
{
'inputs': image_tensor, # [H, W, 3] float32
'groundtruths': {
'label': mask_tensor, # [H, W, 1] int32
}
}
The loader supports on-the-fly augmentation including random flipping, scaling, and cropping during training.
End-to-End Implementation Example
Below is a complete workflow demonstrating how to implement image segmentation models with a ResNet-50 backbone and DeepLabV3+ fusion:
import tensorflow as tf
import tensorflow.keras as tf_keras
from official.vision.modeling.segmentation_model import SegmentationModel
from official.vision.modeling.heads.segmentation_heads import SegmentationHead
from official.vision.modeling.backbones import resnet
from official.vision.dataloaders.segmentation_input import SegmentationInput
from official.vision.losses.segmentation_losses import SegmentationLosses
from official.vision.evaluation.segmentation_metrics import SegmentationMetrics
# 1️⃣ Build backbone (ResNet-50)
backbone = resnet.ResNet(
model_id=50,
output_stride=16,
include_top=False,
norm_momentum=0.99,
norm_epsilon=0.001)
# 2️⃣ Build segmentation head with DeepLabV3+ feature fusion
head = SegmentationHead(
num_classes=21, # Pascal VOC classes
level=4,
num_convs=2,
num_filters=256,
feature_fusion='deeplabv3plus',
low_level=2,
low_level_num_filters=48,
upsample_factor=4,
use_sync_bn=False,
norm_momentum=0.99,
norm_epsilon=0.001)
# 3️⃣ Assemble model (no decoder in this example)
model = SegmentationModel(backbone=backbone, decoder=None, head=head)
# 4️⃣ Configure loss and metrics
losses = SegmentationLosses(
loss_type='softmax_cross_entropy',
ignore_label=-1)
metrics = SegmentationMetrics(
num_classes=21,
ignore_label=-1)
model.compile(
optimizer=tf_keras.optimizers.Adam(learning_rate=1e-4),
loss=losses,
metrics=[metrics])
# 5️⃣ Create TFRecord data pipeline
train_input = SegmentationInput(
file_pattern='gs://my-bucket/dataset/train-*-of-*.tfrecord',
is_training=True,
batch_size=8,
input_size=(512, 512))
train_dataset = train_input.make_dataset()
# 6️⃣ Train
model.fit(train_dataset, epochs=50)
This example connects a ResNet-50 feature extractor directly to a DeepLabV3+ style head, omitting a separate decoder module. The head upsamples predictions by a factor of 4 to match input resolution.
Customization Strategies
To adapt the framework for specific research needs, modify these key components:
Custom Decoders: Subclass tf_keras.Model to process backbone_features and return decoder tensors, then pass this instance as the decoder argument to SegmentationModel.
Multi-Scale Inference: Wrap model.predict in a tf.function that processes image pyramids, resizes resulting logits, and averages predictions across scales.
Panoptic Segmentation: Set feature_fusion='panoptic_fpn_fusion' and configure decoder_min_level and decoder_max_level parameters (lines 49-55 in segmentation_heads.py) to control FPN hierarchy integration.
Custom Loss Functions: While SegmentationLosses supports softmax cross-entropy, focal, and Dice losses, you can pass any callable to model.compile(loss=my_custom_loss) for specialized objectives like Lovász-Softmax.
Mask Quality Estimation: Instantiate MaskScoring(num_classes, fc_input_size, ...) and provide it as mask_scoring_head when building SegmentationModel for Mask Scoring R-CNN implementations.
Essential Source Files
When implementing custom segmentation architectures, reference these specific files in the tensorflow/models repository:
official/vision/modeling/segmentation_model.py: CoreSegmentationModelclass tying components together.official/vision/modeling/heads/segmentation_heads.py:SegmentationHeadandMaskScoringimplementations.official/vision/losses/segmentation_losses.py: Loss computation utilities.official/vision/evaluation/segmentation_metrics.py: mIoU and boundary F-score metrics.official/vision/dataloaders/segmentation_input.py: TFRecord parsing and augmentation pipeline.docs/vision/semantic_segmentation.ipynb: End-to-end training notebook for Cityscapes and Pascal VOC.
Summary
- Assemble components: Use
SegmentationModelto combine backbone, decoder, and head fromofficial/vision/modeling/. - Configure fusion: Set
feature_fusioninSegmentationHeadtodeeplabv3plusorpanoptic_fpn_fusionfor multi-scale feature integration. - Handle data: Use
SegmentationInputto parse TFRecords with image and mask pairs. - Evaluate properly: Employ
SegmentationMetricsfor mIoU calculation andSegmentationLossesfor pixel-wise classification objectives. - Extend functionality: Add
MaskScoringheads for instance segmentation or custom decoders for specific architectural requirements.
Frequently Asked Questions
What backbone architectures are supported for segmentation in TensorFlow Models?
The framework supports any tf_keras.Model following the backbone API convention, returning feature maps as a level-indexed dictionary. Pre-implemented options include ResNet (50/101) and EfficientNet variants located in official/vision/modeling/backbones/. You can also inject custom backbones provided they output the expected feature dictionary format.
How do I handle datasets with ignored or unlabeled pixels?
Pass the ignore_label parameter (typically -1 or 255) to both SegmentationLosses and SegmentationMetrics during initialization. These classes automatically mask out these indices when computing cross-entropy losses or IoU metrics, ensuring invalid pixels do not affect gradient updates or evaluation scores.
Can I use this framework for instance segmentation or only semantic segmentation?
While primarily designed for semantic segmentation, the framework supports instance segmentation through the MaskScoring class. Instantiate this head and pass it as mask_scoring_head when creating SegmentationModel. For full Mask R-CNN implementations, combine this with detection components from the official detection modeling library.
What is the difference between using a decoder and using feature fusion in the head?
The decoder is a separate network module (e.g., U-Net style upsampling) that processes backbone features before they reach the head. Feature fusion (configured via feature_fusion in SegmentationHead) happens inside the head itself, merging low-level backbone features with high-level features using operations like DeepLabV3+ or FPN-style aggregation. You can use both together or independently depending on your architecture requirements.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →