How to Implement Image Segmentation Models with TensorFlow Models

To implement image segmentation models using the TensorFlow Models repository, assemble a backbone network, optional decoder, and segmentation head via the SegmentationModel class, then configure specialized losses and metrics from the official Vision library for end-to-end training.

The tensorflow/models repository provides a production-ready, modular framework for building both semantic and instance segmentation pipelines. By composing reusable components—from ResNet backbones to DeepLab-style fusion heads—you can implement image segmentation models without writing low-level TensorFlow operations from scratch.

Core Architecture Components

The segmentation stack consists of four interconnected components defined in official/vision/modeling/. The backbone extracts hierarchical features, the decoder optionally upsamples and enriches these features, the segmentation head produces per-pixel logits, and an optional mask-scoring head refines instance mask quality.

SegmentationModel Class

Located in official/vision/modeling/segmentation_model.py, the SegmentationModel class (lines 64-76) acts as the orchestration layer. Its constructor accepts:

  • backbone: Any tf_keras.Model returning a dictionary of feature maps (e.g., {2: feat2, 3: feat3, ...})
  • decoder: Optional upsampling module (omit to connect backbone directly to head)
  • head: A SegmentationHead instance producing logits
  • mask_scoring_head: Optional MaskScoring layer for instance segmentation

The call method executes these stages sequentially, returning a dictionary containing logits and optionally mask_scores.

Segmentation Head Implementation

The SegmentationHead class in official/vision/modeling/heads/segmentation_heads.py (lines 91-100) generates pixel-wise classifications. Key configuration parameters include:

  • feature_fusion: Controls low-level feature integration. Options include deeplabv3plus, pyramid_fusion, panoptic_fpn_fusion, or None (lines 334-376).
  • upsample_factor: Nearest-neighbor upsampling ratio applied after convolutions (lines 60-62).
  • num_convs and num_filters: Defines the convolutional stack depth and width before the final 1×1 classifier (lines 63-88).

The final output layer is a Conv2D with num_classes filters, optionally followed by logit_activation (softmax or sigmoid).

Mask Scoring for Instance Segmentation

For Mask R-CNN-style architectures, the MaskScoring class (same file, lines 24-45) refines mask quality predictions. It applies depth-wise convolutions, resizes features to fc_input_size, and passes them through fully-connected layers (lines 97-104) to produce per-class mask confidence scores.

Loss Functions and Evaluation Metrics

The framework provides specialized utilities for segmentation training and evaluation.

SegmentationLosses (in official/vision/losses/segmentation_losses.py) combines pixel-wise cross-entropy, focal loss, and Dice loss terms. It accepts ignore_label parameters to handle unannotated pixels common in datasets like Cityscapes.

SegmentationMetrics (in official/vision/evaluation/segmentation_metrics.py) computes Mean Intersection-over-Union (mIoU), per-class IoU, and boundary F-scores. Both utilities automatically handle the model's output dictionary, including optional mask_scores.

Data Pipeline Configuration

The SegmentationInput class in official/vision/dataloaders/segmentation_input.py parses TFRecord datasets containing paired images and uint8 masks. It yields a dictionary:

{
    'inputs': image_tensor,          # [H, W, 3] float32

    'groundtruths': {
        'label': mask_tensor,        # [H, W, 1] int32

    }
}

The loader supports on-the-fly augmentation including random flipping, scaling, and cropping during training.

End-to-End Implementation Example

Below is a complete workflow demonstrating how to implement image segmentation models with a ResNet-50 backbone and DeepLabV3+ fusion:

import tensorflow as tf
import tensorflow.keras as tf_keras
from official.vision.modeling.segmentation_model import SegmentationModel
from official.vision.modeling.heads.segmentation_heads import SegmentationHead
from official.vision.modeling.backbones import resnet
from official.vision.dataloaders.segmentation_input import SegmentationInput
from official.vision.losses.segmentation_losses import SegmentationLosses
from official.vision.evaluation.segmentation_metrics import SegmentationMetrics

# 1️⃣  Build backbone (ResNet-50)

backbone = resnet.ResNet(
    model_id=50,
    output_stride=16,
    include_top=False,
    norm_momentum=0.99,
    norm_epsilon=0.001)

# 2️⃣  Build segmentation head with DeepLabV3+ feature fusion

head = SegmentationHead(
    num_classes=21,               # Pascal VOC classes

    level=4,
    num_convs=2,
    num_filters=256,
    feature_fusion='deeplabv3plus',
    low_level=2,
    low_level_num_filters=48,
    upsample_factor=4,
    use_sync_bn=False,
    norm_momentum=0.99,
    norm_epsilon=0.001)

# 3️⃣  Assemble model (no decoder in this example)

model = SegmentationModel(backbone=backbone, decoder=None, head=head)

# 4️⃣  Configure loss and metrics

losses = SegmentationLosses(
    loss_type='softmax_cross_entropy',
    ignore_label=-1)

metrics = SegmentationMetrics(
    num_classes=21,
    ignore_label=-1)

model.compile(
    optimizer=tf_keras.optimizers.Adam(learning_rate=1e-4),
    loss=losses,
    metrics=[metrics])

# 5️⃣  Create TFRecord data pipeline

train_input = SegmentationInput(
    file_pattern='gs://my-bucket/dataset/train-*-of-*.tfrecord',
    is_training=True,
    batch_size=8,
    input_size=(512, 512))

train_dataset = train_input.make_dataset()

# 6️⃣  Train

model.fit(train_dataset, epochs=50)

This example connects a ResNet-50 feature extractor directly to a DeepLabV3+ style head, omitting a separate decoder module. The head upsamples predictions by a factor of 4 to match input resolution.

Customization Strategies

To adapt the framework for specific research needs, modify these key components:

Custom Decoders: Subclass tf_keras.Model to process backbone_features and return decoder tensors, then pass this instance as the decoder argument to SegmentationModel.

Multi-Scale Inference: Wrap model.predict in a tf.function that processes image pyramids, resizes resulting logits, and averages predictions across scales.

Panoptic Segmentation: Set feature_fusion='panoptic_fpn_fusion' and configure decoder_min_level and decoder_max_level parameters (lines 49-55 in segmentation_heads.py) to control FPN hierarchy integration.

Custom Loss Functions: While SegmentationLosses supports softmax cross-entropy, focal, and Dice losses, you can pass any callable to model.compile(loss=my_custom_loss) for specialized objectives like Lovász-Softmax.

Mask Quality Estimation: Instantiate MaskScoring(num_classes, fc_input_size, ...) and provide it as mask_scoring_head when building SegmentationModel for Mask Scoring R-CNN implementations.

Essential Source Files

When implementing custom segmentation architectures, reference these specific files in the tensorflow/models repository:

Summary

  • Assemble components: Use SegmentationModel to combine backbone, decoder, and head from official/vision/modeling/.
  • Configure fusion: Set feature_fusion in SegmentationHead to deeplabv3plus or panoptic_fpn_fusion for multi-scale feature integration.
  • Handle data: Use SegmentationInput to parse TFRecords with image and mask pairs.
  • Evaluate properly: Employ SegmentationMetrics for mIoU calculation and SegmentationLosses for pixel-wise classification objectives.
  • Extend functionality: Add MaskScoring heads for instance segmentation or custom decoders for specific architectural requirements.

Frequently Asked Questions

What backbone architectures are supported for segmentation in TensorFlow Models?

The framework supports any tf_keras.Model following the backbone API convention, returning feature maps as a level-indexed dictionary. Pre-implemented options include ResNet (50/101) and EfficientNet variants located in official/vision/modeling/backbones/. You can also inject custom backbones provided they output the expected feature dictionary format.

How do I handle datasets with ignored or unlabeled pixels?

Pass the ignore_label parameter (typically -1 or 255) to both SegmentationLosses and SegmentationMetrics during initialization. These classes automatically mask out these indices when computing cross-entropy losses or IoU metrics, ensuring invalid pixels do not affect gradient updates or evaluation scores.

Can I use this framework for instance segmentation or only semantic segmentation?

While primarily designed for semantic segmentation, the framework supports instance segmentation through the MaskScoring class. Instantiate this head and pass it as mask_scoring_head when creating SegmentationModel. For full Mask R-CNN implementations, combine this with detection components from the official detection modeling library.

What is the difference between using a decoder and using feature fusion in the head?

The decoder is a separate network module (e.g., U-Net style upsampling) that processes backbone features before they reach the head. Feature fusion (configured via feature_fusion in SegmentationHead) happens inside the head itself, merging low-level backbone features with high-level features using operations like DeepLabV3+ or FPN-style aggregation. You can use both together or independently depending on your architecture requirements.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →