How to Implement Video Classification Models with TensorFlow Models
To implement video classification models with TensorFlow Models, use the modular config-driven stack in official/vision that wires together experiment configurations, 3D backbones, and task logic to train on video datasets like Kinetics-400 without modifying core training code.
The TensorFlow Models repository provides a production-grade framework for building video classification models using a declarative configuration system. By leveraging the official/vision components, you can assemble complete training pipelines—from data ingestion to model deployment—by editing Python dataclasses rather than imperative code.
Configuration-Driven Architecture
The implementation follows a strict config → data → model → task pattern. All hyperparameters are centralized in official/vision/configs/video_classification.py, which defines DataConfig for datasets like Kinetics or UCF-101, VideoClassificationModel for architecture selection, and experiment factories (e.g., video_classification_kinetics400) that assemble complete ExperimentConfig objects with trainer and optimizer settings.
Experiment Factory Pattern
Rather than manually instantiating classes, you retrieve pre-built configurations via exp_factory.get_exp_config. This factory, located in official/core/exp_factory.py, returns a complete experiment specification including batch sizes, learning rate schedules, and warm-up steps configured through the add_trainer helper in the config file.
Data Pipeline and Preprocessing
The input pipeline is handled by official/vision/dataloaders/video_input.py, which parses TFRecord or TFDS video examples and applies temporal sampling strategies. The parser performs frame decoding, random augmentations (crop, flip, rotation, AutoAugment/RandAugment), and returns a dictionary with shape {'image': <tensor>} matching the DataConfig.feature_shape (typically (batch, T, H, W, C)).
Input Reader Construction
The official/vision/dataloaders/input_reader_factory.py module constructs the final tf.data pipeline using the video parser, handling distributed reading and batching across TPU or GPU workers.
Model Definition and 3D Backbones
The VideoClassificationModel class in official/vision/modeling/video_classification_model.py serves as a thin wrapper around 3D convolutional backbones. It aggregates features from the backbone (such as ResNet-3D or SlowFast defined in official/vision/configs/backbones_3d.py), applies global pooling, optional dropout, and projects to num_classes via a dense layer.
Backbone Selection
You specify the backbone architecture through the configuration's backbone.type field. The official/vision/modeling/factory_3d.py module dispatches to the appropriate builder based on this type string, supporting architectures like SlowFast without code changes.
Task Logic and Training Loop
VideoClassificationTask in official/vision/tasks/video_classification.py orchestrates the entire training process. It builds the Keras model via factory_3d.build_model, loads optional pretrained checkpoints, constructs the input pipeline, and defines the loss function (categorical or binary cross-entropy) and metrics (top-1/top-5 accuracy, AUC, per-class recall).
The task implementation handles mixed-precision training automatically and defines the forward pass logic for both training and validation steps.
End-to-End Implementation Example
The following code demonstrates how to assemble a complete video classification training setup using the factory pattern:
# 1️⃣ Load the experiment configuration for Kinetics‑400.
from official.core import exp_factory
exp_cfg = exp_factory.get_exp_config('video_classification_kinetics400')
# 2️⃣ Optionally override hyper‑parameters.
exp_cfg.trainer = exp_cfg.trainer.replace(
steps_per_loop=1000, # custom step granularity
optimizer_config=exp_cfg.trainer.optimizer_config.replace(
optimizer={'type': 'adam'}, # switch optimizer
)
)
# 3️⃣ Build the model (the task will construct it internally).
from official.vision.tasks import video_classification as video_task
task = video_task.VideoClassificationTask(task_config=exp_cfg.task)
model = task.build_model() # ↳ builds ResNet‑3D backbone + classification head
# 4️⃣ Inspect the model summary.
model.summary()
# 5️⃣ (Optional) Compile and run a quick sanity‑check fit.
model.compile(
optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Dummy data – shape matches DataConfig.feature_shape: (batch, T, H, W, C)
import tensorflow as tf
dummy_x = tf.random.uniform([4, 64, 224, 224, 3])
dummy_y = tf.one_hot(tf.random.uniform([4], maxval=400, dtype=tf.int32), 400)
model.fit(dummy_x, dummy_y, epochs=1)
Running Distributed Training
Once configured, pass the experiment configuration to the official training script model_main_tf2.py. This script creates a Trainer instance (as specified by the config's trainer field) and executes distributed training on TPUs or GPUs using the strategy defined in the runtime configuration.
Summary
- Use
exp_factory.get_exp_configto retrieve pre-built experiment configurations for standard datasets like Kinetics-400 or UCF-101. - Modify
official/vision/configs/video_classification.pyto change hyperparameters, backbones (ResNet-3D, SlowFast), or augmentation strategies without touching training logic. - Leverage
VideoClassificationTaskto handle model construction, checkpoint loading, loss computation, and metric tracking in a single class. - Process video data through
official/vision/dataloaders/video_input.py, which handles temporal sampling, decoding, and augmentations for TFRecord or TFDS sources. - Scale to distributed hardware by passing the experiment config to
model_main_tf2.py, which automatically configures theTrainerfor TPU or GPU clusters.
Frequently Asked Questions
How do I switch from ResNet-3D to SlowFast backbone?
Update the backbone.type field in your experiment configuration to 'slowfast' before building the task. The factory_3d.py module dispatches to the appropriate builder based on this string, and VideoClassificationModel automatically adjusts its pooling and projection layers to match the new backbone output shapes.
Can I use custom video datasets instead of Kinetics-400?
Yes. Create a new configuration function in official/vision/configs/video_classification.py that returns an ExperimentConfig with your custom DataConfig pointing to your TFRecord files. Adjust feature_shape and num_classes to match your video resolution and label space, then call exp_factory.get_exp_config with your new factory name.
Where is the mixed-precision training logic implemented?
Mixed-precision handling is built into VideoClassificationTask in official/vision/tasks/video_classification.py. The task automatically applies the appropriate policy during the training and validation step definitions, and the trainer configuration in official/vision/configs/video_classification.py controls the precision mode via optimizer settings.
How do I add custom augmentations to the video pipeline?
Modify the parser in official/vision/dataloaders/video_input.py or adjust the augmentation_type field in your DataConfig (defined in official/vision/configs/video_classification.py). The existing implementation supports RandAugment and AutoAugment policies, and you can extend the parser's process_example method to inject custom temporal or spatial transformations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →