How to Use the TensorFlow Models Optimization Module for Training

The TensorFlow Models optimization module provides a config-driven factory pattern to create optimizers, learning-rate schedules, and warm-up policies through the OptimizerFactory class in official/modeling/optimization/optimizer_factory.py.

The tensorflow/models repository includes a powerful optimization framework under official/modeling/optimization/ that decouples hyperparameter configuration from training logic. By using the OptimizationConfig dataclass and the OptimizerFactory, you can switch between AdamW, SGD, or LARS optimizers, toggle cosine or stepwise learning-rate schedules, and enable exponential moving average (EMA) without modifying your training loop code.

Understanding the Optimization Module Architecture

The module follows a strict separation between configuration data and construction logic. This design allows JSON or YAML configs to drive the training process while the factory handles object instantiation.

Configuration Layer (OptimizationConfig)

The top-level configuration is defined in official/modeling/optimization/configs/optimization_config.py. The OptimizationConfig dataclass aggregates three sub-configs:

  • optimizer – Specifies the optimizer type (e.g., sgd, adamw, lamb) and its hyper-parameters (momentum, weight decay, etc.), defined in official/modeling/optimization/configs/optimizer_config.py.
  • learning_rate – Defines the schedule type (stepwise, cosine, polynomial, exponential) and schedule-specific parameters such as decay_steps or boundaries.
  • warmup (optional) – Configures warm-up strategy (linear, polynomial) to stabilize early training steps.

Factory Layer (OptimizerFactory)

The OptimizerFactory class in official/modeling/optimization/optimizer_factory.py consumes an OptimizationConfig instance and provides two primary methods:

  1. build_learning_rate() – Constructs the learning-rate schedule and optionally wraps it with a warm-up schedule using mappings defined in official/modeling/optimization/lr_schedule.py.
  2. build_optimizer(lr) – Instantiates the optimizer, handling legacy vs. new Keras optimizer APIs, gradient clipping, and EMA wrapping via official/modeling/optimization/ema_optimizer.py.

Creating Learning Rate Schedules and Warm-Up Policies

The factory supports multiple decay strategies through the LR_CLS mapping in optimizer_factory.py. You can combine any schedule with a warm-up wrapper from the WARMUP_CLS mapping.

Stepwise and Cosine Decay Schedules

Stepwise decay reduces the learning rate at specific step boundaries, while cosine decay applies a smooth cosine annealing curve. In official/modeling/optimization/lr_schedule.py, the CosineDecayWithOffset class handles cosine schedules with optional step offsets.

To configure a cosine schedule with linear warm-up:

params = {
    "learning_rate": {
        "type": "cosine",
        "cosine": {
            "decay_steps": 20000,
            "alpha": 0.0  # final LR fraction of initial

        }
    },
    "warmup": {
        "type": "linear",
        "linear": {
            "warmup_steps": 1000,
            "warmup_learning_rate": 0.001
        }
    }
}

Linear and Polynomial Warm-Up

Warm-up gradually increases the learning rate from an initial value to the target schedule value over a specified number of steps. The LinearWarmup and PolynomialWarmup classes in lr_schedule.py implement these policies.

When build_learning_rate() detects a warmup config, it wraps the base schedule:


# Inside OptimizerFactory.build_learning_rate()

if self._warmup_config:
    warmup_cls = WARMUP_CLS[self._warmup_config.type]
    lr_schedule = warmup_cls(lr_schedule, **warmup_params)

Building Optimizers with the Factory

The build_optimizer() method handles optimizer instantiation, version selection, and optional EMA wrapping.

Legacy vs. New Keras Optimizers

TensorFlow Models supports both legacy optimizers (TF 2.x stable) and new experimental optimizers (TF 2.11+). The factory selects the appropriate class based on the use_legacy_optimizer flag and the optimizer.type field.

Supported optimizers are mapped in optimizer_factory.py:

  • Legacy: tf.keras.optimizers.SGD, tf.keras.optimizers.Adam, etc.
  • New: tf.keras.optimizers.experimental.SGD, tf.keras.optimizers.experimental.AdamW, etc.
  • Shared: LARS, LAMB, and other custom optimizers.

To force the new AdamW optimizer with weight decay:

params = {
    "optimizer": {
        "type": "adamw",
        "adamw": {"weight_decay": 0.01, "beta_1": 0.9, "beta_2": 0.999}
    }
}

opt_factory = optimization.OptimizerFactory(optimization.OptimizationConfig(params))
lr = opt_factory.build_learning_rate()
optimizer = opt_factory.build_optimizer(lr, use_legacy_optimizer=False)

Enabling Exponential Moving Average (EMA)

EMA maintains a shadow copy of model weights with exponential decay, often improving model generalization. To enable EMA, add the ema field to your config:

params = {
    "ema": {
        "average_decay": 0.9999,
        "trainable_weights_only": True
    }
}

When build_optimizer() detects the ema config, it wraps the base optimizer with ema_optimizer.ExponentialMovingAverage from official/modeling/optimization/ema_optimizer.py.

Complete Training Examples

The following snippets demonstrate end-to-end usage patterns for common scenarios.

SGD with Stepwise Decay and Linear Warm-Up

This example uses the legacy optimizer API with a piecewise constant learning rate schedule.

from official.modeling import optimization

config_dict = {
    "optimizer": {
        "type": "sgd",
        "sgd": {"momentum": 0.9}
    },
    "learning_rate": {
        "type": "stepwise",
        "stepwise": {
            "boundaries": [10000, 20000],
            "values": [0.1, 0.01, 0.001]
        }
    },
    "warmup": {
        "type": "linear",
        "linear": {"warmup_steps": 500, "warmup_learning_rate": 0.01}
    }
}

opt_cfg = optimization.OptimizationConfig(config_dict)
factory = optimization.OptimizerFactory(opt_cfg)

learning_rate = factory.build_learning_rate()
optimizer = factory.build_optimizer(learning_rate)

model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy")

AdamW with Cosine Decay (New Optimizer API)

This configuration uses the modern tf.keras.optimizers.experimental.AdamW with cosine annealing.

from official.modeling import optimization

params = {
    "optimizer": {
        "type": "adamw",
        "adamw": {"weight_decay": 0.01}
    },
    "learning_rate": {
        "type": "cosine",
        "cosine": {"decay_steps": 20000, "alpha": 0.0}
    },
    "warmup": {
        "type": "linear",
        "linear": {"warmup_steps": 1000}
    }
}

opt_cfg = optimization.OptimizationConfig(params)
factory = optimization.OptimizerFactory(opt_cfg)

lr = factory.build_learning_rate()
optimizer = factory.build_optimizer(lr, use_legacy_optimizer=False)

@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        logits = model(images, training=True)
        loss = loss_fn(labels, logits)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

Enabling EMA with SGD

This example adds exponential moving average shadow variables to a standard SGD optimizer.

from official.modeling import optimization

params = {
    "optimizer": {
        "type": "sgd",
        "sgd": {"momentum": 0.9}
    },
    "learning_rate": {"type": "constant", "constant": {"learning_rate": 0.01}},
    "ema": {
        "average_decay": 0.9999,
        "trainable_weights_only": True
    }
}

opt_cfg = optimization.OptimizationConfig(params)
factory = optimization.OptimizerFactory(opt_cfg)

lr = factory.build_learning_rate()
optimizer = factory.build_optimizer(lr)

model.compile(optimizer=optimizer, loss="categorical_crossentropy")

Summary

  • The TensorFlow Models optimization module centralizes optimizer configuration through the OptimizationConfig dataclass and OptimizerFactory class.
  • Configuration files define optimizer type, learning-rate schedules (stepwise, cosine, polynomial), and optional warm-up policies without modifying training code.
  • Learning-rate schedules are constructed via OptimizerFactory.build_learning_rate(), which automatically wraps base schedules with warm-up classes from lr_schedule.py when configured.
  • Optimizer instantiation supports both legacy and new Keras APIs through build_optimizer(), with automatic EMA wrapping when the ema config field is present in official/modeling/optimization/ema_optimizer.py.
  • This factory pattern enables rapid experimentation by changing JSON/YAML configs rather than Python training scripts.

Frequently Asked Questions

How do I switch between legacy and new Keras optimizers in the factory?

Pass the use_legacy_optimizer boolean argument to OptimizerFactory.build_optimizer(). When set to False, the factory selects classes from NEW_OPTIMIZERS_CLS (e.g., tf.keras.optimizers.experimental.AdamW); when True, it uses LEGACY_OPTIMIZERS_CLS (e.g., tf.keras.optimizers.Adam). This flag is defined in official/modeling/optimization/optimizer_factory.py.

What learning-rate schedules are available in the optimization module?

The module supports stepwise, cosine, polynomial, exponential, power, and constant schedules. These are mapped to implementation classes in LR_CLS within optimizer_factory.py, with specific logic residing in official/modeling/optimization/lr_schedule.py. For example, CosineDecayWithOffset handles cosine annealing with step offsets.

Can I use warm-up without changing my training loop code?

Yes. Warm-up is configured entirely within the OptimizationConfig. When you include a warmup dictionary (specifying type: linear or type: polynomial), the OptimizerFactory.build_learning_rate() method automatically wraps your base learning-rate schedule with the corresponding warm-up class from WARMUP_CLS. No changes to the training loop are required.

How does EMA (Exponential Moving Average) integration work?

To enable EMA, add an ema field to your configuration dictionary containing average_decay and optionally trainable_weights_only. When OptimizerFactory.build_optimizer() detects this field, it wraps the constructed optimizer with ExponentialMovingAverage from official/modeling/optimization/ema_optimizer.py. This wrapper maintains shadow variables for model weights and updates them after each training step, improving model generalization without requiring manual EMA logic in your training script.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →