# How to Use the TensorFlow Models Optimization Module for Training

> Learn to use the TensorFlow Models optimization module for training with OptimizerFactory. Configure optimizers, learning-rate schedules, and warm-up policies efficiently.

- Repository: [tensorflow/models](https://github.com/tensorflow/models)
- Tags: how-to-guide
- Published: 2026-02-28

---

**The TensorFlow Models optimization module provides a config-driven factory pattern to create optimizers, learning-rate schedules, and warm-up policies through the `OptimizerFactory` class in [`official/modeling/optimization/optimizer_factory.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/optimizer_factory.py).**

The `tensorflow/models` repository includes a powerful optimization framework under `official/modeling/optimization/` that decouples hyperparameter configuration from training logic. By using the `OptimizationConfig` dataclass and the `OptimizerFactory`, you can switch between **AdamW**, **SGD**, or **LARS** optimizers, toggle **cosine** or **stepwise** learning-rate schedules, and enable **exponential moving average (EMA)** without modifying your training loop code.

## Understanding the Optimization Module Architecture

The module follows a strict separation between configuration data and construction logic. This design allows JSON or YAML configs to drive the training process while the factory handles object instantiation.

### Configuration Layer (OptimizationConfig)

The top-level configuration is defined in [`official/modeling/optimization/configs/optimization_config.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/configs/optimization_config.py). The `OptimizationConfig` dataclass aggregates three sub-configs:

- **`optimizer`** – Specifies the optimizer type (e.g., `sgd`, `adamw`, `lamb`) and its hyper-parameters (momentum, weight decay, etc.), defined in [`official/modeling/optimization/configs/optimizer_config.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/configs/optimizer_config.py).
- **`learning_rate`** – Defines the schedule type (`stepwise`, `cosine`, `polynomial`, `exponential`) and schedule-specific parameters such as `decay_steps` or `boundaries`.
- **`warmup`** (optional) – Configures warm-up strategy (`linear`, `polynomial`) to stabilize early training steps.

### Factory Layer (OptimizerFactory)

The `OptimizerFactory` class in [`official/modeling/optimization/optimizer_factory.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/optimizer_factory.py) consumes an `OptimizationConfig` instance and provides two primary methods:

1. **`build_learning_rate()`** – Constructs the learning-rate schedule and optionally wraps it with a warm-up schedule using mappings defined in [`official/modeling/optimization/lr_schedule.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/lr_schedule.py).
2. **`build_optimizer(lr)`** – Instantiates the optimizer, handling legacy vs. new Keras optimizer APIs, gradient clipping, and EMA wrapping via [`official/modeling/optimization/ema_optimizer.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/ema_optimizer.py).

## Creating Learning Rate Schedules and Warm-Up Policies

The factory supports multiple decay strategies through the `LR_CLS` mapping in [`optimizer_factory.py`](https://github.com/tensorflow/models/blob/main/optimizer_factory.py). You can combine any schedule with a warm-up wrapper from the `WARMUP_CLS` mapping.

### Stepwise and Cosine Decay Schedules

**Stepwise decay** reduces the learning rate at specific step boundaries, while **cosine decay** applies a smooth cosine annealing curve. In [`official/modeling/optimization/lr_schedule.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/lr_schedule.py), the `CosineDecayWithOffset` class handles cosine schedules with optional step offsets.

To configure a cosine schedule with linear warm-up:

```python
params = {
    "learning_rate": {
        "type": "cosine",
        "cosine": {
            "decay_steps": 20000,
            "alpha": 0.0  # final LR fraction of initial

        }
    },
    "warmup": {
        "type": "linear",
        "linear": {
            "warmup_steps": 1000,
            "warmup_learning_rate": 0.001
        }
    }
}

```

### Linear and Polynomial Warm-Up

Warm-up gradually increases the learning rate from an initial value to the target schedule value over a specified number of steps. The `LinearWarmup` and `PolynomialWarmup` classes in [`lr_schedule.py`](https://github.com/tensorflow/models/blob/main/lr_schedule.py) implement these policies.

When `build_learning_rate()` detects a `warmup` config, it wraps the base schedule:

```python

# Inside OptimizerFactory.build_learning_rate()

if self._warmup_config:
    warmup_cls = WARMUP_CLS[self._warmup_config.type]
    lr_schedule = warmup_cls(lr_schedule, **warmup_params)

```

## Building Optimizers with the Factory

The `build_optimizer()` method handles optimizer instantiation, version selection, and optional EMA wrapping.

### Legacy vs. New Keras Optimizers

TensorFlow Models supports both legacy optimizers (TF 2.x stable) and new experimental optimizers (TF 2.11+). The factory selects the appropriate class based on the `use_legacy_optimizer` flag and the `optimizer.type` field.

Supported optimizers are mapped in [`optimizer_factory.py`](https://github.com/tensorflow/models/blob/main/optimizer_factory.py):

- **Legacy**: `tf.keras.optimizers.SGD`, `tf.keras.optimizers.Adam`, etc.
- **New**: `tf.keras.optimizers.experimental.SGD`, `tf.keras.optimizers.experimental.AdamW`, etc.
- **Shared**: `LARS`, `LAMB`, and other custom optimizers.

To force the new AdamW optimizer with weight decay:

```python
params = {
    "optimizer": {
        "type": "adamw",
        "adamw": {"weight_decay": 0.01, "beta_1": 0.9, "beta_2": 0.999}
    }
}

opt_factory = optimization.OptimizerFactory(optimization.OptimizationConfig(params))
lr = opt_factory.build_learning_rate()
optimizer = opt_factory.build_optimizer(lr, use_legacy_optimizer=False)

```

### Enabling Exponential Moving Average (EMA)

EMA maintains a shadow copy of model weights with exponential decay, often improving model generalization. To enable EMA, add the `ema` field to your config:

```python
params = {
    "ema": {
        "average_decay": 0.9999,
        "trainable_weights_only": True
    }
}

```

When `build_optimizer()` detects the `ema` config, it wraps the base optimizer with `ema_optimizer.ExponentialMovingAverage` from [`official/modeling/optimization/ema_optimizer.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/ema_optimizer.py).

## Complete Training Examples

The following snippets demonstrate end-to-end usage patterns for common scenarios.

### SGD with Stepwise Decay and Linear Warm-Up

This example uses the legacy optimizer API with a piecewise constant learning rate schedule.

```python
from official.modeling import optimization

config_dict = {
    "optimizer": {
        "type": "sgd",
        "sgd": {"momentum": 0.9}
    },
    "learning_rate": {
        "type": "stepwise",
        "stepwise": {
            "boundaries": [10000, 20000],
            "values": [0.1, 0.01, 0.001]
        }
    },
    "warmup": {
        "type": "linear",
        "linear": {"warmup_steps": 500, "warmup_learning_rate": 0.01}
    }
}

opt_cfg = optimization.OptimizationConfig(config_dict)
factory = optimization.OptimizerFactory(opt_cfg)

learning_rate = factory.build_learning_rate()
optimizer = factory.build_optimizer(learning_rate)

model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy")

```

### AdamW with Cosine Decay (New Optimizer API)

This configuration uses the modern `tf.keras.optimizers.experimental.AdamW` with cosine annealing.

```python
from official.modeling import optimization

params = {
    "optimizer": {
        "type": "adamw",
        "adamw": {"weight_decay": 0.01}
    },
    "learning_rate": {
        "type": "cosine",
        "cosine": {"decay_steps": 20000, "alpha": 0.0}
    },
    "warmup": {
        "type": "linear",
        "linear": {"warmup_steps": 1000}
    }
}

opt_cfg = optimization.OptimizationConfig(params)
factory = optimization.OptimizerFactory(opt_cfg)

lr = factory.build_learning_rate()
optimizer = factory.build_optimizer(lr, use_legacy_optimizer=False)

@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        logits = model(images, training=True)
        loss = loss_fn(labels, logits)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

```

### Enabling EMA with SGD

This example adds exponential moving average shadow variables to a standard SGD optimizer.

```python
from official.modeling import optimization

params = {
    "optimizer": {
        "type": "sgd",
        "sgd": {"momentum": 0.9}
    },
    "learning_rate": {"type": "constant", "constant": {"learning_rate": 0.01}},
    "ema": {
        "average_decay": 0.9999,
        "trainable_weights_only": True
    }
}

opt_cfg = optimization.OptimizationConfig(params)
factory = optimization.OptimizerFactory(opt_cfg)

lr = factory.build_learning_rate()
optimizer = factory.build_optimizer(lr)

model.compile(optimizer=optimizer, loss="categorical_crossentropy")

```

## Summary

- The **TensorFlow Models optimization module** centralizes optimizer configuration through the `OptimizationConfig` dataclass and `OptimizerFactory` class.
- **Configuration files** define optimizer type, learning-rate schedules (stepwise, cosine, polynomial), and optional warm-up policies without modifying training code.
- **Learning-rate schedules** are constructed via `OptimizerFactory.build_learning_rate()`, which automatically wraps base schedules with warm-up classes from [`lr_schedule.py`](https://github.com/tensorflow/models/blob/main/lr_schedule.py) when configured.
- **Optimizer instantiation** supports both legacy and new Keras APIs through `build_optimizer()`, with automatic EMA wrapping when the `ema` config field is present in [`official/modeling/optimization/ema_optimizer.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/ema_optimizer.py).
- This factory pattern enables rapid experimentation by changing JSON/YAML configs rather than Python training scripts.

## Frequently Asked Questions

### How do I switch between legacy and new Keras optimizers in the factory?

Pass the `use_legacy_optimizer` boolean argument to `OptimizerFactory.build_optimizer()`. When set to `False`, the factory selects classes from `NEW_OPTIMIZERS_CLS` (e.g., `tf.keras.optimizers.experimental.AdamW`); when `True`, it uses `LEGACY_OPTIMIZERS_CLS` (e.g., `tf.keras.optimizers.Adam`). This flag is defined in [`official/modeling/optimization/optimizer_factory.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/optimizer_factory.py).

### What learning-rate schedules are available in the optimization module?

The module supports `stepwise`, `cosine`, `polynomial`, `exponential`, `power`, and `constant` schedules. These are mapped to implementation classes in `LR_CLS` within [`optimizer_factory.py`](https://github.com/tensorflow/models/blob/main/optimizer_factory.py), with specific logic residing in [`official/modeling/optimization/lr_schedule.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/lr_schedule.py). For example, `CosineDecayWithOffset` handles cosine annealing with step offsets.

### Can I use warm-up without changing my training loop code?

Yes. Warm-up is configured entirely within the `OptimizationConfig`. When you include a `warmup` dictionary (specifying `type: linear` or `type: polynomial`), the `OptimizerFactory.build_learning_rate()` method automatically wraps your base learning-rate schedule with the corresponding warm-up class from `WARMUP_CLS`. No changes to the training loop are required.

### How does EMA (Exponential Moving Average) integration work?

To enable EMA, add an `ema` field to your configuration dictionary containing `average_decay` and optionally `trainable_weights_only`. When `OptimizerFactory.build_optimizer()` detects this field, it wraps the constructed optimizer with `ExponentialMovingAverage` from [`official/modeling/optimization/ema_optimizer.py`](https://github.com/tensorflow/models/blob/main/official/modeling/optimization/ema_optimizer.py). This wrapper maintains shadow variables for model weights and updates them after each training step, improving model generalization without requiring manual EMA logic in your training script.