# How to Create Custom Training Loops with Controller and Trainer in TensorFlow Models

> Create custom training loops in TensorFlow Models. Subclass Controller to modify sampling/steps or extend Trainer to override orchestration for greater flexibility.

- Repository: [tensorflow/models](https://github.com/tensorflow/models)
- Tags: how-to-guide
- Published: 2026-02-28

---

**To implement custom training loops with Controller and Trainer in the tensorflow/models repository, subclass the Controller class to modify episode sampling or training steps, or extend the Trainer class to override the orchestration flow while retaining the underlying factory methods for graph construction and environment initialization.**

The `Controller` and `Trainer` classes in the `research/pcl_rl` package provide the foundational infrastructure for reinforcement learning experiments in TensorFlow Models. These components handle everything from environment interaction and replay buffer management to graph construction and distributed training coordination. Understanding how to extend these classes allows you to implement advanced training workflows—such as custom stopping criteria, alternative exploration strategies, or multi-objective optimization—without rewriting the underlying TensorFlow operations.

## Understanding the Core Components

### Controller: The Training Step Orchestrator

The **Controller** class, defined in [`research/pcl_rl/controller.py`](https://github.com/tensorflow/models/blob/main/research/pcl_rl/controller.py), manages the inner loop of reinforcement learning training. It samples episodes from the environment, converts observations into the format expected by your model, manages replay buffer insertion, and executes the model's training steps. The Controller also provides a greedy-policy evaluation routine for validation purposes.

Key responsibilities include calling `sample_episodes` (which invokes the internal `_sample_episodes` method) to collect experience batches, optionally adding data to a replay buffer via `add_to_replay_buffer`, and invoking the internal `_train` method to perform either trust-region optimization (`model.trust_region_step`) or standard gradient descent (`model.train_step`).

### Trainer: The High-Level Training Driver

The **Trainer** class, located in [`research/pcl_rl/trainer.py`](https://github.com/tensorflow/models/blob/main/research/pcl_rl/trainer.py), serves as the entry point for training jobs. It builds the TensorFlow graph, creates `GymWrapper` environment instances, instantiates a `Controller` through factory methods, and executes the standard training loop. The Trainer also manages checkpointing, supports multi-replica training via Supervisor, and handles hyper-parameter configuration through `tf.flags`.

During initialization, `Trainer.__init__` parses command-line flags, constructs the environment, and stores hyper-parameters in `self.hparams`. The `Trainer.get_controller` method then constructs a `Controller` instance, passing factory callables such as `get_model`, `get_replay_buffer`, and `get_buffer_seeds`.

## How Controller and Trainer Interact

The standard execution flow follows four distinct phases that connect these two classes:

1. **Initialization**: `Trainer.__init__` parses flags and builds the `GymWrapper` environment.
2. **Controller Construction**: `Trainer.get_controller` instantiates the `Controller`, injecting factory methods for model and buffer creation.
3. **Session Setup**: `Trainer.run` creates the TensorFlow session, calls `controller.setup(train=True)` to build the model graph, and initializes variables.
4. **Training Loop**: The trainer repeatedly calls `controller.train(sess)`, which internally executes `sample_episodes`, manages the replay buffer, and runs `_train` to update model parameters. The Controller may also update the relative-entropy coefficient (`eps_lambda`) if configured.

Because the training logic resides within `Controller.train`, you can intercept or replace behavior at multiple points while maintaining access to low-level tensors such as `model.global_step` and `model.inc_global_step`.

## Approaches to Customization

You can create custom training loops using three primary extension strategies. Each approach provides different levels of control over the training process.

**Subclass Controller** to modify how episodes are sampled or how training steps are executed. Override `sample_episodes`, `_train`, or `train` to implement custom exploration strategies, loss functions, or data processing pipelines.

**Subclass Trainer** to modify the outer training orchestration while reusing existing factory methods. Override `run` to implement custom stopping criteria, validation schedules, or multi-phase training regimens.

**Bypass Trainer.run Entirely** to implement completely custom loops. Instantiate `Trainer` to access its factory methods and hyper-parameters, then manually create a session and invoke `controller.train(sess)` within your own loop structure.

## Implementation Examples

### Example 1: Custom Loop Without Trainer.run

This approach uses the `Trainer` class for environment and model setup, but replaces the standard training loop with custom logic for early stopping based on episode count or performance thresholds.

```python

# custom_loop.py

import tensorflow as tf
import numpy as np
from research.pcl_rl.trainer import Trainer
from research.pcl_rl.controller import Controller

# Build trainer to access factories and hyper-parameters

trainer = Trainer()

with tf.Graph().as_default():
    sess = tf.Session()
    controller = trainer.get_controller(trainer.env)
    controller.setup(train=True)
    sess.run(tf.global_variables_initializer())

    max_episodes = 500
    episode = 0
    while episode < max_episodes:
        loss, summary, total_rewards, episode_rewards = controller.train(sess)
        
        # Early stop when average reward exceeds threshold

        if tf.reduce_mean(total_rewards).eval(session=sess) > 200.0:
            print('Target reward reached, stopping.')
            break
        
        episode += 1
        if episode % 50 == 0:
            print(f'Episode {episode}: avg reward = {np.mean(total_rewards)}')

```

This pattern retains the `Trainer`'s configuration parsing and factory methods while allowing arbitrary loop logic, custom metrics logging, or conditional checkpointing.

### Example 2: Custom Sampling Strategy via Controller Subclass

To implement epsilon-greedy exploration instead of the default policy sampling, extend the `Controller` class and override the `_sample_episodes` method.

```python

# my_controller.py

import numpy as np
from research.pcl_rl import controller as base

class EpsilonGreedyController(base.Controller):
    """Controller with epsilon-greedy exploration."""
    def __init__(self, *args, epsilon=0.1, **kwargs):
        super(EpsilonGreedyController, self).__init__(*args, **kwargs)
        self.epsilon = epsilon

    def _sample_episodes(self, sess, greedy=False):
        init_state, obs, act, rew, pads = super(
            EpsilonGreedyController, self)._sample_episodes(sess, greedy=greedy)
        
        # Apply epsilon-greedy mask to actions

        for i, a in enumerate(act):
            mask = np.random.rand(*a.shape) < self.epsilon
            random_actions = self.env_spec.sample_random_actions(a.shape[0])
            a[mask] = random_actions[mask]
        return init_state, obs, act, rew, pads

```

To use this custom controller, override `get_controller` in your trainer subclass:

```python

# custom_trainer.py

from research.pcl_rl.trainer import Trainer
from my_controller import EpsilonGreedyController

class CustomTrainer(Trainer):
    def get_controller(self, env):
        return EpsilonGreedyController(
            env, self.env_spec, self.internal_dim,
            use_online_batch=self.use_online_batch,
            batch_by_steps=self.batch_by_steps,
            unify_episodes=self.unify_episodes,
            replay_batch_size=self.replay_batch_size,
            max_step=self.max_step,
            cutoff_agent=self.cutoff_agent,
            save_trajectories_file=self.save_trajectories_file,
            use_trust_region=self.trust_region_p,
            use_value_opt=self.value_opt not in [None, 'None'],
            update_eps_lambda=self.update_eps_lambda,
            prioritize_by=self.prioritize_by,
            get_model=self.get_model,
            get_replay_buffer=self.get_replay_buffer,
            get_buffer_seeds=self.get_buffer_seeds,
            epsilon=0.05)

```

### Example 3: Reward-Based Early Stopping via Trainer Subclass

To stop training when the environment reaches a target reward threshold, override the `run` method in your `Trainer` subclass.

```python

# early_stop_trainer.py

import numpy as np
from research.pcl_rl.trainer import Trainer

class EarlyStopTrainer(Trainer):
    def run(self):
        target_reward = 250.0
        
        # Standard initialization from Trainer.run

        self.sess = tf.Session()
        self.controller = self.get_controller(self.env)
        self.controller.setup(train=True)
        self.sess.run(tf.global_variables_initializer())
        
        while True:
            loss, summary, total_rewards, episode_rewards = self.controller.train(self.sess)
            avg_reward = np.mean(total_rewards)
            print(f'Step {self.global_step.eval()}, avg reward {avg_reward:.2f}')
            
            if avg_reward >= target_reward:
                print('Reached target reward, exiting.')
                break
        
        # Final checkpoint save

        if self.sv is not None:
            self.sv.saver.save(self.sess, self.sv.save_path,
                               global_step=self.sv.global_step)

```

## Common Customization Scenarios

**Change the stopping criterion**: Override `Trainer.run` and replace the standard `for step in xrange(1 + self.num_steps):` loop with a `while` loop that checks reward-based or convergence conditions.

**Use a different sampling policy**: Subclass `Controller` and replace the call to `self.model.sample_step` in `_sample_episodes` with custom logic for epsilon-greedy, Boltzmann exploration, or learned exploration bonuses.

**Add extra logging or metrics**: Insert additional TensorBoard summaries or print statements in `Controller.train` after the `_train` call, or within your custom training loop by running `self.sess.run(custom_summary_op)`.

**Swap the replay buffer implementation**: Override `Trainer.get_replay_buffer` to return a different buffer class, such as a simple FIFO buffer or a novel prioritization scheme.

**Combine multiple objectives**: Extend `Trainer.get_objective` to instantiate a composite objective that wraps existing PCL, A3C, or custom loss functions for multi-task learning.

## Key Source Files

The following files constitute the complete training stack in the `research/pcl_rl` package:

- [`research/pcl_rl/controller.py`](https://github.com/tensorflow/models/blob/main/research/pcl_rl/controller.py): Core class handling episode sampling, replay buffer interaction, and per-batch training steps.
- [`research/pcl_rl/trainer.py`](https://github.com/tensorflow/models/blob/main/research/pcl_rl/trainer.py): High-level orchestration, graph construction, and distributed training support.
- [`research/pcl_rl/model.py`](https://github.com/tensorflow/models/blob/main/research/pcl_rl/model.py): Policy and value network definitions, target network lag, and low-level training methods (`train_step`, `trust_region_step`).
- [`research/pcl_rl/replay_buffer.py`](https://github.com/tensorflow/models/blob/main/research/pcl_rl/replay_buffer.py): Prioritized experience replay implementation used when `FLAGS.replay_buffer_freq > 0`.
- [`research/pcl_rl/gym_wrapper.py`](https://github.com/tensorflow/models/blob/main/research/pcl_rl/gym_wrapper.py): Batched environment wrapper for OpenAI Gym interfaces.

## Summary

- **Controller** manages the inner training loop including episode sampling, replay buffer management, and model updates via `train_step` or `trust_region_step`.
- **Trainer** handles outer-loop orchestration, graph construction, and hyper-parameter management through factory methods like `get_controller` and `get_model`.
- Subclass **Controller** to modify sampling strategies (`_sample_episodes`) or training procedures (`_train`).
- Subclass **Trainer** to implement custom stopping criteria, validation schedules, or multi-phase training workflows by overriding `run`.
- Access low-level training tensors including `model.global_step` and replay buffer instances through the public APIs of both classes.

## Frequently Asked Questions

### How do I access the model's global step tensor from a custom training loop?

The global step tensor is available through the model instance created by the Controller. When you call `trainer.get_controller(env)`, the Controller stores a reference to the model at `self.model`. You can access `self.model.global_step` or `self.model.inc_global_step` within your Controller subclass methods, or evaluate them in your custom loop via `sess.run(controller.model.global_step)`.

### Can I use a custom replay buffer implementation without modifying the base classes?

Yes. Override the `get_replay_buffer` method in your `Trainer` subclass to return an instance of your custom buffer class instead of the default prioritized replay buffer. The Controller will use this buffer automatically when `add_to_replay_buffer` is called during training, provided your implementation matches the expected interface for `add` and `sample` operations.

### What is the difference between overriding `Controller.train` versus `Controller._train`?

Override `Controller.train` to modify the high-level training step sequence, including episode sampling, replay buffer management, and summary generation. Override `Controller._train` to change only the parameter update mechanism itself, such as replacing the trust region optimization with a custom gradient descent routine, while retaining the standard data collection and preprocessing logic.

### How do I implement distributed training with a custom Controller subclass?

The base `Trainer` class handles distributed training coordination via TensorFlow's Supervisor (`self.sv`). When you subclass `Trainer` and override `run`, ensure you initialize the Supervisor and session using the standard pattern from the base class. Then instantiate your custom Controller subclass within `get_controller`. The Supervisor will manage variable synchronization across replicas while your custom Controller logic executes on each worker.