How to Create Custom Training Loops with Controller and Trainer in TensorFlow Models

To implement custom training loops with Controller and Trainer in the tensorflow/models repository, subclass the Controller class to modify episode sampling or training steps, or extend the Trainer class to override the orchestration flow while retaining the underlying factory methods for graph construction and environment initialization.

The Controller and Trainer classes in the research/pcl_rl package provide the foundational infrastructure for reinforcement learning experiments in TensorFlow Models. These components handle everything from environment interaction and replay buffer management to graph construction and distributed training coordination. Understanding how to extend these classes allows you to implement advanced training workflows—such as custom stopping criteria, alternative exploration strategies, or multi-objective optimization—without rewriting the underlying TensorFlow operations.

Understanding the Core Components

Controller: The Training Step Orchestrator

The Controller class, defined in research/pcl_rl/controller.py, manages the inner loop of reinforcement learning training. It samples episodes from the environment, converts observations into the format expected by your model, manages replay buffer insertion, and executes the model's training steps. The Controller also provides a greedy-policy evaluation routine for validation purposes.

Key responsibilities include calling sample_episodes (which invokes the internal _sample_episodes method) to collect experience batches, optionally adding data to a replay buffer via add_to_replay_buffer, and invoking the internal _train method to perform either trust-region optimization (model.trust_region_step) or standard gradient descent (model.train_step).

Trainer: The High-Level Training Driver

The Trainer class, located in research/pcl_rl/trainer.py, serves as the entry point for training jobs. It builds the TensorFlow graph, creates GymWrapper environment instances, instantiates a Controller through factory methods, and executes the standard training loop. The Trainer also manages checkpointing, supports multi-replica training via Supervisor, and handles hyper-parameter configuration through tf.flags.

During initialization, Trainer.__init__ parses command-line flags, constructs the environment, and stores hyper-parameters in self.hparams. The Trainer.get_controller method then constructs a Controller instance, passing factory callables such as get_model, get_replay_buffer, and get_buffer_seeds.

How Controller and Trainer Interact

The standard execution flow follows four distinct phases that connect these two classes:

  1. Initialization: Trainer.__init__ parses flags and builds the GymWrapper environment.
  2. Controller Construction: Trainer.get_controller instantiates the Controller, injecting factory methods for model and buffer creation.
  3. Session Setup: Trainer.run creates the TensorFlow session, calls controller.setup(train=True) to build the model graph, and initializes variables.
  4. Training Loop: The trainer repeatedly calls controller.train(sess), which internally executes sample_episodes, manages the replay buffer, and runs _train to update model parameters. The Controller may also update the relative-entropy coefficient (eps_lambda) if configured.

Because the training logic resides within Controller.train, you can intercept or replace behavior at multiple points while maintaining access to low-level tensors such as model.global_step and model.inc_global_step.

Approaches to Customization

You can create custom training loops using three primary extension strategies. Each approach provides different levels of control over the training process.

Subclass Controller to modify how episodes are sampled or how training steps are executed. Override sample_episodes, _train, or train to implement custom exploration strategies, loss functions, or data processing pipelines.

Subclass Trainer to modify the outer training orchestration while reusing existing factory methods. Override run to implement custom stopping criteria, validation schedules, or multi-phase training regimens.

Bypass Trainer.run Entirely to implement completely custom loops. Instantiate Trainer to access its factory methods and hyper-parameters, then manually create a session and invoke controller.train(sess) within your own loop structure.

Implementation Examples

Example 1: Custom Loop Without Trainer.run

This approach uses the Trainer class for environment and model setup, but replaces the standard training loop with custom logic for early stopping based on episode count or performance thresholds.


# custom_loop.py

import tensorflow as tf
import numpy as np
from research.pcl_rl.trainer import Trainer
from research.pcl_rl.controller import Controller

# Build trainer to access factories and hyper-parameters

trainer = Trainer()

with tf.Graph().as_default():
    sess = tf.Session()
    controller = trainer.get_controller(trainer.env)
    controller.setup(train=True)
    sess.run(tf.global_variables_initializer())

    max_episodes = 500
    episode = 0
    while episode < max_episodes:
        loss, summary, total_rewards, episode_rewards = controller.train(sess)
        
        # Early stop when average reward exceeds threshold

        if tf.reduce_mean(total_rewards).eval(session=sess) > 200.0:
            print('Target reward reached, stopping.')
            break
        
        episode += 1
        if episode % 50 == 0:
            print(f'Episode {episode}: avg reward = {np.mean(total_rewards)}')

This pattern retains the Trainer's configuration parsing and factory methods while allowing arbitrary loop logic, custom metrics logging, or conditional checkpointing.

Example 2: Custom Sampling Strategy via Controller Subclass

To implement epsilon-greedy exploration instead of the default policy sampling, extend the Controller class and override the _sample_episodes method.


# my_controller.py

import numpy as np
from research.pcl_rl import controller as base

class EpsilonGreedyController(base.Controller):
    """Controller with epsilon-greedy exploration."""
    def __init__(self, *args, epsilon=0.1, **kwargs):
        super(EpsilonGreedyController, self).__init__(*args, **kwargs)
        self.epsilon = epsilon

    def _sample_episodes(self, sess, greedy=False):
        init_state, obs, act, rew, pads = super(
            EpsilonGreedyController, self)._sample_episodes(sess, greedy=greedy)
        
        # Apply epsilon-greedy mask to actions

        for i, a in enumerate(act):
            mask = np.random.rand(*a.shape) < self.epsilon
            random_actions = self.env_spec.sample_random_actions(a.shape[0])
            a[mask] = random_actions[mask]
        return init_state, obs, act, rew, pads

To use this custom controller, override get_controller in your trainer subclass:


# custom_trainer.py

from research.pcl_rl.trainer import Trainer
from my_controller import EpsilonGreedyController

class CustomTrainer(Trainer):
    def get_controller(self, env):
        return EpsilonGreedyController(
            env, self.env_spec, self.internal_dim,
            use_online_batch=self.use_online_batch,
            batch_by_steps=self.batch_by_steps,
            unify_episodes=self.unify_episodes,
            replay_batch_size=self.replay_batch_size,
            max_step=self.max_step,
            cutoff_agent=self.cutoff_agent,
            save_trajectories_file=self.save_trajectories_file,
            use_trust_region=self.trust_region_p,
            use_value_opt=self.value_opt not in [None, 'None'],
            update_eps_lambda=self.update_eps_lambda,
            prioritize_by=self.prioritize_by,
            get_model=self.get_model,
            get_replay_buffer=self.get_replay_buffer,
            get_buffer_seeds=self.get_buffer_seeds,
            epsilon=0.05)

Example 3: Reward-Based Early Stopping via Trainer Subclass

To stop training when the environment reaches a target reward threshold, override the run method in your Trainer subclass.


# early_stop_trainer.py

import numpy as np
from research.pcl_rl.trainer import Trainer

class EarlyStopTrainer(Trainer):
    def run(self):
        target_reward = 250.0
        
        # Standard initialization from Trainer.run

        self.sess = tf.Session()
        self.controller = self.get_controller(self.env)
        self.controller.setup(train=True)
        self.sess.run(tf.global_variables_initializer())
        
        while True:
            loss, summary, total_rewards, episode_rewards = self.controller.train(self.sess)
            avg_reward = np.mean(total_rewards)
            print(f'Step {self.global_step.eval()}, avg reward {avg_reward:.2f}')
            
            if avg_reward >= target_reward:
                print('Reached target reward, exiting.')
                break
        
        # Final checkpoint save

        if self.sv is not None:
            self.sv.saver.save(self.sess, self.sv.save_path,
                               global_step=self.sv.global_step)

Common Customization Scenarios

Change the stopping criterion: Override Trainer.run and replace the standard for step in xrange(1 + self.num_steps): loop with a while loop that checks reward-based or convergence conditions.

Use a different sampling policy: Subclass Controller and replace the call to self.model.sample_step in _sample_episodes with custom logic for epsilon-greedy, Boltzmann exploration, or learned exploration bonuses.

Add extra logging or metrics: Insert additional TensorBoard summaries or print statements in Controller.train after the _train call, or within your custom training loop by running self.sess.run(custom_summary_op).

Swap the replay buffer implementation: Override Trainer.get_replay_buffer to return a different buffer class, such as a simple FIFO buffer or a novel prioritization scheme.

Combine multiple objectives: Extend Trainer.get_objective to instantiate a composite objective that wraps existing PCL, A3C, or custom loss functions for multi-task learning.

Key Source Files

The following files constitute the complete training stack in the research/pcl_rl package:

Summary

  • Controller manages the inner training loop including episode sampling, replay buffer management, and model updates via train_step or trust_region_step.
  • Trainer handles outer-loop orchestration, graph construction, and hyper-parameter management through factory methods like get_controller and get_model.
  • Subclass Controller to modify sampling strategies (_sample_episodes) or training procedures (_train).
  • Subclass Trainer to implement custom stopping criteria, validation schedules, or multi-phase training workflows by overriding run.
  • Access low-level training tensors including model.global_step and replay buffer instances through the public APIs of both classes.

Frequently Asked Questions

How do I access the model's global step tensor from a custom training loop?

The global step tensor is available through the model instance created by the Controller. When you call trainer.get_controller(env), the Controller stores a reference to the model at self.model. You can access self.model.global_step or self.model.inc_global_step within your Controller subclass methods, or evaluate them in your custom loop via sess.run(controller.model.global_step).

Can I use a custom replay buffer implementation without modifying the base classes?

Yes. Override the get_replay_buffer method in your Trainer subclass to return an instance of your custom buffer class instead of the default prioritized replay buffer. The Controller will use this buffer automatically when add_to_replay_buffer is called during training, provided your implementation matches the expected interface for add and sample operations.

What is the difference between overriding Controller.train versus Controller._train?

Override Controller.train to modify the high-level training step sequence, including episode sampling, replay buffer management, and summary generation. Override Controller._train to change only the parameter update mechanism itself, such as replacing the trust region optimization with a custom gradient descent routine, while retaining the standard data collection and preprocessing logic.

How do I implement distributed training with a custom Controller subclass?

The base Trainer class handles distributed training coordination via TensorFlow's Supervisor (self.sv). When you subclass Trainer and override run, ensure you initialize the Supervisor and session using the standard pattern from the base class. Then instantiate your custom Controller subclass within get_controller. The Supervisor will manage variable synchronization across replicas while your custom Controller logic executes on each worker.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →