How to Use Data Loaders for NLP Tasks in TensorFlow Models: A Complete Guide

TensorFlow Models provides a plug-in data loader framework that uses DataConfig dataclasses, an abstract DataLoader base class, and a DataLoaderFactory registry to automatically route datasets to the correct loader without hard-coding class names.

The tensorflow/models repository offers a standardized data loading architecture for official NLP tasks that decouples dataset configuration from implementation. This system allows researchers to switch between TFRecord, TFDS, and raw text sources by changing configuration objects rather than model code. Understanding how to leverage DataConfig, DataLoader, and the factory registry is essential for training custom NLP models on new datasets.

The Three Core Components of the NLP Data Loading Framework

The architecture revolves around three abstractions defined in official/nlp/data/:

DataConfig: The Dataset Specification

DataConfig is a dataclass that holds user-specified parameters for a specific dataset. It defines paths, batch sizes, sequence lengths, and format-specific flags.

Concrete implementations include:

DataLoader: The Abstract Base Class

DataLoader in official/nlp/data/data_loader.py defines the interface all concrete loaders must implement. Subclasses override the load(self, input_context=None) method to return a tf.data.Dataset.

The base class enforces a three-step pipeline:

  1. _decode: Deserialize raw tf.Example records
  2. _parse: Transform decoded tensors into model-friendly dictionaries
  3. InputReader.read: Stitch functions together and apply sharding/batching

DataLoaderFactory: The Registry

DataLoaderFactory in official/nlp/data/data_loader_factory.py maintains a global registry that maps DataConfig subclasses to their corresponding DataLoader classes. This eliminates hard-coded class names in task definitions.

How the Factory Registers and Resolves Loaders

The factory uses a decorator-based registration system to link configurations to implementations.

Registration via Decorator

Concrete loaders register themselves using @data_loader_factory.register_data_loader_cls, which stores the mapping in a module-level dictionary named _REGISTERED_DATA_LOADER_CLS:


# Located in official/nlp/data/sentence_prediction_dataloader.py

@data_loader_factory.register_data_loader_cls(SentencePredictionDataConfig)
class SentencePredictionDataLoader(data_loader.DataLoader):
    def load(self, input_context=None):
        # Implementation details

        pass

Runtime Resolution

NLP tasks obtain the correct loader by calling get_data_loader with a configuration instance. The factory inspects the config type, looks up the registered class, and instantiates the loader:


# From official/nlp/tasks/sentence_prediction.py build_inputs()

return data_loader_factory.get_data_loader(params).load(input_context)

Because resolution happens at runtime, the same build_inputs code works for any dataset format without modification.

Implementing the DataLoader Interface

A concrete DataLoader implementation typically constructs an InputReader from official/core/input_reader.py to handle dataset logistics:

def load(self, input_context=None):
    reader = input_reader.InputReader(
        dataset_fn=dataset_fn.pick_dataset_fn(self._params.file_type),
        params=self._params,
        decoder_fn=self._decode,
        parser_fn=self._parse)
    return reader.read(input_context)

InputReader manages:

  • Sharding across multiple accelerators via input_context
  • Batching to global_batch_size
  • Prefetching and optimization

The dataset_fn.pick_dataset_fn utility in official/common/dataset_fn.py selects the appropriate low-level reader for TFRecord, SSTable, or other file types based on the file_type parameter in the config.

Practical Code Examples for NLP Tasks

Loading Sentence Prediction Datasets (TFRecord)

For classification tasks using preprocessed TFRecord files:

from official.nlp.data.sentence_prediction_dataloader import (
    SentencePredictionDataConfig)
from official.nlp.data import data_loader_factory

# Configure the dataset

cfg = SentencePredictionDataConfig(
    input_path='gs://my-bucket/sentences.tfrecord',
    global_batch_size=64,
    seq_length=128,
    label_type='int',
    is_training=False)

# Factory instantiates SentencePredictionDataLoader automatically

loader = data_loader_factory.get_data_loader(cfg)
dataset = loader.load()

# Iterate for evaluation

for batch in dataset.take(1):
    print(batch)  # Contains 'input_word_ids', 'input_mask', etc.

Loading Tagging Datasets with Sentence IDs

For NER or POS tagging that requires sentence identifiers:

from official.nlp.data.tagging_dataloader import TaggingDataConfig
from official.nlp.data import data_loader_factory

cfg = TaggingDataConfig(
    input_path='gs://my-bucket/tagging.tfrecord',
    global_batch_size=32,
    seq_length=128,
    include_sentence_id=True,
    is_training=True)

loader = data_loader_factory.get_data_loader(cfg)
ds = loader.load()

for features, labels in ds.take(1):
    print(features['input_word_ids'].shape, labels.shape)

Loading Raw Text via TensorFlow Datasets

For direct text loading without pre-tokenized TFRecords:

from official.nlp.data.sentence_prediction_dataloader import (
    SentencePredictionTextDataConfig)
from official.nlp.data import data_loader_factory

cfg = SentencePredictionTextDataConfig(
    tfds_name='glue/sst2',
    tfds_split='train',
    vocab_file='gs://my-bucket/vocab.txt',
    tokenization='WordPiece',
    seq_length=128,
    global_batch_size=32)

loader = data_loader_factory.get_data_loader(cfg)
ds = loader.load()

for batch in ds.take(1):
    print(batch)  # Tokenized and padded automatically

Extending the Framework for Custom Datasets

To add support for a new dataset format:

  1. Define a DataConfig subclass with task-specific parameters
  2. Inherit from DataLoader and implement _decode, _parse, and load
  3. Apply the registration decorator: @data_loader_factory.register_data_loader_cls(MyConfig)

No modifications to task code are required—the factory automatically resolves the new loader when the corresponding config is passed to get_data_loader.

Summary

  • TensorFlow Models uses a registry-based data loading framework in official/nlp/data/ that separates configuration from implementation
  • DataConfig dataclasses define dataset parameters, while DataLoader subclasses implement the tf.data.Dataset construction logic
  • DataLoaderFactory maps configs to loaders via the @register_data_loader_cls decorator, enabling runtime resolution without hard-coded imports
  • InputReader in official/core/input_reader.py handles cross-replica sharding, batching, and prefetching uniformly across all NLP tasks
  • The system supports TFRecord, TFDS, and raw text sources through the same unified interface

Frequently Asked Questions

What is the difference between DataConfig and DataLoader?

DataConfig is a dataclass that stores hyperparameters like file paths, batch size, and sequence length, while DataLoader is a class that contains the actual logic to read files and return a tf.data.Dataset. The config describes what to load; the loader implements how to load it.

How do I register a custom data loader for a new NLP dataset?

Create a new DataConfig subclass for your dataset parameters, then implement a DataLoader that inherits from the base class in official/nlp/data/data_loader.py. Decorate your loader class with @data_loader_factory.register_data_loader_cls(YourConfigClass). The factory will automatically instantiate your loader when your config is passed to get_data_loader().

Can I use these data loaders with non-TFRecord formats?

Yes. The InputReader class uses dataset_fn.pick_dataset_fn from official/common/dataset_fn.py to select the appropriate reader based on the file_type parameter in your DataConfig. The framework supports SSTable, raw text, and TensorFlow Datasets (TFDS) in addition to TFRecord.

Where does the batching and prefetching logic live?

The InputReader class in official/core/input_reader.py handles all dataset optimization including sharding across accelerators, batching to the specified global_batch_size, and applying tf.data optimizations like prefetching and caching. DataLoader implementations delegate these concerns to InputReader rather than reimplementing them.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →