How to Use Data Loaders for NLP Tasks in TensorFlow Models: A Complete Guide
TensorFlow Models provides a plug-in data loader framework that uses DataConfig dataclasses, an abstract DataLoader base class, and a DataLoaderFactory registry to automatically route datasets to the correct loader without hard-coding class names.
The tensorflow/models repository offers a standardized data loading architecture for official NLP tasks that decouples dataset configuration from implementation. This system allows researchers to switch between TFRecord, TFDS, and raw text sources by changing configuration objects rather than model code. Understanding how to leverage DataConfig, DataLoader, and the factory registry is essential for training custom NLP models on new datasets.
The Three Core Components of the NLP Data Loading Framework
The architecture revolves around three abstractions defined in official/nlp/data/:
DataConfig: The Dataset Specification
DataConfig is a dataclass that holds user-specified parameters for a specific dataset. It defines paths, batch sizes, sequence lengths, and format-specific flags.
Concrete implementations include:
SentencePredictionDataConfiginofficial/nlp/data/sentence_prediction_dataloader.pyfor classification tasksTaggingDataConfiginofficial/nlp/data/tagging_dataloader.pyfor NER and POS tagging
DataLoader: The Abstract Base Class
DataLoader in official/nlp/data/data_loader.py defines the interface all concrete loaders must implement. Subclasses override the load(self, input_context=None) method to return a tf.data.Dataset.
The base class enforces a three-step pipeline:
_decode: Deserialize rawtf.Examplerecords_parse: Transform decoded tensors into model-friendly dictionariesInputReader.read: Stitch functions together and apply sharding/batching
DataLoaderFactory: The Registry
DataLoaderFactory in official/nlp/data/data_loader_factory.py maintains a global registry that maps DataConfig subclasses to their corresponding DataLoader classes. This eliminates hard-coded class names in task definitions.
How the Factory Registers and Resolves Loaders
The factory uses a decorator-based registration system to link configurations to implementations.
Registration via Decorator
Concrete loaders register themselves using @data_loader_factory.register_data_loader_cls, which stores the mapping in a module-level dictionary named _REGISTERED_DATA_LOADER_CLS:
# Located in official/nlp/data/sentence_prediction_dataloader.py
@data_loader_factory.register_data_loader_cls(SentencePredictionDataConfig)
class SentencePredictionDataLoader(data_loader.DataLoader):
def load(self, input_context=None):
# Implementation details
pass
Runtime Resolution
NLP tasks obtain the correct loader by calling get_data_loader with a configuration instance. The factory inspects the config type, looks up the registered class, and instantiates the loader:
# From official/nlp/tasks/sentence_prediction.py build_inputs()
return data_loader_factory.get_data_loader(params).load(input_context)
Because resolution happens at runtime, the same build_inputs code works for any dataset format without modification.
Implementing the DataLoader Interface
A concrete DataLoader implementation typically constructs an InputReader from official/core/input_reader.py to handle dataset logistics:
def load(self, input_context=None):
reader = input_reader.InputReader(
dataset_fn=dataset_fn.pick_dataset_fn(self._params.file_type),
params=self._params,
decoder_fn=self._decode,
parser_fn=self._parse)
return reader.read(input_context)
InputReader manages:
- Sharding across multiple accelerators via
input_context - Batching to
global_batch_size - Prefetching and optimization
The dataset_fn.pick_dataset_fn utility in official/common/dataset_fn.py selects the appropriate low-level reader for TFRecord, SSTable, or other file types based on the file_type parameter in the config.
Practical Code Examples for NLP Tasks
Loading Sentence Prediction Datasets (TFRecord)
For classification tasks using preprocessed TFRecord files:
from official.nlp.data.sentence_prediction_dataloader import (
SentencePredictionDataConfig)
from official.nlp.data import data_loader_factory
# Configure the dataset
cfg = SentencePredictionDataConfig(
input_path='gs://my-bucket/sentences.tfrecord',
global_batch_size=64,
seq_length=128,
label_type='int',
is_training=False)
# Factory instantiates SentencePredictionDataLoader automatically
loader = data_loader_factory.get_data_loader(cfg)
dataset = loader.load()
# Iterate for evaluation
for batch in dataset.take(1):
print(batch) # Contains 'input_word_ids', 'input_mask', etc.
Loading Tagging Datasets with Sentence IDs
For NER or POS tagging that requires sentence identifiers:
from official.nlp.data.tagging_dataloader import TaggingDataConfig
from official.nlp.data import data_loader_factory
cfg = TaggingDataConfig(
input_path='gs://my-bucket/tagging.tfrecord',
global_batch_size=32,
seq_length=128,
include_sentence_id=True,
is_training=True)
loader = data_loader_factory.get_data_loader(cfg)
ds = loader.load()
for features, labels in ds.take(1):
print(features['input_word_ids'].shape, labels.shape)
Loading Raw Text via TensorFlow Datasets
For direct text loading without pre-tokenized TFRecords:
from official.nlp.data.sentence_prediction_dataloader import (
SentencePredictionTextDataConfig)
from official.nlp.data import data_loader_factory
cfg = SentencePredictionTextDataConfig(
tfds_name='glue/sst2',
tfds_split='train',
vocab_file='gs://my-bucket/vocab.txt',
tokenization='WordPiece',
seq_length=128,
global_batch_size=32)
loader = data_loader_factory.get_data_loader(cfg)
ds = loader.load()
for batch in ds.take(1):
print(batch) # Tokenized and padded automatically
Extending the Framework for Custom Datasets
To add support for a new dataset format:
- Define a
DataConfigsubclass with task-specific parameters - Inherit from
DataLoaderand implement_decode,_parse, andload - Apply the registration decorator:
@data_loader_factory.register_data_loader_cls(MyConfig)
No modifications to task code are required—the factory automatically resolves the new loader when the corresponding config is passed to get_data_loader.
Summary
- TensorFlow Models uses a registry-based data loading framework in
official/nlp/data/that separates configuration from implementation DataConfigdataclasses define dataset parameters, whileDataLoadersubclasses implement thetf.data.Datasetconstruction logicDataLoaderFactorymaps configs to loaders via the@register_data_loader_clsdecorator, enabling runtime resolution without hard-coded importsInputReaderinofficial/core/input_reader.pyhandles cross-replica sharding, batching, and prefetching uniformly across all NLP tasks- The system supports TFRecord, TFDS, and raw text sources through the same unified interface
Frequently Asked Questions
What is the difference between DataConfig and DataLoader?
DataConfig is a dataclass that stores hyperparameters like file paths, batch size, and sequence length, while DataLoader is a class that contains the actual logic to read files and return a tf.data.Dataset. The config describes what to load; the loader implements how to load it.
How do I register a custom data loader for a new NLP dataset?
Create a new DataConfig subclass for your dataset parameters, then implement a DataLoader that inherits from the base class in official/nlp/data/data_loader.py. Decorate your loader class with @data_loader_factory.register_data_loader_cls(YourConfigClass). The factory will automatically instantiate your loader when your config is passed to get_data_loader().
Can I use these data loaders with non-TFRecord formats?
Yes. The InputReader class uses dataset_fn.pick_dataset_fn from official/common/dataset_fn.py to select the appropriate reader based on the file_type parameter in your DataConfig. The framework supports SSTable, raw text, and TensorFlow Datasets (TFDS) in addition to TFRecord.
Where does the batching and prefetching logic live?
The InputReader class in official/core/input_reader.py handles all dataset optimization including sharding across accelerators, batching to the specified global_batch_size, and applying tf.data optimizations like prefetching and caching. DataLoader implementations delegate these concerns to InputReader rather than reimplementing them.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →