# How to Use Data Loaders for NLP Tasks in TensorFlow Models: A Complete Guide

> Master TensorFlow data loaders for NLP. This guide explains using DataConfig, DataLoader, and DataLoaderFactory for seamless dataset routing in your models.

- Repository: [tensorflow/models](https://github.com/tensorflow/models)
- Tags: how-to-guide
- Published: 2026-02-28

---

**TensorFlow Models provides a plug-in data loader framework that uses `DataConfig` dataclasses, an abstract `DataLoader` base class, and a `DataLoaderFactory` registry to automatically route datasets to the correct loader without hard-coding class names.**

The `tensorflow/models` repository offers a standardized data loading architecture for official NLP tasks that decouples dataset configuration from implementation. This system allows researchers to switch between TFRecord, TFDS, and raw text sources by changing configuration objects rather than model code. Understanding how to leverage `DataConfig`, `DataLoader`, and the factory registry is essential for training custom NLP models on new datasets.

## The Three Core Components of the NLP Data Loading Framework

The architecture revolves around three abstractions defined in `official/nlp/data/`:

### DataConfig: The Dataset Specification

**`DataConfig`** is a dataclass that holds user-specified parameters for a specific dataset. It defines paths, batch sizes, sequence lengths, and format-specific flags.

Concrete implementations include:
- **`SentencePredictionDataConfig`** in [`official/nlp/data/sentence_prediction_dataloader.py`](https://github.com/tensorflow/models/blob/main/official/nlp/data/sentence_prediction_dataloader.py) for classification tasks
- **`TaggingDataConfig`** in [`official/nlp/data/tagging_dataloader.py`](https://github.com/tensorflow/models/blob/main/official/nlp/data/tagging_dataloader.py) for NER and POS tagging

### DataLoader: The Abstract Base Class

**`DataLoader`** in [`official/nlp/data/data_loader.py`](https://github.com/tensorflow/models/blob/main/official/nlp/data/data_loader.py) defines the interface all concrete loaders must implement. Subclasses override the `load(self, input_context=None)` method to return a `tf.data.Dataset`.

The base class enforces a three-step pipeline:
1. **`_decode`**: Deserialize raw `tf.Example` records
2. **`_parse`**: Transform decoded tensors into model-friendly dictionaries  
3. **`InputReader.read`**: Stitch functions together and apply sharding/batching

### DataLoaderFactory: The Registry

**`DataLoaderFactory`** in [`official/nlp/data/data_loader_factory.py`](https://github.com/tensorflow/models/blob/main/official/nlp/data/data_loader_factory.py) maintains a global registry that maps `DataConfig` subclasses to their corresponding `DataLoader` classes. This eliminates hard-coded class names in task definitions.

## How the Factory Registers and Resolves Loaders

The factory uses a decorator-based registration system to link configurations to implementations.

### Registration via Decorator

Concrete loaders register themselves using `@data_loader_factory.register_data_loader_cls`, which stores the mapping in a module-level dictionary named `_REGISTERED_DATA_LOADER_CLS`:

```python

# Located in official/nlp/data/sentence_prediction_dataloader.py

@data_loader_factory.register_data_loader_cls(SentencePredictionDataConfig)
class SentencePredictionDataLoader(data_loader.DataLoader):
    def load(self, input_context=None):
        # Implementation details

        pass

```

### Runtime Resolution

NLP tasks obtain the correct loader by calling `get_data_loader` with a configuration instance. The factory inspects the config type, looks up the registered class, and instantiates the loader:

```python

# From official/nlp/tasks/sentence_prediction.py build_inputs()

return data_loader_factory.get_data_loader(params).load(input_context)

```

Because resolution happens at runtime, the same `build_inputs` code works for any dataset format without modification.

## Implementing the DataLoader Interface

A concrete `DataLoader` implementation typically constructs an `InputReader` from [`official/core/input_reader.py`](https://github.com/tensorflow/models/blob/main/official/core/input_reader.py) to handle dataset logistics:

```python
def load(self, input_context=None):
    reader = input_reader.InputReader(
        dataset_fn=dataset_fn.pick_dataset_fn(self._params.file_type),
        params=self._params,
        decoder_fn=self._decode,
        parser_fn=self._parse)
    return reader.read(input_context)

```

**`InputReader`** manages:
- Sharding across multiple accelerators via `input_context`
- Batching to `global_batch_size`
- Prefetching and optimization

The `dataset_fn.pick_dataset_fn` utility in [`official/common/dataset_fn.py`](https://github.com/tensorflow/models/blob/main/official/common/dataset_fn.py) selects the appropriate low-level reader for TFRecord, SSTable, or other file types based on the `file_type` parameter in the config.

## Practical Code Examples for NLP Tasks

### Loading Sentence Prediction Datasets (TFRecord)

For classification tasks using preprocessed TFRecord files:

```python
from official.nlp.data.sentence_prediction_dataloader import (
    SentencePredictionDataConfig)
from official.nlp.data import data_loader_factory

# Configure the dataset

cfg = SentencePredictionDataConfig(
    input_path='gs://my-bucket/sentences.tfrecord',
    global_batch_size=64,
    seq_length=128,
    label_type='int',
    is_training=False)

# Factory instantiates SentencePredictionDataLoader automatically

loader = data_loader_factory.get_data_loader(cfg)
dataset = loader.load()

# Iterate for evaluation

for batch in dataset.take(1):
    print(batch)  # Contains 'input_word_ids', 'input_mask', etc.

```

### Loading Tagging Datasets with Sentence IDs

For NER or POS tagging that requires sentence identifiers:

```python
from official.nlp.data.tagging_dataloader import TaggingDataConfig
from official.nlp.data import data_loader_factory

cfg = TaggingDataConfig(
    input_path='gs://my-bucket/tagging.tfrecord',
    global_batch_size=32,
    seq_length=128,
    include_sentence_id=True,
    is_training=True)

loader = data_loader_factory.get_data_loader(cfg)
ds = loader.load()

for features, labels in ds.take(1):
    print(features['input_word_ids'].shape, labels.shape)

```

### Loading Raw Text via TensorFlow Datasets

For direct text loading without pre-tokenized TFRecords:

```python
from official.nlp.data.sentence_prediction_dataloader import (
    SentencePredictionTextDataConfig)
from official.nlp.data import data_loader_factory

cfg = SentencePredictionTextDataConfig(
    tfds_name='glue/sst2',
    tfds_split='train',
    vocab_file='gs://my-bucket/vocab.txt',
    tokenization='WordPiece',
    seq_length=128,
    global_batch_size=32)

loader = data_loader_factory.get_data_loader(cfg)
ds = loader.load()

for batch in ds.take(1):
    print(batch)  # Tokenized and padded automatically

```

## Extending the Framework for Custom Datasets

To add support for a new dataset format:

1. **Define a `DataConfig` subclass** with task-specific parameters
2. **Inherit from `DataLoader`** and implement `_decode`, `_parse`, and `load`
3. **Apply the registration decorator**: `@data_loader_factory.register_data_loader_cls(MyConfig)`

No modifications to task code are required—the factory automatically resolves the new loader when the corresponding config is passed to `get_data_loader`.

## Summary

- **TensorFlow Models** uses a registry-based data loading framework in `official/nlp/data/` that separates configuration from implementation
- **`DataConfig`** dataclasses define dataset parameters, while **`DataLoader`** subclasses implement the `tf.data.Dataset` construction logic
- **`DataLoaderFactory`** maps configs to loaders via the `@register_data_loader_cls` decorator, enabling runtime resolution without hard-coded imports
- **`InputReader`** in [`official/core/input_reader.py`](https://github.com/tensorflow/models/blob/main/official/core/input_reader.py) handles cross-replica sharding, batching, and prefetching uniformly across all NLP tasks
- The system supports TFRecord, TFDS, and raw text sources through the same unified interface

## Frequently Asked Questions

### What is the difference between DataConfig and DataLoader?

**`DataConfig`** is a dataclass that stores hyperparameters like file paths, batch size, and sequence length, while **`DataLoader`** is a class that contains the actual logic to read files and return a `tf.data.Dataset`. The config describes what to load; the loader implements how to load it.

### How do I register a custom data loader for a new NLP dataset?

Create a new `DataConfig` subclass for your dataset parameters, then implement a `DataLoader` that inherits from the base class in [`official/nlp/data/data_loader.py`](https://github.com/tensorflow/models/blob/main/official/nlp/data/data_loader.py). Decorate your loader class with `@data_loader_factory.register_data_loader_cls(YourConfigClass)`. The factory will automatically instantiate your loader when your config is passed to `get_data_loader()`.

### Can I use these data loaders with non-TFRecord formats?

Yes. The `InputReader` class uses `dataset_fn.pick_dataset_fn` from [`official/common/dataset_fn.py`](https://github.com/tensorflow/models/blob/main/official/common/dataset_fn.py) to select the appropriate reader based on the `file_type` parameter in your `DataConfig`. The framework supports SSTable, raw text, and TensorFlow Datasets (TFDS) in addition to TFRecord.

### Where does the batching and prefetching logic live?

The **`InputReader`** class in [`official/core/input_reader.py`](https://github.com/tensorflow/models/blob/main/official/core/input_reader.py) handles all dataset optimization including sharding across accelerators, batching to the specified `global_batch_size`, and applying `tf.data` optimizations like prefetching and caching. DataLoader implementations delegate these concerns to `InputReader` rather than reimplementing them.