# How to Implement a TransformerEncoderBlock Layer with tfm.nlp

> Learn to implement a TransformerEncoderBlock layer with tfm.nlp. Explore RMSNorm, grouped query attention, and block-sparse attention in this guide.

- Repository: [tensorflow/models](https://github.com/tensorflow/models)
- Tags: how-to-guide
- Published: 2026-02-28

---

**The `TransformerEncoderBlock` class in `tfm.nlp` provides a configurable Keras layer implementing the standard transformer encoder with support for RMSNorm, grouped query attention, and block-sparse attention, located in [`official/nlp/modeling/layers/transformer_encoder_block.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/transformer_encoder_block.py).**

The TensorFlow Models repository (`tensorflow/models`) offers a robust natural language processing toolkit through the `tfm.nlp` module. When you implement a **TransformerEncoderBlock layer with tfm.nlp**, you leverage a production-tested implementation that balances research flexibility with deployment performance, as defined in [`official/nlp/modeling/layers/transformer_encoder_block.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/transformer_encoder_block.py).

## Core Architecture Components

The `TransformerEncoderBlock` implements the original "Attention Is All You Need" encoder architecture with modern optimizations. According to the source code in [`official/nlp/modeling/layers/transformer_encoder_block.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/transformer_encoder_block.py), the layer composes four primary sub-components:

### Multi-Head Attention Mechanisms

The attention subsystem defaults to `tf.keras.layers.MultiHeadAttention` but automatically upgrades to optimized variants based on configuration. In lines 86-108, the constructor selects `multi_query_attention.MultiHeadAttention` when `num_kv_heads` is specified, or `block_sparse_attention.MultiHeadAttention` when `src_block_size` and `tgt_block_size` are provided. The layer handles query-key-value projections, optional bias terms, and attention dropout through the `attention_dropout` parameter.

### Feed-Forward Network (FFN)

The FFN sub-layer, implemented in lines 140-162, consists of two dense transformations. The first expands input dimensions to `inner_dim` with `inner_activation`, while the second projects back to the original dimension (or `output_last_dim` if specified). Dropout applies after each dense layer via the `inner_dropout` rate.

### Normalization Strategies

Normalization supports both standard `LayerNormalization` and `RMSNorm`. Lines 165-176 configure the normalizer based on `use_rms_norm` and `norm_first` parameters. The `RMSNorm` implementation (lines 71-78) removes mean centering for computational efficiency. When `norm_first=True`, normalization applies before attention and FFN sub-layers (pre-norm); otherwise, it applies after (post-norm).

### Residual Connections and Regularization

Residual connections wrap both attention and FFN sub-layers. Lines 190-208 implement the residual logic with an optional `use_query_residual` path that preserves original query tensors. Three independent dropout rates control regularization: `attention_dropout` for attention output, `inner_dropout` for FFN intermediate states, and `output_dropout` for the final block output.

## Configuration Patterns

### Standard BERT-Style Configuration

For traditional BERT architectures, use post-normalization with standard LayerNorm:

```python
import tensorflow_models as tfm

block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=12,
    inner_dim=3072,
    inner_activation='gelu',
    norm_first=False,        # Post-norm (default)

    use_rms_norm=False,     # Standard LayerNorm

    output_dropout=0.1,
    attention_dropout=0.1,
    inner_dropout=0.1,
)

```

### Pre-Normalization and RMSNorm

Deep transformer stacks often benefit from pre-normalization. Set `norm_first=True` to apply normalization before sub-layers, and enable `use_rms_norm=True` to use the RMSNorm implementation from lines 71-78 instead of standard LayerNorm.

```python
deep_block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=8,
    inner_dim=2048,
    inner_activation='relu',
    norm_first=True,        # Pre-norm for training stability

    use_rms_norm=True,      # Lightweight normalization

    output_dropout=0.1,
)

```

### Grouped Query Attention (GQA)

Reduce memory bandwidth during inference by sharing key-value heads across query heads. Specify `num_kv_heads` fewer than `num_attention_heads` and enable `enable_gqa_optimization=True` to activate the optimized kernel from [`multi_query_attention.py`](https://github.com/tensorflow/models/blob/main/multi_query_attention.py):

```python
gqa_block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=12,
    inner_dim=3072,
    inner_activation='gelu',
    num_kv_heads=2,                    # 2 KV heads vs 12 query heads

    enable_gqa_optimization=True,     # Optimized attention path

)

```

### Block-Sparse Attention for Long Sequences

For sequences exceeding standard transformer lengths, configure block-sparse attention patterns by specifying block sizes:

```python
long_seq_block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=16,
    inner_dim=4096,
    inner_activation='gelu',
    src_block_size=64,
    tgt_block_size=64,
    # Automatically selects block_sparse_attention.MultiHeadAttention

)

```

## Complete Implementation Example

Integrate multiple blocks within a Keras model for text classification:

```python
import tensorflow as tf
import tensorflow_models as tfm

def build_transformer_classifier(vocab_size, max_seq_len, embed_dim=768, num_layers=2):
    inputs = tf.keras.Input(shape=(max_seq_len,), dtype=tf.int32)
    
    # Embedding and positional encoding

    x = tf.keras.layers.Embedding(vocab_size, embed_dim)(inputs)
    x = tf.keras.layers.Dropout(0.1)(x)
    
    # Stack TransformerEncoderBlock layers

    for _ in range(num_layers):
        x = tfm.nlp.layers.TransformerEncoderBlock(
            num_attention_heads=12,
            inner_dim=3072,
            inner_activation='gelu',
            norm_first=False,
            output_dropout=0.1,
            attention_dropout=0.1,
            inner_dropout=0.1,
        )(x)
    
    # Classification head

    pooled = tf.keras.layers.GlobalAveragePooling1D()(x)
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(pooled)
    
    return tf.keras.Model(inputs, outputs)

model = build_transformer_classifier(vocab_size=30522, max_seq_len=128)

```

## Key Source Files

The implementation spans several specialized modules within `tensorflow/models`:

- **[`official/nlp/modeling/layers/transformer_encoder_block.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/transformer_encoder_block.py)**: Core `TransformerEncoderBlock` class and `RMSNorm` helper (lines 71-78, 86-126, 140-162, 165-176, 190-208).
- **[`official/nlp/modeling/layers/multi_query_attention.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/multi_query_attention.py)**: Optimized GQA implementation activated via `num_kv_heads`.
- **[`official/nlp/modeling/layers/block_sparse_attention.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/block_sparse_attention.py)**: Sub-quadratic attention for long sequences.
- **[`official/nlp/modeling/layers/talking_heads_attention.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/talking_heads_attention.py)**: Talking-heads variant enabled by `enable_talking_heads`.
- **[`official/nlp/modeling/layers/util.py`](https://github.com/tensorflow/models/blob/main/official/nlp/modeling/layers/util.py)**: Shape utilities and mask creation helpers.

## Summary

- The `TransformerEncoderBlock` in `tfm.nlp` provides a **configurable transformer encoder** supporting both classic and modern architectures.
- Configure **normalization placement** via `norm_first` (pre-norm vs post-norm) and **normalization type** via `use_rms_norm`.
- Enable **memory-efficient attention** through Grouped Query Attention (`num_kv_heads`) or **long-context support** via block-sparse attention (`src_block_size`, `tgt_block_size`).
- The layer automatically selects optimized attention implementations from [`multi_query_attention.py`](https://github.com/tensorflow/models/blob/main/multi_query_attention.py) or [`block_sparse_attention.py`](https://github.com/tensorflow/models/blob/main/block_sparse_attention.py) based on parameter configuration.
- All components integrate seamlessly with standard Keras workflows, accepting tensors of shape `(batch, seq_len, embed_dim)`.

## Frequently Asked Questions

### What is the difference between norm_first=True and norm_first=False in TransformerEncoderBlock?

Setting `norm_first=True` applies LayerNorm or RMSNorm **before** the attention and FFN sub-layers, which often improves training stability in deep transformers (24+ layers). When `norm_first=False` (the default, used in original BERT), normalization applies **after** each sub-layer. According to the source code in lines 165-176, this parameter controls the placement of the normalization layer within the residual branch.

### How does Grouped Query Attention (GQA) reduce memory usage in tfm.nlp?

GQA reduces memory bandwidth by sharing key and value projections across multiple query heads. When you set `num_kv_heads` to a value lower than `num_attention_heads` (for example, 2 KV heads versus 12 query heads), the implementation in [`multi_query_attention.py`](https://github.com/tensorflow/models/blob/main/multi_query_attention.py) broadcasts the smaller KV tensors during attention computation. This reduces the memory footprint of the KV cache during inference while maintaining model quality.

### When should I use block-sparse attention instead of standard multi-head attention?

Use block-sparse attention when processing sequences longer than 2,048 tokens where full quadratic attention becomes computationally prohibitive. By specifying `src_block_size` and `tgt_block_size` in the `TransformerEncoderBlock` constructor, the layer automatically switches to the `block_sparse_attention.MultiHeadAttention` implementation, which scales sub-quadratically with sequence length rather than O(n²).

### Can I combine RMSNorm with pre-normalization in the same encoder block?

Yes. Set both `use_rms_norm=True` and `norm_first=True` in the constructor. This configuration—common in modern LLM architectures like Llama—applies RMSNorm before each sub-layer. The `RMSNorm` class (lines 71-78) computes normalization without mean subtraction, offering slight computational savings compared to standard LayerNorm while maintaining training stability in deep pre-norm stacks.