How to Implement a TransformerEncoderBlock Layer with tfm.nlp

The TransformerEncoderBlock class in tfm.nlp provides a configurable Keras layer implementing the standard transformer encoder with support for RMSNorm, grouped query attention, and block-sparse attention, located in official/nlp/modeling/layers/transformer_encoder_block.py.

The TensorFlow Models repository (tensorflow/models) offers a robust natural language processing toolkit through the tfm.nlp module. When you implement a TransformerEncoderBlock layer with tfm.nlp, you leverage a production-tested implementation that balances research flexibility with deployment performance, as defined in official/nlp/modeling/layers/transformer_encoder_block.py.

Core Architecture Components

The TransformerEncoderBlock implements the original "Attention Is All You Need" encoder architecture with modern optimizations. According to the source code in official/nlp/modeling/layers/transformer_encoder_block.py, the layer composes four primary sub-components:

Multi-Head Attention Mechanisms

The attention subsystem defaults to tf.keras.layers.MultiHeadAttention but automatically upgrades to optimized variants based on configuration. In lines 86-108, the constructor selects multi_query_attention.MultiHeadAttention when num_kv_heads is specified, or block_sparse_attention.MultiHeadAttention when src_block_size and tgt_block_size are provided. The layer handles query-key-value projections, optional bias terms, and attention dropout through the attention_dropout parameter.

Feed-Forward Network (FFN)

The FFN sub-layer, implemented in lines 140-162, consists of two dense transformations. The first expands input dimensions to inner_dim with inner_activation, while the second projects back to the original dimension (or output_last_dim if specified). Dropout applies after each dense layer via the inner_dropout rate.

Normalization Strategies

Normalization supports both standard LayerNormalization and RMSNorm. Lines 165-176 configure the normalizer based on use_rms_norm and norm_first parameters. The RMSNorm implementation (lines 71-78) removes mean centering for computational efficiency. When norm_first=True, normalization applies before attention and FFN sub-layers (pre-norm); otherwise, it applies after (post-norm).

Residual Connections and Regularization

Residual connections wrap both attention and FFN sub-layers. Lines 190-208 implement the residual logic with an optional use_query_residual path that preserves original query tensors. Three independent dropout rates control regularization: attention_dropout for attention output, inner_dropout for FFN intermediate states, and output_dropout for the final block output.

Configuration Patterns

Standard BERT-Style Configuration

For traditional BERT architectures, use post-normalization with standard LayerNorm:

import tensorflow_models as tfm

block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=12,
    inner_dim=3072,
    inner_activation='gelu',
    norm_first=False,        # Post-norm (default)

    use_rms_norm=False,     # Standard LayerNorm

    output_dropout=0.1,
    attention_dropout=0.1,
    inner_dropout=0.1,
)

Pre-Normalization and RMSNorm

Deep transformer stacks often benefit from pre-normalization. Set norm_first=True to apply normalization before sub-layers, and enable use_rms_norm=True to use the RMSNorm implementation from lines 71-78 instead of standard LayerNorm.

deep_block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=8,
    inner_dim=2048,
    inner_activation='relu',
    norm_first=True,        # Pre-norm for training stability

    use_rms_norm=True,      # Lightweight normalization

    output_dropout=0.1,
)

Grouped Query Attention (GQA)

Reduce memory bandwidth during inference by sharing key-value heads across query heads. Specify num_kv_heads fewer than num_attention_heads and enable enable_gqa_optimization=True to activate the optimized kernel from multi_query_attention.py:

gqa_block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=12,
    inner_dim=3072,
    inner_activation='gelu',
    num_kv_heads=2,                    # 2 KV heads vs 12 query heads

    enable_gqa_optimization=True,     # Optimized attention path

)

Block-Sparse Attention for Long Sequences

For sequences exceeding standard transformer lengths, configure block-sparse attention patterns by specifying block sizes:

long_seq_block = tfm.nlp.layers.TransformerEncoderBlock(
    num_attention_heads=16,
    inner_dim=4096,
    inner_activation='gelu',
    src_block_size=64,
    tgt_block_size=64,
    # Automatically selects block_sparse_attention.MultiHeadAttention

)

Complete Implementation Example

Integrate multiple blocks within a Keras model for text classification:

import tensorflow as tf
import tensorflow_models as tfm

def build_transformer_classifier(vocab_size, max_seq_len, embed_dim=768, num_layers=2):
    inputs = tf.keras.Input(shape=(max_seq_len,), dtype=tf.int32)
    
    # Embedding and positional encoding

    x = tf.keras.layers.Embedding(vocab_size, embed_dim)(inputs)
    x = tf.keras.layers.Dropout(0.1)(x)
    
    # Stack TransformerEncoderBlock layers

    for _ in range(num_layers):
        x = tfm.nlp.layers.TransformerEncoderBlock(
            num_attention_heads=12,
            inner_dim=3072,
            inner_activation='gelu',
            norm_first=False,
            output_dropout=0.1,
            attention_dropout=0.1,
            inner_dropout=0.1,
        )(x)
    
    # Classification head

    pooled = tf.keras.layers.GlobalAveragePooling1D()(x)
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(pooled)
    
    return tf.keras.Model(inputs, outputs)

model = build_transformer_classifier(vocab_size=30522, max_seq_len=128)

Key Source Files

The implementation spans several specialized modules within tensorflow/models:

Summary

  • The TransformerEncoderBlock in tfm.nlp provides a configurable transformer encoder supporting both classic and modern architectures.
  • Configure normalization placement via norm_first (pre-norm vs post-norm) and normalization type via use_rms_norm.
  • Enable memory-efficient attention through Grouped Query Attention (num_kv_heads) or long-context support via block-sparse attention (src_block_size, tgt_block_size).
  • The layer automatically selects optimized attention implementations from multi_query_attention.py or block_sparse_attention.py based on parameter configuration.
  • All components integrate seamlessly with standard Keras workflows, accepting tensors of shape (batch, seq_len, embed_dim).

Frequently Asked Questions

What is the difference between norm_first=True and norm_first=False in TransformerEncoderBlock?

Setting norm_first=True applies LayerNorm or RMSNorm before the attention and FFN sub-layers, which often improves training stability in deep transformers (24+ layers). When norm_first=False (the default, used in original BERT), normalization applies after each sub-layer. According to the source code in lines 165-176, this parameter controls the placement of the normalization layer within the residual branch.

How does Grouped Query Attention (GQA) reduce memory usage in tfm.nlp?

GQA reduces memory bandwidth by sharing key and value projections across multiple query heads. When you set num_kv_heads to a value lower than num_attention_heads (for example, 2 KV heads versus 12 query heads), the implementation in multi_query_attention.py broadcasts the smaller KV tensors during attention computation. This reduces the memory footprint of the KV cache during inference while maintaining model quality.

When should I use block-sparse attention instead of standard multi-head attention?

Use block-sparse attention when processing sequences longer than 2,048 tokens where full quadratic attention becomes computationally prohibitive. By specifying src_block_size and tgt_block_size in the TransformerEncoderBlock constructor, the layer automatically switches to the block_sparse_attention.MultiHeadAttention implementation, which scales sub-quadratically with sequence length rather than O(n²).

Can I combine RMSNorm with pre-normalization in the same encoder block?

Yes. Set both use_rms_norm=True and norm_first=True in the constructor. This configuration—common in modern LLM architectures like Llama—applies RMSNorm before each sub-layer. The RMSNorm class (lines 71-78) computes normalization without mean subtraction, offering slight computational savings compared to standard LayerNorm while maintaining training stability in deep pre-norm stacks.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →