How to Set Up High Availability for the Feast Registry

To achieve high availability for the Feast registry, configure a SQL-backed registry using PostgreSQL or MySQL, deploy multiple replicas via the Feast Operator's FeatureStore custom resource, and expose them through a Kubernetes ClusterIP Service that load-balances requests across all pods.

The feast-dev/feast repository provides a feature store platform where the registry maintains critical metadata about entities, data sources, and feature views. Setting up high availability for the Feast registry ensures your feature store remains operational during node failures, rolling updates, or traffic spikes, eliminating the single-point-of-failure risk inherent in the default file-based implementation.

Why the Default Registry Cannot Scale

By default, Feast uses a single-file implementation (registry.db) that operates as a single-writer system. This file-based approach (SQLite or DuckDB) requires exclusive access for writes and cannot support concurrent modifications across multiple instances. Attempting to replicate this architecture results in race conditions and data corruption, making it unsuitable for production high-availability deployments.

High Availability Architecture Components

Achieving true high availability requires three coordinated components: a transactional database backend, a multi-replica Kubernetes deployment, and a stable service endpoint.

SQL Registry Backend

The foundation of an HA deployment is the SQL Registry, which persists metadata in a relational database rather than a local file. According to docs/reference/registries/sql.md, this backend supports atomic writes from any replica and eliminates the "rewrite-whole-file" bottleneck. Any SQL-Alchemy-supported database (PostgreSQL, MySQL) provides the necessary transactional consistency for concurrent access.

The implementation in go/internal/feast/registry/mysql_registry_store.go (and similar database-specific stores) handles concurrent persistence, ensuring that multiple registry pods can safely read and write metadata simultaneously without conflicts.

Multi-Replica Deployment with the Feast Operator

The Feast Operator manages registry scaling through the FeatureStore custom resource (CR). When you specify spec.replicas greater than 1, the operator automatically:

  1. Generates a Kubernetes Deployment with the requested replica count
  2. Sets the strategy to RollingUpdate to prevent downtime during version changes
  3. Creates a HorizontalPodAutoscaler (HPA) if spec.services.scaling.autoscaling is configured

As documented in infra/website/docs/blog/scaling-feast-feature-server.md, the operator enforces persistence validation at admission time. If you attempt to enable replicas > 1 while using a file-based backend, the request is rejected immediately with a clear error message.

Kubernetes Service Load Balancing

The operator creates a ClusterIP Service named feast-registry that provides a stable DNS endpoint (feast-registry.<namespace>.svc). This service load-balances gRPC and REST traffic across all registry pods. Clients using the Python SDK (sdk/python/feast/infra/registry/registry.py) connect to this service name transparently, requiring no code changes when scaling replica counts up or down.

Step-by-Step Configuration

Follow these steps to transition from a single-file registry to a highly available deployment.

Step 1: Configure the SQL Registry Backend

Update your feature_store.yaml to use a database-backed registry instead of a local file:

project: my_project
provider: aws
online_store: redis
offline_store: file

registry:
  registry_type: sql
  path: postgresql://postgres:mysecret@db-host:5432/feast
  cache_ttl_seconds: 60
  sqlalchemy_config_kwargs:
    echo: false
    pool_pre_ping: true

The path parameter accepts any SQLAlchemy-compatible connection string. The pool_pre_ping: true setting ensures connections are validated before use, preventing errors during database failover events.

Step 2: Deploy Static Replicas

Create a FeatureStore custom resource with a fixed replica count for immediate high availability:

apiVersion: feast.dev/v1
kind: FeatureStore
metadata:
  name: prod-feast
spec:
  feastProject: my_project
  replicas: 3
  services:
    onlineStore:
      persistence:
        store:
          type: postgres
          secretRef:
            name: feast-data-stores
    registry:
      local:
        persistence:
          store:
            type: sql
            secretRef:
              name: feast-data-stores

The replicas: 3 field tells the operator to maintain three registry pods simultaneously. The local service type indicates the operator manages the registry server deployment directly, as described in infra/feast-operator/docs/api/markdown/ref.md.

Step 3: Enable Autoscaling for Dynamic HA

For environments with variable load, configure the HorizontalPodAutoscaler instead of static replicas:

apiVersion: feast.dev/v1
kind: FeatureStore
metadata:
  name: autoscaled-feast
spec:
  feastProject: my_project
  services:
    scaling:
      autoscaling:
        minReplicas: 2
        maxReplicas: 10
        metrics:
          - type: Resource
            resource:
              name: cpu
              target:
                type: Utilization
                averageUtilization: 70
    onlineStore:
      persistence:
        store:
          type: postgres
          secretRef:
            name: feast-data-stores
    registry:
      local:
        persistence:
          store:
            type: sql
            secretRef:
              name: feast-data-stores

This configuration maintains a minimum of two replicas for baseline availability while scaling up to ten pods based on CPU utilization. The HPA ensures you maintain quorum during traffic spikes without over-provisioning resources during quiet periods.

Validation and Safety Mechanisms

The Feast Operator implements admission-time validation to prevent misconfigurations. If you specify replicas > 1 or enable autoscaling while using a file-based persistence layer (SQLite, DuckDB, or local registry.db), the operator rejects the CR with a validation error referencing the safety check in infra/feast-operator/docs/api/markdown/ref.md.

Additionally, the RollingUpdate deployment strategy ensures zero-downtime deployments. When updating the registry image or configuration, Kubernetes terminates old pods only after new pods pass health checks, maintaining continuous availability for SDK and UI clients.

Summary

  • Use a SQL-backed registry (PostgreSQL/MySQL) to enable concurrent writes and eliminate the single-writer limitation of file-based storage.
  • Deploy multiple replicas via the Feast Operator's FeatureStore CR using either static replicas counts or HPA autoscaling for dynamic scaling.
  • Rely on the ClusterIP Service (feast-registry) for transparent load balancing across all registry pods without client reconfiguration.
  • Validate persistence compatibility before deployment—the operator blocks HA configurations with incompatible file-based backends to prevent data corruption.

Frequently Asked Questions

Can I use S3 or GCS for a highly available registry?

Cloud object stores such as S3 and GCS support concurrent readers and can serve as registry storage, but they do not support concurrent atomic writes required for multi-writer HA deployments. For true high availability with multiple registry replicas, you must use a SQL database backend (PostgreSQL, MySQL) that supports transactional consistency and concurrent modifications.

Why can't I use SQLite or DuckDB for HA deployments?

SQLite and DuckDB implementations rely on file-based storage that requires exclusive locks for writes. When multiple registry pods attempt to modify the metadata simultaneously, these backends cannot reconcile concurrent changes, leading to race conditions and potential data corruption. The Feast Operator explicitly blocks replicas > 1 configurations when detecting these storage types.

How does the Feast Operator prevent downtime during updates?

The operator configures the registry Deployment with a RollingUpdate strategy that creates new pods before terminating old ones. Health checks ensure new replicas are ready to serve traffic before the controller removes legacy pods. This approach, combined with the ClusterIP Service maintaining active connections, ensures zero-downtime deployments even when updating registry versions or configuration.

Do clients need to change configuration when scaling registry replicas?

No. Clients—including the Python SDK (sdk/python/feast/infra/registry/registry.py), Feast UI, and other services—continue using the same registry endpoint defined in feature_store.yaml. The Kubernetes Service (feast-registry.<namespace>.svc) abstracts the underlying pod topology, automatically routing requests to healthy replicas regardless of the current replica count or autoscaling state.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →