How to Set Up High Availability for the Feast Registry
To achieve high availability for the Feast registry, configure a SQL-backed registry using PostgreSQL or MySQL, deploy multiple replicas via the Feast Operator's FeatureStore custom resource, and expose them through a Kubernetes ClusterIP Service that load-balances requests across all pods.
The feast-dev/feast repository provides a feature store platform where the registry maintains critical metadata about entities, data sources, and feature views. Setting up high availability for the Feast registry ensures your feature store remains operational during node failures, rolling updates, or traffic spikes, eliminating the single-point-of-failure risk inherent in the default file-based implementation.
Why the Default Registry Cannot Scale
By default, Feast uses a single-file implementation (registry.db) that operates as a single-writer system. This file-based approach (SQLite or DuckDB) requires exclusive access for writes and cannot support concurrent modifications across multiple instances. Attempting to replicate this architecture results in race conditions and data corruption, making it unsuitable for production high-availability deployments.
High Availability Architecture Components
Achieving true high availability requires three coordinated components: a transactional database backend, a multi-replica Kubernetes deployment, and a stable service endpoint.
SQL Registry Backend
The foundation of an HA deployment is the SQL Registry, which persists metadata in a relational database rather than a local file. According to docs/reference/registries/sql.md, this backend supports atomic writes from any replica and eliminates the "rewrite-whole-file" bottleneck. Any SQL-Alchemy-supported database (PostgreSQL, MySQL) provides the necessary transactional consistency for concurrent access.
The implementation in go/internal/feast/registry/mysql_registry_store.go (and similar database-specific stores) handles concurrent persistence, ensuring that multiple registry pods can safely read and write metadata simultaneously without conflicts.
Multi-Replica Deployment with the Feast Operator
The Feast Operator manages registry scaling through the FeatureStore custom resource (CR). When you specify spec.replicas greater than 1, the operator automatically:
- Generates a Kubernetes Deployment with the requested replica count
- Sets the strategy to
RollingUpdateto prevent downtime during version changes - Creates a
HorizontalPodAutoscaler(HPA) ifspec.services.scaling.autoscalingis configured
As documented in infra/website/docs/blog/scaling-feast-feature-server.md, the operator enforces persistence validation at admission time. If you attempt to enable replicas > 1 while using a file-based backend, the request is rejected immediately with a clear error message.
Kubernetes Service Load Balancing
The operator creates a ClusterIP Service named feast-registry that provides a stable DNS endpoint (feast-registry.<namespace>.svc). This service load-balances gRPC and REST traffic across all registry pods. Clients using the Python SDK (sdk/python/feast/infra/registry/registry.py) connect to this service name transparently, requiring no code changes when scaling replica counts up or down.
Step-by-Step Configuration
Follow these steps to transition from a single-file registry to a highly available deployment.
Step 1: Configure the SQL Registry Backend
Update your feature_store.yaml to use a database-backed registry instead of a local file:
project: my_project
provider: aws
online_store: redis
offline_store: file
registry:
registry_type: sql
path: postgresql://postgres:mysecret@db-host:5432/feast
cache_ttl_seconds: 60
sqlalchemy_config_kwargs:
echo: false
pool_pre_ping: true
The path parameter accepts any SQLAlchemy-compatible connection string. The pool_pre_ping: true setting ensures connections are validated before use, preventing errors during database failover events.
Step 2: Deploy Static Replicas
Create a FeatureStore custom resource with a fixed replica count for immediate high availability:
apiVersion: feast.dev/v1
kind: FeatureStore
metadata:
name: prod-feast
spec:
feastProject: my_project
replicas: 3
services:
onlineStore:
persistence:
store:
type: postgres
secretRef:
name: feast-data-stores
registry:
local:
persistence:
store:
type: sql
secretRef:
name: feast-data-stores
The replicas: 3 field tells the operator to maintain three registry pods simultaneously. The local service type indicates the operator manages the registry server deployment directly, as described in infra/feast-operator/docs/api/markdown/ref.md.
Step 3: Enable Autoscaling for Dynamic HA
For environments with variable load, configure the HorizontalPodAutoscaler instead of static replicas:
apiVersion: feast.dev/v1
kind: FeatureStore
metadata:
name: autoscaled-feast
spec:
feastProject: my_project
services:
scaling:
autoscaling:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
onlineStore:
persistence:
store:
type: postgres
secretRef:
name: feast-data-stores
registry:
local:
persistence:
store:
type: sql
secretRef:
name: feast-data-stores
This configuration maintains a minimum of two replicas for baseline availability while scaling up to ten pods based on CPU utilization. The HPA ensures you maintain quorum during traffic spikes without over-provisioning resources during quiet periods.
Validation and Safety Mechanisms
The Feast Operator implements admission-time validation to prevent misconfigurations. If you specify replicas > 1 or enable autoscaling while using a file-based persistence layer (SQLite, DuckDB, or local registry.db), the operator rejects the CR with a validation error referencing the safety check in infra/feast-operator/docs/api/markdown/ref.md.
Additionally, the RollingUpdate deployment strategy ensures zero-downtime deployments. When updating the registry image or configuration, Kubernetes terminates old pods only after new pods pass health checks, maintaining continuous availability for SDK and UI clients.
Summary
- Use a SQL-backed registry (PostgreSQL/MySQL) to enable concurrent writes and eliminate the single-writer limitation of file-based storage.
- Deploy multiple replicas via the Feast Operator's
FeatureStoreCR using either staticreplicascounts or HPA autoscaling for dynamic scaling. - Rely on the ClusterIP Service (
feast-registry) for transparent load balancing across all registry pods without client reconfiguration. - Validate persistence compatibility before deployment—the operator blocks HA configurations with incompatible file-based backends to prevent data corruption.
Frequently Asked Questions
Can I use S3 or GCS for a highly available registry?
Cloud object stores such as S3 and GCS support concurrent readers and can serve as registry storage, but they do not support concurrent atomic writes required for multi-writer HA deployments. For true high availability with multiple registry replicas, you must use a SQL database backend (PostgreSQL, MySQL) that supports transactional consistency and concurrent modifications.
Why can't I use SQLite or DuckDB for HA deployments?
SQLite and DuckDB implementations rely on file-based storage that requires exclusive locks for writes. When multiple registry pods attempt to modify the metadata simultaneously, these backends cannot reconcile concurrent changes, leading to race conditions and potential data corruption. The Feast Operator explicitly blocks replicas > 1 configurations when detecting these storage types.
How does the Feast Operator prevent downtime during updates?
The operator configures the registry Deployment with a RollingUpdate strategy that creates new pods before terminating old ones. Health checks ensure new replicas are ready to serve traffic before the controller removes legacy pods. This approach, combined with the ClusterIP Service maintaining active connections, ensures zero-downtime deployments even when updating registry versions or configuration.
Do clients need to change configuration when scaling registry replicas?
No. Clients—including the Python SDK (sdk/python/feast/infra/registry/registry.py), Feast UI, and other services—continue using the same registry endpoint defined in feature_store.yaml. The Kubernetes Service (feast-registry.<namespace>.svc) abstracts the underlying pod topology, automatically routing requests to healthy replicas regardless of the current replica count or autoscaling state.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →