Feast Batch Processing Engines: Local, Spark, Ray, Snowflake, and AWS Lambda
Feast supports five production-ready batch compute engines—local, Spark, Ray, Snowflake, and AWS Lambda—that materialize features via the batch_engine field in feature_store.yaml, with engine resolution handled through the BATCH_ENGINE_CLASS_FOR_TYPE registry in sdk/python/feast/repo_config.py.
Feast materializes offline feature data to online stores using a pluggable batch materialization engine architecture. The open-source feast-dev/feast repository provides built-in compute engines that scale from local development to distributed serverless environments. You configure your preferred engine through the batch_engine (or batch_engine_config) field in feature_store.yaml, which Feast resolves to concrete implementations via an internal class registry defined in sdk/python/feast/repo_config.py.
How Feast Selects Batch Compute Engines
Feast uses a registry-based lookup mechanism to map shorthand engine names to concrete compute classes. In sdk/python/feast/repo_config.py (lines 46‑53), the BATCH_ENGINE_CLASS_FOR_TYPE dictionary defines the following mappings:
local→feast.infra.compute_engines.local.compute.LocalComputeEnginespark.engine→feast.infra.compute_engines.spark.compute.SparkComputeEngineray.engine→feast.infra.compute_engines.ray.compute.RayComputeEnginesnowflake.engine→feast.infra.compute_engines.snowflake.snowflake_engine.SnowflakeComputeEnginelambda→feast.infra.compute_engines.aws_lambda.lambda_engine.LambdaComputeEngine
When the FeatureStore initializes, it calls get_batch_engine_config_from_type() to validate the configuration against the corresponding *EngineConfig Pydantic model (e.g., SparkEngineConfig). If no batch_engine is specified, RepoConfig.__init__ (lines 42‑45) defaults to the local engine, ensuring that development environments work out-of-the-box.
Local Compute Engine
The Local Compute Engine runs materialization in-process on the driver machine using feast.infra.compute_engines.local.compute.LocalComputeEngine. This engine is ideal for development, unit testing, and small-scale workloads that do not require distributed processing.
Because the engine executes within the same Python process as the Feast client, it avoids the overhead of cluster scheduling and network serialization. The engine implements the abstract ComputeEngine interface defined in feast/infra/compute_engines/base.py, specifically overriding the materialize method to iterate over batch sources locally.
Spark Compute Engine
The Spark Compute Engine (spark.engine) leverages feast.infra.compute_engines.spark.compute.SparkComputeEngine to execute materialization as a distributed Spark job. This engine supports stand-alone Spark clusters, Amazon EMR, Google Dataproc, and Databricks runtimes.
Spark engine configuration accepts a spark_conf dictionary that passes directly to the Spark session builder, allowing you to set the master URL, deploy mode, and application-specific properties. The engine reads the batch source into a Spark DataFrame, applies any on-demand transformations, and writes the results to the configured online store in parallel.
Ray Compute Engine
The Ray Compute Engine (ray.engine) utilizes feast.infra.compute_engines.ray.compute.RayComputeEngine to parallelize materialization across a Ray cluster. This provides fine-grained task scheduling without the overhead of a full Spark stack, making it suitable for Python-native feature transformations.
Configuration requires a ray_address parameter (e.g., ray://my-ray-head:10001) and optional runtime environment specifications. The Ray engine distributes the materialization workload across available cluster nodes, using Ray’s actor and task primitives to scale horizontally while maintaining low-latency Python execution.
Snowflake Compute Engine
The Snowflake Compute Engine (snowflake.engine) uses feast.infra.compute_engines.snowflake.snowflake_engine.SnowflakeComputeEngine to push computation directly into Snowflake’s native compute layer. Instead of exporting data to an external engine, Feast generates SQL queries that execute within Snowflake warehouses.
This approach minimizes data movement and leverages Snowflake’s elastic compute resources. The engine requires warehouse and role parameters in its configuration, and it writes materialized features directly from Snowflake tables to the online store without intermediate extraction.
AWS Lambda Compute Engine
The AWS Lambda Compute Engine (lambda) invokes feast.infra.compute_engines.aws_lambda.lambda_engine.LambdaComputeEngine to execute materialization as a serverless function. This is optimal for lightweight, event-driven batch jobs that run infrequently or require automatic scaling without persistent infrastructure.
The engine accepts function_name, region, and optional payload size or timeout tuning parameters. When materialization triggers, Feast packages the batch job context and invokes the specified Lambda function, which performs the data extraction and online store write within the AWS ecosystem.
Configuration Examples
Below are minimal feature_store.yaml configurations for each engine. All examples assume a FileSource batch source; adjust the offline_store configuration to match your environment (BigQuery, Snowflake, Dask, etc.).
Local Engine Configuration
project: my_project
provider: local
registry: data/registry.db
online_store:
type: sqlite
offline_store:
type: dask
batch_engine: local
Spark Engine Configuration
project: my_project
provider: aws
registry: s3://my-bucket/registry.db
online_store:
type: redis
path: redis://localhost:6379
offline_store:
type: dask
batch_engine:
type: spark.engine
spark_conf:
master: "spark://my-spark-cluster:7077"
deploy_mode: "cluster"
Ray Engine Configuration
project: my_project
provider: gcp
registry: gs://my-bucket/registry.db
online_store:
type: bigtable
offline_store:
type: dask
batch_engine:
type: ray.engine
ray_address: "ray://my-ray-head:10001"
Snowflake Engine Configuration
project: my_project
provider: snowflake
registry: s3://my-bucket/registry.db
online_store:
type: snowflake.online
account: "<account>"
user: "<user>"
password: "<pwd>"
offline_store:
type: dask
batch_engine:
type: snowflake.engine
warehouse: "FEAST_WH"
role: "FEAST_ROLE"
AWS Lambda Engine Configuration
project: my_project
provider: aws
registry: s3://my-bucket/registry.db
online_store:
type: dynamodb
offline_store:
type: dask
batch_engine:
type: lambda
function_name: "feast-materialize"
region: "us-east-1"
Summary
- Feast provides five built-in batch compute engines:
local,spark.engine,ray.engine,snowflake.engine, andlambda, mapped throughBATCH_ENGINE_CLASS_FOR_TYPEinsdk/python/feast/repo_config.py. - The Local engine runs in-process for development, while Spark and Ray provide distributed processing for large-scale materialization.
- The Snowflake engine executes queries natively inside Snowflake warehouses, minimizing data transfer overhead.
- The Lambda engine enables serverless materialization for lightweight, event-driven workloads.
- All engines implement the
ComputeEngineinterface fromfeast/infra/compute_engines/base.py, ensuring a consistentmaterializemethod signature across implementations.
Frequently Asked Questions
What is the default batch compute engine in Feast?
If you omit the batch_engine field in feature_store.yaml, Feast defaults to the local engine. This behavior is defined in RepoConfig.__init__ within sdk/python/feast/repo_config.py (lines 42‑45), which assigns the local compute engine class when no engine type is specified.
How do I add a custom batch compute engine to Feast?
You can extend the pluggable registry by adding a new entry to BATCH_ENGINE_CLASS_FOR_TYPE in sdk/python/feast/repo_config.py, mapping a custom type string to a fully-qualified class path. Your custom class must inherit from the abstract ComputeEngine base class in feast/infra/compute_engines/base.py and implement the materialize method signature.
What are the performance differences between Ray and Spark engines in Feast?
The Ray engine excels at fine-grained, Python-native task parallelism with lower overhead for small-to-medium datasets, while the Spark engine optimizes for large-scale JVM-based data processing with robust fault tolerance. Choose Ray for Python-centric feature logic requiring dynamic scaling, and Spark for massive ETL workloads that benefit from Spark SQL and mature cluster management.
Can the Snowflake engine materialize features from non-Snowflake sources?
No, the snowflake.engine is designed specifically to execute materialization queries within Snowflake’s compute layer against Snowflake-hosted batch sources. If your offline store is BigQuery or Redshift, you should use the local, spark.engine, or ray.engine compute engines instead, as these can read from diverse offline stores and write to any supported online store.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →