How Feast Handles Point-in-Time Joins to Prevent Data Leakage During Training
Feast prevents data leakage by executing point-in-time joins that attach only feature values known at or before each training row's event timestamp, leveraging TTL-aware SQL templates or Pandas asof-merges to guarantee temporal correctness.
Feast (feast-dev/feast) eliminates data leakage in machine learning training pipelines through sophisticated point-in-time joins. When building training datasets, the framework ensures that feature values joined to each entity row reflect only data that was available at that specific moment in history. This temporal correctness is enforced across multiple architectural layers, from timestamp inference to provider-specific query execution.
Understanding Point-in-Time Joins in Feast
Point-in-time (PIT) joins are the mechanism by which Feast attaches historical feature values to entity rows without using future information. For each row in the training dataset, Feast retrieves feature values where the feature timestamp is less than or equal to the entity's event timestamp, while respecting the Time-To-Live (TTL) window defined on the FeatureView.
The Four-Layer Architecture of PIT Joins
The implementation spans four critical layers in the offline store infrastructure:
1. Entity Timestamp Inference
The infer_event_timestamp_from_entity_df function in sdk/python/feast/infra/offline_stores/offline_utils.py (lines 28-44) automatically detects which column contains the event time. If the default event_timestamp column is missing, Feast infers the datetime column or raises a FeastEntityDFMissingColumnsError.
2. Query Context Construction
For each FeatureView, get_feature_view_query_context in offline_utils.py (lines 101-177) builds a FeatureViewQueryContext containing:
- Join keys mapped from entity columns
- TTL converted to seconds
- The source's
timestamp_field - Time window boundaries: maximum entity timestamp and minimum timestamp (entity timestamp minus TTL)
3. SQL and Pandas Rendering
The build_point_in_time_query function (lines 84-124 in offline_utils.py) renders the actual join logic. For BigQuery, Redshift, Snowflake, and other SQL-based stores, it generates templated SQL using LATERAL joins or MAX-OVER windows. For local execution, it falls back to Pandas asof-merges. This is utilized within each store's get_historical_features implementation, such as in bigquery.py (lines 235-272).
4. Provider Facade
The public API entry point FeatureStore.get_historical_features in feature_store.py (lines 1242-1286) delegates to Provider.get_historical_features in provider.py (lines 48-60), which orchestrates the offline store execution and returns a RetrievalJob.
How Feast Executes Point-in-Time Joins
The complete workflow follows these steps:
-
Validation: Feast validates the entity dataframe using
assert_expected_columns_in_entity_df, ensuring required columns exist. -
Timestamp Detection: If not explicitly named
event_timestamp, the system infers the timestamp column viainfer_event_timestamp_from_entity_df. -
Context Building: For every referenced FeatureView,
get_feature_view_query_contextcalculates the valid time window using the entity timestamp range and TTL. -
Query Generation:
build_point_in_time_querycreates store-specific SQL or Pandas logic that filters source rows wherefeature_timestamp <= entity_timestampandfeature_timestamp >= (entity_timestamp - TTL). -
Execution: The offline store executes the generated query, returning a dataset where each row contains entity columns joined only with historically valid features.
This process is documented in detail in docs/getting-started/concepts/point-in-time-joins.md.
Practical Implementation Examples
Basic Historical Feature Retrieval
from feast import FeatureStore
import pandas as pd
# Entity dataframe must contain entity keys and a timestamp column
entity_df = pd.read_csv("entity_df.csv") # Columns: driver_id, event_timestamp, label
store = FeatureStore(repo_path=".")
# Feast automatically performs point-in-time joins
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"driver_hourly_stats:trips_today",
"driver_hourly_stats:earnings_today",
],
).to_df()
Handling Custom Timestamp Columns
# If your timestamp column is named differently, rename it before passing to Feast
entity_df = pd.read_csv("entity_df.csv") # Contains 'event_time' instead of 'event_timestamp'
training_df = store.get_historical_features(
entity_df=entity_df.rename(columns={"event_time": "event_timestamp"}),
features=["driver_hourly_stats:trips_today"],
).to_df()
SQL-Based Entity DataFrames
sql = """
SELECT driver_id, order_timestamp AS event_timestamp, label
FROM my_warehouse.orders
WHERE order_timestamp BETWEEN '2023-01-01' AND '2023-02-01'
"""
training_df = store.get_historical_features(
entity_df=sql,
features=["driver_hourly_stats:trips_today"],
).to_df()
Summary
- Feast prevents data leakage through point-in-time joins that restrict feature values to those available at each entity's event timestamp.
- The implementation relies on
infer_event_timestamp_from_entity_dfinoffline_utils.pyfor timestamp detection andget_feature_view_query_contextfor window calculation. - SQL generation via
build_point_in_time_querycreates store-specific queries using LATERAL joins or window functions. - The
FeatureStore.get_historical_featuresmethod infeature_store.pyprovides the public interface, delegating to provider-specific offline stores.
Frequently Asked Questions
What is data leakage in feature engineering?
Data leakage occurs when information from outside the training dataset is used to create the model, resulting in overly optimistic performance metrics. In temporal data, this happens when future feature values are joined to past entity rows. Feast's point-in-time joins prevent this by strictly enforcing that only data available at or before the event timestamp is retrieved.
How does Feast determine which timestamp column to use?
By default, Feast expects an event_timestamp column in the entity dataframe. If absent, the infer_event_timestamp_from_entity_df function in sdk/python/feast/infra/offline_stores/offline_utils.py (lines 28-44) attempts to infer the datetime column automatically. If inference fails, Feast raises a FeastEntityDFMissingColumnsError.
What happens if no TTL is set on a FeatureView?
If no Time-To-Live (TTL) is specified, Feast does not apply a lower bound to the time window when scanning for features. This means the join will consider all historical feature values up to the entity timestamp, potentially scanning larger datasets but ensuring no future leakage occurs.
Does Feast support point-in-time joins for real-time predictions?
Point-in-time joins are primarily used for batch historical retrieval during training. For real-time predictions, Feast uses the online store to fetch the latest feature values via get_online_features, which does not perform point-in-time joins but rather retrieves the current state of features.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →