deep-dive

How Feast Handles Point-in-Time Joins to Prevent Data Leakage During Training

March 1, 2026 feast-dev/feast ↗

Feast prevents data leakage by executing point-in-time joins that attach only feature values known at or before each training row's event timestamp, leveraging TTL-aware SQL templates or Pandas asof-merges to guarantee temporal correctness.

Feast (feast-dev/feast) eliminates data leakage in machine learning training pipelines through sophisticated point-in-time joins. When building training datasets, the framework ensures that feature values joined to each entity row reflect only data that was available at that specific moment in history. This temporal correctness is enforced across multiple architectural layers, from timestamp inference to provider-specific query execution.

Understanding Point-in-Time Joins in Feast

Point-in-time (PIT) joins are the mechanism by which Feast attaches historical feature values to entity rows without using future information. For each row in the training dataset, Feast retrieves feature values where the feature timestamp is less than or equal to the entity's event timestamp, while respecting the Time-To-Live (TTL) window defined on the FeatureView.

The Four-Layer Architecture of PIT Joins

The implementation spans four critical layers in the offline store infrastructure:

1. Entity Timestamp Inference

The infer_event_timestamp_from_entity_df function in sdk/python/feast/infra/offline_stores/offline_utils.py (lines 28-44) automatically detects which column contains the event time. If the default event_timestamp column is missing, Feast infers the datetime column or raises a FeastEntityDFMissingColumnsError.

2. Query Context Construction

For each FeatureView, get_feature_view_query_context in offline_utils.py (lines 101-177) builds a FeatureViewQueryContext containing:

Join keys mapped from entity columns
TTL converted to seconds
The source's timestamp_field
Time window boundaries: maximum entity timestamp and minimum timestamp (entity timestamp minus TTL)

3. SQL and Pandas Rendering

The build_point_in_time_query function (lines 84-124 in offline_utils.py) renders the actual join logic. For BigQuery, Redshift, Snowflake, and other SQL-based stores, it generates templated SQL using LATERAL joins or MAX-OVER windows. For local execution, it falls back to Pandas asof-merges. This is utilized within each store's get_historical_features implementation, such as in bigquery.py (lines 235-272).

4. Provider Facade

The public API entry point FeatureStore.get_historical_features in feature_store.py (lines 1242-1286) delegates to Provider.get_historical_features in provider.py (lines 48-60), which orchestrates the offline store execution and returns a RetrievalJob.

How Feast Executes Point-in-Time Joins

The complete workflow follows these steps:

Validation: Feast validates the entity dataframe using assert_expected_columns_in_entity_df, ensuring required columns exist.
Timestamp Detection: If not explicitly named event_timestamp, the system infers the timestamp column via infer_event_timestamp_from_entity_df.
Context Building: For every referenced FeatureView, get_feature_view_query_context calculates the valid time window using the entity timestamp range and TTL.
Query Generation: build_point_in_time_query creates store-specific SQL or Pandas logic that filters source rows where feature_timestamp <= entity_timestamp and feature_timestamp >= (entity_timestamp - TTL).
Execution: The offline store executes the generated query, returning a dataset where each row contains entity columns joined only with historically valid features.

This process is documented in detail in docs/getting-started/concepts/point-in-time-joins.md.

Practical Implementation Examples

Basic Historical Feature Retrieval

from feast import FeatureStore
import pandas as pd

# Entity dataframe must contain entity keys and a timestamp column

entity_df = pd.read_csv("entity_df.csv")  # Columns: driver_id, event_timestamp, label

store = FeatureStore(repo_path=".")

# Feast automatically performs point-in-time joins

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:trips_today",
        "driver_hourly_stats:earnings_today",
    ],
).to_df()

Handling Custom Timestamp Columns


# If your timestamp column is named differently, rename it before passing to Feast

entity_df = pd.read_csv("entity_df.csv")  # Contains 'event_time' instead of 'event_timestamp'

training_df = store.get_historical_features(
    entity_df=entity_df.rename(columns={"event_time": "event_timestamp"}),
    features=["driver_hourly_stats:trips_today"],
).to_df()

SQL-Based Entity DataFrames

sql = """
SELECT driver_id, order_timestamp AS event_timestamp, label
FROM my_warehouse.orders
WHERE order_timestamp BETWEEN '2023-01-01' AND '2023-02-01'
"""

training_df = store.get_historical_features(
    entity_df=sql,
    features=["driver_hourly_stats:trips_today"],
).to_df()

Summary

Feast prevents data leakage through point-in-time joins that restrict feature values to those available at each entity's event timestamp.
The implementation relies on infer_event_timestamp_from_entity_df in offline_utils.py for timestamp detection and get_feature_view_query_context for window calculation.
SQL generation via build_point_in_time_query creates store-specific queries using LATERAL joins or window functions.
The FeatureStore.get_historical_features method in feature_store.py provides the public interface, delegating to provider-specific offline stores.

Frequently Asked Questions

What is data leakage in feature engineering?

Data leakage occurs when information from outside the training dataset is used to create the model, resulting in overly optimistic performance metrics. In temporal data, this happens when future feature values are joined to past entity rows. Feast's point-in-time joins prevent this by strictly enforcing that only data available at or before the event timestamp is retrieved.

How does Feast determine which timestamp column to use?

By default, Feast expects an event_timestamp column in the entity dataframe. If absent, the infer_event_timestamp_from_entity_df function in sdk/python/feast/infra/offline_stores/offline_utils.py (lines 28-44) attempts to infer the datetime column automatically. If inference fails, Feast raises a FeastEntityDFMissingColumnsError.

What happens if no TTL is set on a FeatureView?

If no Time-To-Live (TTL) is specified, Feast does not apply a lower bound to the time window when scanning for features. This means the join will consider all historical feature values up to the entity timestamp, potentially scanning larger datasets but ensuring no future leakage occurs.

Does Feast support point-in-time joins for real-time predictions?

Point-in-time joins are primarily used for batch historical retrieval during training. For real-time predictions, Feast uses the online store to fetch the latest feature values via get_online_features, which does not perform point-in-time joins but rather retrieves the current state of features.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how feast-dev/feast works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →