特征工程中的point in time Join （PIT Join）

编纂： archimekai。

archimekai

850人浏览 · 2025-08-12 08:01:58

archimekai · 2025-08-12 08:01:58 发布

文章目录

相关系统或代码
相关文献
- The Hopsworks Feature Store for Machine Learning
- Optimizing Data Pipelines for Machine Learning in Feature Stores
相关技术
- temporal join

编纂： archimekai

相关系统或代码

Azure Managed Feature Store

The underlying offline storage for Azure Managed Feature Store is Azure Data Lake Storage Gen2 (ADLS Gen2).
https://learn.microsoft.com/en-us/azure/machine-learning/offline-retrieval-point-in-time-join-concepts

Feathr-ai

https://github.com/feathr-ai/feathr/blob/main/docs/concepts/point-in-time-join.md

Spark-PIT

https://github.com/Ackuq/spark-pit

Databricks tempo as of join

https://databrickslabs.github.io/tempo/about/user-guide.html#as-of-join
watch_accel_tsdf.asofJoin(phone_accel_tsdf, right_prefix=“phone_accel”)

Databricks unity catalog engineering

point in time feature joins: https://docs.databricks.com/aws/en/machine-learning/feature-store/time-series
Offline store is delta table
online store

fe = FeatureEngineeringClient()
# user_features_df DataFrame contains the following columns:
# - user_id
# - ts
# - purchases_30d
# - is_free_trial_active
fe.create_table(
  name="ml.ads_team.user_features",
  primary_keys=["user_id", "ts"],
  timeseries_columns="ts",
  features_df=user_features_df,
  # lookback_window=timedelta(days=7)
)
feature_lookups = [
  FeatureLookup(
    table_name="ml.ads_team.user_features",
    feature_names=["purchases_30d", "is_free_trial_active"],
    lookup_key="u_id",
    timestamp_lookup_key="ad_impression_ts"
  ),
  FeatureLookup(
    table_name="ml.ads_team.ad_features",
    feature_names=["sports_relevance", "food_relevance"],
    lookup_key="ad_id",
  )
]

# raw_clickstream DataFrame contains the following columns:
# - u_id
# - ad_id
# - ad_impression_ts
training_set = fe.create_training_set(
  df=raw_clickstream,
  feature_lookups=feature_lookups,
  exclude_columns=["u_id", "ad_id", "ad_impression_ts"],
  label="did_click",
)
training_df = training_set.load_df()

delta lake primary key

https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships
Primary and foreign keys are informational only
Informational key constraints can improve performance by supporting query optimizations.
Databricks does not enforce key constraints.
It is the user’s responsibility to check whether a constraint is satisfied. Relying on a constraint that is not satisfied can lead to incorrect query results.