特征工程中的point in time Join (PIT Join)
编纂: archimekai。
编纂: archimekai
相关系统或代码
Azure Managed Feature Store
The underlying offline storage for Azure Managed Feature Store is Azure Data Lake Storage Gen2 (ADLS Gen2).
https://learn.microsoft.com/en-us/azure/machine-learning/offline-retrieval-point-in-time-join-concepts
Feathr-ai
https://github.com/feathr-ai/feathr/blob/main/docs/concepts/point-in-time-join.md
Spark-PIT
https://github.com/Ackuq/spark-pit
Databricks tempo as of join
https://databrickslabs.github.io/tempo/about/user-guide.html#as-of-join
watch_accel_tsdf.asofJoin(phone_accel_tsdf, right_prefix=“phone_accel”)
Databricks unity catalog engineering
point in time feature joins: https://docs.databricks.com/aws/en/machine-learning/feature-store/time-series
Offline store is delta table
online store
fe = FeatureEngineeringClient()
# user_features_df DataFrame contains the following columns:
# - user_id
# - ts
# - purchases_30d
# - is_free_trial_active
fe.create_table(
name="ml.ads_team.user_features",
primary_keys=["user_id", "ts"],
timeseries_columns="ts",
features_df=user_features_df,
# lookback_window=timedelta(days=7)
)
feature_lookups = [
FeatureLookup(
table_name="ml.ads_team.user_features",
feature_names=["purchases_30d", "is_free_trial_active"],
lookup_key="u_id",
timestamp_lookup_key="ad_impression_ts"
),
FeatureLookup(
table_name="ml.ads_team.ad_features",
feature_names=["sports_relevance", "food_relevance"],
lookup_key="ad_id",
)
]
# raw_clickstream DataFrame contains the following columns:
# - u_id
# - ad_id
# - ad_impression_ts
training_set = fe.create_training_set(
df=raw_clickstream,
feature_lookups=feature_lookups,
exclude_columns=["u_id", "ad_id", "ad_impression_ts"],
label="did_click",
)
training_df = training_set.load_df()
delta lake primary key
https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships
Primary and foreign keys are informational only
Informational key constraints can improve performance by supporting query optimizations.
Databricks does not enforce key constraints.
It is the user’s responsibility to check whether a constraint is satisfied. Relying on a constraint that is not satisfied can lead to incorrect query results.
flint temporal join
https://github.com/twosigma/flint?tab=readme-ov-file#temporal-join-functions
leftTSRdd.leftJoin(rightTSRdd, tolerance = “1day”)
相关文献
The Hopsworks Feature Store for Machine Learning
Optimizing Data Pipelines for Machine Learning in Feature Stores
Liu, R., Park, K., Psallidas, F., Zhu, X., Mo, J., Sen, R., … & Camacho-Rodríguez, J. (2023). Optimizing Data Pipelines for Machine Learning in Feature Stores. Proceedings of the VLDB Endowment, 16(13), 4230-4239.
一是通过水平分区优化数据排布,减少PIT Join时要读取的数据量。
二是通过结果重用进一步减少计算量。
it is evident that the layout
of the source datasets on the storage system (e.g., HDFS, object
stores, and data warehouses) significantly affects the performance
of PIT joins. In particular, partitioning horizontally on the time
dimension allows the compute engine to skip reading large portions
of data, making it a highly effective strategy for feature source
dataset layout.
是不是可以按分钟对特征表做分区?
Other strategies such as sharding (or bucketing in
HDFS and object stores [55]) both label and feature source datasets
could also be used;
two other state-
of-the-art PIT join algorithmsÐEarly stop sort-merge PIT join and
Union PIT join [11, 49].
相关技术
temporal join
更多推荐
所有评论(0)