编纂: archimekai

相关系统或代码

Azure Managed Feature Store

The underlying offline storage for Azure Managed Feature Store is Azure Data Lake Storage Gen2 (ADLS Gen2).
https://learn.microsoft.com/en-us/azure/machine-learning/offline-retrieval-point-in-time-join-concepts

Feathr-ai

https://github.com/feathr-ai/feathr/blob/main/docs/concepts/point-in-time-join.md

Spark-PIT

https://github.com/Ackuq/spark-pit

Databricks tempo as of join

https://databrickslabs.github.io/tempo/about/user-guide.html#as-of-join
watch_accel_tsdf.asofJoin(phone_accel_tsdf, right_prefix=“phone_accel”)

Databricks unity catalog engineering

point in time feature joins: https://docs.databricks.com/aws/en/machine-learning/feature-store/time-series
Offline store is delta table
online store

fe = FeatureEngineeringClient()
# user_features_df DataFrame contains the following columns:
# - user_id
# - ts
# - purchases_30d
# - is_free_trial_active
fe.create_table(
  name="ml.ads_team.user_features",
  primary_keys=["user_id", "ts"],
  timeseries_columns="ts",
  features_df=user_features_df,
  # lookback_window=timedelta(days=7)
)
feature_lookups = [
  FeatureLookup(
    table_name="ml.ads_team.user_features",
    feature_names=["purchases_30d", "is_free_trial_active"],
    lookup_key="u_id",
    timestamp_lookup_key="ad_impression_ts"
  ),
  FeatureLookup(
    table_name="ml.ads_team.ad_features",
    feature_names=["sports_relevance", "food_relevance"],
    lookup_key="ad_id",
  )
]

# raw_clickstream DataFrame contains the following columns:
# - u_id
# - ad_id
# - ad_impression_ts
training_set = fe.create_training_set(
  df=raw_clickstream,
  feature_lookups=feature_lookups,
  exclude_columns=["u_id", "ad_id", "ad_impression_ts"],
  label="did_click",
)
training_df = training_set.load_df()

delta lake primary key

https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships
Primary and foreign keys are informational only
Informational key constraints can improve performance by supporting query optimizations.
Databricks does not enforce key constraints.
It is the user’s responsibility to check whether a constraint is satisfied. Relying on a constraint that is not satisfied can lead to incorrect query results.

flint temporal join

https://github.com/twosigma/flint?tab=readme-ov-file#temporal-join-functions
leftTSRdd.leftJoin(rightTSRdd, tolerance = “1day”)

相关文献

The Hopsworks Feature Store for Machine Learning

Optimizing Data Pipelines for Machine Learning in Feature Stores

Liu, R., Park, K., Psallidas, F., Zhu, X., Mo, J., Sen, R., … & Camacho-Rodríguez, J. (2023). Optimizing Data Pipelines for Machine Learning in Feature Stores. Proceedings of the VLDB Endowment, 16(13), 4230-4239.

一是通过水平分区优化数据排布,减少PIT Join时要读取的数据量。
二是通过结果重用进一步减少计算量。

it is evident that the layout
of the source datasets on the storage system (e.g., HDFS, object
stores, and data warehouses) significantly affects the performance
of PIT joins. In particular, partitioning horizontally on the time
dimension allows the compute engine to skip reading large portions
of data, making it a highly effective strategy for feature source
dataset layout.

是不是可以按分钟对特征表做分区?

Other strategies such as sharding (or bucketing in
HDFS and object stores [55]) both label and feature source datasets
could also be used;

two other state-
of-the-art PIT join algorithmsÐEarly stop sort-merge PIT join and
Union PIT join [11, 49].

相关技术

temporal join

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐