Fabio Buso
VP Engineering
Moritz Meister
Software Engineer
Jim Dowling
CEO
Davit Bzhalava
Head of Data Science
TLDR; Hopsworks Feature Store interacts with the new Python and Scala/Java Software Development Kit (SDK) available in Hopsworks 2.0. The new SDK builds on our extensive experience working with Enterprise customers and users with Enterprise requirements. With this new release, we consolidate multiple libraries with a single one. We named it `HSFS` (HopsworkS Feature Store) and in this blog post we will be looking at some of the improvements and key features that the new SDK brings.
Today, we’re introducing the new Hopsworks Feature Store API. Rebuilt from the ground up, today’s release includes the first set of new endpoints and features we’re launching so developers can help the world connect to the public conversation happening on Twitter.
If you can’t wait to check it out, visit Hopsworks.ai to get started. If you can, then read on for more about what we’re building, what’s new about the Feature Store API v2.
The Hopsworks Feature Store was first released at the end of 2018, and it included a new type of Feature Store API based on the FeatureGroup (DataFrame). When designing the Hopsworks Feature Store, we looked at existing feature stores (Michelangelo by Uber and Zipline by AirBnb) that provided Domain Specific Language (DSL) APIs to their Feature Stores - you declaratively define features, and the DSL is then executed to ingest feature data into the Feature Store. However, we were building a general-purpose Feature Store and we knew that history has not been kind to DSLs - they have their day in the sun, but general purpose frameworks and languages win out in the long-run. So, we went with the FeatureGroup (DataFrames in Spark or Pandas) as the way to ingest and export features to/from the Hopsworks Feature Store. Since then, other Feature Stores have followed our approach, such as Feast that introduced FeatureSets in 0.3 almost 1 year later and Spark support almost 2 years later).
However, our API still encouraged developers to think of features as existing in a flat namespace. With great customers, such as PaddyPower, we rethought and redesigned our Feature Store API to consider other practical problems we encountered, such as making breaking changes to FeatureGroup schemas without breaking existing feature pipelines, handling feature naming conflicts, and redesigning a minimal single client library (from 2 libraries previously) that can run in either a (Py)Spark or Python environment. That library is now released as our new Feature Store API and is called HSFS.
HSFS provides a DataFrame API to ingest data into the Hopsworks Feature Store. You can also retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models
The idea of the Feature Store is to have pre-computed features available for both training and serving models. The key functionality required to generate training datasets from reusable features are: feature selection, joins, filters and point in time queries. To enable this functionality, we are introducing a new expressive Query abstraction with HSFS that provides these operations and guarantees reproducible creation of training datasets from features in the Feature Store.
The new joining functionality is heavily inspired by the APIs used by Pandas to merge DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions.
If a data scientist wants to modify a new feature that is not available in the Feature Store, she can write code to compute the new feature (using existing features or external data) and ingest the new feature values into the Feature Store. If the new feature is based solely on existing feature values in the Feature Store, we call it a derived feature. The same HSFS APIs can be used to compute derived features as well as features using external data sources.
# create a query
feature_join = rain_fg.select_all()
.join(temperature_fg.select_all(), on=["date", "location_id"])
.join(location_fg.select_all())
td = fs.create_training_dataset("rain_dataset",
version=1,
label=”weekly_rain”,
data_format=”tfrecords”)
# materialize query in the specified file format
td.save(feature_join)
# use materialized training dataset for training, possibly in a different environment
td = fs.get_training_dataset(“rain_dataset”, version=1)
# get TFRecordDataset to use in a TensorFlow model
dataset = td.tf_data().tf_record_dataset(batch_size=32, num_epochs=100)
# reproduce query for online feature store and drop label for inference
jdbc_querystring = td.get_query(online=True, with_label=False)
When using HSFS to create a training dataset, the features, their order and how the feature groups are joined is also saved as metadata. This metadata is then used at serving time to build a JDBC query that is executed by a client (along with the feature group primary key values) to request a feature vector from the online Feature Store for a specific model.
When we started building the Hopsworks Feature Store we wanted to create a flat namespace for features to make it easier for data scientists to pick the features they wanted. But as soon as our customers started having tens of thousands of features in the feature store, feature naming conflicts became commonplace.
Features like created_on, customer_id, account_id, are often used in many different unrelated Feature Groups. The same user might, for example, have a different account_id for each service the company provides.Imagine an e-commerce use case - most likely you will have multiple `created_on` or `revenue`-named features.
In HSFS we solved this challenge by requiring users to work with the feature group abstraction - users have to specify which features they need from which feature group. The feature groups abstraction also allows the query planner to intelligently identify which feature to use when joining features from different feature groups.
Hopsworks has offered support for Apache Hudi for over one year. Apache Hudi is the key component to make time travel, upserts, and ACID updates possible on feature data. Time travel enables data engineers and data scientists to retrieve a previous snapshot of the feature data for debugging and auditing purposes.
In HSFS, we added support for complex time travel queries, allowing users to create training datasets by retrieving different feature groups at different points in time. The different times are stored as metadata alongside the training datasets. This enables users to easily reproduce the creation of training datasets with historical feature values.
The new SDK improves support for Apache Hudi by integrating directly with the feature groups and joining APIs. Apache Hudi is also now available for the Python APIs. The APIs hide the complexity of dealing with Apache Hudi options. At the same time the high level APIs will allow us to implement support for additional formats such as Delta Lake.
Provenance for feature data is a key new functionality available as part of HSFS. Hopsworks Feature Store implicity tracks dependencies between feature groups, training datasets and models. Provenance allows users to know, at any given moment, what features are the most widely used and also what features are not used anymore and could potentially be removed from the feature store. Provenance also enables users to traverse from a model to the application used to train it, the input training dataset used, and the features and feature group snapshots used to create the training dataset.
In this new release we also allow users to create and attach user-defined labels to feature groups and training datasets. Labels and tags enable users to build a custom metadata catalog for their Feature Store, that (using Elasticsearch) supports low-latency free-text search over potentially thousands of features, descriptions, labels and tags.
While the Hopsworks feature store plays a key role in defining governance around feature data, improving execution speed when it comes to experimenting, building and deploying models, the ultimate customers of the feature stores are the data scientists and machine learning engineers.
To make their interaction with the feature store as smooth as possible, we added support retrieving training dataset as a TensorFlow Dataset format using tf.Data. This allows data scientists to efficiently read training datasets in their TensorFlow code.
Alongside existing support for (Py)Spark, this release also brings supports for feature engineering pipelines built using pure Python programs. With HSFS you are able to upload feature data to the feature store from your SageMaker notebook, KubeFlow, Jupyter notebook, or even your local machine.
This is in addition to the existing capabilities of exploratory data analysis with the feature store from your notebook, and the creation and reading of training datasets.
You can get started immediately by creating a new cluster on https://hopsworks.ai, the library is already pre-installed in your environment so you can get started straight away.
The documentation is available and you can walk through one of the many example notebooks that are available here.