Building a Feature Store around Dataframes and Apache Spark

A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.

In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how our platform, Hopsworks, seamlessly integrates with Spark-based platforms, such as Databricks. With the Feature Store, we will demonstrate in Databricks how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). We will also show the potential of Koalas for making feature engineering even easier on PySpark. Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.

Sign up today

Your browser does not support JavaScript. You have to enable it to be able to submit your application and upload a CV. If you don't know how, either try from a different web browser or other computer/device.

Logical Clocks AB are the makers of Hopsworks, a data-intensive AI platform with a Feature Store.

Hopsworks For SageMaker For Databricks

Healthcare Finance Automotive Betting

Blog Newsletter Events Webinars Newsroom Community Research Whitepapers

About Us Career Contact

Spark + AI Summit North America 2020

Building a Feature Store around Dataframes and Apache Spark

Sign up today

Products

Solutions

Resources

Company