Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Show me the code; how we linked notebooks to features

>
Product Updates
11/23/2021
>
Jim Dowling

TLDR; This blog introduces a new feature in the Hopsworks UI - feature code preview - the ability to view the notebook used to create a Feature Group or Training Dataset. The new feature works for both notebooks run on Hopsworks and notebooks run on Databricks. This new functionality enables Data Scientists to easily inspect the code used to compute features and training data.

Introduction

A key capability of feature stores is that they enable users to understand their features; where features come from - both the source of their data and the code used to create the features. Since Hopsworks version 1.2, users have been able to see which compute jobs run in the Hopsworks platform create which Feature Groups. However, an increasing number of users are running their feature engineering code on Jupyter notebooks and on external platforms, such as Databricks. To this end, we have introduced new functionality in the Feature Store UI, where you can view the notebook code used to create features or training data directly in the Feature Registry.

This new quick feature code view will help data scientists become more productive when they are in the EDA (exploratory data analysis) phase of their work. They will not have to navigate to a source code repository or Jobs UI to understand how a feature is computed. This new capability is part of our ongoing effort to make the Hopsworks Feature Store the most open and modular feature store on the market. Not only do we support the widest number of platforms for feature engineering (Python, Spark, SQL), but we also support the widest number of data sources, where you can store features computed from data stored in many different data platforms, including data warehouses, databases, object stores, graph databases, message buses, and free-text search databases. You do not have to bring your data for AI to a single data warehouse to be able to use our feature store. Keep the data in place, and centralize the features computed from it.

How you compute features for Hopsworks

Hopsworks supports general-purpose programming languages and frameworks for computing features, including Python/Pandas, Apache Spark (PySpark, Scala/Java, SQL), and Apache Flink. You can work with your existing workflow orchestrator tool (Hopsworks comes with Airflow built-in, but you can use an external Airflow, Dagster, Azure Data Factory, Jenkins and many more).

Hopsworks does not force you to define your features in a domain specific language (DSL). We meet you where you like to work - write your feature pipeline as a program in any Python/Spark/SQL environment. Extracting the feature computation code from a DSL is a straightforward task. However, as we support general purpose compute frameworks for feature computation, we had to solve the more general problem of storing and attaching the code for the notebook to the features in the feature store.

A code snippet in Python is shown below that uses the hsfs API (Hopsworks Feature Store) to write a dataframe to a feature group. First, we read data into a Pandas dataframe from a bucket in s3, then we perform feature engineering on the dataframe (elided for brevity), and the dataframe is finally saved to the feature group - an ACID update on the Hudi table for the offline feature store.

Notebook Snapshots and Time-Travel in Hudi-enabled Feature Groups

The feature group shown in the code above stores its historical feature values in an offline feature store - a Hudi table. Hudi is a file format that extends Parquet to give it ACID properties and time-travel capabilities. Our new feature code preview functionality works as follows. Every time we update a feature group, we snapshot the notebook (Jupyter in Hopsworks or Databricks notebook), and store it alongside the Hudi commit. 

Feature Code Snapshot Capabilities

The ability to easily view the code used to compute a given snapshot for a Feature Group enables new capabilities in Hopsworks:

  • Debugging: if you discover a bug in your features, you can easily check the code used to create those features. For example, if you discovered a particular data ingestion run that polluted a feature group, it is now trivial to inspect the code that performed that ingestion.
  • Reproducibility: you can reproduce the data used for a given update to a feature group by inspecting the program used to make that update along with the source data;
  • Auditing and governance: you can travel back in time to see what code was used to create feature data in a given time period.

The new feature code preview functionality will be available in Hopsworks 2.5, due for release before the end of 2021.

Today we are pleased to announce the release of two new RonDB; 21.10.1, a beta release of RonDB 21.10 with 4 new features and a maintenance release for 21.04.

Mikael Ronström

Head of Data

Read

The first release of RonDB is now available. RonDB 21.04.0 brings new documentation, community support, 3x improved performance, bug fixes and new features.

Mikael Ronström

Head of Data

Read

RonDB enables users to have full control over the assignment of threads to CPUs, how the CPU locking is to be performed and how the thread should be scheduled.

Mikael Ronström

Head of Data

Read

RonDB is a managed key-value store with SQL capabilities. It provides the best low-latency, high throughput, and high availability database available today.

Mikael Ronström

Head of Data

Read

HopsFS-S3: cloud-native distributed hierarchical file system that has the same cost as S3, but has 100X the performance of S3 for file move/rename operations.

Mahmoud Ismail

Software engineer

Read

Hopsworks is the world's first Enterprise Feature Store along with an advanced end-to-end ML platform.

Theofilos Kakantousis

VP of Product

Read

Hopsworks is now available as a managed platform for Amazon Web Services (AWS) and Microsoft Azure with a comprehensive free tier.

Steffen Grohsschmiedt

Head of Cloud

Read

Hopsworks Feature Store API was rebuilt from the ground based on our extensive experience working with enterprise users requirements.

Fabio Buso

VP Engineering

Read

Integrate AWS SageMaker with Hopsworks to manage, discover and use features for creating training datasets and for serving features to operational models.

Fabio Buso

VP Engineering

Read

Get started with Hopsworks.ai to effortlessly launch and manage Hopsworks clusters in any AWS account that also integrates with Databricks and AWS SageMaker.

Steffen Grohsschmiedt

Head of Cloud

Read

The Hopsworks 1.x series has updates to the Feature Store, a new Experiments framework for ML, and connectivity with external systems (Databricks, Sagemaker).

Theofilos Kakantousis

VP of Product

Read

Hopsworks now supports AMD GPUs and ROCm for deep learning, enabling developers to train deep learning models on AMD GPU hardware using TensorFlow.

Robin Andersson

Software engineer

Read

Hopsworks is replacing Horovod with Keras/TensorFlow’s new CollectiveAllReduceStrategy , a part of Keras/TensorFlow Estimator framework.

Robin Andersson

Software engineer

Read

Hopsworks is a data platform that integrates popular platforms for data processing such as Apache Spark, TensorFlow, Hops Hadoop, Kafka, and many others.

Jim Dowling

CEO

Read

Datasets used for deep learning may reach millions of files. The well known image dataset, ImageNet, contains 1m images, and its successor, the Open Images...

Jim Dowling

CEO

Read