Sysbench is a tool to benchmark and test open source databases. We have integrated Sysbench into the RonDB installation. This makes it extremely easy to run benchmarks with RonDB. This blog will describe the use of these benchmarks in RonDB. These benchmarks were executed with 1 cluster connection per MySQL Server. This limited the scalability per MySQL Server to about 12 VCPUs. Since we executed those benchmarks we have increased the number of cluster connections per MySQL Server to 4 providing scalability to at least 32 VCPUs per MySQL Server.
As preparation to run those benchmarks we have created a RonDB cluster using the Hopsworks framework that is currently used to create RonDB clusters. In these tests all MySQL Servers are using the c5.4xlarge VM instances in AWS (16 VCPUs with 32 GB memory). We have a RonDB management server using the t3a.medium VM instance type. We have tested using two different RonDB clusters, both have 2 data nodes. The first test is using the r5.4xlarge instance type (16 VCPUs and 128 GB memory) and the second test uses the r5n.8xlarge (32 VCPUs and 256 GB memory). It was necessary to use r5n class since we needed more than 10 Gbit/second network bandwidth for the OLTP RW test with 32 VCPUs on data nodes.
In the RonDB documentation you can find more details on how to set up your own RonDB cluster in either our managed version (currently supporting AWS) or using our open source shell scripts to set up a cluster (currently supporting GCP and Azure).
The graph below shows the throughput results from the larger data nodes using 8-12 MySQL Server VMs. Now in order to make sense of these numbers we will explain a bit more about Sysbench and how you can tweak Sysbench to serve your purposes for testing RonDB.
The Sysbench OLTP RW benchmark consists of 20 SQL queries. There is a transaction, this means that the transaction starts with a BEGIN statement and it ends with a COMMIT statement. After the BEGIN statement follows 10 SELECT statements that selects one row using the primary key of the table. Next follows 4 SELECT queries that select 100 rows within a range and either uses SELECT DISTINCT, SELECT … ORDER BY, SELECT or SELECT sum(..). Finally there is one INSERT, one DELETE and 2 UPDATE queries.
In Pseudo code thus:
Repeat 10 times: SELECT col(s) from TAB where PK=pk
SELECT col(s) from TAB where key >= start AND key < (start + 100)
SELECT DISTINCT col(s) from TAB where key >= start AND key < (start + 100)
SELECT col(s) from TAB where key >= start AND key < (start + 100) ORDER BY key
SELECT SUM(col) from TAB where key >= start AND key < (start + 100)
INSERT INTO TAB values (....)
UPDATE TAB SET col=val WHERE PK=pk
UPDATE TAB SET col=val WHERE key=key
DELETE FROM TAB WHERE PK=pk
This is the standard OLTP RW benchmark.
Now I will describe some changes that the Sysbench installation in RonDB can handle. To understand this we will start by showing the default configuration file for Sysbench.
# Software definition
# Storage definition (empty here)
# MySQL Server definition
# NDB node definitions (empty here)
# Benchmark definition
In this configuration file we provide the pointer to the RonDB binaries, we provide the type of benchmark we want to execute, we provide the password to the MySQL Servers, we provide the number of threads to execute in each step of the benchmark. There is also a list of IP addresses to the MySQL Servers in the cluster and finally we provide the number of instances of Sysbench we want to execute.
This configuration file is created automatically by the managed version of RonDB. This configuration file is available in the API nodes you created when you created the cluster. They are also available in the MySQL Server VMs if you want to test running with a single MySQL Server colocated with the application.
The default setup will run the standard Sysbench OLTP RW benchmark with one sysbench instance per MySQL Server. To execute this benchmark the following steps are done:
Log in to the API node VM where you want to run the benchmark from. The username is ubuntu (in AWS). Thus log in using e.g. ssh ubuntu@IP_address. The IP address is the external IP address that you will find in AWS where your VM instances are listed.
After successfully being logged in you need to log into the mysql user using the command:
sudo su - mysql
Move to the right directory
Execute the benchmark
bench_run.sh --default-directory /home/mysql/benchmarks/sysbench_multi
As you will discover there is also a sysbench_single, dbt2_single, dbt2_multi directory. These are setup for different benchmarks that we will describe in future papers. sysbench_single is the same as sysbench_multi but with only a single MySQL Server. This will exist also on MySQL Server VMs if you want to benchmark from those. Executing a benchmark from the sysbench machine increases latency since it represents a 3-tiered setup whereas executing sysbench in the MySQL Server represents a 2-tiered setup and thus the latency is lower.
If you want to study the benchmark in real-time repeat Step 1, 2 and 3 above and then perform the following commands:
tail -f oltp_rw_0_0.res
This will display the output from the first sysbench instance that will provide latency numbers and throughput of one of the sysbench instances.
When the benchmark has completed the total throughput is found in the file:
The configuration for sysbench_multi is found in:
Thus if you want to modify the benchmark you can edit this file.
In order to modify this benchmark, first you can decide on how many SELECT statements to retrieve using the primary key that should be issued. The default is 10. To change this add the following line in autobench.conf:
This will change such that instead 5 primary key SELECTs will be issued for each transaction.
Next, you can decide that you want those primary key SELECTs to retrieve a batch of primary keys. In this case the SELECT will use IN (key1, key2,,, keyN) in the WHERE clause. To use this set the number of keys to retrieve per statement in SB_USE_IN_STATEMENT. Thus to set this to 100 add the following line to autobench.conf.
This means that if SB_POINT_SELECTS is set to 5 and SB_USE_IN_STATEMENT is set to 100 there will be 500 key lookups performed per Sysbench OLTP transaction.
Next, it is possible to set the number of range scan SELECTs to perform per transaction. So to e.g. disable all range scans we can add the following lines to autobench.conf.
Now, it is also possible to modify the range scans. I mentioned that the range scans retrieves 100 rows. The number 100 is changeable through the configuration parameter SB_RANGE_SIZE.
The default behaviour is to retrieve all 100 rows and send them back to the application. Thus no filtering. We also have an option to perform filtering in those range scans. In this case only 1 row will be returned, but we will still scan the number of rows specified in SB_RANGE_SIZE. This feature of Sysbench is activated through adding the following line to autobench.conf:
Finally it is possible to remove the use of INSERT, DELETE and UPDATEs. This is done by changing the configuration parameter SYSBENCH_TEST from oltp_rw to oltp_ro.
There are many more ways to change the configuration of how to run Sysbench, but these settings are enough for this paper. For more details see the documentation of dbt2-0.37.50, also see the Sysbench tree
In our benchmarking reported in this paper we used 2 different configurations. Later, we will report more variants of Sysbench testing as well as other benchmark variants.
The first is the standard Sysbench OLTP RW configuration. The second is the standard benchmark but adding SB_USE_FILTER=”yes”. This was added since the standard benchmark becomes limited by the network bandwidth using r5.8xlarge instances for the data node. This instance type is limited to 10G Ethernet and it needs almost 20 Gb/s in networking capacity with the performance that RonDB delivers. This bandwidth is achievable using the r5n instances.
Each test of Sysbench creates the tables and fills them with data. To have a reasonable execution time of the benchmark each table will be filled with 1M rows. Each sysbench instance will use its own table. It is possible to set the number of rows per table, it is also possible to use multiple tables per sysbench instance. Here we have used the default settings.
The test runs are executed for a fairly short time to be able to test a large variety of test cases. This means that it is expected that results are a bit better than expected. To see how results are affected by running for a long time we also ran a few select tests where we ran a single benchmark for more than 1 hour. The results are in this case around 10% lower than the numbers of shorter runs. This is mainly due to variance of the throughput that is introduced by the execution of checkpoints in RonDB. Checkpoints consume around 5-10% of the CPU capacity in the RonDB data nodes.
In all tests set up here we have started the RonDB cluster using the Hopsworks infrastructure. In all tests we have used c5.4xlarge as the VM instance type for MySQL Servers. This VM has 16 VCPUs and 32 GB of memory. This means a VM with more or less 8 Intel Xeon CPU cores. In all tests there are 2 RonDB data nodes, we have tested with 2 types of VM instances here, the first is the r5.4xlarge which has 16 VCPUs with 128 GB of memory. The second is the r5n.8xlarge which has 32 VCPUs and 256 GB of memory. In the Standard Sysbench OLTP RW test the network became a bottleneck when using r5.8xlarge. These VMs can use up to 10 Gb/sec, but in reality we could see that some instances could not go beyond 7 Gb/sec, when switching to r5n.8xlarge instead this jumped up to 13Gb/sec immediately, so clearly this bottleneck was due to the AWS infrastructure.
To ensure that the network bottleneck was removed we switched to using r5n.8xlarge instances instead for those benchmarks. These instances are the same as r5.8xlarge except that they can use up to 25 Gb/sec in network bandwidth instead of 10Gb/sec.
The first test we present here is the standard OLTP RW benchmark. When we run this benchmark most of the CPU consumption happens in the MySQL Servers. Each MySQL Server is capable of processing about 4000 TPS for this benchmark. The MySQL Server can process a bit more if the responsiveness of the data node is better, this is likely to be caused by the CPU caches being hotter in that case when the response comes back to the MySQL Server. Two 16 VCPU data nodes can in this case handle the load from 4 MySQL Servers, adding a 5th can increase the performance slightly, but not much. We compared this to 2 data nodes using 32 VCPUs and in principle the load these data nodes could handle was doubled.
The response time was very similar in both cases, at extreme loads the larger data nodes had more latency increases, most likely due to the fact that we got much closer to the limits of what the network could handle.
The top number here was 34870 at 64 threads from 10 MySQL Servers. In this case 95% of the transactions had a latency that was less than 19.7 ms, this means that the time for each SQL query was below 1 millisecond. This meant that almost 700k SQL queries per second were executed. These queries reported back to the application 14.5M rows per second for the larger data nodes, most of them coming from the 4 range scan queries in the benchmark. Each of those rows are a bit larger than 100 bytes, thus around 2 GByte per second of application data is transferred to the application (about 25% of this is aggregated in the MySQL when using the SUM range scan).
Given that Sysbench OLTP RW is to a great extent a networking test we also wanted to perform a test that performed a bit more processing, but reporting back a smaller amount of rows. We achieved this by setting SB_USE_FILTER=”yes” in the benchmark configuration file. This means that instead of each range scan SELECT reporting back 100 rows, it will read 100 rows and filter out 99 of them and report only 1 of the 100 rows. This will decrease the amount of rows to process down to about 1M rows per second. Thus this test is a better evaluator of the CPU efficiency of RonDB whereas the standard Sysbench OLTP RW is a good evaluator of RonDBs ability to ship tons of rows between the application and the database engine.
At first, we wanted to see the effect the number of MySQL servers had on the throughput in this benchmark. We see the results of this in the image above. We see that there is an increase in throughput going from 8 to 12 MySQL Servers. However the additional effect of each added MySQL Server is diminishing. There is very little to gain going beyond 10 MySQL Servers. The optimal use of computing resources is most likely achieved around 8-9 MySQL Servers.
Adding additional MySQL servers also has an impact on the variability of the latency. So probably the overall best fit here is to use about 2x more CPU resources on the MySQL Servers compared to the CPU resources in the RonDB data nodes. This rule is based on this benchmark and isn’t necessarily true for another use case.
The results with the smaller data nodes, r5.4xlarge is the red line that used 5 MySQL Servers in the test.
The rule definitely changes when using the key-value store APIs that RonDB provides. These are at least 100% more efficient compared to using SQL.
A key-value store needs to be a LATS database (low Latency, high Availability, high Throughput, Scalable storage). In this paper we have focused on showing Throughput and Latency. Above is the graph showing how latency is affected by the number of threads in the MySQL Server.
Many applications have strict requirements on the maximum latency of transactions. So for example if the application requires response time to be smaller than 20 ms than we can see in the graph that we can use around 60 threads towards each MySQL Server. At this number of threads r5.4xlarge delivers 22500 TPS (450k QPS) and r5n.8xlarge delivers twice that number, 45000 TPS (900k QPS).
The base latency in an unloaded cluster is a bit below 6 milliseconds. This number is a bit variable based on where exactly the VMs are located that gets started for you. Most of this latency is spent in latency on the networks. Each network jump in AWS has been reported to be around 40-50 microseconds and one transaction performs around 100 of those network jumps in sequence. Thus almost two-thirds of the base latency comes from the latency in getting messages across. At higher loads the queueing waiting for the message to be executed becomes dominating. Benchmarks where everything executes on a single computer has base latency around 2 millisecond per Sysbench transaction which confirms the above calculations.
TLDR: Hopsworks is the Data-Intensive AI platform with a Feature Store for building complete end-to-end machine learning pipelines. This tutorial will show an overview of how to work with Jupyter on the platform and train a state-of-the-art ML model using the fastai python library. Hopsworks provides Jupyter as a service in the platform, including kernels for writing PySpark/Spark and pure Python code. With an intuitive service to install Python libraries covered in a previous blog and access to a Jupyter notebook, getting started with your favourite ML library requires little effort in Hopsworks.
Jupyter provides an integrated development environment (IDE) allowing users to seamlessly mix code with visualization and comments, making it not just handy for prototyping, but also for visualization and educational purposes. In recent years, it has become a favourite tool employed by the many, used for data wrangling, data mining, statistical modeling, visualization, machine learning and the notebooks themselves scheduled to run in production at tech leaders such as Netflix.
The Hopsworks platform ships with Jupyter as one of the integrated components. Having already been pre installed and configured to work with Spark and PySpark kernels, in addition to the Python kernel, getting started with writing your notebooks and scheduling them in production on a Hopsworks cluster is straightforward. The Hopsworks installation also includes a Miniconda environment with the most popular libraries you can find in a data scientists toolkit, such as TensorFlow, PyTorch and scikit-learn.
In this tutorial we will describe how to work with a Jupyter notebook in the Hopsworks platform. As an example, we will demonstrate how to install fastai, a library that provides high-level components for building and training ML models to get state-of-the-art deep learning results, clone a set of notebooks from a git repository and train a model on a P100 GPU.
To follow this tutorial you should have a Hopsworks instance running on https://hopsworks.ai. You can register for free, without providing credit card information, and receive USD 4000 worth of free credits to get started. The only thing you need to do is to connect your cloud account.
The first step in the tutorial is to install the fastai library. To get started, navigate to the Python service to install the fastai library from PyPi as shown in the example below. There are many different approaches to installing the library but in this instance we install the latest version of the fastai and nbdev package from PyPi, required to run the first notebook in the fastai course.
In the Jupyter service page there are three different modes that can be configured.
Firstly, there is a Python tab, in which configuration for the Python kernel, such as Memory/Cores/GPUs is set and optionally a git repository can also be configured that should be cloned when Jupyter starts up. This is the kernel that we are going to use in this tutorial.
Secondly, in the Experiments tab the PySpark kernel is configured. If you want to enable all the features in the plattform regarding, experiment tracking, hyperparameter optimization, distributed training. See HopsML for more information on the Machine Learning pipeline.
Thirdly, for general purpose notebooks, select the Spark tab and run with Static or Dynamic Spark Executors on Spark or PySpark.
The image below shows the configuration options set for the Python kernel. As working with larger ML models can be memory intensive make sure you are configuring the Memory for the kernel to be at least 8GB, then set GPUs to 1 to allocate a GPU that should be accessible for the kernel and set the git configuration to clone the fastai git repository https://github.com/fastai/fastai.git to get access to the notebooks.
Once the configuration has been entered for the Python kernel, press the button on the top that says JupyterLab to start the Notebook Server. Keep in mind that it may take some time as resources need to be allocated for the Notebook Server and to clone the git repository. The image below demonstrates the process of starting Jupyter.
The Jupyter Notebook Server will now have been allocated a GPU which you can use in the Python kernel. To check the type and specifications of the GPU, open a new terminal inside Jupyter and run nvidia-smi. We can see that in this instance we have access to a P100 NVIDIA GPU.
Now you’re all set to start following the course material that fastai provides. To make sure the GPU is being utilized you can leave a terminal window open and run nvidia-smi -l 1, which will print out the GPU utilization every second while you are running the training in the notebook.
In the example below, the first notebook lesson1-pets.ipynb in the fastai course is executed.
Hopsworks is available both on AWS and Azure as a managed platform. Visit hopsworks.ai to try it out.
TLDR; A new class of hierarchical distributed file system with scaleout metadata has taken over at Google, Facebook, and Microsoft that provides a single centralized file system that manages the data for an entire data center, scaling to Exabytes in size. The common architectural feature of these systems is scaleout metadata, so we call them scaleout metadata file systems. Scaleout metadata file systems belie the myth that hierarchical distributed file systems do not scale, so you have to redesign your applications to work with object stores, and their weaker semantics. We have built a scaleout metadata file system, HopsFS, that is open-source, but its primary use case is not Exabyte storage, rather customizable consistent metadata for the Hopsworks Feature Store. Scaleout metadata is also the key technology behind Snowflake, but here we stick to file systems.
Google, Microsoft, and Facebook have been pushing out the state-of-the-art in scalable systems research in the last 15 years. Google has presented systems like MapReduce, GFS, Borg, and Spanner. Microsoft introduced CosmosDB, Azure Blob Storage and federated YARN. Facebook has provided Hive, Haystack, and F4 systems. All of these companies have huge amounts of data (Exabytes) under management, and need to efficiently, securely, and durably store that data in data centers. So, why not unify all storage systems within a single data center to more efficiently manage all of its data? That’s what Google and Facebook have done with Colossus and Tectonic, respectively. The other two scaleout metadata file systems covered here, ADLSv2 and HopsFS, were motivated by similar scalability challenges but, although they could be, they are typically not deployed as data center scale file systems, just as scalable file systems for analytics and machine learning.
First generation hierarchical distributed file systems (like HDFS) were not scalable enough in the cloud, motivating the move to object stores (like S3) as the cloud-native storage service of choice. However, the move to object stores is not without costs. Many applications need to be rewritten as the stronger POSIX-like behaviour of hierarchical file systems (atomic move/rename, consistent read-after-writes) has been replaced by weakened guarantees in object stores. In particular, data analytics frameworks traditionally rely on atomic rename to provide atomic update guarantees for updating columnar data stores. The lack of atomic rename in S3 has been one of the motivations for the introduction of new columnar store frameworks for analytics and ML, such as Delta Lake, Apache Hudi, and Apache Iceberg that provide ACID guarantees for updating tables over object stores. These frameworks add metadata to files in the object store to provide the ACID guarantees, but their performance lags behind systems built on mutable scaleout metadata, underpinning columnar data stores such as Snowflake and BigQuery.
Hierarchical file systems typically provide well-defined behaviour (a POSIX API) for how a client can securely create, read, write, modify, delete, organize, and find files. The data in such file systems is stored in files as blocks or extents. A file is divided up into blocks, and distributed file systems spread and replicate these blocks over many servers for improved performance (you can read many blocks in parallel from different block servers) and high availability (failure of a block server does not cause the file system to go down, as replicas of that block are still available on other block servers).
However, the data about what files, directories, blocks, and file system permissions are in the system have historically been stored in a single server called the metaserver or namenode. We call this data about the file system objects metadata. In file systems like HDFS, the namenode stores its metadata in-memory to improve both latency and throughput in the number of metadata operations it can support per second. Example metadata operations are: create a directory, move or rename a file or directory, change file permissions or ownership. Operations on files and some operations on directories (such as `rm -rf`) require both updates to metadata and to the blocks stored on the block servers.
As the size of data under management by distributed file systems increased, it was quickly discovered that metadata servers became a bottleneck. For example, HDFS could scale to, at a push, a Petabyte, but not handle more than 100K reads/sec and only a few thousand writes/sec.
It has long been desired to re-architect distributed file systems to shard their metadata across many servers to enable them to support (1) larger volumes of metadata and (2) more operations/second. But it is a very hard problem. Read here about the contortions Uber applies to get its HDFS’ namenode to scale instead of re-designing a scaleout metadata layer from scratch.
When sharding the state of the metadata server over many servers, you need to make decisions about how to do it. Google used its existing BigTable key-value store to store Colossus’ metadata. Facebook, similarly, chose the ZippyDB key-value store for Tectonic. Microsoft built their own Replicated State Library - Hekaton Ring Service (RSL-HK) to scale-out ADLS’ metadata. The RSL-HK ring architecture combines Paxos-based metadata with Hekaton (in-memory engine from SQL Server). HopsFS used NDBCluster (now RonDB) to scale out its metadata.
The capabilities of these underlying storage engines are reflected in the semantics provided by the higher level file systems. For example, Tectonic and (probably) Colossus do not support atomic move of files from any directory to any other directory. Their key-value stores do not support agreement protocols across shards (only within a shard). So, at the file system level, you introduce an abstraction like a file system volume (Tectonic calls them tenants), and users then know they can perform atomic rename/move within that volume, but not across volumes. Google solves this problem at a higher layer for structured data with Spanner by implementing two-phase commit transactions to ensure consistency across shards. In contrast, RSL-HK Ring by Microsoft and RonDB by Logical Clocks support cross-shard transactions that enable both ADLSv2 and HopsFS to support atomic rename/move between any two paths in the file system.
To put this in database terms, the consistency models provided by the scaleout metadata file systems are tightly coupled to the capabilities provided by the underlying metadata store. If the store does not support cross-partition transactions - consistent operations across multiple shards, you will not get strongly consistent cross-partition file system operations. For example, if the metadata store is a key-value store, where each shard typically maintains strongly consistent key-value data using Paxos. But Paxos do not compose - you cannot run Paxos between two shards that themselves maintain consistency using Paxos. In contrast, RonDB supports 2-phase commit (2PC) across shards, enabling strongly consistent metadata operations both within shards and across shards.
Once a scaleout metadata storage layer is in place, stateless services can be used to provide access control and implement background maintenance tasks like maintaining the durability and availability of data, disk space balancing, and repairing blocks.
We can see that Hadoop File System APIs are still popular, as they model the contents of a filesystem as a set of paths that are either directories, symbolic links, or files, but address the challenge of scalability by restricting the POSIX-like semantics with append-only writers (there is no support for writing at random offsets in files).
With a scaleout metadata file system, you can have many more concurrent clients, leading to the well-known problem of hotspots - overloaded read/writes that are handled by a single shard. For example, Tectonic, ADLS, and HopsFS all ensure that objects (files/directories) in a directory are co-located in the same shard for efficient low latency directory listing operations. However, if the directory contains millions of files, such an operation can overload the threads responsible for handling operations on that shard. HopsFS and Tectonic randomly spread independent directories across shards to prevent hotspots, while ADLS supports range partitioning. Another well-known technique from object-stores, like S3, is used by ADLS - paged enumeration of directories. This requires clients to perform many iterative operations to list all objects in a large directory, but enables client quotas to kick in throttle and clients before they overload a shard.
Blocks are a logical unit of storage that hides the complexity of raw data storage and durability from the upper layers of the filesystem. In earlier generations of distributed file systems, such as HDFS, full replicas of blocks were stored at different data nodes to ensure high availability of file blocks. However, object stores and scaleout metadata file systems have eschewed full replicas and instead ensure high availability of file blocks using Reed-Solomon (RS) Coding. RS-encoded blocks provide higher availability guarantees and lower storage overhead, but with the disadvantage of more CPU and network bandwidth required to recover lost blocks. Given the continued growth in network bandwidth and available CPU cycles, this tradeoff is favorable.
There is a general trend towards smaller blocks, enabling faster recovery of failed blocks and faster availability of blocks to readers, but the cost is the need for more metadata storage capacity and higher available throughput in ops/sec at the metadata service.
Both Colossus and Tectonic provide rich clients that can customize the types of blocks and RS coding needs, depending on the workload needed by the client. For example, blob storage requires frequent appends and is handled differently from writing tabular data. Although neither Tectonic or Colossus discussed the block sizes they support, it is safe to assume that they support blocks all the way down to a few MBs in size. ADLSv2 stores its block data in Azure Blob Storage (ABS). HopsFS, the managed service on Hopsworks, also stores its blocks as objects in object storage (S3 on AWS and ABS on Azure). On premises, HopsFS stores its blocks replicated as fixed-size files replicated across data nodes.
When your ambition is to store data for the entire data center, you need to support many different storage technologies with different cost/storage trade-offs. As such, Colossus, HopsFS, ADLSv2, and Tectonic all support storing data in tiers: magnetic disks, SSDs, NVMe, in-memory. Among these systems, HopsFS has unique support for storing small files in the scaleout metadata layer for higher performance operations on small files.
HopsFS takes a different approach to using scaleout metadata. Instead of using it, primarily, to build exascale file systems, it provides a principled architecture for easily extending metadata for files and directories. In particular, this is useful in the domain of machine learning where we have both artifacts (feature data, training data, programs, models, log files) that are typically stored as files and metadata (experiments, hyperparameters, tags, metrics, etc ) that are stored in a metastore (often a relational database). HopsFS unifies the artifact store and metastore, and even enables polyglot storage and querying of metadata in both RonDB (SQL) and Elasticsearch (free-text search). This simplifies operations and provides new free-text search capabilities compared to existing ML metastores (TFX, MLFlow). The same approach enabled us to be the first to release an open-source Feature Store for ML based on Hopsworks. When building our Feature Store, instead of needing to build a separate artifact store (file system) and metastore (database) and write complex protocols to ensure the consistency of both stores, we had a single consistent storage system, where artifacts can be easily extended with consistent metadata that can be queried using free-text search. Features can be annotated with statistics and tags in metadata, training data
Some important requirements for extensible file system metadata are that it:
Even though we first heard about Colossus’ architecture in 2009 and its name in 2012, Google has been surprisingly secretive about the lowest layer of their scalable storage and compute architecture. However, after the release of Tectonic (coincidence?) in early 2021, Google released more details on Colossus in May 2021.
Colossus’ metadata storage service is BigTable, which does not support cross-shard transactions. We assume this means that Colossus lacks atomic rename, a hole that is filled for tabular data (at least) by Spanner, which supports cross-shard transactions.
In Colossus, file system clients connect to curators to perform metadata operations, who, in turn, talk to BigTable. Custodians perform file system maintenance operations, and “D” services provide block storage services, where clients read/write blocks directly from/to “D” servers.
Different clients of Colossus can store their data on different volumes (metadata shards). Atomic rename is possible within a volume, but not across volumes.
Tectonic was first announced as a file system at USENIX Fast 2021, and it unifies Facebook’s previous storage services (federated HDFS, Haystack, and others) to provide a data-center scale file system.
Similar to Colossus, Tectonic stores its metadata in a key-value store, but in this case in ZippyDB. As ZippyDB lacks cross-partition transactions, cross-namespace file system operations are not supported. That is, you cannot atomically move a file from one volume (metadata shard) to another. Often, such operations are not needed, as all the data for a given service can fit in a single namespace, and there are no file system operations between different applications. There are separate stateless services to manage the name space, blocks, files, and file system maintenance operations.
Azure Data Lake Storage (ADLS) was first announced at Sigmod 2017 and it supports Hadoop distributed file system (HDFS) and Cosmos APIs. It has since been redesigned as Azure Data Lake Gen 2 (ADLSv2) that provides multi-protocol support to the same data using the Hadoop File System API, the Azure Data Lake Storage API and the Azure Blob storage API. Unlike Colossus and Tectonic, it is available for use as a service - but only on Azure.
The most recent information about ADLS’ architecture is the original paper describing ADLS from 2017 - no architecture has been published yet for ADLSv2. However, ADLS used RSL-HK to store metadata and it has a key-value store (ring) with shards using state machine replication (Paxos) and with transactions across shards, al in an in-memory engine (“It implements a novel combination of Paxos and a new transactional in-memory block data management design.”).
HopsFS was first announced at USENIX Fast 2017 and provides a HDFS API. HopsFS is a rewrite of HDFS and it supports multiple stateless namenode (metadata servers), where the leader performs file system maintenance operations, and a pluggable metadata storage layer.
HopsFS provides a DAL API to support different metadata storage engines. Currently the default engine for HopsFS is RonDB (a fork of NDB Cluster, the storage engine for MySQL Cluster), a scalable key-value store with SQL capabilities. RonDB can scale to handle hundreds of millions of transactional reads per second and 10s of millions of transactional writes per second and it provides both a native key-value API and a SQL API via a MySQL Server. RonDB also provides a CDC (change-data-capture) API to allow us to automatically replicate changes in metadata to Elasticsearch, providing a free-text search API to HopsFS’ metadata (including its extended metadata). Metadata can be queried using any of the 3 APIs: the native key-value API for RoNDB, the SQL API, or using free-text search in Elasticsearch.
HopsFS scales the Namespace Layer with RonDB and Stateless Namenodes, while the block layer is cloud object storage.
The journey from a stronger POSIX-like file system to a weaker object storage paradigm and back again has parallels in the journey that databases have made in recent years. Databases made the transition from strongly consistent single-host systems (relational databases) to highly available (HA), eventually consistent distributed systems (NoSQL systems) to handle the massive increases in data managed by databases. However, NoSQL is just too hard for developers, and databases are returning to strongly consistent (but now scalable) NewSQL systems, with databases such as Spanner, CockroachDB, SingleSQL, and NDB Cluster.
The scaleout metadata file systems, introduce here, show that distributed hierarchical file systems are completing a similar journey, going from strongly consistent POSIX-compliant file systems to object stores (with their weaker consistency models), and back to distributed hierarchical file systems that are have solved the scalability problem by redesigning the file system around a mutable, scaleout metadata service.
TLDR; Maggy is an open-source framework that simplifies writing and maintaining distributed machine learning programs. By encapsulating your training logic in a function, the same code can be run unchanged with Python on your laptop or distributed using PySpark for hyperparameter tuning, data-parallel training, or model-parallel training. With the arrival of GPU support in Spark 3.0, PySpark can be now used to orchestrate distributed deep learning applications in TensorFlow and PySpark. We are pleased to announce we have now added support for Maggy on Databricks, so training machine learning models with many workers should be as easy as running Python programs on your laptop.
Machine Learning is going to be distributed, as Andrew Ng calls it, Data-centric AI. Data volumes are constantly increasing, and models are known to improve their prediction accuracy predictably - so companies can often know what return they will get in terms of more accurate models by simply investing in collecting more training data. One of the main challenges for developers is to switch from programming for a single-machine to programming for a distributed cluster.
That's why we developed Maggy (maggy.ai), an open source Python library that introduces a new unified framework for writing core ML training logic as oblivious training functions. By encapsulating your training logic in a Python function (we call it an oblivious training function), the same code can be run unchanged in Python or PySpark. You don’t need to rewrite your code to do hyperparameter tuning, data-parallel training, or model-parallel training.
Maggy can be extremely useful in the case you want to build a distributed ML solution but you have no prior experience of distributed computing. It is enough to just refactor your training code to wrap it inside a function (this is good development practice, in case you didn’t know) and let Maggy do the distribution magic for you.
Furthermore, it is not important whether you use Tensorflow or PyTorch as Maggy supports both of them. Scikit-Learn and XGBoost are also supported, and are used, in particular, for parallelizing hyperparameter tuning over many workers.
In the following example, we use Maggy to train a Neural Network on the Iris dataset.
The first thing we have to do is to create a cluster on your Databricks workspace. Maggy has been tested on the Databricks Runtime version 7.4 ML. Choose the number of workers you need and you are good to go.
If you don't know how to create a cluster, please follow the tutorial on this page. If it is your first time on Databricks, make sure to get familiar with the platform and prices.
Once you created the cluster, you have to install Maggy. In order to do this, just navigate to your cluster and click Libraries, then click on Install New and PyPi, write 'maggy' as Package and Install.
Once you have installed Maggy, do the same thing with TensorFlow version 2.4 or higher. For example, on the Package field you can write 'tensorflow==2.4'.
Now we can open the Databricks notebook and write our first Maggy program.
In order to use Maggy on your workflow, we need to do the following:
We are now going to present an example on how to implement this workflow. We are going to distribute the training of a (very) simple machine learning model on the iris dataset.
You can use this notebook as reference.
First of all, we have to wrap our ML model in a class, the class has to be an implementation of tf.keras.Model.
It's important to note that we are not instantiating the class, we need to pass the class to Maggy, not an instance of it.
In this example we define a class called NeuralNetwork. It is a superclass of tf.keras.Sequential, an implementation of tf.keras.Model. Make sure that your class implements tf.keras.Model. Finally, we define our ML model in the init function.
First, we need to upload the Iris dataset on Databricks, on your databricks platform, click on Data on the sidebar and then Create Table, and upload Iris.csv that can be downloaded here.
In order to use Maggy, we have to pass the training set, test set and optionally a function for data processing to the configuration file.
The training and test sets can be:
We now wrap the code containing the logics of your experiment in a function.
For HPO, we have to define a function that has the HPs to be optimized as parameters. Inside the function we simply put the training logic as we were training our model in a single machine using Tensorflow. Maggy will run this function multiple times using different parameters for you, as we will see in section 3a.
The training function provides the instruction to run the training and evaluation of your model, given the data passed in the configuration. You just need to wrap the instructions you implemented and eventual hyperparameters (for example the values to pass in the model constructor). The training function has to return a value or a list of values that corresponds to the evaluation results.
It's important to note that in the HPO function we did not pass the model as a parameter while we did that in our oblivious training function. This is because, when using Maggy for distributed training, the library has to patch some functions of the model class.
The next step is to create a configuration instance for Maggy. Since in this example we are using Maggy for hyperparameter optimization and distributed training using TensorFlow, we will use OptimizationConfig and TfDistributedConfig.
OptimizationConfig contains the information about the hyperparameter optimization.
We need to define a Searchspace class that contains the hyperparameters we want to optimize. In this example we want to search for the optimal number of layers of the neural network from 2 to 8 layers.
OptimizationConfig contains the following parameters:
Our HPO function and configuration class are now ready, so we can go on and run the HPO experiment. In order to do that, we run the lagom function, passing our training function and the configuration object we instantiated during the last step.
If you are wondering what lagom means, Lagom is a swedish word representing some cultural aspects of balance and equilibrium, in english could be translated as "just the right amount" or "less is more".
This function will print the HPO summary. As you can see, there are several values returned, we are most interested in the 'best_config' dictionary, it contains the parameters for which the model performed the best.
Now it's time to run the final step of our ML program. Let's initialize the configuration class for the distributed training. First, we need to define our hyperparameters, we want to take the best hyperparameters from the HPO.
TfDistributedConfig class has the following parameters:
hparams: the model and dataset parameters. In this case we also need to provide the 'train_batch_size' and the 'test_batch_size', these values represent the subset sizes of the sharded dataset. It's value is usually the dataset_size/number_workers but can change depending on your needs.
Finally, let's run the distributed training using the lagom function.
Maggy will run the distributed training using the number of workers and resources available to the cluster defined. Finally, it will prompt the test results. The training log can be found in the Spark UI on Databricks.
Maggy is open-source and everyone can contribute to the project. Feel free to experiment with the library and contact us for any questions or issues. You can reach out to us via GitHub or the hopsworks community. You can also give us a star on GitHub to let us know you appreciate our work.
In this article we saw how to train a ML model in a distributed fashion without reformatting our code, thanks to Maggy. Maggy is available on hopsworks.ai. If you want to know more about how to develop your ML projects faster, you may want to check out this article.
TLDR: Hopsworks is the Data-Intensive AI platform with a Feature Store for building complete end-to-end machine learning pipelines. This tutorial will show an overview of how to install and manage Python libraries in the platform. Hopsworks provides a Python environment per project that is shared among all the users in the project. All common installation alternatives are supported such as using the pip and conda package managers, in addition to libraries packaged in a .whl or .egg file and those that reside on a git repository. Furthermore, the service includes automatic conflict/dependency detection, upgrade notifications for platform libraries and easily accessible installations logs.
The Python ecosystem is huge and seemingly ever growing. In a similar fashion, the number of ways to install your favorite package also seems to increase. As such, a data analysis platform needs to support a great variety of these options.
The Hopsworks installation ships with a Miniconda environment that comes preinstalled with the most popular libraries you can find in a data scientists toolkit, including TensorFlow, PyTorch and scikit-learn. The environment may be managed using the Hopsworks Python service to install or libraries which may then be used in Jupyter or the Jobs service in the platform.
In this blog post we will describe how to install a python library, wherever it may reside, in the Hopsworks platform. As an example, this tutorial will demonstrate how to install the Hopsworks Feature Store client library, called hsfs. The library is Apache V2 licensed, available on github and published on PyPi.
To follow this tutorial you should have an Hopsworks instance running on https://hopsworks.ai. You can register for free, without providing credit card information, and receive USD 4000 worth of free credits to get started. The only thing you need to do is to connect your cloud account.
The first step to get started with the platform and install libraries in the python environment is to create a project and then navigate to the Python service.
When a project is created, the python environment is also initialized for the project. An overview of all the libraries and their versions along with the package manager that was used to install them are listed under the Manage Environment tab.
The simplest alternative to install the hsfs library is to enter the name as is done below and click Install as is shown in the example. The installation itself can then be tracked under the Ongoing Operations tab and when the installation is finished the library appears under the Manage Environment tab.
If hsfs would also have been available on an Anaconda repo, which is currently not the case, we would need to specify the channel where it resides. Searching for libraries on Anaconda repos is accessed by setting Conda as the package location.
If a versioned installation is desired to get the hsfs version compatible with a certain Hopsworks installation, the search functionality shows all the versions that have been published to PyPi in a dropdown. Simply pick a version and press Install. The example found below demonstrates this.
Many popular Python libraries have a great variety of builds for different platforms, architectures and with different build flags enabled. As such it is also important to support directly installing a distribution. To install hsfs as a wheel requires that the .whl file was previously uploaded to a Dataset in the project. After that we need to select the Upload tab, which means that the library we want to install is contained in an uploaded file. Then click the Browse button to navigate in the file selector to the distribution file and click Install as the following example demonstrates.
Installing libraries one by one can be tedious and error prone, in that case it may be easier to use a requirements.txt file to define all the dependencies. This makes it easier to move your existing Python environment to Hopsworks, instead of having to install each library one by one.
The file may for example look like this.
This dependency list defines that version 2.2.7 of hsfs should be installed, along with version 2.9.0 of imageio and the latest available release for mahotas. The file needs to be uploaded to a dataset as in the previous example and then selected in the UI.
A great deal of python libraries are hosted on git repositories, this makes it especially handy to install a library during the development phase from a git repository. The source code for the hsfs package is as previously mentioned, hosted on a public github repository. Which means we only need to supply the URL to the repository and some optional query parameters. The subdirectory=python query parameter indicates that the setup.py file, which is needed to install the package, is in a subdirectory in the repository called python.
After each installation or uninstall of a library, the environment is analyzed to detect libraries that may not work properly. As hsfs depends on several libraries, it is important to analyze the environment and see if there are any dependency issues. For example, pandas is a dependency of hsfs and if uninstalled, an alert will appear that informs the user that the library is missing. In the following example pandas is uninstalled to demonstrate that.
When new releases are available for the hsfs library, a notification is shown to make it simple to upgrade to the latest version. Currently, only the hops and hsfs libraries are monitored for new releases as they are utility libraries used to interact with the platform services. By clicking the upgrade text, the library is upgraded to the recommended version automatically.
Speaking from experience, Python library installations can fail for a seemingly endless number of reasons. As such, to find the cause, it is crucial to be able to access the installation logs in a simple way to find a meaningful error. In the following example, an incorrect version of hsfs is being installed, and the cause of failure can be found a few seconds later in the logs.
Hopsworks is available both on AWS and Azure as a managed platform. Visit hopsworks.ai to try it out.
TLDR; The first release of RonDB is now available. RonDB is currently the best low-latency, high availability, high throughput, and high scalability (LATS) database available today. In addition to new documentation and community support, RonDB 21.04.0 brings automated memory configuration, automated thread configuration, improved networking handling, 3x improved performance, bug fixes and new features. It is available as a managed service on the Hopsworks platform and it can also be installed on-premises with the installation scripts or binary tarball.
RonDB 21.04.0 is based on MySQL NDB Cluster 8.0.23. The first release includes the following new features and improvements:
Our new documentation walks through how to use RonDB as a managed service on the Hopsworks platform and to install it on-premises with automated scripts or binary tarball. We have also added new documentation of the RonDB software.
Welcome to the RonDB community! Feel free to ask any question and our developers will try to reply as soon as possible.
NDB has historically required setting a large number of configuration parameters to set memory sizes of various memory pools. RonDB 21.04.0 introduces automatic memory configuration taking away all these configuration properties. RonDB data nodes will use the entire memory available in the server/VM. You can limit the amount of memory it can use with a new configuration parameter TotalMemoryConfig.
For increased performance and stability, RonDB will automate thread configuration. This feature was introduced in NDB 8.0.23, but in RonDB it is the default behaviour. RonDB makes use of all accessible CPUs and also handles CPU locking automatically. Read more about automated thread configuration on our latest blog.
RonDB will perform better under heavy load by employing improved heuristics in thread spinning and in the network stack.
In NDB 8.0.23 and earlier releases of NDB one could only set NoOfReplicas at the initial start of the cluster. The only method to increase or decrease replication level was to perform a backup and restore. In RonDB 21.04.0 we introduce Active and Inactive nodes, making it possible to change the number of replicas without having to perform an initial restart of the cluster.
We have integrated a number of benchmark tools to assess the performance of RonDB. Benchmarking RonDB is now easy as we ship Sysbench, DBT2, flexAsynch and DBT3 benchmarks along with our binary distribution. Support for ClusterJ benchmarks are expected to come in upcoming product releases.
A new addition to the ClusterJ API was added that releases data objects and Session objects to a cache rather than releasing them fully. In addition an improvement of the garbage collection handling in ClusterJ was handled. These improvements led to a 3x improvement in a simple key lookup benchmark.
There are three ways to use RonDB 21.04.0:
TLDR; Maggy is an open source framework that lets you write generic PyTorch training code (as if it is written to run on a single machine) and execute that training distributed across a GPU cluster. Maggy enables you to write and debug PyTorch code on your local machine, and then run the same code at scale without having to change a single line in your program. Going even further, Maggy provides the distribution transparent use of ZeRO, a sharded optimizer recently proposed by Microsoft. You can use ZeRO to improve your memory efficiency with a single change in your Maggy configuration. You can try Maggy in the Hopsworks managed platform for free.
Deep learning has seen a surge in activity with the availability of high level frameworks such as PyTorch to build and train models. A few lines of code in a notebook are sufficient to create powerful classifiers from scratch. However, both the data and model sizes to achieve state-of-the-art performance are ever increasing, so that training on your local GPU becomes a hopeless endeavour.
Enter distributed training. Distributed training allows you to train the same model on multiple GPUs on different shards of your data to speed up training times. In the ideal case, training on 4 GPUs simultaneously should reduce your training time by 75%. In distributed training, each GPU computes a forward and backward pass over its own batch of the data. For the model update, the computed gradients are shared and averaged between the nodes. This way, all models update their parameters with the same combined gradient and stay in sync. This additional communication step introduces additional overhead of course, which is why ideal scaling is never truly achieved.
So if distributed training is such a great tool to accelerate training, why is its use still uncommon among normal PyTorch users? Because it is too tedious to use! A dummy example for starting distributed training might look something like this.
Going even further, you would need to launch your code on all of your nodes and take care of graceful shutdowns and collecting the results. This is where Maggy comes in. Maggy allows you to launch your PyTorch training script without any changes on Spark clusters. It takes care of the training processes for each node, the resource isolation and node connections.
Next we will explore what is needed to run distribution transparent training on Maggy as well as the restrictions that still exist with the framework.
First of all, Maggy requires its experiment to be configured for distributed training. In the most common use case this means passing your model, hyperparameters and your training/test set. Configuring is as easy as creating a config object. Hyperparameters, train and test set are optional and can also be directly loaded in the training loop. If your training loop consists of more than one module such as in training GANs with a Generator and Discriminator or Policy gradient methods in RL, you can also pass a list of modules.
Maggy’s API requires the training function to follow a unified signature. Users have to pass their model class, its hyperparameters and the train and test set to the training function.
If you want to load your datasets on each node by yourself, you can also omit passing the datasets in the config. In fact, this is highly recommended when working with larger dataset objects. Additionally, every module used in the training function should be imported within that function. Think of your training function as completely self contained. Last but not least, users should use the PyTorch DataLoader (as is best practice anyways). Alternatively, you can also use Maggy’s custom PetastormDataLoader to load large datasets from Petastorm parquet files. When using the latter, users need to ensure that datasets are even, that is they should have the same number of batches per epoch on all nodes. When using PyTorch’s DataLoader, you do not have to care about this. So to summarize, your training function needs to
It’s time to combine all the elements we introduced so far in a complete example of distributed training with Maggy. In this example, we are going to create some arbitrary training data, define a function approximator for scalar fields,write our training loop and launch the distributed training.
In order to not rely on specific datasets, we are going to create our own dataset. For this example, a scalar field should suffice. So first of all we randomly sample x and y and compute some function we want our neural network to approximate. PyTorch’s TensorDataset can then be used to form a proper dataset from this data.
Next up we define our function approximator. For our example a standard neural network with 3 layers suffices, although in real applications you would of course train much larger networks.
At the heart of every PyTorch program lies the training loop. Following the APIs introduced earlier, we define our training function as follows.
As you can see, there is no additional code for distributed training. Maggy takes care of all the necessary things.
All that remains now is to configure Maggy and run our training. For this, we have to create the config object and run the lagom function.
After running the training on 4 nodes, we can see that our approximator has converged to a good estimate of our scalar field. Of course, this would also be possible on a local node. But with more complex models and larger training sets such as the ImageNet dataset, distributed learning becomes necessary to leverage your workloads.
Maggy is open-source and documentation is available at maggy.ai. Give us a star or get in touch if you have more questions. Maggy is also available for all Hopsworks users in the managed platform on AWS or Azure. You can get started for free (no credit card required).
TLDR; Machine learning models are only as good as the data (features) they are trained on. In Enterprises, data scientists can often train very effective models in the lab - when given a free hand on which data to use. However, many of those data sources are not available in production environments due to disconnected systems and data silos. An AI-powered product that is limited to the data available within its application silo cannot recall historical data about its users or relevant contextual data from external sources. It is like a jellyfish - its autonomic system makes it functional and useful, but it lacks a brain. You can, however, evolve your models from brain-free AI to Total Recall AI with the help of a Feature Store, a centralized platform that can provide models with low latency access to data spanning the whole enterprise.
Jellyfish are undoubtedly complex creatures with sophisticated behaviour - they move, mate, and munch. They eat and discard waste from the same opening. Yet they have no brain - their autonomic system suffices for their needs. The biggest breakthroughs in AI in recent years have been enabled by deep learning, which requires large volumes of data and specialized compute hardware (e.g., GPUs). However, just like a jellyfish, recent successes in image processing and NLP with deep learning required no brain - no working memory, history, or context. Much of deep learning today is Jellyfish AI. We have made incredible progress in identifying objects in images and translating natural language, yet such deep learning models typically only require the immediate input - the image or the text - to make their predictions. The input signal is information-rich. These image and NLP models seldom require a 'brain' to augment the input with context or memories. Google translate doesn’t need to know the historical enmity between the Scots and the Irish in whether its spelt Whisky or Whiskey. Jellyfish AI is impressive - the input data is information rich and models can learn fantastically advanced behaviour from labeled examples. All “knowledge” needed to make predictions is embedded in the model. The model does not need to have working memory (e.g., it doesn’t need to know the user has clicked 10 times on your website in the last minute).
Now compare using AI for image classification or NLP to building a web application that will use AI to interact with a user browsing a website. The immediate input data your application receives from your web browser are clicks on a mouse or a keyboard. The input signal is information-light - it is difficult to train a useful model using only user clicks. However, large Internet companies collect reams of information about users from many different sources, and transform that user data into features (information-rich signals that are ready to be used for either training models or making predictions with models). Models can then combine the click features with historical features about users and contextual features to build information-rich inputs to models. For example, you could augment the user’s action with everything you know about a user’s history and context to increase the user's engagement with the product. The feature store for machine learning (ML) stores and serves these features to models. We believe that AI-powered products that can easily access historical and contextual features will lead the next wave of AI in the Enterprise, and those products will need a feature store for ML.
A frequent source of tension in Enterprises is between “naive” data scientists and “street-wise” ML engineers. Motivated by good software engineering practices, many ML engineers believe that ML models should be self-contained and tension can arise with data scientists who want to include features in their models that are “obviously not available in the production system”.
However, data scientists are tasked with building the best models they can to add to the bottom line - engage more users, increase revenue, reduce costs. They know they can train better models with more data and more diverse sources of data. For example, a data scientist trying to predict if a financial transaction is suspected of money laundering or not might discover that a powerful feature is the graph of financial transfers related to this individual in the previous day/week/month. They can reduce false alerts of money launder by a factor of 100*, reducing the costs of investigating the false alerts, saving the business millions of dollars per year. The data scientist hands the model over the wall to the ML engineer who dismisses the idea of including the graph-based features in the production environment, and tension arises when communicating what is possible and what is not possible in production. The data scientist is crestfallen - but need not be.
The Feature Store is now the de facto Enterprise platform for storing historical and contextual features for AI-powered products. The Feature Store is, in effect, the brain for AI-powered products, the three-eyed Raven that enables the model to access the history and state of the whole Enterprise, not just the local state in the application.
Feature Stores enable applications or model serving infrastructure to take information-light inputs (such as a cookie identifying a user or a shopping cart session) and enrich it with features harvested from anywhere in the Enterprise or beyond to build feature vectors capable of making better predictions. And as we know from Deep Learning, model accuracy improves predictably with more features and data, so there will be an increasing trend towards adding more and more features to models to improve their accuracy. Andrew Ng has recently been advocating this approach that he calls data-centric development, instead of the more traditional model-centric development. Another noticeable trend in large Enterprises is building faster and more scalable Feature stores that can supply those features within the time budget available to the AI-powered product. But AI is going to revolutionize Enterprise software products, so how do we make sure our AI-enabled products are not just Jellyfish AI?
How do we avoid limiting AI-enabled products to only using the input features collected by the application itself? Models will benefit from access to all data that the Enterprise has collected about the user, product, or its context. A potential source of friction here, however, is the dominant architectural preference for microservices and data stove-pipes. Models themselves are being deployed as microservices in model-serving infrastructure, like KFServing or TensorFlow Serving or Nvidia Triton. How can we give these models access to more features?
Without a Feature Store, applications could contact microservices or databases to compute or retrieve the historical and contextual features (data), respectively. Computing the features in the application itself is an anti-pattern as it duplicates the feature engineering code - that code should already exist to generate the training data for the model. Re-implementing feature engineering logic in applications also introduces the risk of skew between the features computed in the application and the features computed for training. If serving and training environments use the same programming language, they could avoid non-DRY code by reusing a versioned library that computes the features. However, even if feature engineering logic is written in Python in both training and serving, it may use PySpark for training and Python for serving or different versions of Python. Versioned libraries can help, but are not a general solution to the feature skew problem.
The Feature Store solves the training/serving skew problem by computing the features once in a feature pipeline. The feature pipeline is then reused to (1) create training data, and (2) save those pre-engineered features to the Feature Store. The serving infrastructure can then retrieve those features when needed to make predictions. For example, when an application wants to make a prediction about a user, it would supply the user’s ID, shopping cart ID, session ID, or location to retrieve pre-engineered features from the Feature Store. The features are retrieved as a feature vector, and the feature vector is sent to the model that makes the prediction. The Feature Store service for retrieving feature vectors is commonly known as the Online Feature Store. The logic for retrieving features from the Online Feature Store can also be implemented in model serving infrastructure, not just in applications. The advantage of looking up features in serving infrastructure is that it keeps the application logic cleaner, and the application just sends IDs and real-time features to the model serving infrastructure, that in turn, builds the feature vector, sends it to the model for prediction, and returns the result to the application. Low latency and high throughput are important properties for the online feature store - the faster you can retrieve features and the more features you can include in a given time budget, the more accurate models you should be able to deploy in production. To quote DoorDash:
“Latency on feature stores is a part of model serving, and model serving latencies tend to be in the low milliseconds range. Thus, read latency has to be proportionately lower”
So, to summarize, if you want to give your ML models a brain, connect them up to a feature store. For Enterprises building personalized services, the featurestore can enrich their models with a 360 degree Enterprise wide view of the customer - not just a product-specific view of the customer. The feature store enables more accurate predictions through more data being available to make those predictions, and this ultimately enables products with better user experience, increased engagement, and the product intelligence now expected by users.
* This is based on a true story.
This blog introduces how RonDB handles automatic thread configuration. It is more technical and dives deeper under the surface of how RonDB operates. RonDB provides a configuration option, ThreadConfig, whereby the user can have full control over the assignment of threads to CPUs, how the CPU locking is to be performed and how the thread should be scheduled.
However, for the absolute majority of users this is too advanced, thus the managed version of RonDB ensures that this thread configuration is based on best practices found over decades of testing. This means that every user of the managed version of RonDB will get access to a thread configuration that is optimised for their particular VM size.
In addition RonDB makes use of adaptive CPU spinning in a way that limits the power usage, but still provides very low latency in all database operations. Adaptive CPU spinning improves latency by up to 50% and in most cases more than 10% improvement.
RonDB 21.04 uses automatic thread configuration by default. This means that as a user you don’t have to care about the configuration of threads. What RonDB does is that it retrieves the number of CPUs available to the RonDB data node process. In the managed version of RonDB, this is the full VM or bare metal server available to the data node. In the open source version of RonDB, one can also limit the amount of CPUs available to RonDB data nodes process by using taskset or numactl when starting the data node. RonDB retrieves information about CPU cores, CPU sockets, and connections to the L3 caches of the CPUs. All of this information is used to set up the optimal thread configuration.
LDM threads house the data, query threads handle read committed queries, tc threads handle transaction coordination, receive threads handle incoming network messages, send thread handle the sending of network messages, and main threads handle metadata operations, asynchronous replication and a number of other things.
LDM thread is a key thread type. The LDM thread is responsible for reading and writing data. It manages the hash indexes, the ordered indexes, the actual data, and a set of triggers performing actions for indexes, foreign keys, full replication, asynchronous replication. This thread type is where most of the CPU processing is done. RonDB has an extremely high number of instructions per cycle compared to any other DBMS engine. The LDM thread often executes 1.25 instructions per cycle where many other DBMS engines have reported numbers around 0.25 instructions per cycle. This is a key reason why RonDB has such a great performance both in terms of throughput and latency. This is the result of the design of data structures in RonDB that are CPU cache aware and also due to the functional separation of thread types.
Query thread is a new addition that was introduced in NDB Cluster 8.0.23. In NDB this is not used by default, RonDB enables the use of query threads by default in the automatic thread configuration. The query threads run the same code as the LDM threads and handles a subset of the operations that the LDM can handle. A normal SELECT query will use read committed queries that can be executed by the query threads. A table partition (sometimes referred to as a table fragment or shard) belongs to a certain LDM thread, thus only this LDM thread can be used for writes and locked reads on rows in this table partition. However for read committed queries, the query threads can be used.
To achieve the best performance RonDB uses CPU locking. In Linux, it is quite common that a thread migrates from one CPU to another CPU. If the thread migrates to a CPU belonging to a different CPU core, the thread will suffer a lot of CPU cache misses immediately after being migrated. To avoid this, RonDB locks threads to specific CPU cores. Thus, it is possible to migrate the thread, but only to another CPU in a CPU core that shares the same CPU caches.
Query threads and LDM threads are organised into Round Robin groups. Each Round Robin group consists of between 4 and 8 LDM threads and the same amount of query threads. All threads within one Round Robin group share the same CPU L3 cache. This ensures that we retain the CPU efficiency even with the introduction of these new query threads. This is important since query threads introduce new mutexes and the performance of these are greatly improved when threads sharing mutexes also share CPU caches. The query thread chosen to execute a query must be in the same Round Robin group as the data owning LDM thread is.
Query threads make it possible to decrease the amount of partitions in a table. As an example, we are able to process more than 3 times as many transactions per second using a single partition in Sysbench OLTP RW compared to when we only use LDM threads. Most key-value stores have data divided into table partitions for the primary key of the table. Many key-value stores also contain additional indexes on columns that are not used for partitioning. Since the table is partitioned, this means that each table partition will contain each of those additional indexes. When performing a range scan on such an index, each table partition must be scanned. Thus the cost of performing range scans increases as the number of table partitions increases. RonDB can scale the reads in a single partition to many query threads, this makes it possible to decrease the number of table partitions in RonDB. In Sysbench OLTP RW this improves performance by around 20% even in a fairly small 2-node setup of RonDB.
In addition query threads ensure that hotspots in the tables can be handled by many threads, thus avoiding the need to partition even more to handle hotspots.
At the same time a modest amount of table partitions increases the amount of writes that we can perform on a table and it makes it possible to parallelise range scans which will speed up complex query execution significantly. Thus in RonDB we have attempted to find a balance between overhead and improved parallelism and improved write scalability.
The cost of key lookups is not greatly affected by the number of partitions since those use a hash lookup and thus always go directly to the thread that can execute the key lookup.
RonDB locks LDM threads and query threads in pairs. There is one LDM thread and one query thread in each such LDM group, we attempt to lock this LDM Group to one CPU core. LDM Groups are organised into Round Robin Groups.
A common choice for a scheduling algorithm in an architecture like this would be to use a simple round robin scheduler. However such an algorithm is too simple for this model. We have two problems to overcome. The first is that the load on LDM threads is not balanced since we have decreased the number of table partitions in a table. Second writes and locked reads can only be scheduled in an LDM thread. Thus it is important to use the Read Committed queries to achieve a balanced load. Since LDM threads and query threads are locked onto the same CPU core it is ok for an LDM thread to be almost idle and we will still be efficient since the query thread on this CPU core will be very efficient.
When a query can be scheduled to both an LDM thread and the query threads in the same Round Robin group the following two-level scheduling algorithm is used.
We gather statistics about CPU usage of threads and we also gather queue lengths in the scheduling queues. Based on this information we prioritise selecting the LDM thread and the query thread in the same LDM group. However, if required to achieve a balanced use of the CPU resources in the Round Robin group we will also schedule read committed queries to any query thread in the Round Robin group of the LDM thread. The gathered CPU usage information affects the load balancer with a delay of around 100 milliseconds. The queue length information makes it possible to adapt to changing load in less than a millisecond.
Given that we use less table partitions in RonDB compared to other solutions, there is a risk of imbalanced load on the CPUs. This problem is solved by two things. First, we use a two-level load balancer on LDM and Query threads. This ensures that we will move away work from overloaded LDM threads towards unused query threads. Second, since the LDM and Query threads share the same CPU core we will have access to an unused CPU core in query threads that execute on the same CPU core as an LDM thread that is currently underutilized. Thus, we expect that this architecture will achieve a balanced load on the CPU cores in the data node architecture.
LDM and query threads use around 50-60% of the available CPU resources in a data node.
The tc threads receive all database operations sent from the NDB API. They take care of coordinating transactions and decide which node should take care of the queries. They use around 20-25% of the CPU resources. The NDB API selects tc threads in a node using a simple round robin scheme.
The receive threads take care of a subset of the communication links. Thus, the receive thread load is usually fairly balanced but can be a bit more unbalanced if certain API nodes are more used in querying RonDB. The communication links between data nodes in the same node group are heavily used when performing updates. To ensure that RonDB can scale in this situation these node links use multiple communication links. Receive threads use around 10-15% of the CPU resources.
The send threads assist in sending networking messages to other nodes. The sending of messages can be done by any thread and there is an adaptive algorithm that assigns more load for sending to threads that are not so busy. The send threads assists in sending to ensure that we have enough capacity to handle all the load. It is not necessary to have send threads, the threads can handle sending even without a send thread. Send threads use around 0-10% of the CPUs available.
The total cost of sending can be quite substantial in a distributed database engine, thus the adaptive algorithm is important to balance out this load on the various threads in the data node.
The number of main threads supported can be 0, 1 or 2. These threads handle a lot of the interactions around creating tables, indexes and any other metadata operation. They also handle a lot of the code around recovery and heartbeats. They are handling any subscriptions to asynchronous replication events used by replication channels to other RonDB clusters.
RonDB is based on NDB Cluster. NDB was focused on being a high-availability key-value store from its origin in database research in the 1990s. The thread model in NDB is inherited from a telecom system developed in Ericsson called AXE. Interestingly in one of my first jobs at Philips I worked on a banking system developed in the 1970s, this system had a very similar model compared to the original thread model in NDB and in AXE. In the operating system development time-sharing has been the dominant model since a long time back. However the model used in NDB where the execution thread is programmed as an asynchronous engine where the application handles a state machine has a huge performance advantage when handling many very small tasks. A normal task in RonDB is a key lookup, or a small range scan. Each of those small tasks is actually divided even further when performing updates and parallel range scans. This means that the length of a task in RonDB is on the order of 500 ns up to around 10 microseconds.
Time-sharing operating systems are not designed to handle context switches of this magnitude. NDB was designed with this understanding from the very beginning. Early competitors of NDB used normal operating system threads for each transaction and even in a real-time operating system this had no chance to compete with the effectiveness of NDB. None of these competitors are still around competing in the key-value store market.
The first thread model in NDB used a single thread to handle everything, send, receive, database handling and transaction handling. This is version 1 of the thread architecture, that is also implemented in the open source version of Redis. With the development of multi-core CPUs it became obvious that more threads were needed. What NDB did here was introduce both a functional separation of threads and partitioning the data to achieve a more multi-threaded execution environment. This is version 2 of the thread architecture.
Modern competitors of RonDB have now understood the need to use asynchronous programming to achieve the required performance in a key-value store. We see this in AeroSpike, Redis, ScyllaDB and many other key-value stores. Thus the industry has followed the RonDB road to achieving an efficient key-value store implementation.
Most competitors have opted for only partitioning the data and thus each thread still has to execute all the code for meta data handling, replication handling, send, receive and database operations. Thus RonDB has actually advanced version 2 of the thread architecture further than its competitors.
All modern CPUs use both a data cache and an instruction cache. By combining all functions inside one thread, the instruction cache will have to execute more code. In RonDB the LDM thread only executes the operation to change the data structures, the tc thread only executes code to handle transactions and the receive thread can focus on the code to execute network receive operations. This makes each thread more efficient. The same is true for the CPU data cache, the LDM thread need not bother with the data structures used for transaction handling and network receive. It can focus the CPU caches on the requirements for database operations which is challenging enough in a database engine.
A simple splitting of data into different table partitions makes sense if all operations towards the key-value store are primary key lookups or unique key lookups. However most key-value stores also require performing general search operations as part of the application. These search operations are implemented as range scans with search conditions, these scale not so well with a simple splitting of data.
To handle this, RonDB introduces version 3 of the thread architecture that uses a compromise where we still split the data, but we introduce query threads to assist the LDM threads in reading the data. Thus RonDB can handle hotspots of data and require fewer number of table partitions to achieve the required scalability of the key-value store.
Thoughts on a v4 of the thread architecture have already emerged, so expect this development to continue for a while more. This includes even better handling of the higher latency to persistent memory data structures.
Finally, even if a competitor managed to replicate all of those features of RonDB, RonDB has another ace in the 3-level distributed hashing algorithm that makes use of a CPU cache aware data structure.
All of those things combined makes us comfortable that RonDB will continue to lead the key-value store market in terms of LATS: lowest Latency, highest Availability, the highest Throughput and the most Scalable data storage. Thus, being the best LATS database in the industry.
Online feature stores are the data layer for operational machine learning models - the models that make online shopping recommendations for you and help identify financial fraud. When you train a machine learning model, you feed it with high signal-to-noise data called features. When the model is used in operation, it needs the same types of features that it was trained on (e.g., how many times you used your credit card during the previous week), but the online feature store should have low latency to keep the end-to-end latency of using a model low. Using a model requires both retrieving the features from the online feature store and then sending them to the model for prediction.
Hopsworks has been using NDB Cluster as our online feature store from its first release. It has the unique combination of low latency, high availability, high throughput, and scalable storage that we call LATS. However, we knew we could make it even better as an online feature store in the cloud, so we asked one of the world’s leading database developers to do it - the person who invented NDB, Mikael Ronström. Together we have made RonDB, a key-value store with SQL capabilities, that is the world’s most advanced and performant online feature store. Although NDB Cluster is open-source, its adoption has been hampered by an undeserved reputation of being challenging to configure and operate. With RonDB, we overcome this limitation by providing it as a managed service in the cloud on AWS and Azure.
The main requirements from a database used as an online feature store are: low latency, high throughput for mixed read/write operations, high availability and the ability to store large data sets (larger than fit on a single host). We unified these properties in a single muscular term LATS:
RonDB is not without competition as the premier choice as an online feature store. To quote Khan and Hassan from DoorDash, it should be a low latency database:
To that end, Redis fits this requirement as it is an in-memory key-value store (without SQL capabilities). Redis is open source (BSD Licence), and it enjoys popularity as an online feature store. Doordash even invested significant resources in increasing Redis’ storage capacity as an online feature store, by adding custom serialization and compression schemes. Significantly, similar to RonDB, it provides sub-millisecond latency for single key-value store operations. There are other databases that have been proposed as online feature stores, but they were not considered in this post as they have significantly higher latency (over one order-of-magnitude!), such as DynamoDB, BigTable, and SpliceMachine.
As such, we thought it would be informative to compare the performance of RonDB and Redis as an online feature store. The comparison was between Redis open-source and RonDB open-source (the commercial version of Redis does not allow any benchmarks). In addition to our benchmark, we compare the innards of RonDB’s multithreading architecture to the commercial Redis products (since our benchmark identifies CPU scalability bottlenecks in Redis that commercial products claim to overcome).
In this simple benchmark, I wanted to compare apples with apples, so I compared open-source RonDB to the open-source version of Redis, since the commercial versions disallow reporting any benchmarks. In the benchmark, I deliberately hobble the performance of RonDB by configuring it with only a single database thread, as Redis is “a single-threaded server from the POV of command execution”. I then proceed to describe the historical evolution of RonDB’s multithreaded architecture, consisting of three different generations, and how open-source Redis is still at the first generation, while commercial Redis products are now at generation two.
Firstly, for our single-threaded database benchmark, we performed our experiments on a 32-core Lenovo P620 workstation with 64 GB of RAM. We performed key-value lookups. Our experiments show that a single-threaded RonDB instance reached around 1.1M reads per second, while Redis reached more than 800k reads per second - both with a mean latency of around 25 microseconds. The throughput benchmark performed batch reads with 50 reads per batch and had 16 threads issuing batch requests in parallel. Batching reads/writes improves throughput at the cost of increased latency.
On the same 32-core server, both RonDB and Redis reached around 600k writes per second when performing SET for Redis and INSERT, UPDATE or DELETE operations for RonDB. For high availability, both of those tests were done with a setup using two replicas in both RonDB and in Redis.
We expected that the read latency and throughput of RonDB and Redis would be similar since both require two network jumps to read data. In case of updates (and writes/deletes), Redis should have lower latency since an update is only performed on the main replica before returning. That is, Redis only supports asynchronous replication from the main replica to a backup replica, which can result in data loss on failure of the main node. In contrast, RonDB performs an update using a synchronous replication protocol that requires 6 messages (a non-blocking version of two-phase commit). Thus, the expected latency is 3 times higher for RonDB for writes.
A comparison of latency and throughput shows that RonDB already has a slight advantage in a single-threaded architecture, but with its third-generation multithreaded architecture, described below, RonDB has an even bigger performance advantage compared to Redis commercial or open-source. RonDB can be scaled up by adding more CPUs and memory or scaled out, by automatically sharding the database. As early as 2013, we developed a benchmark with NDB Cluster (RonDB’s predecessor) that showed how NDB could handle 200M Reads per second in a large cluster of 30 data nodes with 28 cores each.
The story on high availability is different. A write in Redis is only written to one replica. The replication to other replicas is then done asynchronously, thus consistency can be seriously affected by failures and data can be lost. An online feature store must accept writes that change the online features constantly in parallel with all the batched key reads. Thus handling node failures in an online feature store must be very smooth.
Given that an online feature store may need to scale to millions of writes per second as part of a normal operation, this means that a failed node can cause millions of writes to be lost, affecting the correctness and quality of any models that it is feeding with data. RonDB has transactional capabilities that ensure that transactions can be retried in the event of such partial failures. Thus, as long as the database cluster is not fully down, no transactions will be lost.
In many cases the data comes from external data sources into the online Feature Store, so a replay of the data is possible, but an inconsistent state of the database can easily lead to extra unavailability in failure situations. Since an online feature store is often used in mission-critical services, this is clearly not desirable.
RonDB updates all replicas synchronously as part of writes. Thus, if a node running the transaction coordinator or a participant fails, the cluster will automatically fail over to the surviving nodes, a new transaction coordinator will be elected (non-blocking), and no committed transactions will be lost. This is a key feature of RonDB and has been tested in the most demanding applications for more than 15 years and tested thousands of times on a daily basis.
Additionally it can be mentioned that in a highly available setup, in a cloud environment RonDB can read any replica and still see the latest changes whereas Redis will have to read the main replica to get a consistent view of the data and this will, in this case, require communicating across availability zones which can easily add milliseconds to latency for reads. RonDB will automatically setup the cluster such that applications using the APIs will read replicas that are located in the same availability zone. Thus in those setups RonDB will always be able to read the latest version of the data and still deliver data at the lowest possible latency. Redis setups will have to choose between delivering consistent data with higher latency or inconsistent data with low latency in this setup.
Redis only supports in-memory data - this means that Redis will not be able to support online Feature Stores that store lots of data. In contrast, RonDB can store data both in-memory and on-disk, and with support for up to 144 database nodes in a cluster, it can scale to clusters of up to 1PB in size.
For our single-threaded benchmark, we did not expect there to be, nor were there, any major differences in throughput or latency for either read or write operations. The purpose of the benchmark was to show that both databases are similar in how efficiently they use a single CPU. RonDB and Redis are both in-memory databases, but the implementation details of their multithreaded architectures matters for scalability (how efficiently they handle increased resources), as we will see.
Firstly, “Redis is not designed to benefit from multiple CPU cores. People are supposed to launch several Redis instances to scale out on several cores if needed.” For our use-case of online feature stores, it is decidedly non-trivial to partition a feature store across multiple redis instances. Therefore, commercial vendors encourage users to pay for their distributions that introduce a new multithreaded architecture to Redis. We now chronicle the three different generations of threading architectures underlying RonDB, from its NDB roots, and how they compare to Redis’ journey to date.
Open-source Redis has practically the same thread architecture implemented as in the first version of NDB Cluster from the 1990s, when NDB was purely an in-memory database. The commercial Redis distributions have sinced developed multithreaded architectures very similar to the second thread architecture used by later versions of NDB cluster. However, RonDB has evolved into a third version of the NDB cluster thread architecture. Let’s dive into the details of these 3 generations of threading architectures.
The first generation thread architecture is extremely simple - everything is implemented in one thread. This means that this thread handles receive on the socket, handling transactions, handling the database operation, and finally sending the response. This works better with fast CPUs with large caches - more Intel, less AMD. Still, handling all of this in one thread is efficient and my experiments on a 32-core Lenovo P620 workstation, we can see that RonDB reached 1.1M reads per second and Redis a bit more than 800k reads per second in a single thread with a latency of around 25 microseconds. This test used batch reads with 50 reads per batch and had 16 threads issuing batches in parallel. For INSERT, UPDATE or DELETE operations, both RonDB and Redis reached around 600k writes per second, with Redis performing SET operations (it does not have a SQL API). This benchmark was done with a setup using two replicas in both RonDB and in Redis.
Around 2008, the trend towards an increased number of CPUs per socket accelerated and today a modern cloud server can have more than 100 CPUs. With NDB, we decided it was necessary to redesign its multithreading architecture to scale with the available number of CPUs. This resulted in our second generation thread architecture, where we partitioned database tables and let each partition (shard) be managed by a separate thread. This takes care of the database operation. However, as NDB is a transactional distributed database, it was also necessary to design a multithreaded architecture for socket receive, socket send, and transaction handling. These are harder to scale-out on multi-core hardware.
Both NDB and the commercial solutions of Redis implement their multithreaded architecture using message passing. However, the commercial Redis solution uses Unix sockets for communication, whereas NDB (and RonDB) send messages through memory in a lock-free communication scheme. RonDB has dedicated threads to handle socket receive, socket send, and transaction handling. To further reduce latency, network send is adaptively integrated into transaction handling threads.
RonDB has, however, taken even more steps in its multithreaded architecture compared to commercial Redis solutions. RonDB now has a new third generation thread architecture that improves the scalability of the database - it is now possible for more than one thread to perform concurrent reads on a partition. RonDB will ensure that all these read-only threads that can read a partition all share the same L3 cache to avoid any scalability costs. This ensures that hotspots can be handled by multiple threads in RonDB. This architectural advancement decreases the need to partition the database into an increasingly larger number of partitions, and prevents problems such as over-provisioning due to hotspots in DynamoDB. The introduction of read-only threads is important for applications that also need to perform range scans on indexes that are not part of the partition key. It also enables the key-value store to handle secondary index scans and complex SQL queries in the background, while concurrently handling primary key operations (reads/writes for the online Feature Store).
Finally, the third generation thread architecture enables fine-grained elasticity; it is easy to stop a database node and bring it up again with a higher or lower number of CPUs, thus making it easy to increase database throughput or decrease its cost of operation. Hotspots can be mitigated with read-only threads without resorting to over-provisioning. RonDB also has coarse-grained elasticity, with the ability to add new database nodes without affecting its operation, known as online add node.
There are numerous advantages of RonDB compared to Redis; the higher availability and the ability to handle much larger data sets are the most obvious ones. Both are open-source, but Redis has the advantage of a reduced operational burden as it is not at its core a distributed system. However, in the recent era of managed databases in the cloud, operational overhead is no longer a deciding factor when choosing your database. With RonDB’s appearance as a managed database in the cloud, the operational advantages of Redis diminish, paving the way for RonDB to gain adoption as the highest performance online feature store in the cloud.
TLDR: The Hopsworks Feature Store is an open platform that connects to the largest number of data stores, and data science platforms with the most comprehensive API support - Python, Spark (Python, Java/Scala). It supports Azure ML Studio Notebooks or Designer for feature engineering and as your data science platform. You can design and ingest features and you can browse existing features, along with creating training datasets as either DataFrames or as files on Azure Blob storage.
Azure ML is becoming an increasingly popular data science platform to train, deploy, and manage machine learning models. Azure ML also supports AutoML, and enables you to automate and run pipelines at scale.
The Hopsworks Feature Store is the leading open-source feature store for machine learning. It is provided as a managed service on Azure (and AWS) and it includes all the tools needed to store, retrieve and track features that will be used when training and serving ML models. The Hopsworks Feature Store integrates with many cloud platforms and storage services, one of which is Azure ML Studio.
In this blog post, we show how you can connect to the Hopsworks Feature Store from Azure ML, how to ingest features from a Pandas dataframe, combine different features in order to create a training dataset, and finally save that training dataset to Azure Blob storage, from where the data can be read in Azure ML to train and validate a machine learning model.
In order to follow this tutorial, you need:
Connecting to the Feature Store from Azure ML requires setting up a Feature Store API key for authentication. In Hopsworks, click on your username in the top-right corner (1) and select Settings to open the user settings. Select API keys. (2) Give the key a name and select the job, featurestore, dataset.create and project scopes before (3) creating the key. Copy the key into your clipboard for the next step.
To access the Feature Store from Azure Machine Learning, open a Python notebook and proceed with the following steps to install the Hopsworks Feature Store client called HSFS:
Note that we are installing the latest version at the time of writing this (2.1.4) - you should always install the latest minor version that corresponds to the version of your Hopsworks Feature Store. So in this case our Hopsworks instance is running version 2.1.
Furthermore, for Python clients (such as Azure ML), it is important to install HSFS with the `[hive]` optional extra. Spark clients do not need this.
After successfully installing HSFS, you should be able to connect to the Feature Store from your Azure ML notebook (note: you might need to restart the kernel, if you had HSFS previously installed):
Make sure to replace the [UUID] with the one of the DNS of your Hopsworks instance, the [project-name] with the Hopsworks project that contains your feature store. And the [api-key] with the key created in Step 1. Please note that it’s not good practice to store the Api Key in your notebook- instead you should store the key safely in a permissions protected file and use the “api_key_file” argument to pass the filename to the connection method.
Once you are connected you can get a handle to the feature store with `connection.get_feature_store()`. If the project you have connected to also contains a shared feature store (it is possible to have a feature store from another project shared with the project you are using), you can also get a handle on the shared feature store using the connection object.
You can simply upload some data in your favourite file format to the Azure ML workspace or you configure a Hopsworks Storage Connector to cloud storage or a database. The Storage Connector safely stores endpoints and credentials to external stores or databases, making it easier for Data Scientists to retrieve data from them.
If you opted to upload the data as CSV files, as shown below, simply read it into a pandas dataframe:
Now, we can perform some feature engineering based on the pandas dataframe. We would like to predict the weekly sales of a department, so let’s create our target feature by selecting the last week available for each department:We can create this as a feature group, also containing the `is_holiday` feature, since, this information will be available at prediction time, there is no risk of data leakage.
We can create this as a feature group, also containing the `is_holiday` feature, since, this information will be available at prediction time, there is no risk of data leakage.
We can create this as a feature group, also containing the `is_holiday` feature, since, this information will be available at prediction time, there is no risk of data leakage.
By clicking the hyperlink in the logs underneath the notebook cell, you can follow the progress of your ingestion job in Hopsworks.
Let’s now create a few simple features based on the historical sales of each department:
And again, we finish by creating a feature group with this dataframe and saving it to the feature store:
Note: If you have existing feature engineering notebooks that you would like to reuse with the Hopsworks Feature Store, it should be enough to simply add the two calls (create the Feature Group, and save the dataframe to it) in order to ingest your features to the Feature Store. No other changes are required in your existing programs and you can still use your favourite Python libraries for feature engineering.
With these two feature groups we can move to the next step to create a training dataset. Since we did not disable statistics computation, you can head to the Hopsworks Feature Store and inspect the pre-computed statistics over the newly created feature groups.
HSFS comes with an expressive Join API and Query Planner that allows users to join, filter and explore feature groups in order to create training datasets.
Assuming, you start with a new Jupyter Notebook, the first commands you need to run are to get handles to the previously created feature groups:
Note that we explicitly supply the (schema) version for the feature group (version=1), so that other developers can update the feature groups safely in higher numbered versions of the feature group.
With our two feature group objects, we would like to join the target feature with our historical features, but only select the departments for our training dataset that have a full history of 142 weeks available:
As you can see, feature group joins work similarly to pandas dataframe joins. In this case we can omit the join-key since both feature groups have the same primary key, however, for more advanced joins there is always the possibility to specify the join key from each group as well as the join type (left, inner, right, outer, etc) manually.
Hopsworks Feature Store supports a variety of storage connectors to materialize your training dataset to different cloud storage systems. If you have previously configured an Azure Data Lake Storage connector, you can now use it as the destination for your training dataset:
Similar to feature groups, you can now create the training dataset in your favourite file format, matching the machine learning library you are planning to use - for example, choose ‘tfrecord’ for TensorFlow. The Feature Store will make sure to track all metadata related to your training dataset, even if the training dataset is created outside of Hopsworks.
To retrieve the training dataset in your training environment you can simply get a handle to the dataset and its location, to pass it subsequently to your reader utilities:
This video shows how you can leverage the Hopsworks online feature store to compute and ingest features and make them available to operational models making real-time predictions, with low latency and preventing skew between the training and serving features.
Hopsworks provides a Python environment per project that is shared among all the users in the project. All common installation alternatives are supported, , in addition to libraries packaged in a .whl or .egg file and those that reside on a git repository.
This video shows how you can leverage the Hopsworks online feature store to compute and ingest features and make them available to operational models making real-time predictions, with low latency and preventing skew between the training and serving features.
Connecting Hopsworks to your organisation’s AWS account is the first step towards using the Feature Store: 1. Connect your AWS account, 2. Create an instance profile, 3. Create a S3 bucket, 4. Create a SSH key, 5. Enable permissions for Hopsworks to access
Connecting Hopsworks to your organisation’s Azure account is the first step towards using the Feature Store: 1. Connect your Azure account, 2. Create and configure a storage, 3. Add a ssh key to your resource group, 4. Enable permissions for Hopsworks to access
A feature store is a data platform that manages and governs your features for machine learning - both for training and serving. You have access to and can reuse previously engineered features available within the entire organisation, avoiding the need to write a feature engineering pipeline for every model put in production.
The Hopsworks Feature Store is available today on Azure as both a managed platform (www.hopsworks.ai) and a custom Enterprise installation. It manages your features for training and serving models in a cluster under your control inside your organisation’s existing cloud account.
This video goes into depth on the scale-out metadata architecture in Hopsworks and it is used to enable Impliciit Provenance for Machine Learning Pipelines. It was a Guest Lecture at Boston University in October 2020, on a course taught by John (Ioannis) Liagos.
Financial institutions invest huge amounts of resources in both identifying and preventing money laundering and fraud. Most existing systems for identifying money laundering are rules-based that are not capable of detecting ever changing schemes. Consequently, these systems generate too many false-positive alerts, taking time and money to run down.
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
Distributed deep learning offers many benefits – faster training of models using more GPUs, parallelizing hyperparameter tuning over many GPUs, and parallelizing ablation studies to help understand the behaviour and performance of deep neural networks. With Spark 3.0, GPUs are coming to executors in Spark, and distributed deep learning using PySpark is now possible. However, PySpark presents challenges for iterative model development – starting on development machines (laptops) and then re-writing them to run on cluster-based environments.