Today we are pleased to announce the release of two new RonDB releases.
The source code of RonDB is found in https://github.com/logicalclocks/rondb.
RonDB information is found through the RonDB website http://rondb.com.
Documentation of RonDB is found at http://docs.rondb.com.
RonDB 21.04.1 is a maintenance release of RonDB 21.04 that contains 3 new features and 18 bug fixes. RonDB 21.04 is a long-term support version that will be supported until 2024.RonDB 21.10.1 is a beta release of RonDB 21.10 that is based on RonDB 21.04.1 and contains an additional 4 features and 2 bug fixes. RonDB 21.10.1 improves throughput in the DBT2 benchmark by 70% compared to RonDB 21.04.1 and improves Sysbench benchmarks by about 10%.
Release Notes for 21.04.1 is found here.
You can download a binary tarball of RonDB 21.04.1 here.
You can download a binary tarball of RonDB 21.10.1 here.
Hopsworks brings support for scale-out AI with the ExtremeEarth project which focuses on the most concerning issues of food security and sea mapping.
The Hopsworks Feature Store abstracts away the complexity of a dual database system, unifying feature access for online and batch applications.
In this blog we performed a set of microbenchmarks. In particular, we compare RonDB with ScyllaDB for instruction cache on separating threads.
This tutorial gives an overview of how to work with Jupyter on the platform and train a state-of-the-art ML model using the fastai python library.
HopsFS is an open-source scaleout metadata file system, but its primary use case is not Exabyte storage, rather customizable consistent metadata.
Learn how to train a ML model in a distributed fashion without reformatting our code on Databricks with Maggy, open source tool available on Hopsworks.
The first release of RonDB is now available. RonDB 21.04.0 brings new documentation, community support, 3x improved performance, bug fixes and new features.
Use open-source Maggy to write and debug PyTorch code on your local machine and run the code at scale without changing a single line in your program.
RonDB shows higher availability and the ability to handle larger data sets in comparison with Redis, paving the way to be the fastest key-value store available.
Learn how to design and ingest features, browse existing features, create training datasets as DataFrames or as files on Azure Blob storage.
RonDB is a managed key-value store with SQL capabilities. It provides the best low-latency, high throughput, and high availability database available today.
Connect the Hopsworks Feature Store to Amazon Redshift to transform your data into features to train models and make predictions.
Hopsworks now supports dynamic role-based access control to indexes in elasticsearch with no performance penalty by building on Open Distro for Elasticsearch.
Hopsworks is the world's first Enterprise Feature Store along with an advanced end-to-end ML platform.
Hopsworks is now available as a managed platform for Amazon Web Services (AWS) and Microsoft Azure with a comprehensive free tier.
A data warehouse is an input to the Feature Store. A data warehouse is a single columnar database, while a feature store is implemented as two databases.
Feature store as a new element and AI tool in the machine learning systems for the automotive industry.
Try out Maggy for hyperparameter optimization or ablation studies now on Hopsworks.ai to access a new way of writing machine learning applications.
Learn how to integrate Kubeflow with Hopsworks and take advantage of its Feature Store and scale-out deep learning capabilities.
How ExtremeEarth Brings Large-scale AI to the Earth Observation Community with Hopsworks, the Data-intensive AI Platform
Introducing the feature store which is a new data science tool for building and deploying better AI models in the gambling and casino business.
This is a guide to file formats for ml in Python. The Feature Store can store training/test data in a file format of choice on a file system of choice.
Datasets used for deep learning may reach millions of files. The well known image dataset, ImageNet, contains 1m images, and its successor, the Open Images...
When you train deep learning models with lots of high quality training data, you can beat state-of-the-art prediction models in a wide array of domains.
If you are employing a team of Data Scientists for Deep Learning, a cluster manager to share GPUs between your team will maximize utilization of your GPUs.
A Feature Store stores features. We go through data management for deep learning and present the first open-source feature store now in Hopsworks' ML Platform.
TLDR; Machine learning models are only as good as the data (features) they are trained on. In Enterprises, data scientists can often train very effective models in the lab - when given a free hand on which data to use. However, many of those data sources are not available in production environments due to disconnected systems and data silos. An AI-powered product that is limited to the data available within its application silo cannot recall historical data about its users or relevant contextual data from external sources. It is like a jellyfish - its autonomic system makes it functional and useful, but it lacks a brain. You can, however, evolve your models from brain-free AI to Total Recall AI with the help of a Feature Store, a centralized platform that can provide models with low latency access to data spanning the whole enterprise.
Jellyfish are undoubtedly complex creatures with sophisticated behaviour - they move, mate, and munch. They eat and discard waste from the same opening. Yet they have no brain - their autonomic system suffices for their needs. The biggest breakthroughs in AI in recent years have been enabled by deep learning, which requires large volumes of data and specialized compute hardware (e.g., GPUs). However, just like a jellyfish, recent successes in image processing and NLP with deep learning required no brain - no working memory, history, or context. Much of deep learning today is Jellyfish AI. We have made incredible progress in identifying objects in images and translating natural language, yet such deep learning models typically only require the immediate input - the image or the text - to make their predictions. The input signal is information-rich. These image and NLP models seldom require a 'brain' to augment the input with context or memories. Google translate doesn’t need to know the historical enmity between the Scots and the Irish in whether its spelt Whisky or Whiskey. Jellyfish AI is impressive - the input data is information rich and models can learn fantastically advanced behaviour from labeled examples. All “knowledge” needed to make predictions is embedded in the model. The model does not need to have working memory (e.g., it doesn’t need to know the user has clicked 10 times on your website in the last minute).
Now compare using AI for image classification or NLP to building a web application that will use AI to interact with a user browsing a website. The immediate input data your application receives from your web browser are clicks on a mouse or a keyboard. The input signal is information-light - it is difficult to train a useful model using only user clicks. However, large Internet companies collect reams of information about users from many different sources, and transform that user data into features (information-rich signals that are ready to be used for either training models or making predictions with models). Models can then combine the click features with historical features about users and contextual features to build information-rich inputs to models. For example, you could augment the user’s action with everything you know about a user’s history and context to increase the user's engagement with the product. The feature store for machine learning (ML) stores and serves these features to models. We believe that AI-powered products that can easily access historical and contextual features will lead the next wave of AI in the Enterprise, and those products will need a feature store for ML.
A frequent source of tension in Enterprises is between “naive” data scientists and “street-wise” ML engineers. Motivated by good software engineering practices, many ML engineers believe that ML models should be self-contained and tension can arise with data scientists who want to include features in their models that are “obviously not available in the production system”.
However, data scientists are tasked with building the best models they can to add to the bottom line - engage more users, increase revenue, reduce costs. They know they can train better models with more data and more diverse sources of data. For example, a data scientist trying to predict if a financial transaction is suspected of money laundering or not might discover that a powerful feature is the graph of financial transfers related to this individual in the previous day/week/month. They can reduce false alerts of money launder by a factor of 100*, reducing the costs of investigating the false alerts, saving the business millions of dollars per year. The data scientist hands the model over the wall to the ML engineer who dismisses the idea of including the graph-based features in the production environment, and tension arises when communicating what is possible and what is not possible in production. The data scientist is crestfallen - but need not be.
The Feature Store is now the de facto Enterprise platform for storing historical and contextual features for AI-powered products. The Feature Store is, in effect, the brain for AI-powered products, the three-eyed Raven that enables the model to access the history and state of the whole Enterprise, not just the local state in the application.
Feature Stores enable applications or model serving infrastructure to take information-light inputs (such as a cookie identifying a user or a shopping cart session) and enrich it with features harvested from anywhere in the Enterprise or beyond to build feature vectors capable of making better predictions. And as we know from Deep Learning, model accuracy improves predictably with more features and data, so there will be an increasing trend towards adding more and more features to models to improve their accuracy. Andrew Ng has recently been advocating this approach that he calls data-centric development, instead of the more traditional model-centric development. Another noticeable trend in large Enterprises is building faster and more scalable Feature stores that can supply those features within the time budget available to the AI-powered product. But AI is going to revolutionize Enterprise software products, so how do we make sure our AI-enabled products are not just Jellyfish AI?
How do we avoid limiting AI-enabled products to only using the input features collected by the application itself? Models will benefit from access to all data that the Enterprise has collected about the user, product, or its context. A potential source of friction here, however, is the dominant architectural preference for microservices and data stove-pipes. Models themselves are being deployed as microservices in model-serving infrastructure, like KFServing or TensorFlow Serving or Nvidia Triton. How can we give these models access to more features?
Without a Feature Store, applications could contact microservices or databases to compute or retrieve the historical and contextual features (data), respectively. Computing the features in the application itself is an anti-pattern as it duplicates the feature engineering code - that code should already exist to generate the training data for the model. Re-implementing feature engineering logic in applications also introduces the risk of skew between the features computed in the application and the features computed for training. If serving and training environments use the same programming language, they could avoid non-DRY code by reusing a versioned library that computes the features. However, even if feature engineering logic is written in Python in both training and serving, it may use PySpark for training and Python for serving or different versions of Python. Versioned libraries can help, but are not a general solution to the feature skew problem.
The Feature Store solves the training/serving skew problem by computing the features once in a feature pipeline. The feature pipeline is then reused to (1) create training data, and (2) save those pre-engineered features to the Feature Store. The serving infrastructure can then retrieve those features when needed to make predictions. For example, when an application wants to make a prediction about a user, it would supply the user’s ID, shopping cart ID, session ID, or location to retrieve pre-engineered features from the Feature Store. The features are retrieved as a feature vector, and the feature vector is sent to the model that makes the prediction. The Feature Store service for retrieving feature vectors is commonly known as the Online Feature Store. The logic for retrieving features from the Online Feature Store can also be implemented in model serving infrastructure, not just in applications. The advantage of looking up features in serving infrastructure is that it keeps the application logic cleaner, and the application just sends IDs and real-time features to the model serving infrastructure, that in turn, builds the feature vector, sends it to the model for prediction, and returns the result to the application. Low latency and high throughput are important properties for the online feature store - the faster you can retrieve features and the more features you can include in a given time budget, the more accurate models you should be able to deploy in production. To quote DoorDash:
“Latency on feature stores is a part of model serving, and model serving latencies tend to be in the low milliseconds range. Thus, read latency has to be proportionately lower”
So, to summarize, if you want to give your ML models a brain, connect them up to a feature store. For Enterprises building personalized services, the featurestore can enrich their models with a 360 degree Enterprise wide view of the customer - not just a product-specific view of the customer. The feature store enables more accurate predictions through more data being available to make those predictions, and this ultimately enables products with better user experience, increased engagement, and the product intelligence now expected by users.
* This is based on a true story.
This blog introduces how RonDB handles automatic thread configuration. It is more technical and dives deeper under the surface of how RonDB operates. RonDB provides a configuration option, ThreadConfig, whereby the user can have full control over the assignment of threads to CPUs, how the CPU locking is to be performed and how the thread should be scheduled.
However, for the absolute majority of users this is too advanced, thus the managed version of RonDB ensures that this thread configuration is based on best practices found over decades of testing. This means that every user of the managed version of RonDB will get access to a thread configuration that is optimised for their particular VM size.
In addition RonDB makes use of adaptive CPU spinning in a way that limits the power usage, but still provides very low latency in all database operations. Adaptive CPU spinning improves latency by up to 50% and in most cases more than 10% improvement.
RonDB 21.04 uses automatic thread configuration by default. This means that as a user you don’t have to care about the configuration of threads. What RonDB does is that it retrieves the number of CPUs available to the RonDB data node process. In the managed version of RonDB, this is the full VM or bare metal server available to the data node. In the open source version of RonDB, one can also limit the amount of CPUs available to RonDB data nodes process by using taskset or numactl when starting the data node. RonDB retrieves information about CPU cores, CPU sockets, and connections to the L3 caches of the CPUs. All of this information is used to set up the optimal thread configuration.
LDM threads house the data, query threads handle read committed queries, tc threads handle transaction coordination, receive threads handle incoming network messages, send thread handle the sending of network messages, and main threads handle metadata operations, asynchronous replication and a number of other things.
LDM thread is a key thread type. The LDM thread is responsible for reading and writing data. It manages the hash indexes, the ordered indexes, the actual data, and a set of triggers performing actions for indexes, foreign keys, full replication, asynchronous replication. This thread type is where most of the CPU processing is done. RonDB has an extremely high number of instructions per cycle compared to any other DBMS engine. The LDM thread often executes 1.25 instructions per cycle where many other DBMS engines have reported numbers around 0.25 instructions per cycle. This is a key reason why RonDB has such a great performance both in terms of throughput and latency. This is the result of the design of data structures in RonDB that are CPU cache aware and also due to the functional separation of thread types.
Query thread is a new addition that was introduced in NDB Cluster 8.0.23. In NDB this is not used by default, RonDB enables the use of query threads by default in the automatic thread configuration. The query threads run the same code as the LDM threads and handles a subset of the operations that the LDM can handle. A normal SELECT query will use read committed queries that can be executed by the query threads. A table partition (sometimes referred to as a table fragment or shard) belongs to a certain LDM thread, thus only this LDM thread can be used for writes and locked reads on rows in this table partition. However for read committed queries, the query threads can be used.
To achieve the best performance RonDB uses CPU locking. In Linux, it is quite common that a thread migrates from one CPU to another CPU. If the thread migrates to a CPU belonging to a different CPU core, the thread will suffer a lot of CPU cache misses immediately after being migrated. To avoid this, RonDB locks threads to specific CPU cores. Thus, it is possible to migrate the thread, but only to another CPU in a CPU core that shares the same CPU caches.
Query threads and LDM threads are organised into Round Robin groups. Each Round Robin group consists of between 4 and 8 LDM threads and the same amount of query threads. All threads within one Round Robin group share the same CPU L3 cache. This ensures that we retain the CPU efficiency even with the introduction of these new query threads. This is important since query threads introduce new mutexes and the performance of these are greatly improved when threads sharing mutexes also share CPU caches. The query thread chosen to execute a query must be in the same Round Robin group as the data owning LDM thread is.
Query threads make it possible to decrease the amount of partitions in a table. As an example, we are able to process more than 3 times as many transactions per second using a single partition in Sysbench OLTP RW compared to when we only use LDM threads. Most key-value stores have data divided into table partitions for the primary key of the table. Many key-value stores also contain additional indexes on columns that are not used for partitioning. Since the table is partitioned, this means that each table partition will contain each of those additional indexes. When performing a range scan on such an index, each table partition must be scanned. Thus the cost of performing range scans increases as the number of table partitions increases. RonDB can scale the reads in a single partition to many query threads, this makes it possible to decrease the number of table partitions in RonDB. In Sysbench OLTP RW this improves performance by around 20% even in a fairly small 2-node setup of RonDB.
In addition query threads ensure that hotspots in the tables can be handled by many threads, thus avoiding the need to partition even more to handle hotspots.
At the same time a modest amount of table partitions increases the amount of writes that we can perform on a table and it makes it possible to parallelise range scans which will speed up complex query execution significantly. Thus in RonDB we have attempted to find a balance between overhead and improved parallelism and improved write scalability.
The cost of key lookups is not greatly affected by the number of partitions since those use a hash lookup and thus always go directly to the thread that can execute the key lookup.
RonDB locks LDM threads and query threads in pairs. There is one LDM thread and one query thread in each such LDM group, we attempt to lock this LDM Group to one CPU core. LDM Groups are organised into Round Robin Groups.
A common choice for a scheduling algorithm in an architecture like this would be to use a simple round robin scheduler. However such an algorithm is too simple for this model. We have two problems to overcome. The first is that the load on LDM threads is not balanced since we have decreased the number of table partitions in a table. Second writes and locked reads can only be scheduled in an LDM thread. Thus it is important to use the Read Committed queries to achieve a balanced load. Since LDM threads and query threads are locked onto the same CPU core it is ok for an LDM thread to be almost idle and we will still be efficient since the query thread on this CPU core will be very efficient.
When a query can be scheduled to both an LDM thread and the query threads in the same Round Robin group the following two-level scheduling algorithm is used.
We gather statistics about CPU usage of threads and we also gather queue lengths in the scheduling queues. Based on this information we prioritise selecting the LDM thread and the query thread in the same LDM group. However, if required to achieve a balanced use of the CPU resources in the Round Robin group we will also schedule read committed queries to any query thread in the Round Robin group of the LDM thread. The gathered CPU usage information affects the load balancer with a delay of around 100 milliseconds. The queue length information makes it possible to adapt to changing load in less than a millisecond.
Given that we use less table partitions in RonDB compared to other solutions, there is a risk of imbalanced load on the CPUs. This problem is solved by two things. First, we use a two-level load balancer on LDM and Query threads. This ensures that we will move away work from overloaded LDM threads towards unused query threads. Second, since the LDM and Query threads share the same CPU core we will have access to an unused CPU core in query threads that execute on the same CPU core as an LDM thread that is currently underutilized. Thus, we expect that this architecture will achieve a balanced load on the CPU cores in the data node architecture.
LDM and query threads use around 50-60% of the available CPU resources in a data node.
The tc threads receive all database operations sent from the NDB API. They take care of coordinating transactions and decide which node should take care of the queries. They use around 20-25% of the CPU resources. The NDB API selects tc threads in a node using a simple round robin scheme.
The receive threads take care of a subset of the communication links. Thus, the receive thread load is usually fairly balanced but can be a bit more unbalanced if certain API nodes are more used in querying RonDB. The communication links between data nodes in the same node group are heavily used when performing updates. To ensure that RonDB can scale in this situation these node links use multiple communication links. Receive threads use around 10-15% of the CPU resources.
The send threads assist in sending networking messages to other nodes. The sending of messages can be done by any thread and there is an adaptive algorithm that assigns more load for sending to threads that are not so busy. The send threads assists in sending to ensure that we have enough capacity to handle all the load. It is not necessary to have send threads, the threads can handle sending even without a send thread. Send threads use around 0-10% of the CPUs available.
The total cost of sending can be quite substantial in a distributed database engine, thus the adaptive algorithm is important to balance out this load on the various threads in the data node.
The number of main threads supported can be 0, 1 or 2. These threads handle a lot of the interactions around creating tables, indexes and any other metadata operation. They also handle a lot of the code around recovery and heartbeats. They are handling any subscriptions to asynchronous replication events used by replication channels to other RonDB clusters.
RonDB is based on NDB Cluster. NDB was focused on being a high-availability key-value store from its origin in database research in the 1990s. The thread model in NDB is inherited from a telecom system developed in Ericsson called AXE. Interestingly in one of my first jobs at Philips I worked on a banking system developed in the 1970s, this system had a very similar model compared to the original thread model in NDB and in AXE. In the operating system development time-sharing has been the dominant model since a long time back. However the model used in NDB where the execution thread is programmed as an asynchronous engine where the application handles a state machine has a huge performance advantage when handling many very small tasks. A normal task in RonDB is a key lookup, or a small range scan. Each of those small tasks is actually divided even further when performing updates and parallel range scans. This means that the length of a task in RonDB is on the order of 500 ns up to around 10 microseconds.
Time-sharing operating systems are not designed to handle context switches of this magnitude. NDB was designed with this understanding from the very beginning. Early competitors of NDB used normal operating system threads for each transaction and even in a real-time operating system this had no chance to compete with the effectiveness of NDB. None of these competitors are still around competing in the key-value store market.
The first thread model in NDB used a single thread to handle everything, send, receive, database handling and transaction handling. This is version 1 of the thread architecture, that is also implemented in the open source version of Redis. With the development of multi-core CPUs it became obvious that more threads were needed. What NDB did here was introduce both a functional separation of threads and partitioning the data to achieve a more multi-threaded execution environment. This is version 2 of the thread architecture.
Modern competitors of RonDB have now understood the need to use asynchronous programming to achieve the required performance in a key-value store. We see this in AeroSpike, Redis, ScyllaDB and many other key-value stores. Thus the industry has followed the RonDB road to achieving an efficient key-value store implementation.
Most competitors have opted for only partitioning the data and thus each thread still has to execute all the code for meta data handling, replication handling, send, receive and database operations. Thus RonDB has actually advanced version 2 of the thread architecture further than its competitors.
All modern CPUs use both a data cache and an instruction cache. By combining all functions inside one thread, the instruction cache will have to execute more code. In RonDB the LDM thread only executes the operation to change the data structures, the tc thread only executes code to handle transactions and the receive thread can focus on the code to execute network receive operations. This makes each thread more efficient. The same is true for the CPU data cache, the LDM thread need not bother with the data structures used for transaction handling and network receive. It can focus the CPU caches on the requirements for database operations which is challenging enough in a database engine.
A simple splitting of data into different table partitions makes sense if all operations towards the key-value store are primary key lookups or unique key lookups. However most key-value stores also require performing general search operations as part of the application. These search operations are implemented as range scans with search conditions, these scale not so well with a simple splitting of data.
To handle this, RonDB introduces version 3 of the thread architecture that uses a compromise where we still split the data, but we introduce query threads to assist the LDM threads in reading the data. Thus RonDB can handle hotspots of data and require fewer number of table partitions to achieve the required scalability of the key-value store.
Thoughts on a v4 of the thread architecture have already emerged, so expect this development to continue for a while more. This includes even better handling of the higher latency to persistent memory data structures.
Finally, even if a competitor managed to replicate all of those features of RonDB, RonDB has another ace in the 3-level distributed hashing algorithm that makes use of a CPU cache aware data structure.
All of those things combined makes us comfortable that RonDB will continue to lead the key-value store market in terms of LATS: lowest Latency, highest Availability, the highest Throughput and the most Scalable data storage. Thus, being the best LATS database in the industry.
Online feature stores are the data layer for operational machine learning models - the models that make online shopping recommendations for you and help identify financial fraud. When you train a machine learning model, you feed it with high signal-to-noise data called features. When the model is used in operation, it needs the same types of features that it was trained on (e.g., how many times you used your credit card during the previous week), but the online feature store should have low latency to keep the end-to-end latency of using a model low. Using a model requires both retrieving the features from the online feature store and then sending them to the model for prediction.
Hopsworks has been using NDB Cluster as our online feature store from its first release. It has the unique combination of low latency, high availability, high throughput, and scalable storage that we call LATS. However, we knew we could make it even better as an online feature store in the cloud, so we asked one of the world’s leading database developers to do it - the person who invented NDB, Mikael Ronström. Together we have made RonDB, a key-value store with SQL capabilities, that is the world’s most advanced and performant online feature store. Although NDB Cluster is open-source, its adoption has been hampered by an undeserved reputation of being challenging to configure and operate. With RonDB, we overcome this limitation by providing it as a managed service in the cloud on AWS and Azure.
The main requirements from a database used as an online feature store are: low latency, high throughput for mixed read/write operations, high availability and the ability to store large data sets (larger than fit on a single host). We unified these properties in a single muscular term LATS:
RonDB is not without competition as the premier choice as an online feature store. To quote Khan and Hassan from DoorDash, it should be a low latency database:
To that end, Redis fits this requirement as it is an in-memory key-value store (without SQL capabilities). Redis is open source (BSD Licence), and it enjoys popularity as an online feature store. Doordash even invested significant resources in increasing Redis’ storage capacity as an online feature store, by adding custom serialization and compression schemes. Significantly, similar to RonDB, it provides sub-millisecond latency for single key-value store operations. There are other databases that have been proposed as online feature stores, but they were not considered in this post as they have significantly higher latency (over one order-of-magnitude!), such as DynamoDB, BigTable, and SpliceMachine.
As such, we thought it would be informative to compare the performance of RonDB and Redis as an online feature store. The comparison was between Redis open-source and RonDB open-source (the commercial version of Redis does not allow any benchmarks). In addition to our benchmark, we compare the innards of RonDB’s multithreading architecture to the commercial Redis products (since our benchmark identifies CPU scalability bottlenecks in Redis that commercial products claim to overcome).
In this simple benchmark, I wanted to compare apples with apples, so I compared open-source RonDB to the open-source version of Redis, since the commercial versions disallow reporting any benchmarks. In the benchmark, I deliberately hobble the performance of RonDB by configuring it with only a single database thread, as Redis is “a single-threaded server from the POV of command execution”. I then proceed to describe the historical evolution of RonDB’s multithreaded architecture, consisting of three different generations, and how open-source Redis is still at the first generation, while commercial Redis products are now at generation two.
Firstly, for our single-threaded database benchmark, we performed our experiments on a 32-core Lenovo P620 workstation with 64 GB of RAM. We performed key-value lookups. Our experiments show that a single-threaded RonDB instance reached around 1.1M reads per second, while Redis reached more than 800k reads per second - both with a mean latency of around 25 microseconds. The throughput benchmark performed batch reads with 50 reads per batch and had 16 threads issuing batch requests in parallel. Batching reads/writes improves throughput at the cost of increased latency.
On the same 32-core server, both RonDB and Redis reached around 600k writes per second when performing SET for Redis and INSERT, UPDATE or DELETE operations for RonDB. For high availability, both of those tests were done with a setup using two replicas in both RonDB and in Redis.
We expected that the read latency and throughput of RonDB and Redis would be similar since both require two network jumps to read data. In case of updates (and writes/deletes), Redis should have lower latency since an update is only performed on the main replica before returning. That is, Redis only supports asynchronous replication from the main replica to a backup replica, which can result in data loss on failure of the main node. In contrast, RonDB performs an update using a synchronous replication protocol that requires 6 messages (a non-blocking version of two-phase commit). Thus, the expected latency is 3 times higher for RonDB for writes.
A comparison of latency and throughput shows that RonDB already has a slight advantage in a single-threaded architecture, but with its third-generation multithreaded architecture, described below, RonDB has an even bigger performance advantage compared to Redis commercial or open-source. RonDB can be scaled up by adding more CPUs and memory or scaled out, by automatically sharding the database. As early as 2013, we developed a benchmark with NDB Cluster (RonDB’s predecessor) that showed how NDB could handle 200M Reads per second in a large cluster of 30 data nodes with 28 cores each.
The story on high availability is different. A write in Redis is only written to one replica. The replication to other replicas is then done asynchronously, thus consistency can be seriously affected by failures and data can be lost. An online feature store must accept writes that change the online features constantly in parallel with all the batched key reads. Thus handling node failures in an online feature store must be very smooth.
Given that an online feature store may need to scale to millions of writes per second as part of a normal operation, this means that a failed node can cause millions of writes to be lost, affecting the correctness and quality of any models that it is feeding with data. RonDB has transactional capabilities that ensure that transactions can be retried in the event of such partial failures. Thus, as long as the database cluster is not fully down, no transactions will be lost.
In many cases the data comes from external data sources into the online Feature Store, so a replay of the data is possible, but an inconsistent state of the database can easily lead to extra unavailability in failure situations. Since an online feature store is often used in mission-critical services, this is clearly not desirable.
RonDB updates all replicas synchronously as part of writes. Thus, if a node running the transaction coordinator or a participant fails, the cluster will automatically fail over to the surviving nodes, a new transaction coordinator will be elected (non-blocking), and no committed transactions will be lost. This is a key feature of RonDB and has been tested in the most demanding applications for more than 15 years and tested thousands of times on a daily basis.
Additionally it can be mentioned that in a highly available setup, in a cloud environment RonDB can read any replica and still see the latest changes whereas Redis will have to read the main replica to get a consistent view of the data and this will, in this case, require communicating across availability zones which can easily add milliseconds to latency for reads. RonDB will automatically setup the cluster such that applications using the APIs will read replicas that are located in the same availability zone. Thus in those setups RonDB will always be able to read the latest version of the data and still deliver data at the lowest possible latency. Redis setups will have to choose between delivering consistent data with higher latency or inconsistent data with low latency in this setup.
Redis only supports in-memory data - this means that Redis will not be able to support online Feature Stores that store lots of data. In contrast, RonDB can store data both in-memory and on-disk, and with support for up to 144 database nodes in a cluster, it can scale to clusters of up to 1PB in size.
For our single-threaded benchmark, we did not expect there to be, nor were there, any major differences in throughput or latency for either read or write operations. The purpose of the benchmark was to show that both databases are similar in how efficiently they use a single CPU. RonDB and Redis are both in-memory databases, but the implementation details of their multithreaded architectures matters for scalability (how efficiently they handle increased resources), as we will see.
Firstly, “Redis is not designed to benefit from multiple CPU cores. People are supposed to launch several Redis instances to scale out on several cores if needed.” For our use-case of online feature stores, it is decidedly non-trivial to partition a feature store across multiple redis instances. Therefore, commercial vendors encourage users to pay for their distributions that introduce a new multithreaded architecture to Redis. We now chronicle the three different generations of threading architectures underlying RonDB, from its NDB roots, and how they compare to Redis’ journey to date.
Open-source Redis has practically the same thread architecture implemented as in the first version of NDB Cluster from the 1990s, when NDB was purely an in-memory database. The commercial Redis distributions have sinced developed multithreaded architectures very similar to the second thread architecture used by later versions of NDB cluster. However, RonDB has evolved into a third version of the NDB cluster thread architecture. Let’s dive into the details of these 3 generations of threading architectures.
The first generation thread architecture is extremely simple - everything is implemented in one thread. This means that this thread handles receive on the socket, handling transactions, handling the database operation, and finally sending the response. This works better with fast CPUs with large caches - more Intel, less AMD. Still, handling all of this in one thread is efficient and my experiments on a 32-core Lenovo P620 workstation, we can see that RonDB reached 1.1M reads per second and Redis a bit more than 800k reads per second in a single thread with a latency of around 25 microseconds. This test used batch reads with 50 reads per batch and had 16 threads issuing batches in parallel. For INSERT, UPDATE or DELETE operations, both RonDB and Redis reached around 600k writes per second, with Redis performing SET operations (it does not have a SQL API). This benchmark was done with a setup using two replicas in both RonDB and in Redis.
Around 2008, the trend towards an increased number of CPUs per socket accelerated and today a modern cloud server can have more than 100 CPUs. With NDB, we decided it was necessary to redesign its multithreading architecture to scale with the available number of CPUs. This resulted in our second generation thread architecture, where we partitioned database tables and let each partition (shard) be managed by a separate thread. This takes care of the database operation. However, as NDB is a transactional distributed database, it was also necessary to design a multithreaded architecture for socket receive, socket send, and transaction handling. These are harder to scale-out on multi-core hardware.
Both NDB and the commercial solutions of Redis implement their multithreaded architecture using message passing. However, the commercial Redis solution uses Unix sockets for communication, whereas NDB (and RonDB) send messages through memory in a lock-free communication scheme. RonDB has dedicated threads to handle socket receive, socket send, and transaction handling. To further reduce latency, network send is adaptively integrated into transaction handling threads.
RonDB has, however, taken even more steps in its multithreaded architecture compared to commercial Redis solutions. RonDB now has a new third generation thread architecture that improves the scalability of the database - it is now possible for more than one thread to perform concurrent reads on a partition. RonDB will ensure that all these read-only threads that can read a partition all share the same L3 cache to avoid any scalability costs. This ensures that hotspots can be handled by multiple threads in RonDB. This architectural advancement decreases the need to partition the database into an increasingly larger number of partitions, and prevents problems such as over-provisioning due to hotspots in DynamoDB. The introduction of read-only threads is important for applications that also need to perform range scans on indexes that are not part of the partition key. It also enables the key-value store to handle secondary index scans and complex SQL queries in the background, while concurrently handling primary key operations (reads/writes for the online Feature Store).
Finally, the third generation thread architecture enables fine-grained elasticity; it is easy to stop a database node and bring it up again with a higher or lower number of CPUs, thus making it easy to increase database throughput or decrease its cost of operation. Hotspots can be mitigated with read-only threads without resorting to over-provisioning. RonDB also has coarse-grained elasticity, with the ability to add new database nodes without affecting its operation, known as online add node.
There are numerous advantages of RonDB compared to Redis; the higher availability and the ability to handle much larger data sets are the most obvious ones. Both are open-source, but Redis has the advantage of a reduced operational burden as it is not at its core a distributed system. However, in the recent era of managed databases in the cloud, operational overhead is no longer a deciding factor when choosing your database. With RonDB’s appearance as a managed database in the cloud, the operational advantages of Redis diminish, paving the way for RonDB to gain adoption as the highest performance online feature store in the cloud.
TLDR: The Hopsworks Feature Store is an open platform that connects to the largest number of data stores, and data science platforms with the most comprehensive API support - Python, Spark (Python, Java/Scala). It supports Azure ML Studio Notebooks or Designer for feature engineering and as your data science platform. You can design and ingest features and you can browse existing features, along with creating training datasets as either DataFrames or as files on Azure Blob storage.
Azure ML is becoming an increasingly popular data science platform to train, deploy, and manage machine learning models. Azure ML also supports AutoML, and enables you to automate and run pipelines at scale.
The Hopsworks Feature Store is the leading open-source feature store for machine learning. It is provided as a managed service on Azure (and AWS) and it includes all the tools needed to store, retrieve and track features that will be used when training and serving ML models. The Hopsworks Feature Store integrates with many cloud platforms and storage services, one of which is Azure ML Studio.
In this blog post, we show how you can connect to the Hopsworks Feature Store from Azure ML, how to ingest features from a Pandas dataframe, combine different features in order to create a training dataset, and finally save that training dataset to Azure Blob storage, from where the data can be read in Azure ML to train and validate a machine learning model.
In order to follow this tutorial, you need:
Connecting to the Feature Store from Azure ML requires setting up a Feature Store API key for authentication. In Hopsworks, click on your username in the top-right corner (1) and select Settings to open the user settings. Select API keys. (2) Give the key a name and select the job, featurestore, dataset.create and project scopes before (3) creating the key. Copy the key into your clipboard for the next step.
To access the Feature Store from Azure Machine Learning, open a Python notebook and proceed with the following steps to install the Hopsworks Feature Store client called HSFS:
Note that we are installing the latest version at the time of writing this (2.1.4) - you should always install the latest minor version that corresponds to the version of your Hopsworks Feature Store. So in this case our Hopsworks instance is running version 2.1.
Furthermore, for Python clients (such as Azure ML), it is important to install HSFS with the `[hive]` optional extra. Spark clients do not need this.
After successfully installing HSFS, you should be able to connect to the Feature Store from your Azure ML notebook (note: you might need to restart the kernel, if you had HSFS previously installed):
Make sure to replace the [UUID] with the one of the DNS of your Hopsworks instance, the [project-name] with the Hopsworks project that contains your feature store. And the [api-key] with the key created in Step 1. Please note that it’s not good practice to store the Api Key in your notebook- instead you should store the key safely in a permissions protected file and use the “api_key_file” argument to pass the filename to the connection method.
Once you are connected you can get a handle to the feature store with `connection.get_feature_store()`. If the project you have connected to also contains a shared feature store (it is possible to have a feature store from another project shared with the project you are using), you can also get a handle on the shared feature store using the connection object.
You can simply upload some data in your favourite file format to the Azure ML workspace or you configure a Hopsworks Storage Connector to cloud storage or a database. The Storage Connector safely stores endpoints and credentials to external stores or databases, making it easier for Data Scientists to retrieve data from them.
If you opted to upload the data as CSV files, as shown below, simply read it into a pandas dataframe:
Now, we can perform some feature engineering based on the pandas dataframe. We would like to predict the weekly sales of a department, so let’s create our target feature by selecting the last week available for each department:We can create this as a feature group, also containing the `is_holiday` feature, since, this information will be available at prediction time, there is no risk of data leakage.
We can create this as a feature group, also containing the `is_holiday` feature, since, this information will be available at prediction time, there is no risk of data leakage.
We can create this as a feature group, also containing the `is_holiday` feature, since, this information will be available at prediction time, there is no risk of data leakage.
By clicking the hyperlink in the logs underneath the notebook cell, you can follow the progress of your ingestion job in Hopsworks.
Let’s now create a few simple features based on the historical sales of each department:
And again, we finish by creating a feature group with this dataframe and saving it to the feature store:
Note: If you have existing feature engineering notebooks that you would like to reuse with the Hopsworks Feature Store, it should be enough to simply add the two calls (create the Feature Group, and save the dataframe to it) in order to ingest your features to the Feature Store. No other changes are required in your existing programs and you can still use your favourite Python libraries for feature engineering.
With these two feature groups we can move to the next step to create a training dataset. Since we did not disable statistics computation, you can head to the Hopsworks Feature Store and inspect the pre-computed statistics over the newly created feature groups.
HSFS comes with an expressive Join API and Query Planner that allows users to join, filter and explore feature groups in order to create training datasets.
Assuming, you start with a new Jupyter Notebook, the first commands you need to run are to get handles to the previously created feature groups:
Note that we explicitly supply the (schema) version for the feature group (version=1), so that other developers can update the feature groups safely in higher numbered versions of the feature group.
With our two feature group objects, we would like to join the target feature with our historical features, but only select the departments for our training dataset that have a full history of 142 weeks available:
As you can see, feature group joins work similarly to pandas dataframe joins. In this case we can omit the join-key since both feature groups have the same primary key, however, for more advanced joins there is always the possibility to specify the join key from each group as well as the join type (left, inner, right, outer, etc) manually.
Hopsworks Feature Store supports a variety of storage connectors to materialize your training dataset to different cloud storage systems. If you have previously configured an Azure Data Lake Storage connector, you can now use it as the destination for your training dataset:
Similar to feature groups, you can now create the training dataset in your favourite file format, matching the machine learning library you are planning to use - for example, choose ‘tfrecord’ for TensorFlow. The Feature Store will make sure to track all metadata related to your training dataset, even if the training dataset is created outside of Hopsworks.
To retrieve the training dataset in your training environment you can simply get a handle to the dataset and its location, to pass it subsequently to your reader utilities:
We are pleased to introduce RonDB, the world’s fastest key-value store with SQL capabilities, available now in the cloud. RonDB is an open source distribution of NDB Cluster, thus providing the same core technology and performance as NDB, but as a managed platform in the cloud. RonDB is the best low-latency, high throughput, and high availability database available today. In addition, it also brings large data storage capabilities. RonDB is a critical component of the infrastructure needed to build feature stores, and, in general, real-time applications.
RonDB is an open source distribution of NDB Cluster which has been used as a key-value store for applications for almost 20 years. The name RonDB is a synthesis of its inventor, Mikael Ronström, and its heritage NDB. RonDB is offered as a managed version of NDB Cluster, providing the fastest solution targeting the traditional applications that use NDB Cluster such as those used in telecom, finance, and gaming. RonDB provides a solution as the data layer for operational AI applications. RonDB gives online models access to reliable, low-latency, high-throughput storage for precomputed features.
RonDB is ideal for developers who require a real-time database that can (1) be quickly deployed (RonDB provides a possibility to set up an RonDB cluster in less than a minute), (2) handle millions of database operations per second, and (3) be elastically scaled up and down to match the current load, reducing operational costs.
A key-value store associates a value with a unique identifier (a key) to enable easy storage, retrieval, searching, and updating of data using simple queries - making them ideal for organisations with large volumes of data that needs to be constantly retrieved and updated.
Distributed key-value stores, such as RonDB, are optimal for organisations with heavy read/write workloads on large data sets, thus being a good fit for businesses operating at scale. For example, they are a critical component of the infrastructure needed to build feature stores and many different types of interactive and real-time applications.
RonDB can be used to develop interactive online applications where storing, retrieving and updating data is essential, but there is also the possibility to search for data in the same system. RonDB provides efficient access using a partition key through either native APIs or SQL access.
RonDB supports distributed transactions enabling applications to perform any type of data modification. RonDB enables concurrent complex queries using SQL as part of the online application. Through its efficient partitioning scheme, RonDB can scale nodes to support hundreds of CPUs per node and then scale to hundreds of data nodes (tens of thousands of CPUs in total).
RonDB supports writing low latency, high availability SQL applications using its full SQL capabilities, providing a MySQL Server API.
RonDB is the fastest key-value store currently available and the only one with general purpose query capabilities, through its SQL support. The performance of key-value stores are evaluated based on low Latency, high Availability, high Throughput, and scalable Storage (LATS).
RonDB has shown the fastest response times: it can respond in 100-200 microseconds on individual requests and in less than a millisecond on batched read requests; it can also perform complex transactions in a highly loaded cluster within 10 milliseconds. Moreover, and perhaps even more importantly, RonDB has predictable latency.
NDB Cluster can provide Class 6 Availability; in other words, an NDB system is operational 99.9999% of the time, thus no more than 30 seconds of downtime per year. RonDB brings the same type of availability now to the cloud, thus ensuring that RonDB is always available. RonDB allows you to place replicas within a cloud region on different availability zones, ensuring that failure of availability zones does not result in downtime for RonDB.
RonDB scales to up to hundreds of millions of read or write ops/seconds. A single RonDB instance on a virtual machine can support millions of key value operations per second, and, without downtime, can be scaled up to a large cluster that supports hundreds of millions of key value operations per second.
RonDB offers large dataset capabilities, being able to store up to petabytes of data.Individual RonDB data nodes can store up to 16TBytes of in-memory data and many tens of TBytes of disk data. Given that RonDB clusters can scale up to 144 data nodes, an RonDB cluster can be scaled to store petabytes of data.
The combination of predictable low latency, the ability to handle millions of operations per second, and a unique high availability solution makes RonDB the leading LATS database, that is now also available in the cloud.
RonDB is the foundation of the Hopsworks Online Feature Store. For those unfamiliar with the concept of feature store, it is a data warehouse of features for machine learning. It contains two databases: an online database serving features at low latency to online applications and an offline database storing large volumes of historic feature values.
The online feature store provides the data (features) that underlie the models used in intelligent online applications. For example, it helps inform an intelligent decision about whether a financial transaction is suspected of money laundering or not by providing the real-time information about the user’s credit history, recent card usage, location of the transaction, and so on. Thus, the most important requirements from the feature store are quick and consistent responses (low latency), handling large volumes of key lookups (high throughput), and being always available (high availability). In addition an online feature store should be able to scale as the business grows. RonDB is the only database available in the market that meets all these requirements.
We’re not releasing RonDB publicly just yet. Instead, we have made it available as early access. If you think your team could benefit from RonDB, please contact us!
TLDR: Hopsworks Feature Store provides an open ecosystem that connects to the largest number of data storage, data pipelines, and data science platforms. You can connect the Feature Store to Amazon Redshift to transform your data into features to train models and make predictions.
Amazon Redshift is a popular managed data warehouse on AWS. Companies use Redshift to store structured data for traditional data analytics. This makes Redshift a popular source of raw data for computing features for training machine learning models.
The Hopsworks Feature Store is the leading open-source feature store for machine learning. It is provided as a managed service on AWS and it includes all the tools needed to transform data in Redshift into features that will be used when training and serving ML models.
In this blog post, we show how you can configure the Redshift storage connector to ingest data, transform that data into features (feature engineering) and save the pre-computed features in the Hopsworks Feature Store.
To follow this tutorial users should have an Hopsworks Feature Store instance running on https://hopsworks.ai. You can register for free with no credit-card and receive 4000 USD of credits to get started.
Users should also have an existing Redshift cluster. If you don't have an existing cluster, you can create one by following the AWS Documentation
The first step to be able to ingest Redshift data into the feature store is to configure a storage connector. The Redshift connector requires you to specify the following properties. Most of them are available in the properties area of your cluster in the Redshift UI.
There are two options available for authenticating with the Redshift cluster. The first option is to configure a username and a password. The password is stored in the secret store and made available to all the members of the project.
The second option is to configure an IAM role. With IAM roles, Jobs or notebooks launched on Hopsworks do not need to explicitly authenticate with Redshift, as the HSFS library will transparently use the IAM role to acquire a temporary credential to authenticate the specified user.
In Hopsworks, there are two different ways to configure an IAM role: a per-cluster IAM role or a federated IAM role (role chaining). For the per-cluster IAM role, you select an instance profile for your Hopsworks cluster when launching it in hopsworks.ai, and all jobs or notebooks will be run with the selected IAM role. For the federated IAM role, you create a head IAM role for the cluster that enables Hopsworks to assume a potentially different IAM role in each project. You can even restrict it so that only certain roles within a project (like a data owner) can assume a given role.
With regards to the database driver, the library to interact with Redshift *is not* included in Hopsworks - you need to upload the driver yourself. First, you need to download the library from here. Select the driver version without the AWS SDK. You then upload the driver files to the “Resources” dataset in your project, see the screenshot below.
Then, you add the file to your notebook or job before launching it, as shown in the screenshots below.
Hopsworks supports the creation of (a) cached feature groups and (b) external (on-demand) feature groups. For cached feature groups, the features are stored in Hopsworks feature store. For external feature groups, only metadata for features is stored in the feature store - not the actual feature data which is read from the external database/object-store. When the external feature group is accessed from a Spark or Python job, the feature data is read on-demand using a connector from the external store. On AWS, Hopsworks supports the creation of external feature groups from a large number of data stores, including Redshift, RDS, Snowflake, S3, and any JDBC-enabled source.
In this example, we will define an external feature group for a table in Redshift. External feature groups in Hopsworks support “provenance” in the Hopsworks Web UI, you can track which features are stored on which external systems and how they are computed. Additionally HSFS (the Python/Scala library used to interact with the feature store) provides the same APIs for external feature groups as for cached feature groups.
An external (on-demand) feature group can be defined as follow:
When running `save()` the metadata is stored in the feature store and statistics for the data are computed and made available through the Hopsworks UI. Statistics helps data scientists in the quest of building better features from raw data.
On-demand feature groups can be used directly as a source for creating training datasets. This is often the case if a company is migrating to Hopsworks and there are already feature engineering pipelines in production writing data to Redshift.
This flexibility provided by Hopsworks allows users to hit the ground running from day 1, without having to rewrite their pipelines to take advantage of the benefits the Hopsworks feature store provides.
On-demand feature groups can also be joined with cached feature groups in Hopsworks to create training datasets. This helper guide explains in detail how the HSFS joining APIs work and how they can be used to create training datasets.
If, however, Redshift contains raw data that needs to be feature engineered, you can retrieve a Spark DataFrame backed by the Redshift table using the HSFS API.
You can then transform your `spark_df` into features using feature engineering libraries. It is also possible to retrieve the dataframe as a Pandas dataframe and perform the feature engineering steps in Python code.
You can then save the final dataframe containing the engineered features to the feature store as a cached feature group:
Storing feature groups as cached feature groups within Hopsworks provides several benefits over on-demand feature groups. First it allows users to leverage Hudi for incremental ingestion (with ACID properties, ensuring the integrity of the feature group) and time travel capabilities. As new data is ingested, new commits are tracked by Hopsworks allowing users to see what has changed over time. On each commit, statistics are computed and tracked in Hopsworks, allowing users to understand how the data has changed over time.
Cached feature groups can also be stored in the online feature store (`online_enabled=True`), thus enabling low latency access to the features using the online feature store API.
Get started today on Hopsworks.ai by configuring your Redshift storage connector and run this notebook.
TLDR: The need for an open-source alternative to Elasticsearch has recently become more evident; platforms that bundle Open Distro for Elasticsearch are able to future-proof open-source support for free-text search and Elasticsearch. In this post, we describe how Hopsworks leverages the authentication and authorization support in Open Distro for Elasticsearch to make free text search a project-based multi-tenant service in Hopsworks. More concretely, Hopsworks now supports dynamic role-based access control (RBAC) to indexes in elasticsearch with no performance penalty by building on Open Distro for Elasticsearch (ODES).
In January 2021, Elastic switched from the Apache V2 open-source license for both Elasticsearch and Kibana to a non open-source license to Server Side Public License (SSPL).
Hopsworks is an open-source platform that includes Open Distro for Elasticsearch (a fork of Elasticsearch) and Kibana.
In Hopsworks, we use Elasticsearch to provide free-text search for AI assets (features, models, experiments, datasets, etc). We also make Elasticsearch indexes available for use by programs run in Hopsworks. As we interpret it, the latter functionality means we contravene the licensing terms of the SSPL:
“If you make the functionality of the Program or a modified version available to third parties as a service... (license conditions apply)”
Luckily, we recently made the switch from Elasticsearch to Open Distro for Elasticsearch, supported by AWS, which is Apache v2 licensed.
Open Distro for Elasticsearch supports Active Directory and LDAP for authentication and authorization. Using the Security plugin, you can use RBAC to control the actions a user can perform. A role defines the cluster operations and index operations a user can perform, including access to indices, and even fine-grained field and document level access. RBAC allows an administrator to define a single security policy and apply it to all members of a department. But individuals may be members of multiple departments, so a user might be given multiple roles. With dynamic role-based access control you can change the set of roles a user can hold at a given time.
For example, if a user is a member of two departments - one for accessing banking data and another one for accessing trading data, with dynamic RBAC, you could restrict the user to only allow her to hold one of those roles at a given time. The policy for deciding which role the user holds could, for example, depend on what VPN (virtual private network) the user is logged in to or what building the user is present in. In effect, dynamic roles would allow the user to hold only one of the roles at a time and sandbox her inside one of the domains - banking or trading. It would prevent her from cross-linking or copying data between the different trading and banking domains.
Hopsworks implements a dynamic role-based access control model through its project-based multi-tenant security model. Every Project has an owner with full read-write privileges and zero or more members. A project owner may invite other users to his/her project as either a Data Scientist (read-only privileges and run jobs privileges) or Data Owner (full privileges). Users can be members of (or own) multiple Projects, but inside each project, each member (user) has a unique identity - we call it a project-user identity. For example, user Alice in Project A is different from user Alice in Project B - (in fact, the system-wide (project-user) identities are ProjectA__Alice and ProjectB__Alice, respectively).
As such, each project-user identity is effectively a role with the project-level privileges to access data and run programs inside that project. If a user is a member of multiple projects, she has, in effect, multiple possible roles, but only one role can be active at a time when performing an action inside Hopsworks. When a user performs an action (for example, runs a program) it will be executed with the project-user identity. That is, the action will only have the privileges associated with that project. The figure below illustrates how Alice has a different identity for each of the two projects (A and B) that she is a member of. Each project contains its own separate private assets. Alice can use only one identity at a time which guarantees that she can’t access assets from both projects at the same time.
Hopsworks enables you to host sensitive data in a shared cluster using a project-based access control security model (an implementation of dynamic role-based access control). In Hopsworks, a project is a secure sandbox with members, data, code, and services. Similar to GitHub repositories, projects are self-service: users manage membership, roles, and can securely share data assets with other projects. This project-based multi-tenant security model enables users to host both sensitive and shared data in a single Hopsworks cluster - you do not need to manage and pay for separate clusters.
An important aspect of project-based multi-tenancy is that assets can be shared between projects - sharing does not mean that data is duplicated. The current assets that can be shared between projects are: files/directories in HopsFS, Hive databases, feature stores, and Kafka topics. For example, in the figure below there are three users (User1, User2, and User3) and two projects (A and B). User1 is a member of project A, while User2 and User3 are members of project B. All three users (User1, User2, User3) can access the assets shared between project A and project B. As sharing does not mean copying, the access control rules for the asset are updated to give users in the other project read or write permissions on the shared asset.
Project-user identity is primarily based on a X.509 certificate issued internally by Hopsworks. Access control policies, however, are implemented by the platform services (HopsFS, Hive, Feature Store, Kafka), and for Elasticsearch Open Distro, permissions are managed using an open-source Hopsworks project-based authorizer plugin.
The following PySpark code snippet, available as a notebook when you run the Spark Tour on Hopsworks, shows how to read from an index that is private to a project from PySpark. There is also an equivalent Scala/Spark notebook.
In Hopsworks, we use Public Key Infrastructure (PKI) with X.509 certificates to authenticate and authorize users. Every user and every service in a Hopsworks cluster has a private key and an X.509 certificate. Hopsworks projects also support multi-tenant services that are not currently backed by X.509 certificates, including Elasticsearch. Open Distro for Elasticsearch supports authentication and access control using JSON Web Tokens (JWT). Similar to application X.509 certificates, Hopsworks’ resource manager (HopsYARN) issues a JWT for each submitted job and propagates it to running containers. Using the JWT, user code can then securely make calls to Elasticsearch indexes owned by the project. The JWT is rotated automatically before it expires and invalidated by HopsYARN once the application has finished.
For every Hopsworks project, a number of private indexes can be created in Elasticsearch: an index for real-time logs of applications in that project (accessible via Kibana), an index for ML experiments in the project, and an index for provenance for the project’s applications and file operations. Elastic indexes are private to the project - they are not accessible by users that are not members of the project. This access control is implemented as follows: when a request is made on Elasticsearch using a JWT token, our authorizer plugin extracts the project-specific username from the JWT token, which is of the form:
The index names have the following form:
Our plugin checks if a project-specific user is allowed to read/write an index by checking that the prefix (ProjectA) of both the user and the index match one another. We plan to add support for sharing elasticsearch indexes between projects by storing a list of projects allowed to perform read and write operations, respectively, on the indexes belonging to a project.
In Hopsworks, services communicate with each other using their own certificate to authenticate and encrypt all traffic. Each service in Hopsworks, that supports TLS encryption and/or authentication, has its own service-specific X.509 certificate, including all services in the ELK Stack (Elasticsearch, Kibana, and Logstash). Service certificates contain the Fully Qualified Domain Name (FQDN) of the host they are installed on and the login name of the system user that the process runs as. They are generated when a user provisions Hopsworks, and they have a long lifespan. Service certificates can be rotated automatically in configurable intervals or upon request of the administrator.
Hopsworks can be integrated with Kubernetes by configuring it to use one of the available authentication mechanisms: API tokens, credentials, certificates, and IAM roles for AWS’ managed EKS offering. Hopsworks can run users’ jobs on Kubernetes that have project-specific security material, X.509 certificates and JWTs, materialized to the launched Pods so user code can securely access services in Hopsworks, such as Open Distro for Elasticsearch. That is, Kubernetes jobs launched from within a project in Hopsworks are only allowed to access those Elasticsearch indexes that belong to that project.
In this post, we gave an overview of Hopsworks project-based multi-tenant security model and how we use Hopsworks projects and JWT tokens to make Open Distro for Elasticsearch a multi-tenant service.
TLDR; Many developers believe S3 is the "end of file system history". It is impossible to build a file/object storage system on AWS that can compete with S3 on cost. But what if you could build on top of S3 a distributed file system with a HDFS API that gives you POSIX goodness and improved performance? That’s what we have done with a cloud-native release of HopsFS that is highly available across availability zones, has the same cost as S3, but has 100X the performance of S3 for file move/rename operations, and 3.4X the read throughput of S3 (EMRFS) for the DFSIO Benchmark (peer reviewed at ACM Middleware 2020).
S3 has become the de-facto platform for storage in AWS due to its scalability, high availability, and low cost. However, S3 provides weaker guarantees and lower performance compared to distributed hierarchical file systems. Despite this, many developers erroneously believe that S3 is the end of file system history - there is no alternative to S3, so just re-write your applications to account for its limitations (such as slow and inconsistent file listings, non atomic file/dir rename, closed metadata, and limited change data capture (CDC) support). Azure has built an improved file system, Azure Data Lake Storage (ADLS) V2, on top of Azure Blob Storage (ABS) service. ADLS provides a HDFS API to access data stored in a ABS container, giving improved performance and POSIX-like goodness. But, until today, there has been no equivalent to ADLS for S3. Today, we are launching HopsFS as part of Hopsworks.
Hierarchical distributed file systems (like HDFS, CephFS, GlusterFS) were not scalable enough or highly available across availability zones in the cloud, motivating the move to S3 as the scalable storage service of choice. In addition to the technical challenges, AWS have priced virtual machine storage and inter-availability zone network traffic so high that no third party vendor could build a storage system that offers a per-byte storage cost close in price to S3.
However, the move to S3 has not been without costs. Many applications need to be rewritten as the stronger POSIX-like behaviour of hierarchical file systems (atomic move/rename, consistent file listings, consistent read-after-writes) has been replaced by weakened guarantees in S3. Even simple tasks, such as finding out what files you have, cannot be easily done on S3 when you have enough files, so a new service was introduced to enable you to pay extra to get a stale listing of your files. Most analytical applications (e.g., on EMR) use EMRFS, instead of S3, which is a new metadata layer for S3 that provides slightly stronger guarantees than S3 - such as consistent file listings.
The journey from a stronger POSIX-like file system to a weaker object storage paradigm and back again has parallels in the journey that databases have made in recent years. Databases made the transition from strongly consistent single-host systems (relational databases) to highly available (HA), eventually consistent distributed systems (NoSQL systems) to handle the massive increases in data managed by databases. However, NoSQL is just too hard for developers, and databases are returning to strongly consistent (but now scalable) NewSQL systems, with databases such as Spanner, CockroachDB, SingleSQL, and MySQL Cluster.
In this blog, we show that distributed hierarchical file systems are completing a similar journey, going from strongly consistent POSIX-compliant file systems to object stores (with their weaker consistency models, but high availability across data centers), and back to distributed hierarchical file systems that are HA across data centers, without any loss in performance and, crucially, without any increase in cost, as we will use S3 as block storage for our file system.
HopsFS is a distributed hierarchical file system that provides a HDFS API (POSIX-like API), but stores its data in a bucket in S3. We redesigned HopsFS to (1) be highly available across availability zones in the cloud and (2) to transparently use S3 to store the file’s blocks without sacrificing the file system’s semantics. The original data nodes in HopsFS have now become stateless workers (part of a standard Hopsworks cluster) that include a new block caching service to leverage faster local VM storage for hot blocks. It is important to note that the cache is a global cache - not a local worker cache found in other vendor’s Spark workers - that includes secure access control to the cache. In our experiments, we show that HopsFS outperforms EMRFS (S3 with metadata in DynamoDB for improved performance) for IO-bound workloads, with up to 20% higher performance and delivers up to 3.4X the aggregated read throughput of EMRFS. Moreover, we demonstrate that metadata operations on HopsFS (such as directory rename or file move) are up to two orders of magnitude faster than EMRFS. Finally, HopsFS opens up the currently closed metadata in S3, enabling correctly-ordered change notifications with HopsFS’ change data capture (CDC) API and customized extensions to metadata.
At Logical Clocks, we have leveraged HopsFS’ capabilities to build the industry’s first feature store for machine learning (Hopsworks Feature Store). The Hopsworks Feature Store is built on Hops Hive and customized metadata extensions to HopsFS, ensuring strong consistency between the offline Feature Store, the online Feature Store (NDB Cluster), and data files in HopsFS.
We compared the performance of EMRFS instead of S3 with HopsFS, as EMRFS provides stronger guarantees than S3 for consisting listing of files and consistent read-after-updates for objects. EMRFS uses DynamoDB to store a partial replica of S3’s metadata (such as what files/directories are found in a given directory), enabling faster listing of files/dirs compared to S3 and stronger consistency (consistent file listings and consistent read-after-update, although no atomic rename) .
Here are some selected results from our peer-reviewed research paper accepted for publication at ACM/IFIP Middleware 2020. The paper includes more results than shown below, and for writes, HopsFS is on-average about 90% of the performance of EMRFS - as HopsFS has the overhead of first writing to workers who then write to S3. HopsFS has a global worker cache (if the block is cached at any worker, clients will retrieve the data directly from the worker) for faster reads and the HopsFS’ metadata layer is built on NDB cluster for faster metadata operations.
*Enhanced DFSIO Benchmark Results with 16 concurrent tasks reading 1GB files. For higher concurrency levels (64 tasks), the performance improvement drops from 3.4X to 1.7X.
**As of November 2020, 3500 ops/sec is the maximum number of PUT/COPY/POST/DELETE per second per S3 prefix, while the maximum number of GET/HEAD requests per prefix is 5500 reads/sec. You can increase throughput in S3 by reading/writing in parallel to different prefixes, but this will probably require rewriting your application code and increasing the risk of bugs. For HopsFS (without S3), we showed that it can reach 1.6m metadata ops/sec across 3 availability zones.
In our paper published at ICDCS, we measured the throughput of HopsFS when deployed in HA mode over 3 availability zones. Using a workload from Spotify, we compared the performance with CephFS. HopsFS (1.6M ops/sec) reaches 2X the throughput of CephFS (800K ops/sec) when both are deployed in full HA mode. CephFS, however, does not currently support storing its data in S3 buckets.
HopsFS is available as open-source (Apache V2). However, cloud-native HopsFS is currently only available as part of the hopsworks.ai platform. Hopsworks.ai is a platform for the design and operation of AI applications at scale with support for scalable compute in the form of Spark, Flink, TensorFlow, etc (comparable to Databricks or AWS EMR). You can also connect Hopsworks.ai to a Kubernetes cluster and launch jobs on Kubernetes that can read/write from HopsFS. You connect your cluster to a S3 bucket in your AWS account or on Azure to a Azure Blob Storage bucket. You can dynamically add/remove workers to/from your cluster, and the workers act as part of the HopsFS cluster - using minimal resources, but reading/writing to/from S3 or ABS on behalf of clients, providing access control, and caching blocks for faster retrieval.
TLDR; Hopsworks 2.0 is now generally available. Hopsworks is the world's first Enterprise Feature Store along with an advanced end-to-end ML platform. This version includes improvements to the User Interface, client APIs and improved integration with Kubernetes. Hopsworks 2.0 also comes with upgrades for popular ML frameworks such as TensorFlow 2.3 and PyTorch 1.6.
In addition to bug and documentation fixes, Hopsworks 2.0 includes the following new features and improvements:
Detailed release notes are available at the Hopsworks GitHub repository.
The Feature Store API is now more intuitive by following a Pandas-like style programming style for feature engineering and works with abstractions such as feature groups and training datasets. Allowing for an improved workflow. Solving the biggest pain points of professional users working with Big Data and AI: being able to organize and discover data with as few steps as possible. An all new API documentation page is accessible here .
This release brings the new ability to enrich Feature Groups abstractions by attaching metadata in the form of tags and then allowing free-text search on these, available across the feature store to all users.
Hopsworks is built from the ground-up with support for implicit provenance. What that means is data engineers and data scientists that develop machine learning pipelines, from data ingestion, to feature engineering, to model training and serving, get lineage information for free as the data flows and is transformed through the pipeline. Users can then navigate from a model served in production back to the ML experiment that created the model along with its training dataset and finally all the way back to the feature groups and ingestion pipelines the training dataset was created from.
Up till this moment in Hopsworks’ journey, its Jobs service provided users with the means to run and monitor applications and do feature engineering using popular distributed processing frameworks such as Apache Spark/PySpark and Apache Flink. Hopsworks now further increases its reach to Feature Store users by enabling them to run such feature engineering jobs by using pure Python programs, detangling them for example from the need to run their code within a PySpark context/cluster. Developers can now do feature engineering with Pandas or their library of choice and still benefit from Jobs capabilities such as include them in Airflow data and ML pipelines as well as run Python Jupyter notebooks as jobs, simplifying the process of including notebooks in development and production pipelines. User guide is available here.
Sharing datasets across projects of a Hopsworks instance is now made simpler by using a fine-grained access control mechanism. Users can now select from a list of options that enable a dataset to be shared with different permissions for each project and for each role users have in the target project (Data Owner, Data Scientist). In addition, these new dataset sharing semantics now make it easier to share feature stores with different permissions across projects. For more details please check the user guide here.
Hopsworks developers can now use GitLab for code version control, in addition to the existing GitHub support. JupyterLab in Hopsworks comes shipped with the Git plugin so developers can work with Git directly from within the JupyterLab IDE. Further details are available in the user-guide here.
Hopsworks 2.0 is shipped with the latest and the greatest frameworks of the data engineering and data science Python world. Every Hopsworks instance now comes pre-installed with Pandas version 1.1.2, TensorFlow 2.3.1, PyTorch 1.6.0 and many more.
A plethora of new feature engineering and machine learning examples are now available at our examples GitHub repository. Data engineers can now follow examples that demonstrate how to work from the Feature Store with Redshift or Snowflake. Also new end-to-end and parallel experiments with deep learning and TensorFlow 2.3 are available.
Visit http://hopsworks.ai to get started with a free instance of Hopsworks and explore the new Hopsworks Feature Store and data science support of the Hopsworks 2.0 series and don’t forget to join our growing community!
TLDR; Hopsworks is now available as a managed platform for Amazon Web Services (AWS) and Microsoft Azure with a comprehensive free tier. Hopsworks is the only currently available cloud-native Enterprise Feature Store and it also includes a Data Science platform for developing and operating machine learning models in production.
You can get started with Hopsworks for free today. We offer free Hops Credits for you to start engineering and managing your features. You have access to all functionality of the Hopsworks Feature Store alongside all main features of Hopsworks platform, including compute elasticity (add/remove workers when needed), storage on S3/Azure Blob Storage, EKS, ECR and GPUs, as well as integrations with external Data Science and Data Lake/Warehouse platforms. Get started now!
Hopsworks is the world's first Enterprise Feature Store for machine learning, including an end-to-end ML platform for developing and operating AI applications. Hopsworks includes a new and improved Feature Store API, integration with Kubernetes, Sagemaker, Databricks, Redshift, Snowflake, Kubeflow, and many more. Hopsworks also comes with upgrades for popular ML frameworks, such as TensorFlow, PyTorch, and Maggy.
All details about the 2.0 release can be found in the release blog post: Hopsworks 2.0: The Next Generation Platform for Data-Intensive AI with a Feature Store.
Hopsworks has a managed platform for AWS and we are pleased to now announce full support for Microsoft Azure. Hopsworks is now cloud-independent, allowing customers to seamlessly move from AWS to Azure or vice-versa, without the fear of being locked in to a single vendor. Enterprises can now manage features for training and serving models at scale on Hopsworks, while maintaining control of their data inside their organisation’s own cloud accounts. Check out how to get started on AWS or Microsoft Azure.
As Hopsworks is a modular platform with integrations to third party platforms, organisations can combine our Feature Store with their existing data stores and data science platforms. Hopsworks natively integrates with Databricks on both AWS and Azure, Amazon Sagemaker, Kubeflow and other ML platforms. Check out how to get started with Databricks on AWS and Azure, Amazon Sagemaker, Kubeflow and other Data and ML platforms.
The general available version of Hopsworks adds support for pay-per-use compute, allowing you to add and remove compute nodes for feature engineering and machine learning jobs as needed to scale with your demand and minimize cost when not needed. We support different types of instances allowing you add or remove exactly the right amount of resources (CPU, GPU, memory) that you need to run your application. You can, for example, add GPU nodes to run a large-scale distributed training job while at the same time adding or removing non-GPU nodes for feature engineering workloads. Hopsworks supports running jobs on a wide variety of frameworks: Spark, Python, Flink, TensorFlow, PyTorch, Scikit-Learn, XGBoost, and more.
Hopsworks on AWS added support for using GPU instances for running ML workloads, scaling to multiple nodes and GPUs. GPU nodes can be added or removed at any point in time to scale with your workload and safe cost when not needed. In addition, you can utilize EKS for GPU-accelerated workloads in notebooks and Python jobs.
The Feature Store stores its historical data on HopsFS (a next generation HDFS) that now uses both S3 and Azure Blob storage as its backing store, reducing costs and enabling storage to scale independently of the size of a Hopsworks cluster. As a customer, you only have to provide your bucket and an IAM role/Managed identity with sufficient rights when creating a cluster. All data is kept private and stored inside your cloud account. HopsFS provides both performance improvements (3.4X read throughput vs S3), POSIX-like semantics, and open metadata compared to S3.
Hopswoksi on AWS integrates with EKS and ECR allowing you to run applications as containers on EKS from Hopsworks. Some services in Hopsworks are offloaded to EKS, such as running Jupyter/JupyterLab notebooks and model serving workloads. Hopsworks users run applications on EKS that can securely callback and use services in Hopsworks. In particular, there is special support for Python programs, where Data Scientists can install python libraries in their Hopsworks project, and Hopsworks transparently builds the Docker image that is run as a Python job on Kubernetes. Even Jupyter notebooks on Hopsworks can be run and scheduled as jobs on Kubernetes, without users having to write and manage Dockerfiles. An additional benefit of EKS is that the cluster can be set up with autoscaling allowing your workloads to scale dynamically with demand. Check out Integration with Amazon EKS and Amazon ECR to get started with EKS for Hopsworks.
Hopsworks supports data-parallel feature engineering jobs in Spark, PySpark, and Flink, as well as pure Python feature engineering with Pandas, Feature Tools, and any Python library. Feature engineering pipelines typically end at the Feature Store, and can be orchestrated using Airflow on Hopsworks (or an external Airflow cluster). Hopsworks has additional support for data validation with Great Expectations and Deequ. After the Feature Store, ML pipelines create train/test datasets, train models, analyze, validate and deploy models. Hopsworks provides experiment tracking, Tensorboard, distributed ML (with open-source Maggy), model analysis with the What-if-Tool. Again, ML training pipelines can be orchestrated using Airflow on Hopsworks (or an external Airflow cluster).
Enterprise Hopsworks now enables fine-grained access to AWS resources using AWS role chaining (federated IAM roles). Administrators can now define a mapping between AWS roles and project-specific user roles in Hopsworks. This enables both fine-grained access and auditing of users when they use AWS services from Hopsworks. Hopsworks users can either set a default role to be assumed by their jobs and notebooks to enable access to their AWS resources, or assume a role manually from within their programs by using the Python and Java/Scala SDKs (guide).
The enterprise version of Hopsworks supports fully-automated backups and upgrades at the click of a button, making the operation of Hopsworks cluster a non-issue for your DevOps team.
The enterprise version of Hopsworks supports single sign-on to Hopsworks clusters with Active Directory/LDAP integration. This enables your enterprise to manage access to the Feature Store and Hopsworks with your existing enterprise Identity Provider. Hopsworks also supports OAuth-2 as an Identity Provider.
Hopsworks supports billing organizations and user management, enabling your teams to share access to your Hopsworks account.
We are continuously working on improving the Hopsworks experience. Some of the upcoming features are: