When you invest money in machine learning (ML), you typically start by investing in people. You hire data scientists, data engineers, and ML engineers to transform your data into insights that can help you both reduce costs and increase revenue. However, if you do not manage the ML assets you create (the feature engineering jobs, the feature data, the models, and the CI/CD pipelines), the cost of each ML project will be roughly constant - every new project will start over from scratch and your ML - readiness will grow slower that the leading companies in your field. That is because the leading ML companies have all invested in building a data platform for ML (aka a feature store).
In this blog, we make a cost-benefit analysis for a Feature Store for ML, identifying some of the cost reductions and productivity improvement metrics they bring:
At the end of this blog, we provide our Feature Store ROI Calculator so you can estimate the return on investment of integrating a feature store in your ML development and operations.
The cost of performing the first few ML projects will not be substantially reduced if a company has a feature store. However, as more engineered features become available in the feature store, they can be reused by different teams in many different ML pipelines. As it has been estimated that 80% of the effort of ML projects is feature engineering, the reuse of features leads to substantial reductions in the cost of both developing and maintaining ML projects. With a well-populated feature store, organizations can expect to be able to productionize many more models at much reduced cost with fewer data scientists.
As 80% of the effort of ML projects is typically feature engineering, the availability of ready-made features in the feature store enables organizations to release models in significantly less time than if no feature store is available. The feature store also reduces the time needed by eliminating the need for exploratory data analysis (EDA), as feature distributions and descriptive statistics are precomputed and available in the feature store. On top of this, there is an improved division of labor. Data engineers are more skilled at writing features pipelines for ingesting and transforming raw data from backend databases, data warehouses, and data lakes, and this increases the time available for data scientists to develop more models and better models.
A feature store gives an immediate 50% reduction in the cost of maintaining feature engineering pipelines for online applications, as only one feature pipeline is needed to fill both the online and offline feature stores, not two. Without a feature store, features are computed (and often implemented) twice: once to serve features to the online application (performing model inference) and once to build train/test datasets for training models. Without a feature store, you can expect increased operational costs to ensure the consistency of both implementations of the features (serving and training). This consistency problem is technical debt that can be paid down ahead of time by having a feature store.
“Data is biased..But learning algorithms themselves are not biased...Bias in data can be fixed.”
When features have not been battle-tested and validated, there is a risk that features will either reveal sensitive information or models will introduce biased predictions (for example, predictions on slices of the data will perform differently than others). For example, models that produce different prediction results based on the race or gender of users are particularly high-risk for consumer companies.
A feature store, integrated into a ML pipeline can provide early warning for anomalies in training and serving data. One mechanism is to automate the identification and notification of feature drift - anomalous changes in the values or distribution of feature values. The feature store also enables Data Scientists to more easily build more extensive experiments analyzing models and linking the performance or bias of models to individual features from the feature store.
The Hopsworks Feature Store enables teams to work effectively together, sharing outputs and assets at all stages in ML pipelines. In effect, our Hopsworks Feature Store:
To summarize, a way to look at the value that a feature store can bring is shown in the table below.
You can download our Feature Store ROI Calculator spreadsheet or contact our sales team with the form below to have an estimated ROI if you integrate a feature store into your ML pipelines.
TLDR; Feature stores are the new cool kids in the neighbourhood of Data engineering and AI (artificial intelligence). Hyperscale AI companies (such as Uber, Netflix) have built their own feature stores to solve the problems of reusing, governing and securing access to features (data for AI) in a shared platform. Hopsworks is a modular open-source platform, developed by Logical Clocks, for managing data for AI (a standalone Feature Store), computing features (Spark, Python), and training models. In this post, we introduce the project-based multi-tenancy security model in Hopsworks for users, data, and programs. We describe how our project-based multi-tenant security model is, in effect, a form of dynamic role-based access control with zero performance overhead.
Hopsworks.ai is a SaaS version of Hopsworks, currently available on AWS. Hopsworks clusters can be run with an IAM profile, providing them with an identity in AWS with permission policies that capture what operations the Hopsworks cluster is authorized to perform in AWS - such as read/write data in S3 buckets. Hopsworks can also be used from Data Science and Feature Engineering platforms (Databricks, Sagemaker, KubeFlow, EMR) using API keys exported from Hopsworks.
This post is concerned primarily with the internal security model in Hopsworks that enables you to host sensitive data in a shared cluster, providing powerful access control and self-service capabilities. The benefit of Hopsworks project-based multi-tenancy model is that you can host many users and feature stores (and other projects) on a single cluster, with self-service access to different feature stores. The advantage of our security model is that you can host production, staging, and development feature stores in a single cluster - you do not need to manage and pay for separate clusters.
Role-based access control (RBAC) is a well-known security model that enables administrators to give a group of users the same access rights to selected resources. With roles, an administrator at a company could define a single security policy and apply it to all members of a department. But individuals may be members of multiple departments, so a user might be given multiple roles. Dynamic role-based access control means that, based on some other policy, you can change the set of roles a user can hold at a given time. For example, if a user has two different roles - one for accessing banking data and another one for accessing trading data, with dynamic RBAC, you could restrict the user to only allow her to hold one of those roles at a given time. The policy for deciding which role the user holds could, for example, depend on what VPN (virtual private network) the user is logged in to or what building the user is present in. In effect, dynamic roles would allow to hold only one of the roles at a time and sandbox her inside one of the domains - banking or trading. It would prevent her from cross-linking or copying data between the different trading and banking domains.
Hopsworks implements a dynamic role-based access control model through a project-based multi-tenant security model. Inspired by GDPR, in Hopsworks a Project is a sandboxed set of users, data, and programs (where data can be shared in a controlled manner between projects). Every Project has an owner with full read-write privileges and zero or more members.
A project owner may invite other users to his/her project as either a Data Scientist (read-only privileges and run jobs privileges) or Data Owner (full privileges). Users can be members of (or own) multiple Projects, but inside each project, each member (user) has a unique identity - we call it a project-user identity. For example, user Alice in Project A is different from user Alice in Project B - (in fact, the system-wide (project-user) identities are ProjectA__Alice and ProjectB__Alice, respectively). As such, each project-user identity is effectively a role with the project-level privileges to access data and run programs inside that project. If a user is a member of multiple projects, she has, in effect, multiple possible roles, but only one role can be active at a time when performing an action inside Hopsworks. When a user performs an action (for example, runs a program) it will be executed with the project-user identity. That is, the action will only have the privileges associated with that project. The figure below illustrates how Alice has a different identity for each of the two projects (A and B) that she is a member of. Each project contains its own separate private assets. Alice can use only one identity at a time which guarantees that she can’t access assets from both projects at the same time.
An important aspect of Project based multi-tenancy is that assets can be shared between projects - sharing does not mean that data is duplicated. The current assets that can be shared between projects are: files/directories in HopsFS, Hive databases, feature stores, and Kafka topics. For example, in the figure below there are three users (User1, User2, and User3) and two projects (A and B). User1 is a member of project A, while User2 and User3 are members of project B. All three users (User1, User2, User3) can access the assets shared between project A and project B. As sharing does not mean copying, the access control rules for the asset are updated to give users in the other project read or write permissions on the shared asset.
As we will see later on, project-user identity is based on a X.509 certificate issued internally by Hopsworks. Access control policies, however, are implemented by the platform services:
When a user authenticates with Hopsworks, they are logged into the platform with a Hopsworks user identity. This user identity is needed to be able to construct the project-user identity - it is the “user” part of the project-user identity. In Hopsworks, a user-identity is mapped to a global Hopsworks role (independent of project membership) - a normal user or an administrator. A normal user can search for assets, update her profile, generate API keys, and change to/from projects. Administrators have access to user, project, storage, and application management pages, system monitoring and maintenance services. They can activate or block users, delete Projects, manage Project quotas, promote normal users to administrators, and so on. It’s important to mention here that a Hopsworks administrator cannot view the data inside a project - even if they are allowed to delete a project.
A user interacts with Hopsworks through the web application and they don’t necessarily realize that the web application is a facade to a modular distributed system. In the background we run HopsFS - our next-generation HDFS-on-S3 filesystem, HopsYARN - a cluster management system and scheduler, Apache Hive, Elasticsearch, (optionally Kafka and Airflow), and other logging and monitoring services.
A fundamental principle in every distributed system is that processes exchange messages over the network or through shared state (such as a filesystem or database). When communication is performed by message passing, it is imperative that we protect the messages from adversaries reading or modifying their content and validate the identity of the caller. Traditionally in Hadoop, they use Kerberos and GSSAPI to authenticate and authorize users and encrypt data in-transit. While Kerberos (Active Directory) is widely adopted by big organizations, the administration of users and services is a painful process and the system does not scale. On top of that Kerberos APIs are so complicated that programming against them can be really challenging.
To avoid the pain of Kerberos (and be able to natively integrate with platforms like Kubernetes), we completely re-designed the security model for HopsFS and YARN to use certificates. We replaced Kerberos with Public Key Infrastructure (PKI) with X.509 certificates to authenticate and authorize users. Certificates enabled us to also use the well established TLS protocol to provide confidentiality and data integrity. Every user and every service in a Hopsworks cluster has a private key and an X.509 certificate.
Hopsworks supports a number of stateful and compute services that use X.509 certificates to authenticate users, applications, and services: HopsFS, HiveServer2, Kafka, YARN. These services all provide their own authorization schemes. We unified HopsFS and Hive’s authorization models by providing 2-way TLS in HiveServer2 and storage based authorization for the Hive metastore, that we ported to Hive 3.X, to delegate access control decisions for Hive to HopsFS. In Hive, tables and databases store their data files inside directories on HopsFS, so HopsFS ACLs (access control lists) authorize file system operations by Hive (read from tables, write to tables). The easy-to-understand ACLs that we expose in Hopsworks (for Hive, and datasets in HopsFS) are captured in two roles: Data Scientists can read, Data Owners can read/write. HopsFS ACLs can be customized directly in Hopsworks from version 1.4.
For Kafka, we developed a Hopsworks Authorizer plugin that authorizes operations on Kafka topics by extracting the project-user identity from the client supplied X.509 certificate. The Hopsworks Kafka Authorizer then validates that the user is a member of the project that has permissions to perform the requested action on the Kafka topic.
HopsYARN uses X.509 certificates to identify users. HopsYARN also creates and manages (including renewal) application certificates for YARN applications (such as Spark jobs). Application certificates in HopsYARN are a key feature missing currently in Kubernetes - because each application has an identity, you can track and log its actions (reading/writing files/topics/databases/etc). This is incredibly valuable for machine learning pipelines, where it enables Hopsworks to automatically gather provenance information for models, notebooks, and train/test datasets (see our USENIX OpML ‘20 paper for more details) - enabling easy reproduction of models.
Hopsworks projects also support two other multi-tenant services that are not currently backed by X.509 certificates: Elasticsearch and the Online Feature Store (MySQL Cluster (NDB)).
Hopsworks includes the Open Distro for Elasticsearch that supports authentication and access control using JSON Web Tokens (JWT). For every Hopsworks project, a number of private indexes are created in Elasticsearch: an index for real-time logs of applications in that project (accessible via Kibana), an index for ML experiments in the project, and an index for provenance for the project’s applications and file operations. Currently, we do not support sharing elastic indexes across projects - they are private to the project. MySQL Cluster is the online feature store in Hopsworks, and details on how we provide multi-tenant access to the online feature store using user credentials is provided later in this post.
As we saw in the previous section, most multi-tenant services in Hopsworks are built on X.509 certificates. Hopsworks comes with its own root Certificate Authority (CA) for signing certificates internally. Hopsworks root CA does not directly sign requests, instead it uses an intermediate CA to do so. To protect against a security breach, more than one intermediate CA can be made responsible for a specific domain. As shown in the above figure, there is one intermediate CA for creating API and Kublet certificates that can be used to access an external Kubernetes cluster (more details on integration with Kubernetes are provided later in this post) and another Hopsworks intermediate CA for user, application, and service certificates inside Hopsworks.
Hopsworks intermediate CA supports three kinds of certificates: User, Application and Service certificates. They are all signed by the same Hopsworks intermediate CA but have different lifespan and attributes. In the following sections we are going to dig deeper on how they are issued, their lifecycle and how they’re used.
When a user in Hopsworks becomes a member of a project (either when they create their own project or are added as a member to a project), a new user is created behind the scenes. This new user uniquely identifies a user belonging to a specific Project, a project-user ID. The username of this project-user is in the form of PROJECTNAME__USERNAME.
For each project-user, we automatically generate an X.509 certificate and a private key. The certificate contains the user’s username to the Common Name field of the X.509 Subject to authenticate the user.
Certificates and private keys are stored in the database and pulled when needed to perform RPCs or REST calls (a caching mechanism is set in place to avoid round-trips to the database). All private keys are stored encrypted in the database. User certificates are used for all operations that are not executed within the context of an application but still access the data managed by Hopsworks. For example, a user in Hopsworks who previews a dataset in the UI or when a user submits an application to the scheduler.
In most cases, users perform actions within the context of an application. A user will launch a job which will read/write data from/to HopsFS, negotiate for more resources in the cluster, write to the feature store, or produce to a Kafka topic. To support fine-grained access control, lineage, and provenance, we issue a new set of cryptographic material for every application. The application certificate contains the username of the user who submitted the job and a unique identifier for the current application. With this information we are now able to log information about who created or modified a specific file and with what application, or in machine learning, we can infer which application read/wrote this feature data.
Application certificates are transparent to the user. HopsYARN is responsible for their lifecycle. When a new application is submitted to the cluster, a new X.509 certificate is generated and signed by Hopsworks intermediate CA. The cryptographic material is shipped automatically to all containers and revoked once the application has finished. To minimize the attack vector, HopsYARN will periodically rotate them. It will generate a new pair, ship them to already running containers, and revoke the previous set.
Finally, the last type of Hopsworks certificate is the one used by the services themselves. Services will communicate with each other using their own certificate to authenticate and encrypt all traffic. Each service in Hopsworks, that supports TLS encryption and/or authentication, has its own service-specific certificate. Service certificates contain the Fully Qualified Domain Name (FQDN) of the host they are installed on and the login name of the system user who’s supposed to run the process. They are generated when a user provisions Hopsworks - either with a fresh on-prem installation or by spinning up an instance on hopsworks.ai - and in general they have a long lifespan. Service certificates can be rotated automatically in configurable intervals or upon request of the administrator. The services in Hopsworks that have their own X.509 certificates for encrypting their network traffic:
The services in Hopsworks that support two-way X.509 certificates for both encrypting their network traffic and client-side authentication:
So, now that we’ve outlined the building blocks for our security model, we will see how we actually protect all data in-transit. All backend services exchange messages with each other. Both servers and clients require all communication to be done over TLS with 2-way authentication, that is both entities should present a certificate.
The server will check if the client’s certificate is still valid and trusted. It will also validate the peer’s certificate against a Certificate Revocation List and if the certificate has been revoked, it will drop the connection. As a last step of protection for project-user and application certificates, the username that is encoded in the message - which is the effective user performing an action - is validated against the Common Name field of the X.509 certificate. That way a rogue user cannot impersonate other users. In server-to-server communication, service certificates are used. Remember the FQDN in the certificate? The receiver will do a DNS lookup (result is cached) and the answer should match the incoming IP address, so an adversary can’t join a faulty node in the cluster. Finally, the Locality (L) field of a Service X.509 certificate should also match the username of an incoming RPC.
Hopsworks Certificate Authority keeps a list of revoked certificates. Each time a project is deleted, or a user is removed from a project or a job finishes or when a certificate is rotated, Hopsworks updates its CRL and digitally signs it. All backend services periodically fetch the CRL and refresh their internal data structures so that connections with revoked certificates will be dropped.
In Hopsworks.ai, HopsFS data can be stored in a bucket in S3. S3 has the advantages of lower cost, but it also provides encryption-at-rest for files stored in buckets. For on-premises HopsFS, data can be stored encrypted-at-rest using ZFS native encryption support. Hopsworks can centrally manage encryption keys for ZFS pools (volumes), so that encryption keys are not stored on the storage hosts. As services in Hopsworks run in user-mode in Linux, the platform can be configured to ensure that administrators are not able to read file data.
Hopsworks and Elasticsearch (Open Distro) use JWT for authentication. Similar to application X.509 certificates, HopsYARN issues a token for each submitted job and propagates it to running containers. User code can then securely make REST calls to Hopsworks API or to Elasticsearch indexes owned by the project. The JWT is rotated automatically before it expires and invalidated once the application has finished.
A general overview of the security architecture of Hopsworks with all the artifacts we’ve discussed so far is depicted in the figure below.
When an online feature store is enabled in a project in Hopsworks, a new database is created in MySQL Cluster NDB to store the online features. The credentials for accessing that database are created by Hopsworks and securely stored in the database, encrypted by a master key. Clients of the online feature store (such as external online applications that make queries on machine learning models) use a REST request to retrieve their feature store credentials with which they can read data directly from MySQL Cluster using a JDBC/TLS connection. Credentials can be rotated in MySQL Cluster.
While Hopsworks has its own security model, it needs to integrate with the security models of the environments in which it is used.
In our Hopsworks SaaS platform, www.hopsworks.ai, a Hopsworks cluster can be given an IAM role - an identity in AWS with permission policies that capture what operations the Hopsworks cluster is authorized to perform in AWS. In Hopsworks.ai, you can start a cluster by selecting an instance profile for it in the UI - the instance profile is a container for the IAM role. Users can create their own instance profile which suits their needs and security requirements. It is also possible to use AWS keys in Hopsworks.ai to define connectors to external AWS services. With either the instance profile or AWS keys, applications (Python, PySpark, Spark, and Flink) can natively read/write to services in AWS, such as S3. Similar to Hopsworks.ai SaaS platform, Enterprise Hopsworks on AWS can also be configured to run clusters with an instance profile.
Hopsworks Enterprise supports single sign-on for both Kerberos, LDAP, and OAuth2 identity providers. If you deploy Hopsworks in an environment where users use ActiveDirectory (Kerberos), LDAP, or OAuth2 for single sign-on, when they navigate to the Hopsworks UI, authentication plugins available in Enterprise Hopsworks (SPNEGO for Kerberos, and OpenID for OAuth2) will automatically log users in.
Hopsworks can be integrated with Kubernetes by configuring it to use one of the available authentication mechanisms: API tokens, credentials, certificates, and IAM roles for AWS’ managed EKS offering. Hopsworks can offload to Kubernetes some of its microservices (Jupyter notebook instances, model serving) and run users’ jobs on Kubernetes. Project specific security material such as X.509 certificates and JWTs will be propagated to launched Pods so user code can access services in Hopsworks: HopsFS, Hive, Elasticsearch, and the Hopsworks REST API.
The Hopsworks installer gives you the option to also install the open-source Kubernetes distribution alongside Hopsworks. This is convenient for environments with no existing Kubernetes installation or for testing and development. For this case, our KubernetesCA intermediate CA will issue all Kuberentes internal and etcd certificates.
Programs can also authenticate to the Hopsworks REST API using JWT or API keys. The former expire after a short period of time and the latter can be issued with limited scope, e.g., access only the Feature Store API. As such, API keys are typically used for external access to Hopsworks Feature Store by platforms such as Databricks, AWS Sagemaker, KubeFlow, and AWS EMR.
In this post, we gave an overview of how Hopsworks tackles information security and we introduced our novel project-based multi-tenant security model and the multi-tenant services supported in Hopsworks. Our security model is based primarily on X.509 certificates that come in three types: Project-User certificates, service certificates, and application issued certificates. We also described how we address encryption in transit with TLS, as well as the authentication methods to our web front-end based on JWT and API Tokens and integration with third-party services.
TLDR; Applied machine learning in the automotive industry is commonly accepted and utilised to create new intelligent products and optimised ways of working. The amount of data that connected cars are producing is massive. This data and other automotive data can be used to build models that predict when e.g., maintenance needs to take place or to classify “driver- behaviour”. This blog introduces the feature store as a new element in automotive machine learning (ML) systems and as a new data science tool and process for building and deploying better ML- models in the automotive industry.
Changes in consumer behaviour and technology are disrupting traditional modes of operation. To succeed, carmakers, dealers, and other automotive ecosystem companies must adapt quickly to the changing environment, embracing challenges and opportunities by exploiting the power of data.
The current generation of vehicles are software-enabled, data generating, connected devices, opening up opportunities for new (data ) products and services. Automobile data science isn’t just about self-driving cars. Data science and machine learning technologies can help keep carmakers competitive by improving everything from research to design manufacturing to marketing processes.
Automotive industry players have to innovate with data management to get the biggest bang for their buck from the data generated by their vehicles and customers. Acquiring, unifying, and gaining insight from data is a vital part of this innovation process. The Internet of Things and connected systems will have a significant influence on automotive innovation. Recent research says that by 2025, roughly 470 million cars will be collecting data from sensors and making it available online.(PWC report, p9). In addition, research suggests that the availability of connected systems will be vital in winning the Millennial market and retaining its loyalty.(Cars 2025, Goldman Sachs, trend 6 “The Internet of Cars”)
Given these trends, the automotive industry needs to have a data strategy for connected cars. The richest data set for vehicle-specific data is recorded on the CANBUS, and the automakers have the easiest access to that data. This access puts automakers in the best position to decide who can utilise the data and how.
New machine learning solutions are being implemented in the automotive industry each year. In the “Future Automotive Industry Structure — FAST 2030” report, seven trends are mentioned (see figure 1) to shape the automotive vertical by 2030, such as self-driving, connected vehicles, and e-mobility. Most of these trends will have elements of applied machine learning enabling the solutions that predict, recommend and classify.
Source: Oliver Wyman analysis from “Future Automotive Industry Structure — FAST 2030” report
To determine the place of machine learning in the automotive industry, we’ve listed some use cases that apply machine learning in products and solutions:
We’re used to cars notifying us with alerts and lights to check-engine or oil. Connected vehicles can do more than that. Machine learning monitors all the sensors detecting potential problems before they occur and drivers can get a prescriptive alert of what is going on with their vehicles.
Autonomous vehicles are a much-discussed topic in the automotive industry. Most manufacturers have announced timelines for introductions of self-driving cars. Providing vehicles with artificial intelligence could make them smart enough to become driverless.
Tesla is in the lead in developing a self-driving car. Waymo by Google has been tested on roads in the US for a while now. A large advantage that AI provides autonomous vehicles is learning and adjusting based on new data. Also, all the data a vehicle collects is available to the rest of the fleet, creating network-effects whereby as more cars collect more data, the AI used for self-driving can become trained to be more accurate. The limiting factor here is the ability to work with increasing volumes of data and develop an AI platform that scales to process massive volumes of data.
As it may still take a while before autonomous vehicles arrive, a more popular AI feature to use now is driver assistance. Mercedes-Benz and others have introduced their driver assistance packages and implement them in their newest vehicles to improve the driver’s experience.
Trying to predict the future is vital to Insurance companies. Embracing AI-technology will improve that capability by doing risk assessments in real-time and the ability e.g. for customers to file claims when accidents occur.
The collaboration between insurance companies and machine learning technology has created Insurtech. Insurance companies want access to speed, acceleration, and navigation data to provide more accurate premium estimates for individual users, and usage-based insurance ML-technology creates drivers’ risk profiles based on individual risk factors and then predicts drivers’ behaviour based on previous actions.
Successful applied ML plays a vital role in making new initiatives and automotive data product programs a success. Data engineering is a vital and time-consuming part of this process and has to solve the complex challenge of working with local, global, and event-based data that will be consumed from different data sources and with varying cadences. Instead of increasing data engineering headcount to solve these complex problems, an alternative solution is to start sharing data features across multiple models and across multiple teams and lines of business.
Depending on where you are in your journey of managing data pipelines, ML-platform integration and successfully taking ML-models into production in intelligent apps, data complexities and related risks of technical debt and inefficiencies will increase.
Looking at best practices of hyperscale AI companies such as Uber, Twitter, and Airbnb - they all had a common need to build a feature store, a central repository/data warehouse for machine learning. The feature store facilitates an operating model that accelerates ML-projects by making ML-features reusable, cost-effective, verifiable, governed, and searchable.
To give an idea of possible pain points in your current operations we illustrate what happens during the model development stage of a typical model ML-model lifecycle. The data scientist will build common features and features that are specific to the model. This process can cause a couple of pain points such as:
A feature store facilitates reusability of features across the organisation as features are visible by search to all potential users in multiple business domains. In model development teams and in model serving teams it becomes possible to discover, store and manage their diverse feature sets. The table below highlights automotive models where common features are created and used in different models
Uber, Twitter, Airbnb, and others that spearheaded the development of the feature store for training and serving their models have made this data layer for AI the basis for their industrialised ML deployments.
Deciding between building and buying will depend on your strategy and we have recently published an article “How to build your own Feature Store” that I recommend you to read if you’re interested in building one yourself. But, in general one could say: “build what differentiates you, and buy what accelerates your ML-projects”. In different presentations from companies that have built their own feature stores, we have seen that feature store projects can take up to a year to complete and longer.
In case you choose to buy an enterprise-ready feature store then it’s vital that you figure out how to integrate the feature store into your data pipelines and current machine learning operations. We wrote another interesting article “MLOps with a Feature Store” that will help you understand some of the challenges for automating and integrating machine learning into your operational IT environments.
The Hopsworks Feature Store enables teams to work effectively together , sharing outputs and assets at all stages in machine learning (ML) pipelines. In effect, our Feature Store:
In addition, Logical Clocks provides a data processing and model training platform that makes it easy to do deep learning with very large datasets at scale on tens or hundreds of GPUs in parallel. We have a number of automotive customers using Hopsworks on-premise and public cloud for their deep learning projects related to autonomous vehicles.
Hopsworks is a product from Logical Clocks AB, a specialist in data for AI and large scale distributed deep learning. Funded by leading European VC’s such as Inventure and Frontline Ventures. The company has offices in Stockholm, London, and Palo Alto.
Read more about the Hopsworks Feature Store or contact our sales team with the form below.
This article was originally published at Extreme Earth.
In recent years, unprecedented volumes of data are generated in various domains. Copernicus, a European Union flagship programme, produces more than three petabytes(PB) of Earth Observation (EO) data annually from Sentinel satellites . This data is made readily available to researchers that are using it, among other things, to develop Artificial Intelligence (AI) algorithms in particular using Deep Learning (DL) techniques that are suitable for Big Data. One of the greatest challenges that researchers face however, is the lack of tools that can help them unlock the potential of this data deluge and develop predictive and classification AI models.
ExtremeEarth is an EU-funded project that aims to develop use-cases that demonstrate how researchers can apply Deep Learning in order to make use of Copernicus data in the various EU Thematic Exploitation Platforms (TEPs). A main differentiator of ExtremeEarth to other such projects is the use of Hopsworks, a Data-Intensive AI software platform for scalable Deep Learning. Hopsworks is being extended as part of ExtremeEarth to bring specialized AI tools for EO data and the EO data community in general.
Hopsworks is a Data-Intensive AI platform which brings a collaborative data science environment to researchers who need a horizontally scalable solution to developing AI models using Deep Learning. Collaborative means that users of the platform get access to different workspaces, called projects, where they can share data and programs with their colleagues, hence improving collaboration and increasing productivity. The Python programming language has become the lingua franca amongst data scientists and Hopsworks is a Python-first platform, as it provides all the tools needed to get started programming with Python and Big Data. Hopsworks integrates with Apache Spark and PySpark, a popular distributed processing framework.
Hopsworks brings to the Copernicus program and the EO data community essential features required for developing aI applications at scale, such as distributed Deep Learning with Graphics Processing Units (GPUs) on multiple servers, as demanded by the Copernicus volumes of data. Hopsworks provides services that facilitate conducting Deep Learning experiments, all the way from doing feature engineering with the Feature Store , to developing Deep Learning models with the Experiments and Models services that allow them to manage and monitor Deep Learning artifacts such as experiments, models and automated code-versioning and much more . Hopsworks storage and metadata layer is built on top of HopsFS, the award-winning highly scalable distributed file system , which enables Hopsworks to meet the extreme storage and computational demands of the ExtremeEarth project.
Hopsworks brings horizontally scalable Deep Learning for EO data close to where the data lives, as it can be deployed on Data and Information Access Services (DIAS) . The latter provides centralised access to Copernicus data and information which combined with the AI for EO data capabilities that Hopsworks brings, an unparalleled data science environment is made available to researchers and data scientists of the EO data community.
Recent years have witnessed the performance leaps of Deep Learning (DL) models thanks to the availability of big datasets (e.g. ImageNet) and the improvement of computation capabilities (e.g., GPUs and cloud environments). Hence, with the massive amount of data coming from earth observation satellites such as the Sentinel constellation, DL models can be used for a variety of EO-related tasks. Examples of these tasks are sea-ice classification, monitoring of water flows, and calculating vegetation indices.
However, together with the performance gains comes many challenges for applying DL to EO tasks, including, but not limited to:
While collecting raw Synthetic-Aperture Radar (SAR) images from the satellites is one thing, labeling those images to make them suitable for supervised DL is yet a time consuming task. Should we seek help from unsupervised or semi-supervised learning approaches to eliminate the need for labeled datasets? Or should we start building tools to make annotating the datasets easier?
Given enough labeled data, we can probably build a model with satisfactory performance. But how can we justify the reasons behind why the model makes certain predictions given certain inputs? While we can extract the intermediate predictions for given outputs, can we reach interpretations that can be better understood by humans?
Managing terabytes (TB) of data that can still fit into a single machine is one thing, but managing petabytes (PB) of data that requires distributed storage and provides a good service for the DL algorithms so as not to slow down the training and serving process is a totally different challenge. To further complicate the management, what about partial failures in the distributed file system? How shall we handle them?
How can we build models that effectively use multi-modalities? For example, how can we utilize the geo-location information in an image classification model?
While we might be able to perform preprocessing and design model architectures for RGB image classification, how do these apply to SAR images? Can we use the same model architectures? How to extract useful information from multi-spectral images?
Hyperparameters are those parameters of the training process (e.g., the learning rate of the optimizer, or the size of the convolution windows) that should be manually set before training. How can we effectively train models and tune the hyperparameters? Should we change the code manually? Or can we use frameworks to provide some kind of automation?
Once the training is done, we want to use our trained model to predict outcomes based on the newly observed data. Often these predictions have to be made in real-time or near-real-time to make quick decisions. For example, we want to update the ice charts of the shipping routes every hour. How to serve our DL models online to meet these real-time requirements?
A Data Science application in the domain of Big Data typically consists of a set of stages that form a Deep Learning pipeline. ُThis pipeline is responsible for managing the lifecycle of data that comes into the platform and is to be used for developing machine learning models. In the EO data domain in particular, these pipelines need to scale to the petabyte-scale data that is available within the Copernicus program. Hopsworks provides data scientists with all the required tools to build and orchestrate each stage of the pipeline, depicted in the following diagram.
In detail, a typical Deep Learning pipeline would consist of:
Drifting icebergs pose major threats to the safety of navigation in areas where icebergs might appear, e.g., the Northern Sea Route and North-West Passage. Currently, domain experts manually conduct what is known as an “ice chart” on a daily basis, and send it to ships and vessels. This is a time-consuming and repetitive task, and automating it using DL models for iceberg classification would result in generation of more accurate and more frequent ice charts, which in turn leads to safer navigation in concerned routes.
Iceberg classification is concerned with telling whether a given SAR image patch contains an iceberg or not. Details of the classification depends on the dataset that will be used. For example, given the Statoil/C-CORE Iceberg Classifier Challenge dataset , the main task is to train a DL model that can predict whether an image contains a ship or an iceberg (binary classification).
The steps we took to develop and serve the model were the following:
The final step is to export and serve the model. Model is exported and saved into the Hopsworks “Models” dataset. Then we use the Hopsworks elastic model serving infrastructure to host TensorFlow serving which can scale with the number of inference requests.
In this blog post we described how the ExtremeEarth project brings new tools and capabilities with Hopsworks to the EO data community and the Copernicus program. We also showed how we have developed a practical use case by using Copernicus data and Hopsworks. We keep developing Hopsworks to make it even more akin to the tools and processes used by researchers across the entire EO community and we continue development of our use cases with more sophisticated models using even more advanced distributed Deep Learning training techniques.
Solving challenges from regulators, cybercrime and your customer base can be hard. One of the ways of responding and acting to these challenges is by using the power of data and machine learning, for example, to identify fraud, improve user engagement, and ensure responsible gambling by identifying at-risk players.
The success of Machine Learning algorithms in a broad range of areas has led to an increasing demand for ML-platform/solutions in the gaming industry. Many operators are currently focused on managing data pipelines and evaluating or developing machine learning platforms to optimize their operations and stay competitive.
ML is core to what AI-native companies like Uber, Airbnb, and Twitter do for creating new products and redefining customer experience standards. The crucial first step in their ML-process is feature engineering, and it often is the most laborious activity in the model building lifecycle. These AI-native companies almost all have built feature stores to optimize their feature engineering processes across multiple teams and models.
We helped Paddy Power mature their digital transformation by implementing the innovative and essential feature store concept, a central repository of features (input data used to train ML models) in a store that act as an enterprise-wide marketplace of features for different teams with different remits. The feature store enables the reuse of common features and uses case-specific ML-features, for predictive betting models for different sports books, anti-fraud and AML (anti-money laundering) models and player management and responsible gambling models where features are reused across different models.
The concept of a feature store was introduced by Uber in 2017 as part of its internal Michelangelo platform for ML. The feature store is a central place to store curated features within an organization. So what is a feature, exactly? A feature is a measurable property of some data-sample. It could, for example, be the number of customer transactions over a period of time (hour, day, week), the recent performance of a horse in horse-racing, or the average number of deposits and exits within the last hour. Features can be extracted directly from files and database tables or can be derived values, computed from one or more data sources.
Features are the fuel for AI systems, as we use them to train machine learning models so that you can make predictions using new feature values that your model has never seen before.
A feature store enables the reusability of features across your gaming operations, as existing features are visible to all potential users (data engineers, data scientists, machine learning engineers, business analysts, etc). Shared features can then be used to develop models for:
The feature store supports feature enrichment, discovery, ranking, lineage and lifecycle management for features.
In both training and serving models, the feature store plays a valuable role. During model training, the feature store is used to create training data in the file format of choice for the Data Scientists. There is no need to write and run new data pipelines to make feature data available in .tfrecord or .npy or .csv files. Data scientists can interactively generate train/test data in the file format of their choice on the storage platform of their choice (s3, HDFS, etc).
When models are being used, the feature store provides batch applications access to large volumes of feature data, while for online model serving, the feature store provides low latency access to feature data for online applications.
Without a feature store, organizations have ad-hoc scripts and programs for feature engineering with limited sharing of features either within teams or between teams. Features can be rewritten many times, in different ways, by different developers. Feature pipelines also need to be re-written when new training file formats appear (petastorm, for example), and enterprises have little insight into which features are being used in the organization and adding most value. Developers are also required to develop infrastructure to ensure that offline and online feature data is kept consistent, a non-trivial task.
Feature engineering pipelines are written and operated, by Data Engineers, that take data from backend systems, and transform and validate it before filling the feature store with feature data. Data scientists are now freed from heavy feature engineering and dedicate more time to developing higher quality models by selecting features and backfilling train/test datasets that they then use to train models. Data scientists are responsible for training and validating their models before they are deployed for either batch or online applications. ML engineers, who operate models in production, can also lookup feature data in real-time for applications
The Hopsworks Feature Store enables teams to work effectively together, sharing outputs and assets at all stages in machine learning (ML) pipelines. In effect, the Feature Store:
tl;dr Deep learning is now the state-of-the-art technique for identifying financial transactions suspected of money laundering. It delivers a lower number of false positives and with higher accuracy than traditional rule-based approaches. However, even though labelled datasets of historical transactions are available at many financial institutions, supervised machine learning is not a viable approach due to the massive imbalance between the number of “good” and “bad” transactions. The answer is to go unsupervised and welcome with open arms GANs and graph embeddings.
Financial institutions invest huge amounts of resources in both identifying and preventing money laundering and fraud. Many companies have systems that automatically flag financial transactions as ‘suspect’ using a database of static rules that generate alerts if they match for a transaction, see Figure below. Flagged transactions can be either withheld and/or investigated by human investigators. As anti-money laundering (AML) is a pattern matching problem, and many institutions have terabytes (TBs) of labelled data for historical transactions (transactions are labelled as either ‘good’ or ‘bad’), many banks and companies are investigating deep learning as a potential technology for classifying transactions as "good" or "bad", see Figure below. However, supervised machine learning is not a viable approach due to the typical imbalance between the number of “good” and “bad” transactions, where you may only have 1 “bad” transaction for every million transaction or more.
What are the alternatives? Well, in 2019, unsupervised learning and self-supervised learning have taken deep learning by storm, with self-supervised solutions now the state-of-the-art systems for both Natural Language Processing (NLP), RobertA by Facebook, and image classification for ImageNet, Noisey Student and AdvProp both by Quoc Le’s team at Google. Self-supervised learning is autonomous supervised learning, that doesn’t always require labelled data, as self-supervised learning systems extract and use some form of feature or metadata as a supervisory signal. In contrast, unsupervised learning has a long history of being used for anomaly detection, and in this blog we describe how we have worked on a AML solution based on Deep Learning, and how we moved through unsupervised, self-supervised, to semi-supervised to arrive at our method of choice - Generative Adversarial Networks (GANs).
In the figure above, we can see the huge data imbalance between good and bad transactions found in a typical AML transaction dataset. A supervised deep learning system could be used to train a binary classifier that predicts with 99.9999% accuracy whether a transaction was involved in money laundering or not. Of course, it would always predict ‘good’, and only get wrong the very small number of ‘bad’ transactions! This classifier would, of course, be useless. The problem we are trying to solve is to predict correctly when ‘bad’ transactions are bad, and minimize the number of ‘good’ transactions that we predict as bad. In ML terminology, we say we want to maximize true positives (miss no bad transactions), and minimize false negatives (good transactions predicted as bad), see the confusion matrix below. False negatives get you in trouble with the regulator and authorities, false positives create unnecessary work and cost for your company.
Our solution will have to address the class imbalance problem, as you may have thousands or millions or good transactions for every one transaction that is known to be bad. Supervised machine learning algorithms, including deep learning, work best when the number of samples in each class are roughly equal. This is a consequence of the fact that most supervised ML algorithms maximize accuracy and reduce error. Unfortunately, over-sampling the minority class, undersampling the majority class, synthetic sampling, and cost-function approaches will not help us solve the class imbalance problem, due to the sheer scale of typical data imbalance. Other classical techniques such as One-Class support vector machines or Kernel Density Estimators (KDE) will not scale enough as they “require manual feature engineering to be effective on high-dimensional data and are limited in their scalability to large datasets”.
Anomaly detection follows quite naturally from a good unsupervised model
- Alex Graves (Deep Mind) at NeurIPS 2018.
As AML problems are not easily amenable to supervised machine learning, much ongoing research and development concentrates on treating AML as an Anomaly detection (AD) problem. That is, how can we automatically identify suspected money-laundering transactions as anomalies? The domain of AML is also challenging due to its non-stationarity - new money transfer schemes are constantly being introduced into the market, whole countries may join or leave common payment areas or currencies, and, of course, new money launderingschemes are constantly appearing. AML is a living organism and any techniques developed need to be able to be quickly updated in response to changes in its environment.
Traditional machine learning approaches to solving anomaly detection have involved using unsupervised learning approaches, see Figure above. An unsupervised approach to AML would be to find a “compact” description of the “good” class, with “bad” transactions being anomalies (not part of the “good” class). Examples of unsupervised approaches to AD include k-means clustering and principal component analysis (PCA).
However, AML is not a classical use-case for anomaly detection techniques as we typically have labelled datasets - financial institutions typically know which transactions were “good” and which transactions were “bad”. We should, therefore, exploit that labelled data when training models. That is, we have semi-supervised learning - it is not fully unsupervised learning as there is some labeled training data available.
Before we get into semi-supervised ML, we need to discuss the features that can be used to train your AML model. Firstly, you should only train your AML model on features that will be accessible in your production AML binary classifier. If you access to features while in development that will never be usable in production, don’t even use them out of curiosity. Don’t kill the cat. If you are designing an AML classifier for an online application, then it is important that you know that you can access the features used to train your model at very low latency in production. Typically, this requires an online feature store, such as the Hopsworks Feature Store (see End-to-End figure later). If you are training a classifier for an offline batch application, then you typically do not need to worry about low latency access to feature data, as you will probably have feature engineering code in the batch application (or in the Feature Store) that pre-computes the features for your classifier.
The most obvious features that are available for AML are the features that arrive with the transaction itself: sending customer, receiving customer, sending/receiving bank, the amount of money being transferred, the date, etc). Other examples of useful features include how many transactions the sending/receiving account has executed in the last day, week, or month; whether the transaction date/time is “special” (holiday, weekend, etc), the graph of customers connected to the sender/receiver accounts over different time windows (last hour/day/week/month/year); credit scores for accounts, and so on. Typically, these features are not available in your online application and you do not want to rewrite your feature engineering code in your online AML classifier application, so you use the Feature Store to lookup those features at runtime when the transaction arrives. When you have queried those features from the Online Feature Store, you can join them (in the correct order) with the transaction-supplied features and then send the complete feature vector to the network-hosted model for a prediction of whether the transaction is “good” or “bad”.
Semi-supervised learning is a class of machine learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.
In semi-supervised learning, we use many of the techniques from unsupervised learning - you train a good model of the data, then you can see how likely something is under that model. In AML, we could use, for example, one-class novelty detection, to train one GAN on the large class of ‘good’ transactions as the novelty detector, and another GAN that supports it by enhancing the inlier samples and distorting the outliers.
Generative models involve building a model of the world that enable you to act under uncertainty. More precisely, they are able to model complex and high dimensional distributions of real-world data. We use GANs for anomaly detection by modeling the normal behavior of some training data using adversarial training and then detecting anomalies using an anomaly score. To evaluate our performance, we use Precision, Recall and F1-Score metrics.
GANs are generally very difficult to train - there is a significant risk of mode-collapse - where the generator and discriminator do not learn from one another. You will need to carefully consider the architecture you use. To quote Marc Aurelio Ranzato (Facebook) at NeurIPS 2018 “the Convolutional Neural Network architecture is more important than how you train (GANs)”. You will need to dedicate time and GPU resources to hyperparameter tuning due to the sensitivity of GANs and the risk of mode-collapse.
You need to have the tools and platform to do hyperparameter sweeps at scale with GANs, due to their sensitivity. Hopsworks uses PySpark and the Maggy framework to distribute hyperparameter trials to GPUs running TensorFlow/Keras/PyTorch code. Hopsworks/Maggy supports state-of-the-art asynchronous directed hyperparameter search, which can provide up to 80% more efficient use of GPUs when tuning sensitive GANs, compared to other hyperparameter tuning approaches using PySpark. Read our Maggy blog post to find out more.
Similar to MLFlow, Hopsworks also provides an experiment service where you can manage all hyperparameter trials as experiments. Experiments supports Tensorboard, tabular results for trials, and any images generated during training. This makes training runs and models easier to analyse and reproduce.
If you have now successfully trained a GAN that performs well on your test dataset (keep the last year/months of your transaction data as holdout data), you can save that the discriminator as a model for use on new transactions. Typically, you want further model validation - the what-if tool is a good way to evaluate black-box models in Jupyter notebooks on on Hopsworks - before you deploy your model from your model registry into production for either online applications or batch applications.
While there is no guarantee that GANs will magically make money laundering disappear overnight, they appear to be a much more powerful tool for correctly identifying fraudulent transactions and for minimizing the number of false alerts that need to be manually investigated. In China, GANs are already being used in production at two banks: “Two commercial banks have reduced losses of about 10 million RMB in twelve weeks and significantly improved their business reputation”, GAN-based telecom fraud detection at the receiving bank. Another example shown below was recently presented at the Spark/AI Summit EU 2019.
There is also the issue of explainability (particularly in the EU), where it is unclear how regulators will handle the use of black-box techniques in making fraud classification decisions. The first step to take is to augment existing systems with extra input from a GAN-based system, and evaluate for a period of time before discussing how to proceed with the regulators.
Hopsworks enables automated end-to-end ML workflows for taking historical financial transactions from production backend data systems (data warehouses, data lakes, Kafka), feature engineering with Spark/PySpark, feature storage in the Feature Store. For orchestrating such feature pipelines, Hopsworks includes Airflow as a service. Airflow is also used by Data Scientists to orchestrate Jupyter notebooks for (1) selecting features to create train/test data, (2) finding good hyperparameters, (3) training a model, (4) evaluating the model before deploying it into production. Finally, online AML detection applications need to enrich their feature vectors with transaction window counts and other features not available locally. The Hopsworks’ online Feature Store is a low latency highly available database, based on MySQL Cluster, that enables online AML applications to access their pre-engineered feature data with very low latency (<10 ms). The Hopsworks Feature store is modular with a REST API, so all steps 1-5 in the above figure can either be run on the Hopsworks platform itself or on external platforms. You decide.
Deep learning is revolutionizing how we identify money laundering, and it has unique challenges related to huge data volumes and massive data imbalances. At Logical Clocks, we have worked on a semi-supervised approach based on GANs, enabled by Hopsworks that scales to process huge datasets, and graph embeddings at scale, using Spark. The result is new state-of-the-art results for money laundering. The next phase is to put this model into production.