The Hopsworks Platform
Hopsworks is an open-source Enterprise platform for the development and operation of Machine Learning (ML) pipelines at scale, based around the industry’s first Feature Store for ML. You can easily progress from data exploration and model development in Python using Jupyter notebooks and conda to running production quality end-to-end ML pipelines, without having to learn how to manage a Kubernetes cluster.
Connect your datasources
Hopsworks can ingest data from the datasources you use. Whether they are in the cloud, on‑premise, IoT networks, or from your Industry 4.0-solution.
Hopsworks' White PaperData Pipelines and Feature Engineering
The starting point for building ML applications is to ingest data from a variety of structured and unstructured sources, wrangle and validate the data so that it can be used to build the features that will be used to train models. Hopsworks supports data preparation and feature engineering in the following frameworks:
- Python programs that use Pandas for feature engineering;
- Spark or PySpark applications to process larger amounts with more resources - faster;
- Beam on Flink which enables the use of TensorFlow Extended (TFX) components: TensorFlow Data validation and TensorFlow Transform.
Feature Store
World's first publicly available Feature Store
Hopsworks’ Feature Store is a new data layer in horizontally scalable machine learning pipelines that:
- Enables features to be discovered, analyzed, and reused across applications
- Ensures consistency of feature engineering between training and model serving
- Enables time-travel queries to read historical values for feature values
- Ensure high quality feature data through integration with data validation tooling
Experimentation & Model Training
Tools and Frameworks loved by most
Hopsworks provides framework support for enabling Machine Learning code to:
- Easily move from development to production: Jupyter notebooks can be run directly in ML pipelines;
- Easily scale from a single container/GPU to a cluster of 10s or 100s of GPUs using the Maggy framework, which provides a unified framework for the parallel execution of Hyperparameter Optimization and Ablation Studies using PySpark;
- Easily scale from a single container/GPU to a cluster of 10s or 100s of GPUs for distributed training on TensorFlow with the CollectiveAllReduce strategy, and
- Be reproducible through management of experiments, showing output models of training runs, the notebook used to generate the model, and the conda environment used;
- Produce and validate models that are managed by Hopsworks.
Putting ML Pipelines in Production
Hopsworks includes Airflow as an orchestration engine for managing the execution of ML pipelines.
- Notebooks or Jobs in Hopsworks can be run as stages in ML pipelines through the HopsworksJob Airflow Operator;
- Airflow provides error handling and notification support to ensure that any production problems with pipelines can be immediately discovered and acted upon;
- Pipelines are data-aware - as Hopsworks manages versioned models, you can check if the model that has been trained outperforms the currently deployed model before deploying it to production.
Model Serving & Monitoring
Hopsworks supports the deployment and management of both batch and real-time models:
- Batch applications use models either by downloading them via the REST API or within Hopsworks by Spark/Beam/Flink applications reading them from as files from HopsFS;
- Online applications can use models that are served in real-time from either TensorFlow Serving Server or from a flask application (Scikit-learn / H2O). Hopsworks Enterprise serves these models on elastic Kubernetes containers;
- Logs for real-time models (prediction requests and responses) can be stored in a Kafka topic so that streaming applications can monitor the performance/behaviour of the model in real-time;
- Online model serving provide a TLS-enabled REST API with self-service access control.
On-Premises or in the Cloud
Deploy on‑premises on your own hardware or at your preferred cloud provider. Hopsworks will provide the same user experience in the cloud or in the most secure of air‑gapped deployments.