Hopsworks 1.x series brings many new features and improvements, ranging from services such as the Feature Store and Experiments, to enhanced support for distributed stream processing and analytics with Apache Flink and Apache Beam, to building Deep Learning pipelines with TensorFlow Extended (TFX), to code versioning support for Jupyter notebooks with Git, to all-new provenance/lineage of data across all steps of a data engineering and data science. We are also excited that Hopsworks 1.x is the back-bone of the all new Managed Hopsworks platform for AWS, Hopsworks.ai (https://www.hopsworks.ai/).
Hopsworks 1.x brings significant Feature Store improvements ranging from updated UI components to connectivity with external systems and feature discovery. Most notably:
- Being able to store training datasets in external data sinks (such as S3) but still track the metadata in hopsworks
- Improved online feature store experience
- Support Apache Hudi as storage format for feature groups, to allow for upserts and time travel
- Pluggable storage of feature groups and training datasets by using storage connectors to external systems such as S3, JDBC, HopsFS
- UI-support for Citizen Data Scientists who can: (1) generate training datasets from the UI by selecting features; (2) generate feature groups using SQL; (3) update feature group and training dataset statistics.
- On-Demand feature groups: allow the user to define feature groups using SQL that are computed on-demand using an external JDBC connection, without having to cache the data in Hopsworks
Users of Hopsworks Enterprise can now easily connect to the Feature Store from their Databricks notebooks and Amazon Sagemaker. Documentation for connecting with these two platforms can be found at hopsworks.readthedocs.io and a plethora of notebooks are available at our hops-examples GitHub repository.
All-new Experiments UX and Model Registry
Hopsworks 1.x brings an all-new Experiments user experience with a revamped user interface and new functionalities. To make use of the Experiments service, Data Scientists can use the hops Python library, a rich experiment API for Data Scientists to run their Machine Learning code, whether it be TensorFlow, Keras, PyTorch or another framework with a Python API. Experiments also provide features such as automatic versioning of notebooks and the Python environment, parallel hyperparameter tuning algorithms, and managed TensorBoard. Along with the experiments API comes Maggy, an in-house built framework for asynchronous algorithms for parallel hyperparameter tuning and parallel ablation studies.
The new Experiments user interface allows users to easily track experiments and compare them using metrics exposed by the API and defined by the Data Scientists in their programs/notebooks. With one click, Hopsworks users can now launch the TensorBoard of a current or past experiment, view and export the Python anaconda environment that the particular experiment run used, preview within Hopsworks or download the notebook used for this experiment, view the experiment’s logs, view historical information about the underlying Spark framework such as configuration parameters and execution information.
Last but not least, Data Scientists can now manage their models in the new Model Registry service with UI support. The Model Registry lists all models developed (exported) from different experiment runs and more importantly it makes it easy for Data Scientists to discover, search and compare models developed by other Data Scientists within the current project. Users can also easily compare the performance of different versions of the same model using the metrics supplied in their experiments. Each model provides a link to the experiment that create it, providing provenance to the code and Python environment used to create the model, enabling models to be more easily reproduced.
Tracking experiments, models and feature data that was used for developing the models is managed by Hopsworks by using the provenance capabilities of HopsFS. Hopsworks can now track operations on files that are created/deleted and models that are developed from programs that use Experiments APIs. Effectively, users in Hopsworks can now navigate from the Feature Store to train/test data to experiments (programs and Python environments) to models.
Project-based multi-tenant Elasticsearch
Hopsworks 1.x expands its unique project-based Elasticsearch multi-tenancy. Compared to previous Hopsworks versions, users of the platform now get programmatic access to Elasticsearch indices that are private to their projects (workspaces). By private, we mean that users in a Hopsworks project can only access the indices owned by that project and not any other indices belonging to other projects. Programmatic access means that users can use Elasticsearch APIs from within their Hopsworks jobs. For example, Spark dataframes can be securely written and read directly from Elasticsearch from with a Spark (Scala) or a PySpark (Python) notebook. Project-based multi-tenancy is implemented by integrating the open-distro security plugin, open-sourced by Amazon, with Elasticsearch OSS.
Enhanced Jobs UI
Users will now notice changes in the Hopsworks Jobs UI, with a sleeker design and more functionalities available from within the Jobs page. It is now possible to quickly navigate through different runs of a job and click-to-view logs in full-screen mode. Further information is available in Hopsworks Jobs user guide.
New notebook services
Hopsworks 1.x adds support for working with JupyterLab as part of the notebook service offering of the platform. Users can now select their favorite notebook IDE between JupyterLab and Jupyter Notebook from within the Hopsworks Jupyter dashboard.
It is now easier than ever to get started with writing Python programs that utilize a GPU. Hopsworks Enterprise integrates with Kubernetes to enable users to allocate GPUs to the container running their Jupyter notebook. Now, users can now train models using either the Python kernel or the PySpark (sparkmagic) kernel.
In addition, Hopsworks 1.x brings a long-awaited feature, git support for notebooks. Users can now set their GitHub repository in the Hopsworks Jupyter dashboard and then Hopsworks automates the process of cloning the repository, checking out branches and pushing changes to the notebooks back to GitHub. There is also Git plugin support in JupyterLab.
Support for Apache Flink, Apache Beam, and TensorFlow Extended (TFX)
Support for running Apache Flink has been completely re-engineered as part of the Hopsworks 1.x series. Flink is now a first-class citizen in the Hopsworks Jobs service as users can now create a new Flink job (cluster) by setting various parameters and access the Flink dashboard and Flink history server from within the Jobs UI. More information on how to use Flink from the UI or even launch Flink programs programmatically can be found in our Hopsworks-Flink docs.
Apache Beam is now also supported in beta. Full-fledged support will be added in the next 1.3 release along with the latest versions of Beam, Flink and TFX. Hopsworks supports developing and running Beam programs with the Beam Portability framework and the Flink runner. To ease development, the hops Python library provides the beam module that automates collecting logs, managing Beam related services, and distributing binaries. Examples can be found in hops-example/flink and extended documentation in Hopsworks-Beam docs.
Building on Beam support, Hopsworks now provides initial support for building ML pipelines with TensorFlow Extended (TFX) components on Beam with the Flink runner. Details on how the integration of Flink, Beam and TFX is implemented in Hopsworks is presented in our talk at BigThings conference 2019, link to video, in our docs and our examples.
Hopsworks 1.x is the engine behind Hopsworks.ai, the platform for Data-Intensive AI in the cloud. Hopsworks.ai enables businesses or individuals to seamlessly deploy Hopsworks with the Feature Store in an AWS account. Visit https://www.hopsworks.ai/ on more information on how to quickly get started.
Release cycle and new support page
The first three releases of this series, 1.0, 1.1 and 1.2, kick off with more than 300 JIRAs including new features, improvements and bug fixes. An important change in the Hopsworks release cycle is the move to timely releases. A new release of Hopsworks will now be issued every ~6 weeks, allowing for faster availability of new features and making upgrades smoother than before.
Detailed release notes can be found at the Hopsworks GitHub repository and important release notes with any breaking changes and upgrades notes are available in the Hopsworks version upgrades documentation page.
Last but not least, Hopsworks community support has moved under a new roof at https://community.hopsworks.ai where Hopsworks developers answer any questions you may have regarding Hopsworks and Hopsworks.ai platforms!