Big Data and AI for Humans and Others
tl;dr Hopsworks is a data platform that integrates popular platforms for data processing such as Apache Spark, TensorFlow, Hops Hadoop, Kafka, and many others. All services provided by Hopsworks can be accessed using either a REST API or a User Interface. But the real value add of Hopsworks is that it makes Big Data and AI frameworks easier to use by introducing new concepts for collaborative data science (Projects, Users, and Datasets) and ubiquitous support for TLS certificates, opening the platform for integration with the outside world (IoT/mobile devices and external applications).
The Problem – Collaborative Data Science
The erudite Thomas Dinsmore describes the desirable properties of a collaborative data science platform, that we think aptly describe the Hopsworks platform:
“The rise of collaborative data science leads organizations to adopt open data science platforms that do the following:
- Provide a shared platform for all data science contributors
- Facilitate the use of open data science tools (such as Python and R) at scale
- Provide self-service access to data, storage, and compute
- Support a complete pipeline from data to deployment
- Include collaborative development tools
- Ensure asset management and reproducibility”
Hopsworks is a data platform that includes the most popular and scalable data storage and processing systems in the Hadoop ecosystem and beyond. In Hopsworks, all services are exposed via a single REST API and security is managed, end-to-end, using TLS certificates and new abstractions required for multi-tenancy, based on Projects. For humans, Hopsworks is accessed via a user interface and allows users to access data, services, and code through a new project abstraction. A project is a sandbox containing datasets, other users, and code. Users familiar with GitHub will recognize a project as the equivalent of a GitHub repository – users manage membership of the project themselves and also what code and data should be in the project. In a project, a user can have the role of “Data Owner” (the administrator) or a “Data Scientist”. Data Scientists are restricted to only uploading programs, running programs, and visualizing results. Data Scientists are not allowed change membership of the project or import/export data from the project. This enables Data Owners to manage the analysis of sensitive datasets within the confines of a project by inviting a Data Scientist into the project to carry out the analysis within the project. In the background, for each project that a user is a member of, we will construct a new Hadoop identity and create a new TLS certificate for the “project-specific user” (or projectUser, for short).
Share Datasets like in Dropbox
Some datasets will need to be made in available in more than one project. Rather than storing a copy of the dataset in more than one project, which is both expensive and error-prone, a data owner can share a dataset from her project with another project, making it available in that project with either read-only or read-write privileges. Datasets can also be made public within an organization, so users can add them to their projects in a self-service manner.
First Class Python Support with Conda/Pip
An important and unique feature of Hopsworks (unique among Hadoop platforms, in any case) is that each project can have its own conda environment on all hosts in the cluster. If a Data Owner enables conda for her project, a conda environment for that project is provisioned on every host in the Hops cluster. The user can then install the libraries and versions of the libraries that she wants just for her Project. Pyspark and TensorFlow jobs run in that project will run in the project’s conda environment on every host in the cluster. The conda environment can be initialized with the desired version of python, such as 2.7 or 3.6. There is only one conda environment supported per project.
Thomas Dinsmore again writes:
“I have to run a hundred experiments to find the best model,” he complained, as he showed me his Jupyter notebooks. “That takes time. Every experiment takes a lot of programming, because there are so many different parameters. We cross-check everything manually to make sure there are no mistakes.”
Hops supports GPUs-as-a-Resource, and Hopsworks allows users to start a job asking for GPUs for executors. YARN node labels can be used when there are different types of GPU servers.
As Hops manages GPUs-as-a-Resource, from within Hopsworks, we can start tensorflow applications with 10s or even hundreds of GPUs. Hopsworks supports APIs for massive parallel experimentation (such as hyperparameter optimization) via the Hops Python API. As Hopsworks supports distributed TensorFlow (including Horovod and TensorFlowOnSpark), we make it easier to use by running TensorFlow applications from within PySpark applications.
In practice this means, you put your TensorFlow code inside the wrapper function below:
Hops’ experiment API allows us to easily support hyperparameter optimization across tens or hundreds of GPUs with the following code example:
Tensorboard support is also easily added as follows:
This will enable the user to debug her application with Tensorboard from the Hopsworks UI:
Spark is supported for both batch applications and streaming applications. There is extensive API support for launching Spark Jobs in both the Python API and the Java/Scala API. Hops’ APIs can make programming Spark applications much easier, by hiding information about the location of services (what IP are the Kafka brokers running on?), transparently distributing TLS certificates to executors, and providing configuration files for services (like Kafka):
First Class Streaming Support
Now you don’t need to choose between Spark Streaming and Flink when you decide on a data platform. Hopsworks supports both Flink and Spark Streaming, through YARN. We also support Kafka-as-a-Service. Kafka topics (used as channels for producing/consuming messages) are private to projects. Users can create a topic with just a few clicks. Topics can also be shared with other projects enabling real-time communication between projects. Hopsworks also provides an Avro schema registry for Kafka topics that is accessible via Hopsworks’ REST API.
An example of how to manage a streaming application with Hopsworks is shown below. Data arrives at a Kafka topic as a stream (the external applications uses a certificate downloaded from Hopsworks to securely communicate with the topic and to authenticate itself). The Engineering project, process the stream and filter/enrich the data before it is forwarded to different Kafka topics for different projects, as well as sent to a sink, such as Apache Hive for offline analysis. The FX team below can then process the stream arriving at their FX topic without further help from IT, and manage access to that data themselves.
Hopsworks provides UI and REST support for running Jobs. A job could be a Spark application or a Spark workflow, a TensorFlow application (parallel experimentation or distributed training), or a Flink application. Jobs can be scheduled for periodic execution or run on-demand. Spark Jobs are monitored using a Grafana UI that provides information about Spark, HDFS, and YARN resource consumption. Spark and TensorFlow Job logs can be accessed either in real-time via a Kibana UI. All jobs (Spark, TensorFlow, Flink) have their logs aggregated by YARN and they can be read after the job completes from a “Logs” dataset in the project.
Jobs can be scheduled or run from a Jobs UI that also provides access to job logs and the Spark History Server UI, Kibana Logs for the Job, and Grafana logs for the Job. The Jobs UI is used to run workflows. Hopsworks provides a REST API for composing a workflow of Spark Jobs together. The workflow can then be run as a single Spark Job.
Spark Jobs can use log4j to write logs in real-time to Elasticsearch. These logs can be viewed in real-time with Kibana from the Jobs UI. Hopsworks also makes the complete application logs available in a Logs dataset, but they are only available after the Job has completed.
Spark Jobs provide a metrics.properties file to write performance statistics to InfluxDB, and those logs can be viewed in real-time with Grafana from the Jobs UI. This information is useful for performance debugging of Spark apps.
Each project in Hopsworks can have its own Apache Hive database, by enabling the Hive service for that project. By default, only members of the project can read/write from the Hive database. Hive databases can also be shared with other Projects, similar to how datasets are shared between Projects. Hopsworks supports both Hive-on-Tez and Hive/LLAP.
Hopsworks supports integration with Active Directory and LDAP servers. Users authenticate with a Kerberos keytab or LDAP credentials with Hopsworks. The user then receives a JWOT token that is used to communicate over TLS with the REST API (or via the browser and the Hopsworks UI). Hopsworks also supports native password-based authentication and 2-Factor authentication using either Smartphones (Google Authenticate) or Yubikey (for more secure environments),
Hopsworks supports Jupyter Notebooks, where each projectUser can launch his/her own Jupyter Notebook server. Jupyter notebooks are stored, by default, in the “Jupyter” dataset in HopsFS, but notebooks can be opened from any dataset within the Project. We support the sparkmagic Jupyter kernel, which we use to run both Spark and TensorFlow applications (Flink is not supported by Jupyter).
Hopsworks supports Zeppelin as a Notebook. There is a single Zeppelin Server for Hopsworks, and each project can start its own interpreter. Interpreters are shared between all users in a project, making Zeppelin suitable for collaborative work on notebooks between different users. Zeppelin notebooks are stored in HopsFS and available from the “Notebooks” dataset.
Hopsworks provides a REST API for authentication, running Spark jobs, producing to Kafka topics, training and deploying TensorFlow applications, and performing all the functions that are available in the Hopsworks User Interface. This enables Hopsworks to be an embedded platform, that is included as a component in other systems – truly, Big-Data-and-AI-as-a-Service. External clients can download either a certificate for accessing the REST API or access it using JWT tokens. All communication with the REST API is encrypted over TLS/SSL connections. Hopsworks also has a library for Android devices, allowing them to seamless produce data in real-time to Kafka and call inferencing functions on trained TensorFlow models.
In Hopsworks, extended metadata is built-in to the platform.The Hopsworks UI supports free-text search for data assets using Elasticsearch. Hopsworks also provides a metadata designer tool to design or import a schema with which to describe data assets. Non-technical users can then curate data through our user interface, while the extended metadata is automatically exported to Elasticsearch (using our own ePipe system). For users, Hopsworks provides a simple unified interface for managing data assets. Hopsworks also has built-in auditing capabilities, supported by the default open-source web application server shipped with Hopsworks, Payara.
Metadata schemas can be designed with an intuitive UI.
Metadata can be added to data assets from the UI by attaching metadata schemas to them and editing the metadata in the UI. The data asset can then be discovered using free-text search (using an Elasticsearch backend).
Free-text search for data assets is supported. Public datasets are discoverable from the landing page, but search within a project can be used to discover data assets only within the scope of that Project.
Hopsworks is a new data platform that makes working with large-scale data processing platforms easier for both humans, via an integrated UI, and external devices and applications, via an integrated REST API. Hopsworks is built on the new abstractions of Projects, ProjectUsers, and Datasets and it is enabled by a new distributed metadata architecture in Hops Hadoop.
- Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, Ismail et al, ICDCS 2017.
- The GDPR: 5 Ques. Data-Driven Companies Should Ask to Manage Risks & Reputation