This video goes into depth on the scale-out metadata architecture in Hopsworks and it is used to enable Impliciit Provenance for Machine Learning Pipelines. It was a Guest Lecture at Boston University in October 2020, on a course taught by John (Ioannis) Liagos.
Distributed deep learning offers many benefits – faster training of models using more GPUs, parallelizing hyperparameter tuning over many GPUs, and parallelizing ablation studies to help understand the behaviour and performance of deep neural networks. With Spark 3.0, GPUs are coming to executors in Spark, and distributed deep learning using PySpark is now possible. However, PySpark presents challenges for iterative model development – starting on development machines (laptops) and then re-writing them to run on cluster-based environments.
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
Hops FS is the first production-grade distributed hierarchical filesystem to store metadata normalized in an in-memory, shared nothing database. In this talk, we present how HopsFS reached a performance milestone by scaling the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. We discuss the challenges in building secure multi-tenant streaming applications on YARN that are metered and easy-to-debug. We also show how users in Hopsworks use the ELK stack for logging their running streaming applications as well as how they use Grafana and Graphite for monitoring.
Enterprises and non-profit organizations often work with sensitive business or personal information, that must be stored in an encrypted form due to corporate confidentiality requirements, the new GDPR regulations, and other reasons. Unfortunately, a straightforward encryption doesn't work well for modern columnar data formats, such as Apache Parquet, that are leveraged by Spark for acceleration of data ingest and processing. When Parquet files are bulk-encrypted at the storage, their internal modules can't be extracted, leading to a loss of column / row filtering capabilities and a significant slowdown of Spark workloads. Existing solutions suffer from either performance or security drawbacks.
Apache Beam is a key technology for building scalable End-to-End ML pipelines, as it is the data preparation and model analysis engine for TensorFlow Extended (TFX), a framework for horizontally scalable Machine Learning (ML) pipelines based on TensorFlow. In this talk, we present TFX on Hopsworks, a fully open-source platform for running TFX pipelines on any cloud or on-premise. Hopsworks is a project-based multi-tenant platform for both data parallel programming and horizontally scalable machine learning pipelines.
Methods that scale with computation are the future of AI", Richard Sutton, father of reinforcement learning. Large labelled training datasets were only one of the key pillars of the deep learning revolution, the widespread availability of GPU compute was the other. The next phase of deep learning is the widespread availability of distributed GPU compute. As data volumes increase, GPU clusters will be needed for the new distributed methods that already produce the state-of-the-art results for ImageNet and Cifar-10, such as neural architecture search