MSc projects 2018/2019
Auto-ML and Neural Architecture Search for Tensorflow
This project will involve providing building support for automated machine learning for TensorFlow on the Hops platform. We will start from defining a feature store, a central location of storing training data that can be used, with minimal pre-processing, to train deep neural networks. Given the availability of training data, we will then enable users to compose neural networks using an Experiment/Estimator/Evaluator model. The main thrust of the research will involve neural architecture search, where we define and implement algorithms to automatically search for good architectures and good hyperparameters. On technique we will investigate is Bayesian Parameter Optimization, which gives you the next experiment by modelling the results of previous experiments (and favoring novelty). Finally, we will automate the deployment of good models to an inferencing platform, from where the models can be consumed.
Hudi for HiveOnHops
Hudi, a open-source framework developed by Uber, supports a unified framework for real-time queries and historical data, in columnar storage. Currently, Hudi only works for Spark/Presto/Parquet. In this project, we will extend Hoodie to work with the world’s most popular SQL-on-Hadoop engine, Apache Hive (3x more popular than Impala and SparkSQL – see db-enginesranking). This work will be performed with the Hops team, who already have Hive-on-Hops and fast small-file performance in HopsFS (world’s fastest Hadoop platform).
Low Latency Inferencing for TensorFlow on Hopsworks
In this project, you will design and implement a a prediction serving system that sits between user-facing applications and a wide range of commonly used machine learning models and frameworks, such as TensorFlow and Spark. You will investigate techniques for improving throughput and ensuring reliable millisecond latencies by introducing adaptive batching, caching, and straggler mitigation techniques.You will investigate bandit and ensemble methods to intelligently select and combine predictions and achieve real-time personalization across machine learning frameworks. You will also work with managing models using meta-data and providing search support for discovering models.
Git DFS Backend with HopsFS
Git is a popular open-source source control management (SCM) system. The default implementation of git stores data on a local filesystem. However, for very large codebases, such as those at Google and GitHub, they need a scale-out filesystem at the backend to store the code. As far as we know, only Google Ketch, GithHub’s DGit, and Palantir have DFS backends for Git. However, all of these backends are closed-source.
HopsFS is the world’s most scalable implementation of the Hadoop Filesystem, and a recent feature (store small files in the database, not on block servers (datanodes)), means it can reduce the latency for the read/write of small files by an order of magnitude (with reads/writes taking ~O(10ms)). Existing distributed filesystem backends for Git use two storage engines: a key-value store one for mutable data, and a distributed filesystem for the immutable data. We propose using HopsFS and its small file support for both mutable data (refs and packfiles) and immutable data, simplifying the implementation and operation. This would be a world’s first for distributed Git backends.
Scaling out Apache Hive
Apache Hive is the most popular SQL-on-Hadoop database. However, it has query planning bottleneck, where a single MySQL (or Postgres) server is used to generate query plans. A recent attempt at replacing MySQL with HBase proved unsuccessful. In this thesis, we will design and implement and evaluate a scale-out architecture for Hive Query planning. We will investigate master-slave architectures, where read-only queries can be load balanced over slaves, while writes handled by a master, as well as synchronous approaches, such as multi-master replication.
Feature Store for Hopsworks
Dataprep is a service provided by Google (but developed by Trifacta), as “an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis.” We want to design and build a similar service for Hopsworks. Some of the features we could investigate include: automatically detects schemas, datatypes, possible joins and anomalies such as missing values, outliers, and duplicates.
This project will involve developing a domain-specific language (DSL) for describing dataprep operations, and then an implementation of the DSL in a platform like Apache Spark, to execute the dataprep operations.
Hopsworks on Kubernetes
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. Kubernetes is the defacto standard for elastic applications in the Cloud.
In this project, we will design and implement a Kubernetes system for stateless services in Hopsworks. We will start with TensorFlow serving and examine how we can make it scale-out and scale-in in response to changes in request load. We will also investigate Jupyter notebooks and Hopsworks. In this project, you will gain deep experience in Kubernetes, Docker. It is expected that you have experience programming in Java and in using Linux or some Unix operating system.
Population-based Parameter Tuning for Distributed Deep Learning on TensorFlow
How can we massively speed up the training of Deep Neural Networks on TensorFlow? First, we have to find good hyperparameters, and that takes time. DeepMind showed a method to do this by supporting massive numbers of parallel experiments, and using population-based optimization methods to progressively improve hyperparameters. The approach massively outperforms grid search and random search methods for finding good hyperparameters. This would be the world’s first open-source implementation of the DeepMind work and, if successful, could become a part of the Hops open-source Python library
Hopsworks on the (IoT) Edge
This project deals with providing IoT integration support for Hopsworks. That is, you will design and develop a framework for the ingestion of IoT data from devices such as Android and IoT platforms (such as via a MQTT Gateway). You will work with Stream processing technology in Hopsworks, probably Apache Spark or Apache Flink, and Apache Kafka for high-throughput reliable message ingestion.