Published by dowlingj on

Introducing Hops Hadoop

tl;dr Hops is a new distribution of Apache Hadoop that uncorks Hadoop’s metadata bottleneck by introducing a scale-out distributed metadata layer. Hops can be scaled out at runtime by adding new nodes at both the metadata layer and the datanode layer. Hops also solves the small files problem for HDFS by (1) adding an order of magnitude more metadata capacity, and (2) stuffing small files in the database alongside metadata. Hops builds on TLS (not Kerberos) and supports GPUs-as-a-Resource (Hadoop does not yet). Hops is fully open-source.

The Problem

Hops started life as a research project to remove the metadata bottleneck in HDFS. HDFS’ throughput is limited to about 80K filesystem operations/sec on typical read-heavy industrial workloads, such as at Yahoo! and Spotify. If you have a more write-intensive workload with 20% writes, throughput drops to 20K operations/sec. The reason for this is a global filesystem lock for writes (multiple concurrent readers are supported, though). This global lock also negatively affects client latency when running at scale. With only a few thousand concurrent jobs, client latencies for metadata operations are ten times lower in Hops compared to HDFS. To summarize, the architectural limitations of HDFS are:

  • Small, inflexible namespace, as metadata is stored in-memory on a JVM Heap
  • Low throughput due to a concurrency model where metadata operations are protected using a single global lock (single-writer, multiple readers)

The source of the problem with HDFS is the use of a Java program as a metadata server, the NameNode. Due to JVM garbage collection artefacts, metadata size is limited in size to a few hundred GB in size, and the lack of support for transactions means that a single global write lock is used to protect metadata from concurrent modification.

So, why did the Hadoop vendors not re-develop the Namenode as a distributed service? Firstly, it’s a non-trivial problem. HopsFS is the first production-grade distributed hierarchical filesystem with its metadata in a scale-out distributed database. It would require a large investment with high risk and uncertainty to take on the project. And if the project succeeded? Well, then you, as a company, don’t reap all the rewards as other vendors benefit equally from your investment. It’s a classic tragedy of the commons, where Apache Hadoop is the commons. The Hops project started life as an open-ended research problem with no hard deadlines and uncertainty of success.

HopsFS

hopsfs-only-arch.png

HopsFS is a drop-in replacement for HDFS where the metadata management is distributed among stateless NameNodes and NDB database nodes (MySQL Cluster).

At KTH, RISE SICS, and Logical Clocks AB, we developed HopsFS to address the limitations of HDFS’ architecture by redesigning the metadata service as a scale-out metadata layer with no global write lock. The metadata is stored in an external in-memory distributed database, and multiple stateless NameNodes coordinate access to the metadata. Our default database is the open-source MySQL Cluster (NDB) database that can scale to 10s of TBs stored on up to 48 database nodes. As our metadata is stored in a database, it can be easily extended, we can efficiently listen to a changelog for the metadata, and we can guarantee the integrity and consistency of metadata using transactions and foreign keys. In the next sections, we’ll see how extended metadata allows us to re-imagine Hadoop as a project-based multi-tenant data platform.

In work published at USENIX FAST in 2017 and IEEE Scale Challenge winner 2017, we showed some exciting results for HopsFS that enables an order of magnitude larger and higher throughput clusters compared to HDFS. Cluster capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we showed that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.

HopsFS has from 16X (Spotify workload) to 35X (20% writes workload) the throughput of HDFS

Metadata Capacity in HopsFS is at least 35X that of Apache Hadoop, enabling billions of files.

Screenshot from 2017-02-14 00-13-16.png

HDFS Client latency is up to 10X lower in HopsFS compared to HDFS for an increasing number of concurrent clients (X-axis)

Another bottleneck we have removed in HDFS is the block reporting protocol, where we reduced its overhead by up to 100X.

Fixing the Small Files Problem in HDFS

In follow-on work, we addressed another problem that has long been the bane of the HDFS community: small files. In many existing clusters, small files make up a disproportionality large fraction of all files (at Yahoo! and Spotify, roughly 20% of all files are under 1 KB in size, and around 33% of all files under 64KBs in size). Our solution is to move the small files into the database in a tiered solution. 1KB files are stored in-memory in the database, and files between 1KB and 64KB are stored in on-disk columns in the database. We have been running the small files feature in production since October 2018, and the performance improvements are dramatic for some workloads. In the graph below, you will see the throughput improvements for a classic machine learning workload (the 9m images dataset – an extension to the ImageNet dataset). Writing small files yields orders of magnitude greater performance and we get a nice 4.5X throughput increase for reading files. These experiments were with only 6 database nodes, and we expect that performance will improve linearly with more database nodes.

Size Matters: Improving the Performance of Small Files in HDFS. Niazi, Haridi, Dowling. EuroSys 2017.

Online Scale-Out of Distributed Metadata

HopsFS’ metadata layer can be scaled out without any downtime. Both NameNodes and MySQL Cluster database nodes can be added at runtime without requiring a system restart. This means that Hops clusters can grow in size with your data.

TLS Certificates, not Kerberos

When Hadoop designed its security model in 2009, there was a debate about whether to use Kerberos or TLS/SSL for authentication (HADOOP-4487). The Hadoop community chose Kerberos due to (1) better performance, and (2) simpler user management.

In recent times, however, TLS certificates have become a popular choice as it has improved scalability (with multiple intermediate CA servers vs the single Kerberos KDC), and offers more flexible user management (we use certificates to implement no-overhead dynamic roles in Hops). Companies such as Google, with the Google Cloud Platform (earlier versions of which were the inspiration for Hadoop), and Netflix’s BLESS, are now built on certificates.

In Hops, we chose TLS certificates as problem (1) is now solved with our distributed metadata solution in Hops, and (2) Kerberos’ simpler approach to user identity means trade-offs, with no support for dynamic roles or attribute-based access control – which are needed if you want a multi-tenant user platform. For example, if you have a sensitive dataset and you want a user only to analyze that dataset (and not have privileges to copy the data elsewhere or cross-link with other datasets), then Kerberos is not enough. In Hops, we solve this problem by supporting sandboxes (we call them projects) and managing user certificates for each sandbox, allowing us to implement dynamic roles with no overhead. In a future blog post, we will delve into the details, but as a quick overview, we built on earlier work to add support for TLS/SSL RPC support to Hadoop (HADOOP-1386), and we have built platform support for certificate management, revocation, rotation, and integrated it in the Hops Enterprise platform.

HopsYARN

Similar to HDFS, we have been working in research on moving YARN’s metadata from in-memory on a Java heap to the database. The results of our research have been that we saw that the metadata should stay in-memory during processing, but be persisted to the database for recovery (not Zookeeper, as is default in Apache Hadoop). We use the persisted metadata to provide a Quota service for CPU/Memory in Hopsworks. We can assign CPU quotas to projects, and the database collects usage statistics for applications, so that usage is charged to projects. All usage data is collected in the same transactions that allocate/free containers in HopsYARN.

GPUs-as-a-Resource in HopsYARN

Hops supports GPUs as a requestable/schedulable resource in HopsYARN. Apache Hadoop (as of verison 3.0) does not yet support GPUs. Currently, HopsYARN only supports Nvidia Cuda-based GPUs, the defacto GPU manufacturer for deep learning systems, supported natively by TensorFlow via libraries such as cuDNN (Cuda Deep Neural Network library) and NCCL2 (NVIDIA Collective Communications Library).

Summary

Hops stands for Hadoop Open Platform-as-a-Service. We prefer the lowercase “Hops” moniker over the shouty all uppercase version HOPS. We feel it is more appropriate given our humble, understated Swedish origins. Having said that, we’re still proud enough to shout out about how proud we are for having the steely viking ambition and stubbornness to remake the Big Data ecosystem for scaling out data-processing and deep learning. Follow us on our journey at @hopshadoop on Twitter, www.hops.io and at www.logicalclocks.com .

Categories: Hops