Published by dowlingj on

Stuffed Inodes in HopsFS using tiered storage with NDB

Mikael Rönstrom is the designer of MySQL Cluster (or NDB as we prefer to call it to ensure you don’t confuse this distributed, real-time in-memory database with InnoDB), and recently he wrote about both a new, faster checkpointing algorithm in the upcoming version 7.6 as well as support for on-disk data in NDB.

HopsFS now supports storing small files (by default, files under 64KB in size) in NDB in on-disk columns. Our solution is actually a tiered storage solution, where files under 1KB in size are stored in-memory in NDB. We have a research paper under review on the details of our approach and the performance results are very good. We observed up to 66X throughput improvements for writing small files and up 4.5X throughput improvements for reading small files. For latency, we saw operational latencies on Spotify’s Hadoop workload were 7.39 times lower for writing small files and 3.15 times lower for reading small files. For real-world datasets, like the 9m images dataset (an extended version of ImageNet) that is used to training deep neural networks (CNNs) as image classifiers, we saw 4.5X improvements for reading files and 5.9X improvements for writing files (images can be written during pre-processing to generate augmented data (warped, scaled) on non-GPU servers – this reduces CPU load on GPU servers during training).

We had a poster on our small files solution at Eurosys 2017. An interesting fact from there was the distribution of files sizes seen in large Hadoop clusters at Spotify and Yahoo. As you can see below, 20% of the files in the system are less than 4 kilobytes (KBs), and these files receive 20% of all the file system operations. As a result, HDFS clients experience poor performance for small files in moderately sized/loaded clusters, as the namenode becomes a bottleneck, increasing metadata processing latency.

In our under-submission paper, we discuss how we address problems of HDFS Compatibility: changes for handling small files should not break HopsFS’ compatibility with HDFS clients (which expect the data to reside on the datanodes). We also address the problem of migrating data between different storage types: when the size of a small file that is stored in the database exceeds some threshold then the file is reliably and safely moved to the HopsFS datanodes.

Real World Workload (9m Images)

So, how does HopsFS perform on real-world workloads. In the figure below, we show the performance of HopsFS in reading and writing files for the 9m images dataset. The results are good.

HopsFS can read and write files from the 9m images 4.5 and 5.9 times faster than HDFS, respectively.

Now, we show the distribution of file sizes for the 9m images dataset. 9m images is an extended version of the ImageNet dataset, widely used to benchmark image classification solutions. This dataset is particularly interesting as it contains both large and small files, based on our definition of a small file as being one smaller than 64 KB in size and a large file being larger than 64 KB in size.

File size distribution for the 9m Images dataset

Finally, you can already benefit from our small files solution for HopsFS that has been available since HopsFS 2.8.2, released in 2017. We have been running using small files since October 2017, and we are very happy with its stability in production.

Categories: HopsHopsFSNDB