No items found.

Jim Dowling

CEO and Co-Founder

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

Build your Value Case for Feature Store Implementations

Job Scheduling & Orchestration using Hopsworks and Airflow

Build Your Own pdf.ai: Using both RAG and Fine-Tuning in one Platform

Build Vs Buy: For Machine Learning/AI Feature Stores

5-minute interview Abi Aryan

Article updated on

January 18, 2022

What is Distributed File Systems (DFS) and why you need it for Deep Learning

October 17, 2018

8 min

Read

Jim Dowling

CEO and Co-Founder

Hopsworks

TL;DR

When you train deep learning models with lots of high quality training data, you can beat state-of-the-art prediction models in a wide array of domains (image classification, voice recognition, and machine translation). Distributed file systems are becoming increasingly indispensable as a central store for training data, logs, model serving, and checkpoints. HopsFS is a great choice, as it has native support for the main Python frameworks for Data Science: Pandas, TensorFlow/Keras, PySpark, and Arrow.

What is DFS?

As the name suggests, Distributed File System (DFS) is a file system that is distributed across multiple platforms or locations. Using this technology, applications can access or store isolated files just like local files, allowing programmers to access files from any network or computer. The primary function of the Distributed File System (DFS) is to enable users of physically distributed systems to share resources and data. Distributed File System (DFS) has two components: Location transparency (using the namespace component) and Redundancy (using the file replication component). By allowing shares in several locations to be logically grouped under one folder, these components provide data availability in the event of failures or heavy loads.

Prediction Performance Improves Predictably with Dataset Size

Baidu showed that the improvement in prediction accuracy (or reduction in generalization error) for deep learning models was predictable based on the amount of training data. The decrease in generalization error with increasing training dataset size follows a power-law distribution(as seen by the straight lines in the log-log graph below). This astonishing result came from a large-scale study in the different application domains of machine translation, language modeling, image classification, and speech recognition. Given that this result holds true in vastly different application domains, there is a good chance the same result holds true for your particular application domain. This result is important for companies considering investing in distributed file systems (DFS) for deep learning - if it costs $X to collect or generate a new GB of high quality training data, you can predict the improvement of prediction accuracy for your model, given the slope, Y, of the log-log graph you have observed while training.

Learning curve and model size results and trends — Figure 1. Learning curve and model size results and trends for word language models - Baidu Research

Predictable ROI in the Power-Law Region

This predictable return-on-investment (ROI) for collecting/generating more training data is slightly more complex than the one described above. You first need to collect enough training data to get beyond the “Small Data Region” in the diagram below. That is, you can only make predictions if you have enough data that you are in the “Power-Law Region”.

Training Data Set — Figure 2. Sketch of power-law learning curves - Baidu Research

You can determine this by graphing the reduction in your generalization error as a function of your training data size on a log-log scale. After you start observing the straight line on your model, calculate the exponent of your power-law graph (the slope of the graph). Baidu’s empirically-collected learning curves showed exponents in the range [-0.35, -0.07] - suggesting models learn real-world data more slowly than suggested by theory (theoretical models indicate the power-law exponent is expected to be -0.5).

Still, if you observe the power-law region, increasing your training data set size will give you a predictable decrease in generalization error. For example, if you are training an image classifier for a self-driving vehicle, the number of hours your cars have driven autonomously determines your training data size. So, going from 2m hours to 6m hours of autonomous driving should reduce errors in your image classifier by a predictable amount. This is important in giving businesses a level of certainty in the improvements they can expect when making large investments in new data collection or generation.

Need for a Distributed Filesystem (DFS)

The TensorFlow team say a distributed filesystem is a must for deep learning. Datasets are getting larger, worker GPUs need to coordinate for model checkpointing, worker GPUs need to coordinate for hyperparameter optimization, and/or model-architecture search. Your system may grow beyond a single server, or you may have different servers for serving your models from the servers you have for training your models. A distributed file system (DFS) is the glue that holds together the different stages of your machine learning workflows, and it enables teams to share GPU hardware. What is important is that the distributed file system (DFS) works with your choice of programming language and deep learning framework(s).

A distributed file system (DFS) — Figure 3. A distributed file system (DFS) is needed for managing logs, tensorboard, coordinating GPUs for experiments, storing checkpoints during training, and storing/serving models.

HopsFS is a great choice as a distributed file system (DFS), due to it being a drop-in replacement for HDFS. HopsFS/HDFS are supported in major Python frameworks: Pandas, PySpark DataFrames, TensorFlow Data, and so on. In Hops, we provide built-in HopsFS/HDFS support with the pydoop library. HopsFS has one additional feature that is aimed at machine learning workloads: improved throughput and lower latency reading/writing for small files. In a peer reviewed paper at Middleware 2018, we showed throughput improvements of up to 66X compared to HDFS for small files.

HopsFS Additional Features — Figure 4. HopsFS Additional Features

‍

Python Support in Distributed File Systems (DFS)

As we can see from the table below, the choice of distributed file system will affect what you can do.

Table Python Support — Figure 5. Python Support in Distributed File Systems (DFS)

Python Support in HopsFS

We now give some simple examples of how to write Python code to use datasets in HopsFS. Complete and up to date notebooks can be found on our examples repository in github

Pandas with HopsFS

Code Snippet

In Pandas, the only change we need to make to our code, compared to a local filesystem, is to replace open_file(..) with h.open_file(..), where h is a file handle to HDFS/HopsFS.

PySpark with HopsFS

TensorFlow Datasets with HopsFS

Code Snippet

References

Deep Learning Scaling is Predictable, Empirically by Baidu https://arxiv.org/pdf/1712.00409.pdf

Morning Blog review by Adrian Colyer https://blog.acolyer.org/2018/03/28/deep-learning-scaling-is-predictable-empirically/

Distributed Deep Learning on Hops, Spark Summit Talk

Example Jupyter Notebook for working with files in HopsFS.

What is DFS (Distributed File System)?

Interested for more?

🤖 Register for free on Hopsworks Serverless
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🐍 Learn all about the Python-Centric Feature Store
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

Explore Job Scheduling and Orchestration in Hopsworks including how simple jobs can be scheduled through the Hopsworks UI by non-technical users.

Data Engineering

Job Scheduling & Orchestration using Hopsworks and Airflow

This article covers the different aspects of Job Scheduling in Hopsworks including how simple jobs can be scheduled through the Hopsworks UI by non-technical users

Ehsan Heydari

During our latest LLM Makerspace event we demonstrated how to build your own pdf.ai LLM application using RAG and fine-tuning in one platform.

Data Engineering

Build Your Own pdf.ai: Using both RAG and Fine-Tuning in one Platform

A summary from our LLM Makerspace event where we built our own pdf.ai using RAG and fine-tuning in one platform. Follow along the journey to build a LLM application from scratch.

Jim Dowling

When deciding on whether to build versus buy a feature store platform for AI/ML, there are technological, strategic and innovative components to consider.

Build Vs Buy: For Machine Learning/AI Feature Stores

On the decision of building versus buying a feature store there are strategic and technical components to consider as it impacts both cost and technological debt.

Rik Van Bruggen

PRODUCT

RESOURCES

COMPANY

JOIN OUR MAILING LIST

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

© Hopsworks 2024. All rights reserved. Various trademarks held by their respective owners.

Terms and Conditions