Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

How to build ML models with fastai and Jupyter in Hopsworks

Robin Andersson

TLDR: Hopsworks is the Data-Intensive AI platform with a Feature Store for building complete end-to-end machine learning pipelines. This tutorial will show an overview of how to work with Jupyter on the platform and train a state-of-the-art ML model using the fastai python library. Hopsworks provides Jupyter as a service in the platform, including kernels for writing PySpark/Spark and pure Python code. With an intuitive service to install Python libraries covered in a previous blog and access to a Jupyter notebook, getting started with your favourite ML library requires little effort in Hopsworks.


Jupyter provides an integrated development environment (IDE) allowing users to seamlessly mix code with visualization and comments, making it not just handy for prototyping, but also for visualization and educational purposes. In recent years, it has become a favourite tool employed by the many, used for data wrangling, data mining, statistical modeling, visualization, machine learning and the notebooks themselves scheduled to run in production at tech leaders such as Netflix.

The Hopsworks platform ships with Jupyter as one of the integrated components. Having already been pre installed and configured to work with Spark and PySpark kernels, in addition to the Python kernel, getting started with writing your notebooks and scheduling them in production on a Hopsworks cluster is straightforward. The Hopsworks installation also includes a Miniconda environment with the most popular libraries you can find in a data scientists toolkit, such as TensorFlow, PyTorch and scikit-learn.

In this tutorial we will describe how to work with a Jupyter notebook in the Hopsworks platform. As an example, we will demonstrate how to install fastai, a library that provides high-level components for building and training ML models to get state-of-the-art deep learning results, clone a set of notebooks from a git repository and train a model on a P100 GPU.


To follow this tutorial you should have a Hopsworks instance running on You can register for free, without providing credit card information, and receive USD 4000 worth of free credits to get started. The only thing you need to do is to connect your cloud account.

The tutorial requires that you have configured the Hopsworks cluster with AKS or EKS depending on your cloud provider, and that the Kubernetes nodes are equipped with GPUs. 

Step 1: Install fastai in the Python service

The first step in the tutorial is to install the fastai library. To get started, navigate to the Python service to install the fastai library from PyPi as shown in the example below. There are many different approaches to installing the library but in this instance we install the latest version of the fastai and nbdev package from PyPi, required to run the first notebook in the fastai course. 

Step 2: Configuring and starting Jupyter

In the Jupyter service page there are three different modes that can be configured. 

Firstly, there is a Python tab, in which configuration for the Python kernel, such as Memory/Cores/GPUs is set and optionally a git repository can also be configured that should be cloned when Jupyter starts up. This is the kernel that we are going to use in this tutorial.

Secondly, in the Experiments tab the PySpark kernel is configured. If you want to enable all the features in the plattform regarding, experiment tracking, hyperparameter optimization, distributed training. See HopsML for more information on the Machine Learning pipeline.

Thirdly, for general purpose notebooks, select the Spark tab and run with Static or Dynamic Spark Executors on Spark or PySpark.

The image below shows the configuration options set for the Python kernel. As working with larger ML models can be memory intensive make sure you are configuring the Memory for the kernel to be at least 8GB, then set GPUs to 1 to allocate a GPU that should be accessible for the kernel and set the git configuration to clone the fastai git repository to get access to the notebooks.

Step 3: Start the Notebook Server

Once the configuration has been entered for the Python kernel, press the button on the top that says JupyterLab to start the Notebook Server. Keep in mind that it may take some time as resources need to be allocated for the Notebook Server and to clone the git repository. The image below demonstrates the process of starting Jupyter.

Step 4: Inspecting the GPU

The Jupyter Notebook Server will now have been allocated a GPU which you can use in the Python kernel. To check the type and specifications of the GPU, open a new terminal inside Jupyter and run nvidia-smi. We can see that in this instance we have access to a P100 NVIDIA GPU.

Step 5: Start using fastai by following the course material

Now you’re all set to start following the course material that fastai provides. To make sure the GPU is being utilized you can leave a terminal window open and run nvidia-smi -l 1, which will print out the GPU utilization every second while you are running the training in the notebook.

In the example below, the first notebook lesson1-pets.ipynb in the fastai course is executed.

Get started

Hopsworks is available both on AWS and Azure as a managed platform. Visit to try it out.

From 100 to ZeRO: PyTorch and DeepSpeed ZeRO on any Spark Cluster with Maggy

Moritz Meister

TLDR; Maggy is an open source framework that lets you write generic PyTorch training code (as if it is written to run on a single machine) and execute that training distributed across a GPU cluster. Maggy enables you to write and debug PyTorch code on your local machine, and then run the same code at scale without having to change a single line in your program. Going even further, Maggy provides the distribution transparent use of ZeRO, a sharded optimizer recently proposed by Microsoft. You can use ZeRO to improve your memory efficiency with a single change in your Maggy configuration. You can try Maggy in the Hopsworks managed platform for free.

Distributed learning - An introduction

Deep learning has seen a surge in activity with the availability of high level frameworks such as PyTorch to build and train models. A few lines of code in a notebook are sufficient to create powerful classifiers from scratch. However, both the data and model sizes to achieve state-of-the-art performance are ever increasing, so that training on your local GPU becomes a hopeless endeavour. 

Enter distributed training. Distributed training allows you to train the same model on multiple GPUs on different shards of your data to speed up training times. In the ideal case, training on 4 GPUs simultaneously should reduce your training time by 75%. In distributed training, each GPU computes a forward and backward pass over its own batch of the data. For the model update, the computed gradients are shared and averaged between the nodes. This way, all models update their parameters with the same combined gradient and stay in sync. This additional communication step introduces additional overhead of course, which is why ideal scaling is never truly achieved. 

So if distributed training is such a great tool to accelerate training, why is its use still uncommon among normal PyTorch users? Because it is too tedious to use! A dummy example for starting distributed training might look something like this.

def train(args):
    args.world_size = args.gpus * args.nodes
    os.environ['MASTER_ADDR'] = ''
    os.environ['MASTER_PORT'] = '8888'
    mp.spawn(train, nprocs=args.gpus, args=(args,))
    rank = * args.gpus + gpus
    model = args.Module()
    model = nn.parallel.DistributedDataParallel(

Going even further, you would need to launch your code on all of your nodes and take care of graceful shutdowns and collecting the results. This is where Maggy comes in. Maggy allows you to launch your PyTorch training script without any changes on Spark clusters. It takes care of the training processes for each node, the resource isolation and node connections.

Next we will explore what is needed to run distribution transparent training on Maggy as well as the restrictions that still exist with the framework.

Building blocks for distribution transparent training

Configuring Maggy

First of all, Maggy requires its experiment to be configured for distributed training. In the most common use case this means passing your model, hyperparameters and your training/test set. Configuring is as easy as creating a config object. Hyperparameters, train and test set are optional and can also be directly loaded in the training loop. If your training loop consists of more than one module such as in training GANs with a Generator and Discriminator or Policy gradient methods in RL, you can also pass a list of modules.

from maggy.experiment_config import TorchDistributedConfig

config = TorchDistributedConfig(module=models.resnet50,train_set=train_set, test_set=test_set)

Writing the training function

Maggy’s API requires the training function to follow a unified signature. Users have to pass their model class, its hyperparameters and the train and test set to the training function. 

def train(module, hparams, train_set, test_set):

If you want to load your datasets on each node by yourself, you can also omit passing the datasets in the config. In fact, this is highly recommended when working with larger dataset objects. Additionally, every module used in the training function should be imported within that function. Think of your training function as completely self contained. Last but not least, users should use the PyTorch DataLoader (as is best practice anyways). Alternatively, you can also use Maggy’s custom PetastormDataLoader to load large datasets from Petastorm parquet files. When using the latter, users need to ensure that datasets are even, that is they should have the same number of batches per epoch on all nodes. When using PyTorch’s DataLoader, you do not have to care about this. So to summarize, your training function needs to

  • Implement the correct signature
  • Import all used modules inside the function
  • Use the PyTorch DataLoader (or Maggy’s PetastormDataLoader with even Datasets)

Distributed training on Maggy - A complete example

It’s time to combine all the elements we introduced so far in a complete example of distributed training with Maggy. In this example, we are going to create some arbitrary training data, define a function approximator for scalar fields,write our training loop and launch the distributed training. 

Generate some training data

In order to not rely on specific datasets, we are going to create our own dataset. For this example, a scalar field should suffice. So first of all we randomly sample x and y and compute some function we want our neural network to approximate. PyTorch’s TensorDataset can then be used to form a proper dataset from this data.

import torch
import torch.nn as nn
import torch.nn.functional as F

coord = torch.rand((10000,2)) * 10 - 5  # Create random x/y coordinates in [-5,5]
z = torch.sin(coord[:,1]) + torch.cos(coord[:,0])  # Calculate scalar field for all points to get a dataset
train_set =[:8000,:], z[:8000])
test_set =[8000:,:], z[8000:])

Define the approximator

Next up we define our function approximator. For our example a standard neural network with 3 layers suffices, although in real applications you would of course train much larger networks.

class Approximator(torch.nn.Module):

    def __init__(self):
        self.l1 = torch.nn.Linear(2,100)
        self.l2 = torch.nn.Linear(100,100)
        self.l3 = torch.nn.Linear(100,1)
    def forward(self, x):

Writing the training loop

At the heart of every PyTorch program lies the training loop. Following the APIs introduced earlier, we define our training function as follows.

def train(module, hparams, train_set, test_set):
    import torch
    model = module()

    n_epochs = 100
    batch_size = 64
    lr = 1e-5    
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss_criterion = torch.nn.MSELoss()
    train_loader =, batch_size=batch_size)
    test_loader =, batch_size=batch_size)    
    def eval_model():
        loss = 0
        for coord, z in test_loader:
            prediction = model(coord).detach()
            loss += loss_criterion(prediction, z)
        return loss

    for epoch in range(n_epochs):
        for coord, z in train_loader:
            prediction = model(coord)
            loss = loss_criterion(prediction, z)
    return eval_model()

As you can see, there is no additional code for distributed training. Maggy takes care of all the necessary things. 

Starting the training

All that remains now is to configure Maggy and run our training. For this, we have to create the config object and run the lagom function.

from maggy import experiment
from maggy.experiment_config import TorchDistributedConfig

config = TorchDistributedConfig(module=Approximator, train_set = train_set, test_set=test_set)
experiment.lagom(train, config)  # Starts the training loop

Evaluating the training

After running the training on 4 nodes, we can see that our approximator has converged to a good estimate of our scalar field. Of course, this would also be possible on a local node. But with more complex models and larger training sets such as the ImageNet dataset, distributed learning becomes necessary to leverage your workloads.

Try it for yourself

Maggy is open-source and documentation is available at Give us a star or get in touch if you have more questions. Maggy is also available for all Hopsworks users in the managed platform on AWS or Azure. You can get started for free (no credit card required).

Star us on Github
Follow us on Twitter

Welcoming AMD/ROCm to Hopsworks

Robin Andersson

With Hopsworks 1.0, we have now added support for AMD GPUs with ROCm. This enables you to take your TensorFlow programs and run them, unchanged, on Hopsworks. RadeonOpenCompute (ROCm) is an open-source framework for GPU computing that supports multi-GPU computing to scale out training and reduce the time needed to train models.

ROCm is signficant for data scientists as, until now, they have had a lack of choice in GPU hardware when training models. With the recent upstreaming of ROCm changes to TensorFlow, ROCm is now a first-class citizen in the TensorFlow ecosystem. This enable Enterprise AI platforms, such as Hopsworks, to allow TensorFlow applications to run, unchanged, on AMD GPU hardware. To further enable deep learning on many GPUs in a cluster, Logical Clocks have also added support for resource scheduling of AMD GPUs in Hopsworks clusters, with Hops YARN. With Hopsworks and AMD GPUs, developers can now train deep learning models much faster on frameworks like TensorFlow using tens or hundreds of GPUs in parallel.

To learn more, read our whitepaper ROCm in Hopsworks, see our talk and demo from the Databricks Summit 2019 or the O'Reilly AI Conference 2019.

Hopsworks adds support for AMD GPUs by adding ROCm support to YARN, Hopsworks' resource scheduler.

Why you need a Distributed Filesystem for Deep Learning

Jim Dowling

tl;dr When you train deep learning models with lots of high quality training data, you can beat state-of-the-art prediction models in a wide array of domains (image classification, voice recognition, and machine translation). Distributed filesystems are becoming increasingly indispensable as a central store for training data, logs, model serving, and checkpoints. HopsFS is a great choice, as it has native support for the main Python frameworks for Data Science: Pandas, TensorFlow/Keras, PySpark, and Arrow.

Prediction Performance Improves Predictably with Dataset Size

Baidu showed that the improvement in prediction accuracy (or reduction in generalization error) for deep learning models was predictable based on the amount of training data. The decrease in generalization error with increasing training dataset size follows a power-law distribution(as seen by the straight lines in the log-log graph below). This astonishing result came from a large-scale study in the different application domains of machine translation, language modeling, image classification, and speech recognition. Given that this result holds true in vastly different application domains, there is a good chance the same result holds true for your particular application domain. This result is important for companies considering investing in deep learning – if it costs $X to collect or generate a new GB of high quality training data, you can predict the improvement of prediction accuracy for your model, given the slope, Y, of the log-log graph you have observed while training.

[Baidu Research ]

Predictable ROI in the Power-Law Region

This predictable return-on-investment (ROI) for collecting/generating more training data is slightly more complex that the one described above. You first need to collect enough training data to get beyond the “Small Data Region” in the diagram below. That is, you can only make predictions if you have enough data that you are in the “Power-Law Region”.

[Baidu18 ]

You can determine this by graphing the reduction in your generalization error as a function of your training data size on a log-log scale. After you start observing the straight line on your model, calculate the exponent of your power-law graph (the slope of the graph). Baidu’s empirically-collected learning curves showed exponents in the range [-0.35, -0.07] – suggesting models learn real-world data more slowly than suggested by theory (theoretical models indicate the power-law exponent is expected to be -0.5).

Still, if you observe the power-law region, increasing your training data set size will give you a predictable decrease in generalization error. For example, if you are training an image classifier for a self-driving vehicle, the number of hours your cars have driven autonomously determines your training data size. So, going from 2m hours to 6m hours of autonomous driving should reduce errors in your image classifier by a predictable amount. This is important in giving businesses a level of certainty in the improvements they can expect when making large investments in new data collection or generation.

Need for a Distributed Filesystem

The TensorFlow team say a distributed filesystem is a must for deep learning. Datasets are getting larger, GPUs are disaggregated from storage, workers with GPUs need to coordinate for model checkpointing, hyperparameter optimization, and model-architecture search. Your system may grow beyond a single server, or you may have different servers for serving your models from the servers you have for training your models. A distributed filesystem is the glue that holds together the different stages of your machine learning workflows, and it enables teams to share both GPU hardware and data. What is important is that the distributed filesystem works with your choice of programming language and deep learning framework(s).

A distributed filesystem is needed for managing logs, tensorboard, coordinating GPUs for experiments, storing checkpoints during training, and storing/serving models.

HopsFS is a great choice as a distributed filesystem, due to it being a drop-in replacement for HDFS. HopsFS/HDFS are supported in major Python frameworks: Pandas, PySpark DataFrames, TensorFlow Data, and so on. In Hopsworks, we provide built-in HopsFS/HDFS support with the pydoop library. HopsFS has one additional feature that is aimed at machine learning workloads: improved throughput and lower latency reading/writing for small files. In a peer reviewed paper at Middleware 2018, we showed throughput improvements of up to 66X compared to HDFS for small files.

Python Support in Distributed Filesystems

As we can see from the table below, the choice of distributed filesystem will affect what you can do.

Python Support in HopsFS

We now give some simple examples of how to write Python code to use datasets in HopsFS. Complete notebooks can be found here.

Pandas with HopsFS

import hops.hdfs as hdfs

cols = [“Age”, “Occupation”, “Sex”, …, “Country”]
h = hdfs.get_fs()
with h.open_file(hdfs.project_path()+“/TestJob/data/census/”, “r”) as f:
train_data=pd.read_csv(f, names=cols, sep=r’\s*,\s*’,engine=‘python’,na_values=“?”)

In Pandas, the only change we need to make to our code, compared to a local filesystem, is to replace open_file(..) with h.open_file(..), where h is a file handle to HDFS/HopsFS.

PySpark with HopsFS

from mmlspark import ImageTransformer

images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache()
tr = (ImageTransformer().setOutputCol(“transformed”)
   .resize(height = 200, width = 200)
   .crop(0, 0, height = 180, width = 180) )

smallImgs = tr.transform(images).select(“transformed”)“/Projects/myProj/Resources/small_imgs”, format=“parquet”)

TensorFlow Datasets with HopsFS

def input_fn(batch_sz):
files =
def tfrecord_dataset(f):
return, num_parallel_reads=32, buffer_size=8*1024*1024)
dataset = files.apply(,cycle_length=32))
dataset = dataset.prefetch(4)
return dataset