No items found.

Healthcare & Pharmaceuticals

Secure storage, analytics and machine learning for sensitive data in a user-friendly platform that runs on your infrastructure or in the cloud.

Karolinska Institutet is Scandinavia’s largest university hospital and one of the world's leading medical universities known for its high quality research and education, accounting  for over 40% of the medical academic research in Sweden.

Vertical Use Case

Large-scale storage, management, and processing of genomic data, including deep learning.

SECURE, LOW-COST, AND SCALABLE GENOMIC DATA

Challenge: Data preparation, cataloging, and feature management for a massive genomic dataset containing sensitive information.

At the Karolinska Institutet’s center for cervical cancer prevention, sequencing machines have generated 800+ TBs of next-generation sequencing data, requiring both low-cost storage and secure large-scale processing by researchers.

The organisation utilizes large-scale processing on Apache Spark and deep learning on TensorFlow to analyze these scale sensitive datasets to identify novel viruses, perform large cohort studies, and identify genetic mutations causing diseases. However:

  • Research studies require data to be sandboxed to avoid cross-link with data outside of the study and importing or exporting data. Neither Kubernetes nor Hadoop based platforms support storing and processing sensitive data on a shared cluster and providing one cluster per research study introduces excessive cost and administration overhead.
  • Infrastructure was too complex and /expensive to administer without a dedicated IT operations team.
  • Researchers required a data science platform that provided them with the ability to do everything from small scale analyses in Python on notebooks, to large-scale processing with Spark/PySpark, to deep learning with GPUs.

Key Results

90% Cost Reduction

Costs savings associated with storing large volumes of data, as well as compute resources (CPU) and Graphical Processing Units (GPUs) to process this data.

Integrated Data Science Platform

Easy collaboration between researchers when managing, sharing, and processing genomic data.

Faster Data Processing

Massively parallel data processing pipeline for massive genomic datasets.

Solution: User-friendly and secure deep learning on low-cost commodity infrastructure.

Karolinska Institutet deployed Hopsworks to provide it with a secure and scalable platform to manage genomic studies. The model provides a GDPR-compliant environment and it’s designed to enable secure collaboration between data owners and researchers in a shared cluster.

The model is built around projects, enabling data owners to sandbox sensitive research data and give selected researchers the ability to process the data, while ensuring that those researchers cannot cross-link data with other sources or export the data from the project’s sandbox.

Hopsworks is optimized to work on commodity hardware, and was installed on servers in an internal data center. The cluster can be easily expanded by adding servers or storage capacity. Hopsworks provides a much lower cost storage solution for up to PBs of capacity compared to enterprise storage racks. Similarly, Hopsworks supports both Nvidia and AMD GPUs that can be used for deep learning, resulting in savings of about 90%.

Hopsworks’ user-friendly web interface enabled KI researchers to run, manage and access data and programs without software administration knowledge and skills.


Hopsworks’ key capabilities that we used are:

  • Multi-tenant Security Model to ensure the integrity and privacy of sensitive research data in a shared cluster;
  • Python/Jupyter notebooks for small scale studies;
  • Spark for scalable processing of genomic data;
  • TensorFlow/PyTorch for deep learning on genomic data;
  • Custom Metadata Designer to manage and search for genomic data with free-text search;
  • Commodity hardware to decrease costs associated with storage of large data volume that requires many GPUs.
Hopswork Data-Intensive Machine Learning with a Feature Store
Download the Hopsworks in Healthcare
White Paper

Scaling ML with the hopsworks feature store

Hopsworks is the world’s first horizontally scalable data platform for machine learning to provide a feature store. It aids in the cleaning of data and preparation of features, and it makes features reusable by other teams.

The Hopsworks Feature Store acts as an effective API between team members who are working on data engineering (and pulling data from backend data warehouses and data lakes) versus those working on data science (model building, training, and evaluation).


Security by design: Data scientists can be given sandboxed access to sensitive data, complying with GDPR and stronger security requirements.


Scale-out deep learning: Distributed Deep Learning over 10s or 100s of GPUs for parallel experiments and distributed training.


Provenance support for ML pipelines: Enables fully reproducible models, easier debugging, and comprehensive data governance for pipelines.


Integration with third party platforms: Seamless integration with data science platforms, such as AWS Sagemaker , Databricks and Kubeflow. Hopsworks also integrates with datalakes, such as S3, Hadoop, and Delta Lake. Hopsworks also supports single sign-on for ActiveDirectory, LDAP, and OAuth2.

Download the Hopsworks
Feature Store
White Paper

Hopsworks at a glance

Efficiency & Performance

Development & Operations

Governance & Compliance

Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe Speed with Big Data
Horizontally Scalable
Ingestion, Dataprep, training, Serving
Notebooks For development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Secure Multi-tenancy
Project-based restricted Access
Encription At-rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, Experiment, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies

Efficiency & Performance

Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe Speed with Big Data
Horizontally Scalable
Ingestion, Dataprep, training, Serving

Development & Operations

Notebooks For development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow

Governance & Compliance

Secure Multi-tenancy
Project-based restricted Access
Encription At-rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, Experiment, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies

Book a demo

Get an introduction to Hopsworks and Hopsworks Feature Store for your Machine Learning projects together with one of our engineers.

A comprehensive walk-through
• How Hopsworks can align with your current ML pipelines
• How to manage Features within Hopsworks feature store
• The benefits of Hopsworks Feature Store for your teams

Let us know if your specific wishes and pre-requisites for your personal demonstration.