Secure storage, analytics and machine learning for sensitive data in a
user-friendly platform that runs on your infrastructure or in the cloud.
Karolinska Institutet is Scandinavia’s largest university hospital and one of the world's leading medical universities known for its high quality research and education, accounting for over 40% of the medical academic research in Sweden.
Large-scale storage, management, and processing of genomic data, including deep learning.
At the Karolinska Institutet’s center for cervical cancer prevention, sequencing machines have generated 800+ TBs of next-generation sequencing data, requiring both low-cost storage and secure large-scale processing by researchers.
The organisation utilizes large-scale processing on Apache Spark and deep learning on TensorFlow to analyze these scale sensitive datasets to identify novel viruses, perform large cohort studies, and identify genetic mutations causing diseases. However:
90% Cost Reduction
Costs savings associated with storing large volumes of data, as well as compute resources (CPU) and Graphical Processing Units (GPUs) to process this data.
Integrated Data Science Platform
Easy collaboration between researchers when managing, sharing, and processing genomic data.
Faster Data Processing
Massively parallel data processing pipeline for massive genomic datasets.
Karolinska Institutet deployed Hopsworks to provide it with a secure and scalable platform to manage genomic data and perform secure research studies. Hopsworks is built around projects, providing a GDPR-compliant environment that enables secure collaboration between researchers on medical studies within a shared cluster.
Hopsworks is optimized for commodity hardware and runs on any data center. Clusters can be easily expanded by adding capacity, when needed enabling a low cost solution for up to PBs of data. Similarly, Hopsworks supports commodity or enterprise GPUs that can be used for deep learning.
Hopsworks’ user-friendly web interface enables researchers to run, manage and access data and programs without software administration knowledge and skills.
Hopsworks’ key capabilities that we used are:
Hopsworks is the world’s first horizontally scalable data platform for machine learning to provide a feature store. It aids in the cleaning of data and preparation of features, and it makes features reusable by other teams.
The Hopsworks Feature Store acts as an effective API between team members who are working on data engineering (and pulling data from backend data warehouses and data lakes) versus those working on data science (model building, training, and evaluation).
Security by design: Data scientists can be given sandboxed access to sensitive data, complying with GDPR and stronger security requirements.
Scale-out deep learning: Distributed Deep Learning over 10s or 100s of GPUs for parallel experiments and distributed training.
Provenance support for ML pipelines: Enables fully reproducible models, easier debugging, and comprehensive data governance for pipelines.
Integration with third party platforms: Seamless integration with data science platforms, such as AWS Sagemaker , Databricks and Kubeflow. Hopsworks also integrates with datalakes, such as S3, Hadoop, and Delta Lake. Hopsworks also supports single sign-on for ActiveDirectory, LDAP, and OAuth2.