Implicit model for provenance can be used next to a feature store with versioned data to build reproducible and more easily debugged ML pipelines. We provide development tools and visualization support that can help developers more easily navigate and re-run pipelines .
Maggy is an extension to Spark’s synchronous processing model to allow it to run asynchronous ML trials, enabling end-to-end state-of-the-art ML pipelines to be run fully on Spark. Maggy provides programming support for defining, optimizing, and running parallel ML trials.
HopsFS-S3 is a hybrid cloud-native distributed hierarchical file system that is available across availability zones, has the same cost as S3, but has 100X the performance of S3 for file move/rename operations, and 3.4X the read throughput of S3 (EMRFS) for the DFSIO Benchmark.
HopsFS-CL is a highly available distributed hierarchical file system with native support for AZ awareness using synchronous replication protocols.
Implicit provenance allows us to capture full lineage for ML programs, by only instrumenting the distributed file system and APIs and with no changes to the ML code.
The distribution oblivious training function allows ML developers to reuse the same training function when running a single host Jupyter notebook or performing scale-out hyperparameter search and distributed training on clusters.
New version of block reporting protocol for HopsFS that uses up to 1/1000th of the resources of HDFS' block reporting protocol. IEEE BigDataCongress’19.
Change Data Capture paper for HopsFS (ePipe). CCGRID’19.
Paper description of a demo given for Hopsworks ML pipeline at SysML 2019.
Describes how HopsFS supports small files in metadata on NVMe disks. Middleware 2018.
IEEE Scale Prize Winning submission, May 2017. Heavy on database optimizations in HopsFS' metadata layer.
First main paper on HopsFS at USENIX FAST 2017.