On the Importance of Metadata for a Usable Security Model in Hadoop
Why Hadoop’s security model is wrong, and how to make an understandable and usable security for Hadoop using Projects and TLS.
tl;dr Hops is a new distribution of Apache Hadoop that delivers a quantum leap in both the size and throughput of Hadoop clusters (see blog). Hops’ key innovation is that it moved Hadoop’s metadata to an in-memory a distributed database. In this article, we describe how we exploit increased metadata capacity to make Hadoop easy to use: to make Hadoop for humans. We will cover the security model for our second major revolution in the Hadoop space: Hopsworks – project-based multi-tenancy that for the first time enables Hadoop to become a multi-tenant data platform. Users and organizations can co-exist securely share on the same data platform. We introduce the new concepts of projects – c.f., GitHub, and shareable datasets – c.f., Dropbox.
What is Metadata?
Metadata is “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage an information resource” [wikipedia]. Metadata should make finding data and working with data easier. Hadoop systems typically have lots of data (some clusters have up to 100PB of data), so, it would make sense to have lots of metadata in Hadoop? This is the story of the metadata problem in Hadoop and how we are fixing it in Hops.
The Metadata Problem in Hadoop – Diplodocus and MetaJails
Hadoop is the Diplodocus of Big Data – a dominant dinosaur that bestrides the on-premise data storage plains. But Diplodocus also happened to be the animal that had the smallest ever ratio of brain volume to body mass. Hadoop also has a ridiculously small brain-volume/body-mass ratio. Hadoop’s brain is its metadata and Hadoop’s data blocks are its body. HDFS clusters typically have a brain/body ratio of around 0.0000023 – such as 150GB of metadata for 60PB of data at Spotify in 2016. Wikipedia defines metadata as “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage an information resource.” However, the limited metadata we have in HDFS only concerns files, directories, ownership, and quotas. Yarn also has minimal metadata about users and applications. Core Hadoop has no metadata for important enterprise features such as data provenance, version control, full-text search or more user-friendly abstractions. Hadoop is also hard to use because of its limited metadata. Users can only work with low level file and application abstractions and no coherent security metadata. With Hops, we hoped that with more metadata and new abstractions, we could make Hadoop both more user-friendly and operations-friendly.
The two main Hadoop vendors have their own answer for metadata: lockin with us in MetaJails. For access control, Cloudera has Apache Sentry, while Hortonworks has Apache Ranger. For data governance, Cloudera has Navigator, while Hortonworks has Apache Atlas/Falcon. For SQL-on-Hadoop, Hortonworks favors Apache Hive while Cloudera pushes Apache Impala. Some metadata services are still missing (data provenance for applications/users, version control, and full-text search of HDFS’ namespace)
The problem with the existing metadata services for Hadoop by Cloudera/Hortonworks is that they are eventually consistent. The metadata services run as an external services to the ground-truth data, that is, the data in HDFS or the job in YARN or the topic in Kafka.
We will elucidate the problem with an example. Apache Hive stores its metadata (its information schema, etc) in the metastore (a frontend to a relational database) and the backing directories and datafiles for its databases/tables are stored in HDFS (and more recently S3). If a backing datafile/directory is removed, the metastore will not know it has been removed. To ensure that the datafile and the metadata schema are in sync, Hive could add an eventually consistent synchronization (agreement) protocol or it could just fail without giving any warning that the datafile is missing (Hive does this). Also, take the problem of providing a centralized access control system, like Ranger or Sentry, that overrides the access control schemes already present in HDFS and YARN. That is, Ranger/Sentry metadata supercedes HDFS/YARN/Hive metadata. As Ranger and Sentry access control rules are stored in a relational database, in principal, every HDFS or YARN or Hive operation should query that database before being executed. Due to performance implications, this is not feasible, so Ranger/Sentry access control rules are cached at HDFS/YARN/Hive and synchronized with the relational database (source of ground truth) with an eventually consistent replication protocol.
From a system administrator perspective, the main usability problem with Ranger/Sentry is the presence of multiple access control systems, any of which could be “active” for a given resource. For example, the best practice for access control for Apache Hive, is that if the database is to be protected by Ranger, you should change the permissions on HDFS to indicate that this subtree is protected by Ranger (not HDFS):
hdfs dfs -chmod -R 000 /apps/hive
This is very confusing for administrators. File system permissions are overriden by some service that does not have visibility at the filesystem command-line level – you have to go to a web browser and have Ranger privileges to see the access control rules that apply. This model is not, in our opinion, a viable long-term usable security model for Hadoop and will lead to the ecosystem atrophying through poor usability compared with cloud-native alternatives, such as Kubernetes.
Strongly Consistent Metadata
What if we could just unify all this metadata in a single database and use transactions to update it consistently and use foreign keys to ensure the integrity? This approach would not only fix the problem of non-DRY, eventually-consistent metadata (Ranger/Sentry/etc), but it would also provide a principled approach for extending metadata and ensuring its correctness. This is the approach we take in Hops, and it should enable us to build better abstractions for Hadoop, to making using and administering Hadoop easier. To make Hadoop for Humans.
In Hops, we start with the ground-truth metadata in HopsFS and YARN and extend it to build higher level metadata services. For example, in Apache Hive, we migrated Hive’s metastore into the same database server as HopsFS (MySQL Cluster). This way, we could guarantee strong consistency for the Hive metadata by adding foreign keys from Hive’s metadata schema(s) to the backing directories/datafiles in HopsFS. That is, if you delete the backing directories/files for a Hive database, the metadata for the Hive database is automatically cleaned up, see the figure below. For the larger problem of replacing Ranger and Sentry with a usable access control model, we will see how we provide a new abstraction, called a project, that extends HDFS’ existing access control mechanisms (rather than replacing them).
In the next section, we discuss how we designed a new security model for Hadoop using extended metadata.
Several years ago, we were given a challenge by the Biobanking community:
“How can I have a sensitive dataset in Hadoop, where I can give access to Alice who is then prevented from copying that data elsewhere or cross-linking that data with non-approved data sources?”
In the above figure, we can see that in order to restrict Alice’s access to the sensitive dataset, we need to (temporarily) remove her privileges for reading/writing to the other dataset. In Hadoop, that is a non-trivial, non-practical thing to do. Organizations typically create a new Hadoop cluster for just the sensitive data, and when they need to share that sensitive data with other projects, they copy the data. Copying data, however, is not sharing data – it’s expensive and leads to errors when datasets diverge.
This access-control mechanism we needed can be provided with dynamic roles or attribute-based access control. We could give Alice the role with privileges for reading/writing the sensitive dataset and take away the role for reading/writing the other dataset. Or we could specify attributes to achieve the same goal. However, neither of these mechanisms are available in Hadoop. We knew that integrating an attribute-based access control PEP (policy-enforcement point) was impractical, as they can handle at most a couple of thousand ops/sec, and HopsFS can handle over 1m ops/sec, (see funny picture). So that left us with dynamic roles. Dynamic roles could be possible, but would need a re-imagining of Hadoop. Our solution was to introduce the notion of a project to Hadoop – each user would have a different identity for each project and the solution would add no new overhead, so performance would not be affected. The projects models means that if a user has 5 projects, from Hadoop’s perspective, there are 5 users. User Alice in Project X cannot read/write data in Project Y, even if Alice is a member of Project Y, as the Alice user in Project Y has a different identity. We can use HDFS’ native access control mechanisms without needing a new higher layer access control system, like Ranger or Sentry. We would ensure users could only act as a single user at a time by funneling all requests through a Rest API (kind of like how git only lets you be active on a single branch at any given time). Each project is given its own subtree in HopsFS, under /Projects. So Projects X and Y would have private subtrees /Projects/X and /Projects/Y, respectively. But what about for other services, like YARN, Hive, and Kafka? For those services, we use certificates, and, where necessary, introduce our own access control layer (as we do in Kafka).
In Hopsworks, per-project user identity (a projectUser) is a composite of the projectname and the username: __. This approach would not scale with Kerberos, as identity creation and authentication is expensive and centralized in the KDC. We chose TLS certificates as a method to identify projectUsers. TLS certificates are used for authentication and encryption-in-transit in Hops. An intermediate CA (certificate authority) generates a new TLS certificate for each such projectUser, and the certificate is stored in the database along with the projectUser details. We link projectUsers to projects using foreign keys, so when a project is removed, its projectUsers are cleaned up automatically.
The project abstraction solved our problem of how to isolate users and data. But what if we want to share a dataset between two projects? Will sharing be copying data, as it is in existing platforms? The answer is, of course, no. We introduce a new dataset abstraction (a new table in the database with a foreign key to its home project) to enable the sharing of a dataset between projects. Giving access to a dataset to users in a different project involves adding its projectUsers as members of the dataset’s group. (Each dataset has its own group-name). With support for Access Control Lists (ACLs) in HopsFS, we can now share datasets in a more fine-grained manner, allowing different projectUsers different permissions on a shared dataset.
Roles within a Project
Hopsworks also supports roles within a project. Projects in Hopsworks have at least one data owner, a user who can add and remove members and change the roles of members of the project. The data owner can also import/export data to/from the project and has other privileges, based on compliance with the data owner role in the European Union’s General Data Protection Regulation (GDPR) law. The other main roles is the data scientist, who can only upload/edit/run programs and install python libraries in the project. The typical workflow for roles is that a data owner invites user(s) into a project as data scientists to analyze her data, secure in the knowledge that, the data scientists cannot download the data or any derived data, cannot copy it data to another project, or cross-link the data with external data sources. This strong model of isolation is necessary for any processing of data, where the data is identified as sensitive under GDPR.
In this project, Robin Andersson has the role“Data Scientist”. Jim Dowling has the role “Data Owner”.
TLS Certificates for Hadoop/Kafka
Hops provides support for TLS/SSL RPC in Hadoop, while Hopsworks provides support for managing certificates – creating, revoking and rotating them. Hopsworks provides a different Hadoop identity for a user for each projectUser in the form of his/her own TLS certificate. Certificate generation can be scaled-out by adding intermediate certificate authorities as needed. Certificates can be used for authenticating projectUsers with services such as Kafka/HopsFS/YARN, or for identifying users as members of projects in the system. Certificates for a projectUser encode both the username and the projectname as part of the certificate’s common name and both the Hadoop RPC server (for HopsFS and YARN) and Kafka extract the project-name and username, using them to both authenticate and authorize the projectUser.
Kafka is a distributed, highly available event/message broker. The basic queue holding events/messages in Kafka is called a topic. In Kafka, the preferred terminology is to produce and consume messages from a topic. Kafka’s metadata for topics and default access control lists for topics (SimpleACLAuthorizer) are stored in Zookeeper.
In Hopsworks, we wanted to extend Project metadata with Kafka ACLs, so we built our own access control service for Kafka, storing the metadata in NDB. In Hopsworks, Kafka topics belong to projects. That is, you create a topic X within a project A, and topic X is private to project A. The default privileges are intuitive: only members of project A can produce to/consume from to topic X. However, topics can also be shared between projects. For example, topic X could be shared with another project B, so that members of project B are now allowed to produce to and consume from topic X. You can also specify more fine-grained privileges for topics, restricting access for certain users or from certain IP addresses.
In Hopsworks, we implemented a HopsACLAuthorizer component that controls accesses to topics by first extracting projectUsernames from the certificates of clients that are producing to/consuming from the topic, and then queries the cached ACLs in our HopsACLAuthorizer. Any changes in the ACLs in NDB are propagated to the HopsACLAuthorizer within a few seconds.
Hopsworks has UI support for creating/managing topics and their schemas.
Ensuring the Consistency and Integrity of Extended Metadata
Hopsworks has a principled method for extending metadata in HopsFS or HopsYARN. We add new tables to extend metadata and add foreign keys to ensure the integrity of extended metadata. By integrity, we mean that metadata is never orphaned – if the extended metadata is for a file, the corresponding inode must be present in the database, or if the extended metadata is for a project, the project should be present in the database. We encapsulate all updates to metadata in transactions to ensure the consistency of the metadata, so that partial updates to the metadata never happen.
However, at the Hopsworks REST API level, we provide high-level operations such as createProject and deleteProject. These operations consist of a number of steps, not all of which can be encapsulated in a database transaction. For example, when deleting a project we also have processes/jobs belonging to the project to be destroyed, such as YARN jobs, Jupyter notebook servers and Zeppelin Interpreters. For project deletion, we destroy jobs/processes before the metadata is deleted transactionally, so that we don’t have orphaned jobs or notebooks running for non-existent projects (if a crash happens in the middle of a project deletion operation).
Entity-Relation Diagram for Extended Metadata: Projects, Datasets, ProjectUsers, Kafka ACLs, and Jobs. Removing a project’s backing directory (in hdfs_inodes) will cascade to remove all the extended metadata for the Project.
The existing access control architectures for Hadoop (Sentry/Ranger) are layered over existing base access control systems in services like HDFS, YARN, and Hive. This leads to a confusing model for administrators, and high implementation complexity due to the need to synchronize the access control rules with the metadata of the services.
In Hops, we simply the security model for Hadoop by introducing new concepts of Projects, Datasets, and projectUsers. Projects are sandboxes for data and programs, and enable the implementation of dynamic roles.