Data Encryption in Apache Hadoop with Project Rhino - Q&A with Steven Ross
Cloudera recently released an update over Project Rhino and data at-rest encryption in Apache Hadoop. Project Rhino is an effort of Cloudera, Intel and Hadoop community to bring a comprehensive security framework for data protection.
Data encryption in Hadoop have two aspects – data at-rest, data stored on persistent storage like disk and data in transit, movement of data from one process or system to another. Most Hadoop components provide encryption for data in transit, but encryption of data at-rest is not supported. Security compliance regulators like HIPAA, PCI DSS and FISMA also call for data protection and encryption.
Project Rhino contributed key security features to HBase 0.98, providing cell level encryption and fine grained access control.
InfoQ recently talked to Steven Ross, product manager, security at Cloudera to learn more about Project Rhino.
InfoQ: When was Project Rhino launched? What are the objectives of the project?
Steven Ross: To catalyze the development of a comprehensive security framework for data protection in Apache Hadoop, Intel launched Project Rhino in early 2013 as an initiative with several broad objectives:
- Provide encryption with hardware-enhanced performance
- Support enterprise-grade authentication and single sign-on for Hadoop services
- Provide role-based access control in Hadoop with cell-level granularity in HBase
- Ensure consistent auditing across essential Apache Hadoop components
InfoQ: Project Rhino is an umbrella project. Apache Sentry is also included in Project Rhino. What are the various projects that are part of Rhino and can you please share some details about these sub-projects?
SR: In summer 2013, Cloudera released software to open source that became the basis for the Apache Sentry project (incubating), which has garnered engagement from engineers at Oracle, IBM and now Intel. Apache Sentry provides fine-grained authorization support for both data and metadata in a Hadoop cluster and is deployed in production in a number of large enterprises.
Strategic partnership between Cloudera and Intel, the security architects and engineers from both teams have renewed their commitments to accelerate the development of security capabilities in Apache Hadoop. Goals of Project Rhino and Apache Sentry to develop more robust authorization mechanisms in Apache Hadoop are in complete alignment, the efforts of the security experts from both companies have merged, and their work now contributes to both projects.
InfoQ: What is Apache Sentry?
SR: Apache Sentry (incubating) is a highly modular system for providing fine-grained role based authorization to both data and metadata stored on an Apache Hadoop cluster.
While Hadoop ecosystem projects feature a myriad of different native authorization systems, all must be separately configured. The flexibility of Hadoop enables the same data to be accessed through multiple ecosystem projects (such as Hive, Solr, MapReduce, Pig). So with authorization configured separately in each project, the admin will likely have inconsistent overlapping policies that they are attempting to keep in sync with each other.
Sentry addresses this IT administration and security challenge by providing a centralized set of policies that can be applied across many different access paths, so that IT admins can set permissions on a data set, and know that those permissions will be enforced consistently regardless of how the data is being accessed.
Technical details about Sentry:
Sentry governs access to each schema object in the Hive Metastore via a set of privileges like SELECT and INSERT. The schema objects are common entities in data management, such as SERVER, DATABASE, TABLE, COLUMN, and URI, i.e. file location within HDFS. Cloudera Search has its own set of privileges, e.g. QUERY, and objects, e.g. COLLECTION.
As with other RBAC systems that IT teams are already familiar with, Sentry provides for:
- Hierarchies of objects, with permissions automatically inherited by objects that exist within a larger umbrella object;
- Rules containing a set of multiple object/permission pairs;
- Groups that can be granted one or more roles;
- Users can be assigned to one or more groups.
Sentry is normally configured to deny access to services and data by default so that users have limited rights until they are assigned to a group that has explicit access roles.
InfoQ: What is Advanced Encryption Standard New Instructions (AES-NI) and how is it related to project Rhino?
SR: Intel AES-NI is a new encryption instruction set that improves on the Advanced Encryption Standard (AES) algorithm and accelerates the encryption of data in the Intel Xeon processor family and the Intel Core processor family.
When enabling encryption, enterprises are typically concerned about the CPU “overhead” it will require, resulting in a slowdown in data storage and retrieval operations. AES-NI offloads encryption processing to specialized dedicated hardware, accomplishing encrypts and decrypts operations much faster, while minimizing load on the CPUs.
AES-NI is important for the success of encryption subprojects of Project Rhino. While Hadoop users of HDFS encryption will not be required to use Intel chips or AES-NI, those that do will see improved encrypt/decrypt performance which minimizes the performance impact of enabling encryption.
InfoQ: What is the future road map of Project Rhino?
SR: Going forward, the broad objectives of Project Rhino are likely to persist, while the underlying sub-projects (usually in the form of Apache projects or JIRAs within an existing project) can be expected to evolve. After achieving the milestones in fine-grained security within HBase (mentioned above), two other subprojects now have momentum:
- Encryption for data at rest in HDFS.
- Unified Authorization - working toward a single set of access policies that are enforced regardless of how a user accesses data, whether through Hive, MapReduce or other access paths. This work is being done through the Apache Sentry project.
All the integration work is done, and the entire solution is thoroughly tested and fully documented.
Project Rhino implements subprojects that are part of Apache Hadoop (and other related Apache projects). CDH bundles Apache Hadoop and other related ecosystem projects.
Randy Shoup Jul 03, 2015