Big Data Security: The Evolution of Hadoop’s Security Model
One of the biggest concerns in our present age revolves around the security and protection of sensitive information. In our current era of Big Data, our organizations are collecting, analyzing, and making decisions based on analysis of massive amounts of data sets from various sources, and security in this process is becoming increasingly more important. At the same time, more and more organizations are being required to enforce access control and privacy restrictions on these data sets to meet regulatory requirements such as HIPAA and other privacy protection laws. Network security breaches from internal and external attackers are on the rise, often taking months to be detected, and those affected are paying the price. Organizations that have not properly controlled access to their data sets are facing lawsuits, negative publicity, and regulatory fines.
Consider the following eye-opening statistics:
- A study released this year by Symantec and the Ponemon Institute found that the average organizational cost of one security breach in the United States is 5.4 million dollars1. Another recent study shows that the cost of cybercrime in the U.S. economy alone is 140 billion dollars per year.
- One of the largest breaches in recent history involved Sony’s Playstation Network in 2011, and experts estimate Sony’s costs related to the breach to be somewhere between 2.7 and 24 billion dollars (a wide range, but the breach was so large, it is almost impossible to quantify).2
- Netflix and AOL have already faced (and in some cases, settled) millions of dollars in lawsuits over their management of large sets of data and their protection of personal information – even data that they had “anonymized” and released for research.3
- Beyond quantifiable costs related to security breaches (loss of customers and business partners, lawsuits, regulatory fines), organizations that have experienced such incidents report that the fallout from a data breach results in a diminished trust of the organization and a damaged reputation that could put a company out of business.4
Simply put - Without ensuring that proper security controls are in place, Big Data can easily become a Big Problem with a Big Price Tag.
What does this mean for organizations processing Big Data? The more data you have, the more important it is that you protect it. It means that not only must we provide effective security controls on data leaving our networks, but we also must control access to data within our networks. Depending on the sensitivity of the data, we may need to make certain that our data analysts have permission to see the data that they are analyzing, and we have to understand the ramifications of the release of the data and resulting analysis. The Netflix data breach alone shows us that even when you attempt to “anonymize” data sets, you may also release unintentional information – something that is addressed in the field of differential privacy.
One of the most popular platforms for Big Data processing is Apache Hadoop. Originally designed without security in mind, Hadoop’s security model has continued to evolve. Its rise in popularity has brought much scrutiny, and as security professionals have continued to point out potential security vulnerabilities and Big Data Security risks with Hadoop, this has led to continued security modifications to Hadoop. There has been explosive growth in the “Hadoop security” marketplace, where vendors are releasing “security-enhanced” distributions of Hadoop and solutions that compliment Hadoop security. This is evidenced by such products as Cloudera Sentry, IBM InfoSphere Optim Data Masking, Intel's secure Hadoop distribution, DataStax Enterprise, DataGuise for Hadoop, Protegrity Big Data Protector for Hadoop, Revelytix Loom, Zettaset Secure Data Warehouse, and the list could go on. At the same time, Apache projects, such as Apache Accumulo provide mechanisms for adding additional security when using Hadoop. Finally, other open source projects, such as Knox Gateway (contributed by HortonWorks) and Project Rhino (contributed by Intel) promise that big changes are coming to Hadoop itself.
The great demand for Hadoop to meet security requirements is resulting in ongoing changes to Hadoop, which is what I will focus on in this article.
A (Brief) History of Hadoop Security
It is a well-known fact that security was not a factor when Hadoop was initially developed by Doug Cutting and Mike Cafarella for the Nutch project. As the initial use cases of Hadoop revolved around managing large amounts of public web data, confidentiality was not an issue. For Hadoop's initial purposes, it was always assumed that clusters would consist of cooperating, trusted machines used by trusted users in a trusted environment.
Initially, there was no security model – Hadoop didn’t authenticate users or services, and there was no data privacy. As Hadoop was designed to execute code over a distributed cluster of machines, anyone could submit code and it would be executed. Although auditing and authorization controls (HDFS file permissions) were implemented in earlier distributions, such access control was easily circumvented because any user could impersonate any other user with a command line switch. Because impersonation was prevalent and done by most users, the security controls that did exist were not really effective.
Back then, organizations concerned about security segregated Hadoop clusters onto private networks, and restricted access to authorized users. However, because there were few security controls within Hadoop, many accidents and security incidents happened in such environments. Well-intended users can make mistakes (e.g. deleting massive amounts of data within seconds with a distributed delete). All users and programmers had the same level of access to all of the data in the cluster, any job could access any data in the cluster, and any user could potentially read any data set. Because MapReduce had no concept of authentication or authorization, a mischievous user could lower the priorities of other Hadoop jobs in order to make his job complete faster – or worse, kill the other jobs.
As Hadoop became a more popular platform for data analytics and processing, security professionals began to express concerns about the insider threat of malicious users in a Hadoop cluster. A malicious developer could easily write code to impersonate other users’ Hadoop services (e.g. writing a new TaskTracker and registering itself as a Hadoop service, or impersonating the hdfs or mapred users, deleting everything in HDFS, etc.). Because DataNodes enforced no access control, a malicious user could read arbitrary data blocks from DataNodes, bypassing access control restrictions, or writing garbage data to DataNodes, undermining the integrity of the data to be analyzed. Anyone could submit a job to a JobTracker and it could be arbitrarily executed.
Because of these security concerns, the Hadoop community realized that more robust security controls were needed, and as a result, a team at Yahoo! decided to focus on authentication, and chose Kerberos as the authentication mechanism for Hadoop, documented in their 2009 white paper.
The release of the .20.20x distributions of Hadoop accomplished their goals, by utilizing the following:
- Mutual Authentication with Kerberos RPC (SASL/GSSAPI) on RPC connections – SASL/GSSAPI was used to implement Kerberos and mutually authenticate users, their processes, and Hadoop services on RPC connections.
- “Pluggable” Authentication for HTTP Web Consoles - meaning that implementers of web applications and web consoles could implement their own authentication mechanism for HTTP connections. This could include (but was not limited to) HTTP SPNEGO authentication.
- Enforcement of HDFS file permissions – Access control to files in HDFS could be enforced by the NameNode based on file permissions - Access Control Lists (ACLs) of users and groups.
- Delegation Tokens for Subsequent Authentication checks - These were used between the various clients and services after their initial authentication in order to reduce the performance overhead and load on the Kerberos KDC after the initial user authentication. Specifically, delegation tokens are used in communication with the NameNode for subsequent authenticated access without using the Kerberos Servers.
- Block Access Tokens for Access Control to Data Block When access to data blocks were needed, the NameNode would make an access control decision based on HDFS file permissions and would issue Block access tokens (using HMAC-SHA1) that could be sent to the DataNode for block access requests. Because DataNodes have no concept of files or permissions, this was necessary to make the connection between the HDFS permissions and access to the blocks of data.
- Job Tokens to Enforce Task Authorization - Job tokens are created by the JobTracker and passed onto TaskTrackers, ensuring that Tasks could only do work on the jobs that they are assigned. Tasks could also be configured to run as the user submitting the job, making access control checks simpler.
- Putting it all together, this provided a significant step forward for Hadoop. Since then, a few notable modifications have been implemented:
- From “Pluggable Authentication” to HTTP SPNEGO Authentication – Although the 2009 security design of Hadoop focused on pluggable authentication, the Hadoop developer community decided that it would be better to use Kerberos consistently, since Kerberos authentication was already being used for RPC connections (users, applications, and Hadoop services). Now, Hadoop web consoles are configured to use HTTP SPNEGO Authentication, an implementation of Kerberos for web consoles. This provided some much-needed consistency.
- Network Encryption - Connections utilizing SASL can be configured to use a Quality of Protection (QoP) of confidential, enforcing encryption at the network level – this includes connections using Kerberos RPC and subsequent authentication using delegation tokens. Web consoles and MapReduce shuffle operations can be encrypted by configuring them to use SSL. Finally, HDFS File Transfer can also be configured for encryption.
Since the security redesign, Hadoop’s security model has by and large stayed the same. Over time, some components of the Hadoop ecosystem have applied their own security as a layer over Hadoop – for example, Apache Accumulo provides cell-level authorization, and HBase provides access controls at the column and family level.
Today’s Hadoop Security Challenges
There are number of security challenges for organizations securing Hadoop, and in a new book that I have written with Boris Lublinsky and Alexey Yakubovich, we dedicate two chapters to securing Hadoop – one focused on Hadoop’s capabilities, and the other focused on strategies for complementing Hadoop security.
Common security questions are:
- How do you enforce authentication for users and applications on all types of clients (e.g. web consoles and processes)?
- How do you make sure that rogue services aren’t impersonating real services (e.g. rogue TaskTrackers and Tasks, unauthorized processes presenting block IDs to DataNodes to get access to data blocks, etc?)
- How do you enforce access control to the data, based on existing access control policies and user credentials?
- How can Attribute-Based Access Control (ABAC) or Role-Based Access Control (RBAC) be implemented?
- How can Hadoop integrate with existing enterprise security services?
- How do you control who is authorized to access, modify and stop MapReduce jobs?
- How can you encrypt data in transit?
- How do you encrypt data at rest?
- How can you keep track of, and audit events & keep track of data provenance?
- What are the best network approaches for protecting my Hadoop cluster on the network?
Many of these can currently be answered by Hadoop’s current capabilities, but many of them cannot, leading to the proliferation of Hadoop security-complementing tools that we see in the industry. Just a few reasons that vendors are releasing security products that complement Hadoop are:
1. No “Data at Rest” Encryption. Currently, data is not encrypted at rest on HDFS. For organizations with strict security requirements related to the encryption of their data in Hadoop clusters, they are forced to use third-party tools for implementing HDFS disk-level encryption, or security-enhanced Hadoop distributions (like Intel’s distribution from earlier this year).
2. A Kerberos-Centric Approach – Hadoop security relies on Kerberos for authentication. For organizations utilizing other approaches not involving Kerberos, this means setting up a separate authentication system in the enterprise.
3. Limited Authorization Capabilities – Although Hadoop can be configured to perform authorization based on user and group permissions and Access Control Lists (ACLs), this may not be enough for every organization. Many organizations use flexible and dynamic access control policies based on XACML and Attribute-Based Access Control. Although it is certainly possible to perform these level of authorization filters using Accumulo, Hadoop’s authorization credentials are limited
4. Complexity of the Security Model and Configuration. There are a number of data flows involved in Hadoop authentication – Kerberos RPC authentication for applications and Hadoop Services, HTTP SPNEGO authentication for web consoles, and the use of delegation tokens, block tokens, and job tokens. For network encryption, there are also three encryption mechanisms that must be configured – Quality of Protection for SASL mechanisms, and SSL for web consoles, HDFS Data Transfer Encryption. All of these settings need to be separately configured – and it is easy to make mistakes.
Implementers requiring security capabilities that Hadoop does not provide today have had to turn to integration of third-party tools, use a vendor’s security-enhanced Hadoop distribution, or come up with other creative approaches.
Big Changes Coming
At the beginning of 2013, Intel launched an open source effort called Project Rhino to improve the security capabilities of Hadoop and the Hadoop ecosystem, and contributed code to Apache. This promises to significantly enhance Hadoop’s current offering. The overall goals for this open source effort are to support encryption and key management, a common authorization framework beyond ACLs of users and groups that Hadoop currently provides, a common token based authentication framework, security improvements to HBase, and improved security auditing. These tasks have been documented in JIRA for Hadoop, MapReduce, HBase, and Zookeeper, and highlights are shown below:
Encrypted Data at Rest - JIRA Tasks HADOOP-9331 (Hadoop Crypto Codec Framework and Crypto Codec Implementation) and MAPREDUCE-5025 (Key Distribution and Management for Supporting Crypto Codec in MapReduce) are directly related. The first focuses on creating a cryptography framework and implementation for the ability to support encryption and decryption of files on HDFS, and the second focuses on a key distribution and management framework for MapReduce to be able to encrypt and decrypt data during MapReduce operations. In order to achieve this, a splittable AES codec implementation is being introduced to Hadoop, allowing distributed data to be encrypted and decrypted from disk. The key distribution and management framework will allow the resolution of key contexts during MapReduce operations so that MapReduce jobs can perform encryption and decryption. The requirements that they have developed include different options for the different stages of MapReduce jobs, and support a flexible way of retrieving keys. In a somewhat related task, ZOOKEEPER-1688 will provide the ability for transparent encryption of snapshots and commit logs on disk, protecting against the leakage of sensitive information from files at rest.
Token-Based Authentication & Unified Authorization Framework - JIRA Tasks HADOOP-9392 (Token-Based Authentication and Single Sign-On) and HADOOP-9466 (Unified Authorization Framework) are also related. The first task presents a token-based authentication framework that is not tightly-coupled to Kerberos. The second task will utilize the token based framework to support a flexible authorization enforcement engine that aims to replace (but be backwards compatible with) the current ACL approaches for access control. For the token-based authentication framework, the first task plans to support tokens for many authentication mechanisms such as LDAP username/password authentication, Kerberos, X.509 Certificate authentication, SQL authentication (based on username/password combinations in SQL databases), and SAML. The second task aims to support an advanced authorization model, focusing on Attribute Based Access Control (ABAC) and the XACML standard.
- Improved Security in HBase - The JIRA Task HBASE-6222 (Add Per-KeyValue Security) adds cell-level authorization to HBase – something that Apache Accumulo has but HBase does not. HBASE-7544 builds on the encryption framework being developed, extending it to HBase, providing transparent table encryption.
These are major changes to Hadoop, but promise to address security concerns for organizations that have these security requirements.
In our fast-paced and connected world where Big Data is king, it is critical to understand the importance of security as we process and analyze massive amounts of data. This starts with understanding our data and associated security policies, and it also revolves around understanding the security policies in our organizations and how they need to be enforced. This article provided a brief history of Hadoop Security, focused on common security concerns, and it provided a snapshot of the future, looking at Project Rhino.
About the Author
Kevin T. Smith is the Director of Technology Solutions and Outreach for the Applied Mission Solutions division of Novetta Solutions, where he provides strategic technology leadership and develops innovative, data-focused and highly-secure solutions for customers. A frequent speaker at technology conferences, he is the author of numerous technology articles and he has authored many technology books, including the upcoming book Professional Hadoop Solutions, as well as Applied SOA: Service-Oriented Architecture and Design Strategies, The Semantic Web: A Guide to the Future of XML, Web Services, and Knowledge Management and many others. He can be reached at KSmith@Novetta.com.
Special thanks to Stella Aquilina, Boris Lublinsky, Joe Pantella, Ralph Perko, Praveena Raavicharla, Frank Tyler, and Brian Uri for their review and comment on some of the content of this article. Also - thanks to Chris Bailey for the “Abbey Road” picture of the evolving Hadoop elephant.
1 Ponemon Institute, “2013 Cost of Data Breach Study: Global Analysis”, May 2013,
2 Business Insider, “Playstation Network Crisis May Cost Sony Billions”,
3 For more information see “CNN/Money – 5 Data Breaches – From Embarrassing to Deadly”, and Wikipedia’s page on the AOL search data leak on anonymized records
4 Ponemon Institute, “Is Your Company Ready for a Big Data Breach?”, March 2013.