Hadoop Summit 2014 Day One - On the Path to Enterprise Grade Hadoop
One overarching theme for Hadoop Summit 2014 is the fast evolving Hadoop platform towards enterprise grade, which includes a focus on the gaps in reliability, security, scalability and manageability. Also very evident is an increase in the number of talks that address operationalizing Hadoop in production indicating an inflection point in its adoption.
This year's summit is well attended by 3200 attendees, twice that of the previous year. As expected, after the release of YARN late last year, many upcoming innovations in the Hadoop Platform are centered around it, most notably Apache Tez and the newly availably technical preview of Apache Slider. HortonWorks announced the YARN ready certification program to assure customers that partner applications have been integrated with HortonWorks Data Platform(HDP) through YARN. Raymie Stata, CEO of AltiScale, briefly introduced the audience to the idea of application isolation in YARN using Docker containers as a critical piece to the success of running their Hadoop as a Service at scale. Vinod Kumar Vavilapalli and Jian He of HortonWorks, educated the attendees on the present and future of YARN and shared these key features expected in the future:
- Operational enhancements with rolling upgrades
- Eliminate the need for Resource Manager and Node Manager restart to enable rolling upgrades
- Enabling more apps beyond Map-Reduce for
- long running services with enhancements in log handling, security specifically access control and enhancements in Ambari for management and monitoring
- multi-dimensional resource scheduling through baking in finer grained CPU resources followed by disk space, IOPS and networking resources.
- fine-grained isolation through custom-memory monitoring, cgroups(cpu,memory), linux containers(docker) and VMs
- Other features:
- App SLAs for predictability (Microsoft contribution)
- Node-labels (AuthZ for special hardware assignment to users)
- Node affinity/anti-affinity (explicit job to node affinity declaration)
- Better online queue mgt
- Web Services through RESTful APIs for submitting, monitoring and killing apps
In the same spirit, Sanjay Radia and Chris Nauroth from HortonWorks, explained the state of Hadoop security in the areas of:
- Kerberos-centric approaches including delegation and block tokens for long running services
- Beyond Kerberos through Knox Gateway based SSO integration with LDAP, Siteminder and most recently OAuth
- Using Apache Knox for perimeter security and single access point to potentially multiple Hadoop clusters using REST APIs
- HDFS ACLs augment existing HDFS POSIX permissions by exceeding the 3 levels to a richer model that includes named users and groups using getfacl and setfacl.
- DDL constructs(GRANT/REVOKE) to manage column level protection in HiveServer2 (No equivalent Pig construct)
- HBase cell level authorization defined with ACLs that can be evaluated by the application either first or last in conjunction with table level ACLs. These Apache Accumulo style cell visibility labels enable an ABAC model.
- Centralized fine grained security management and RBAC based authorization in XA Secure, a recent acquisition by HortonWorks which will be converted to a full fledged Apache open source project
- Beyond component specific audit, centralized audits and compliance conformance controls have been introduced in HDP through XA Secure
- Data Security
The recent spate of security vendor acquisitions by major Hadoop distro vendors should be a sign of growing security concerns of customers in both regulated and non-regulated industries.
Julian Hyde, architect at HortonWorks envisions query optimization as the channel to use memory efficiently through the use of a new kind of data set, which he calls Discardable In-Memory Materialized Queries (DIMMQ). A DIMMQ is:
- A materialized query is a dataset whose contents are guaranteed to be the same as executing a particular query, called the defining query of the DIMMQ. Therefore any query that could be satisfied using that defining query can also be satisfied using the DIMMQ, possibly a lot faster.
- Discardable means that the system can throw it away.
- In-memory means that the contents of the dataset reside in the memory of one or more nodes in the Hadoop cluster.
Work for materialized views is already underway in Apache Optiq, while discardibility and in-memory management are planned extensions to HDFS.
There were some critics of the speaker lineup including Amr Awadallah, CTO of Cloudera, who tweeted:
The Hortonworks Summit is about to start. Seriously, talk line up is ridiculously biased, this is no longer a community event #hadoopsummit
In response, when he was questioned about Hadoop World Amr replied:
@egwada we gave Hadoop World to O'Reilly Strata exactly to avoid the temptation of being biased like that.
What do you think, should Yahoo and HortonWorks do the same for Hadoop Summit?