Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Hadoop Summit 2014 Day Two - On the Path to Enterprise Grade Hadoop

Hadoop Summit 2014 Day Two - On the Path to Enterprise Grade Hadoop

The talks on the second day were technical deep dives into performance optimizations for different workloads primarily in Hive, competing benchmarks from the distro vendors and operational guidance that included security and governance for enterprises. Stories from the trenches continued to trickle in and there were some interesting discussions around the business of Hadoop. Vendors in the broader partner ecosystem shared various innovations that plug gaps in the Hadoop platform and deliver on the lifecycle vision.

In his keynote session, Shaun Connolly, VP of Products described the Hadoop platform from a couple of years ago as a bunch of components that were loosely held together with nothing in the center to manage it like an operating system. YARN changed things fundamentally and expanded the programming paradigms to support various types of workloads. To exemplify it HortonWorks demonstrated a lambda architecture application that uses Apache Kafka as an event messaging middleware to transmit truck movement and traffic violation events that are analyzed in real time by Apache Storm. Arun Murthy, Founder and Architect at HortonWorks, in the keynote panel session on the future of Hadoop also highlighted this fact and envisions Hadoop as a data-oriented operating system into which one can plugin special hardware resources and applications of various workload types. Aaron Davidson from Databricks demonstrated a unified datapipeline architecture application based on Apache Spark which underlines this broader move towards hybrid processing paradigms on a single platform.
Jim Walker, Director of Product Marketing at HortwonWorks, moderated a panel discussion analyzing the Hadoop market. During the discussion Tony Baer of Ovum Research expressed his concern over the amount of venture capital investment in Hadoop technologies:

With all this money being poured into the Hadoop market, even if it grows exponentially to add another 1000 subscriptions next year, will it suffice the investors?

Mike Gualtieri, analyst at Forrester Research commented that he rarely hears the question of ROI for Hadoop which emphasizes why he has never seen a technology trend like this. In response to which Jeff Kelly from WikiBon said that this is the case because compute, storage and other infrastructure is cheap and the potential benefits of storing all the data is valuable in the long run. However all the panelists agreed that the best way to separate signal from noise is by speaking to buyers regularly and that the biggest non-technical hurdle to graduating Hadoop from POC to production is selling its business value to the decision makers. There was also concensus on the fact that a majority of these POC kind of deployments in large companies is a result of shadow IT, primarily technologists aiming at beefing up their resume.

The breakout sessions included talks on Apache Tez and the performance optimizations and benchmarks of executing interactive SQL. Tony Baer summarized these benchmarketing wars in an excellent blog post on the relevance of SQL in the Hadoop world. Tony refers to the most recent benchmarks from Cloudera using Impala, HortonWorks measuring the performance of Hive over Tez and Actian's analytic platform.

On that note, HortonWorks and Cloudera are approaching enterprise security with different solutions through a mix of vendor acquisitions and competing applications in access control. On day one of the summit, HortonWorks covered their approach which heavily depends on Apache Knox for data access and XA Secure for centralized audit and authorization. Joey Echeverria from Cloudera talked about Apache Sentry's central role in security and future enhancements to address the gaps:

  • Sentry is adding more processing engines, primarily MR, Pig & Spark.
  • Improved manageability. Currently the config file is complex and archaic, which will be replaced with a database.
  • Currently in the design phase is the Sentry record service, which is a distributed service that runs on the cluster and aims to grant access to the underlying files at the record level.
  • Increased focus on Project Rhino to create a framework for encryption.

Rate this Article