Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Q&A with Saumitra Buragohain on Hortonworks Data Platform 3.0

Q&A with Saumitra Buragohain on Hortonworks Data Platform 3.0

This item in japanese

Hortonworks Data Platform (HDP) 3.0 based on Hadoop 3.1 went into General Availability recently. Based on Apache Hadoop 3.1, HDP 3.0 includes containerization, GPU support, Erasure Coding and Namenode Federation.

Enterprise features of the new release include Trusted Data Lake leveraging Apache Ranger and Apache Atlas, which are installed by default with HDP 3.0. Some of the components have been removed, such as Apache Falcon, Apache Mahout, Apache Flume, and Apache Hue, and Apache Slider functionalities have been absorbed into Apache YARN.

InfoQ caught up with Saumitra Buragohain, senior director of Product Management at Hortonworks, regarding Hadoop in general and HDP 3.0 in particular.

InfoQ: Has Hadoop been rendered passe due to the success of Spark, Kafka and other Big Data platforms? How is Hadoop in general and HDP 3.0 in particular still relevant in the enterprise, and why should developers care?

Sumitra Buragohain: Hadoop has evolved from the early days where it was mostly about the storage layer (Apache HDFS) and batch workloads (MapReduce) a decade ago. As we are on the cusp on the fourth industrial revolution, Hadoop big data stack has evolved to include real-time database (powered by Apache Hive 3.0), Machine Learning & Deep Learning Platform (Apache Spark, Apache TensorFlow), Stream Processing (Apache Kafka, Apache Storm), and Operational Data Store (Apache Phoenix, Apache HBase). Please stay tuned for our HDP 3.0 blog series! HDP 3.0 can also be deployed in both on-prem and all major cloud providers (Amazon, Azure, Google Cloud).

InfoQ: Is the real time database with HDP 3.0 aimed at providing Spark-like functionality for interactive queries? Can you cover the implementation providing some technical details and how it might help data scientists who are not heavy duty programmers?

Buragohain: Real-time database is powered by Apache Hive 3.0 & Apache Druid and allows a single SQL layer for both batch and historical datasets. Druid allows creating OLAP cubing so that we can enable querying large datasets real-time. Highlighted Apache Hive features in HDP 3.0 include:

  • Workload management for LLAP: You can now run LLAP in a multi-tenant environment without worrying about resource competition.
  • ACID v2 and ACID on by default: We are releasing ACID v2. With the performance improvements in both storage format and execution engine, we are seeing equal or better performance when comparing to non-ACID tables. Thus we are turning ACID on by default and enable full support for data updates.
  • Hive Warehouse Connector for Spark: Hive Warehouse Connector allows you to connect Spark application with Hive data warehouses. The connector automatically handles ACID tables.
  • Materialized view navigation: Hive’s query engine now supports materialized view. The query engine will automatically use materialized view when they are available to speed up your queries.
  • Information schema: Hive now exposes the metadata of the database (tables, columns, etc.) via Hive SQL interface directly.
  • JDBC storage connector: You can now map any JDBC.

Highlighted Druid features in HDP 3.0 include:

  • Kafka-Druid ingest: You can now map a Kafka topic into a Druid table. The events will be automatically ingested and available for querying in near real-time.

InfoQ: Containers are everywhere. Can you talk about how containers can be leveraged in HDP 3.0?

Buragohain: YARN has always supported native containers at memory and CPU granularity. We are expanding that model to support Docker containers and added GPU support on top of memory and CPU. That means that I can now package my applications (such as Spark) with the dependencies such as Python (whether 2.7 or 3.5) and the various Python libraries and run interpedently in an isolation from other tenants sharing the HDP 3.0 cluster.

That means that I can also run Dockerized TensorFlow 1.8 on YARN, leveraging the GPU pooling features. That also means that I can lift and shift 3rd party workloads and run them on HDP 3.0. So, that is the power of HDP 3.0 and we have evolved a lot from Hadoop 1.0 from a decade ago.

InfoQ: Deep Learning is another trending technology and seems like there is increased synergy between HDP 3.0 and Machine Learning. Can you talk about using HDP and Deep Learning?

Buragohain: Absolutely. We have been shipping Spark as a core component of HDP stack for years now and this is one of our biggest workloads on HDP across out customer install base. We are now expanding to Deep Learning framework and have supported features such as GPU pooling/isolation, so that expensive GPUs can be shared as a resource between multiple data scientists.

As mentioned above, we are also supporting containerized workloads and hence, I can run dockerized TensorFlow 1.8 to train the deep learning models, using YARN GPU pooling and using the training data stored in HDP 3.0 data store layer (or in cloud storage when HDP 3.0 is deployed in the cloud). You can also view our keynote demo from DataWorks Summit, where we trained an autonomous car (1/10 scale) with HDP 3.0 technologies.

InfoQ: Erasure coding is part of Hadoop 3.0. Can you talk about how it’s implemented and if there are any additional tweaks in HDP 3.0?

Buragohain: Erasure Coding essentially is RAID across nodes. Just like in the enterprise storage industry, vendors/customers have adopted RAID6 in favor of RAID10 (mirroring)- we are going through a similar shift with Hadoop Data Storage (Apache HDFS).

Instead of creating two more copies of the same data, we break the data into six shards and create three parity shards. Those nine shards are now stored in nine nodes. So, if three nodes go down, we will have six shards (data or parity) and can build the data. So, that is how we provide the same failure resiliency as three replica approach while cutting down the storage foot print by half.

By default, we use replica approach. The customer will need to configure a directory to be erasure coded and we can choose multiple Erasure Coding Reed Solomon Encoding: RS(6,3); RS(10,4); RS(3,2). Then, any data that goes into that directory is erasure coded. We are initially supporting erasure coding for cold data, however, we are providing an optional Intel Storage Acceleration library as part of HDP 3.0 Utilities to do optional hardware acceleration.

InfoQ: What does HDP 3.0 add beyond what Hadoop 3.0 provides? What is the roadmap for HDP 3.0 and beyond?

Buragohain: Please stay tuned for our blogs. HDP 3.0 GA was announced recently and our release notes capture the features in detail (across Real-time database, Stream Processing, Machine Learning & Deep Learning platform, etc). At the end of the day, we provide all the tools in the tool box so that customers can pick and choose (vs. single workload vendors). We are investing in HDP for the long haul and we will have more exciting updates in 2019!

Release Notes for HDP 3.0 is available from the HDP 3.0 Release Notes page.

Rate this Article