Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Q&A with Microsoft's Arindam Chatterjee Discussing Azure HDInsight 4.0

Q&A with Microsoft's Arindam Chatterjee Discussing Azure HDInsight 4.0

Azure HDInsight 4.0, based on Apache Hadoop 3.1, has recently been released as a public preview on Azure. The major updates include: Apache Hive 3.0 Low Latency Analytical Processing (LLAP), known as Interactive Query in HDInsight, which delivers improvements for fast queries and transactions; Apache Spark with updatable tables and ACID transactions with Hive Warehouse Connector; and Apache HBase 2.0 and Apache Phoenix 5.0 performance and stability features.

Azure HDInsight is a service offering services based around Apache Hadoop, Spark and Kafka for Big Data processing and analytics. Based on Apache Hadoop 3.1 and Hortonworks Data Platform (HDP) 3.0, almost all of the components have been updated.

InfoQ caught up with Arindam Chatterjee, principal group manager at Microsoft, regarding the announcements about HDInsight at Microsoft Ignite.

He covers the advantages of the managed service over a typical Infrastrucure as a Service, the different types of clusters that can be created (Apache Hadoop, Spark, Kafka, Storm, etc.) and how it can be customized depending on site needs. He discusses enterprise security features based around Active Directory integration, migration of data to the latest version of the clusters, and being able to create Spark clusters with support for both Jupyter and Zepplin notebooks. Finally, he talks about how the community work will be integrated into the product roadmap.

InfoQ: HDInsight bundles Hortonworks HDP versions as-is, correct? What does HDInsight offer beyond the typical advantages of the Platform as a Service versus Infrastructure as a Service, "PaaS versus IaaS", approach?

Arindam Chatterjee: Azure HDInsight does offer HDP from Hortonworks that is then optimized to operate in Azure against a remote store like Azure Storage, Azure Data Lake Storage (ADLS) Gen1 and Gen2. HDInsight is a "managed platform" where customers get the full control and extensibility they would expect with their on-premise or IaaS deployments while still benefiting from high availability SLAs, 24x7 monitoring and deep integration with other Azure services that they expect from a PaaS service.

InfoQ: Although HDInsight takes more of a PaaS approach, how easy is it to customize each installation?

Chatterjee: There are several ways to customize an Azure HDInsight cluster. First, HDInsight allows customers to ssh in to an HDI cluster and customize it as per their requirements e.g. by installing their own tools, fine-tuning configuration settings etc. Second, HDInsight also allows customers to run a custom script (aka ScriptAction) when provisioning a cluster to customize it as they wish.

Last but not the least, customers can choose any of approx. 30 of the most popular applications in the Hadoop/Spark community from Azure Marketplace and install it on their clusters. These applications provide unique solutions in all aspects of a big data application incl. data ingestion, machine learning, visualization, data orchestration and governance etc.

InfoQ: Enterprise Security is a huge requirement in many industry verticals, from finance to healthcare, that deal with Big Data. What does HDInsight offer in this area?

Chatterjee: Enterprise Security in Azure HDInsight is designed to provide comprehensive defense in depth.

  • Network isolation: Customers can isolate their HDInsight clusters within VNets (virtual network) and configure NSG rules to ensure that only approved users/devices can access the clusters. Further, they can use Service Endpoint security to restrict access to the data stores containing their most sensitive data.
  • Authentication: Like all Azure services, Azure HDInsight integrates with Azure Active Directory (AAD) for authentication for all access to the management portal or functionality. For access to the actual HDI clusters, HDInsight support Kerberos authentication against Active Directory Domain Services (ADDS). These features enable enterprise users to login to HDI clusters using their corporate domain credentials.
  • Authorization: In addition to standard Azure Role Based Access Control (RBAC) policies that are enforced for all management portal/operations, Azure HDInsight supports Apache Ranger for fine grained access control to Hive/Hbase tables, Spark and Map Reduce jobs, Kafka topics etc.
  • Data protection: With Azure HDInsight, customers use Azure Storage or ADLS Gen1 and Gen2 to store the data. Customers can leverage the encryption at rest features of these stores to protect their data. Customers can choose whether they want to manage their own encryption keys (in Azure Key Vault) or they would like to have Microsoft manage the keys on their behalf.

HDInsight ensures that any data in motion is encrypted using TLS.

InfoQ: Has Spark taken over Big Data use cases rendering Hadoop passe? How would you compare and contrast the Azure Databricks offering versus the HDInsight/Spark offering on Azure?

Chatterjee: While Apache Spark certainly has its strengths when compared to Apache Hadoop (esp. around query performance), we see both the Hadoop and Spark stacks evolving to better meet the growing demands of their user base.

Azure Databricks is a premium Spark offering that is ideal for customers who want their data scientists to collaborate easily and run their Spark based workloads efficiently and at industry leading performance.

Azure HDInsight brings both Hadoop and Spark under the same umbrella and enables enterprises to manage both using the same set of tools e.g. using Ambari, Apache Ranger etc. It also offers industry standard notebook experience with support for both Jupyter and Zeppelin notebooks. Enterprises that want this ease of manageability across all their big data workloads can choose to use HDInsight.

InfoQ: Can you comment on the migration of data from the previous versions to HDInsight 4.0? Can you recommend any best practices?

Chatterjee: We just released the preview version of Azure HDInsight 4.0 with Apache Hadoop 3.0. We are working with our early adopters to develop the best practices around data and code migration from prior versions of HDInsight. Until then, customers are encouraged to review available documentation in the open source community.

InfoQ: Can you provide a roadmap beyond HDInsight 4.0 and the plan for working with the Hadoop, Spark and other communities besides working with Hortonworks going forward?

Chatterjee: Microsoft continues to be an active participant in the broader open source community contributing to several projects incl. Apache Yarn and delivering innovative development and diagnosis capabilities across the most popular development tools (like Eclipse, IntelliJ, VSCode etc.). In addition, we are continuing to track the emerging scenarios and innovations (like streaming, deep learning, real-time BI etc.) in the data space with the goal of delivering the most secure and cost-effective solutions to our customers.

More technical details about the latest release of HDInsight is available from the recording of the Microsoft Ignite talk.

Rate this Article