Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Elephant in the Cloud - Hadoop as a Service

Elephant in the Cloud - Hadoop as a Service


Ashish Thusoo, CEO and co-founder at Qubole, recently spoke at Enterprise Data World Conference (EDW) about "The Elephant in the Cloud", Hadoop as a Service offering. Part of a wider trend of big data as a service category rather than a product category, Hadoop as a service offerings are intended to help organizations deal with the challenges and costs associated with running Hadoop services at scale. These cloud-based services also benefit from other properties of the cloud, such as dynamic provisioning, elasticity of compute and storage, and availability in multiple geographies.

Ashish started the discussion by saying that the nature of data now includes high volumes of interaction data that's typically unstructured in nature than just the structured transactional data we have been processing in our applications for a long time.

The nature of analytics has also changed. Ashish talked about the "Analytic Value Escalator" that shows the transition from descriptive to prescriptive analytics.

  • Descriptive Analytics (What happened?)
  • Diagnostic Analytics (Why did it happen?)
  • Predictive Analytics (What will happen?)
  • Prescriptive analytics (How can we make it happen?)

The cloud provides benefits like on-demand and elastic infrastructure, highly scalable object stores and processing, and adaptable infrastructure. Using big data platforms on the cloud disrupts the on-premise model by providing faster time to production, agility and flexibility of infrastructure, and significant cost reduction.

Virtual Private Cloud (VPC) helps isolate access to compute and storage as well as offer security best practices. Security in VPCs includes the options for encryption both for data at rest and data over the network as well as role based access for compute and storage.

A modern data platform needs multiple engines as listed below, that can address the diverse data processing use cases in a typical organization:

  • Hive for complex batch SQL
  • Spark for data science
  • Presto for interactive simple SQL
  • Map Reduce for batch ETL

Ashish also discussed the reference architecture for big data on the cloud. This model includes services like multi-user data access, engine unification, and cloud orchestration & portability service. He concluded the presentation saying that Hadoop as a Service offering is a compelling option to look at while deciding on the big data infrastructure.


Rate this Article