Presto-as-a-Service: Interactive SQL Queries on AWS
Presto, a technology from Facebook enabling interactive SQL queries on petabytes of data, has now taken a first step into mainstream adoption. Big Data startup Qubole has launched its Presto-as-a-Service alpha with integration to Amazon Web Services.
This new system fits into Qubole’s growing platform, the Qubole Data Service (QDS), alongside existing integrations with Hadoop, Hive and Pig. The nature of Presto lends itself well to this kind of managed service since it integrates natively with Hive, HBase and relational databases. This seems like a natural step for Qubole’s co-founders Ashish Thusoo and Joydeep Sen Sarma who founded Hive and brought HBase to Facebook. One of the main use cases for this service seems to be querying Hive tables against data stored on S3, and users of QDS can start running queries on these tables in minutes. Qubole’s service is centered around AWS currently because “this is where we have seen the demand”, says Ashish. Qubole positions itself as a replacement for expensive data-warehouse systems, and like Qubole’s VP of engineering Shrikanth Shankar says, “Presto will provide great value to our users who have previously had to rely on expensive commercial technologies for fast analytics.”
Presto is actually a relatively new technology in the world of Big Data. The project was started in fall 2012 at Facebook and brought in production in the first half of 2013, to be finally open-sourced in November 2013 . The execution model behind Presto is fundamentally different from Hive, as it does not use MapReduce – similarly to other SQL query engines like Impala from Cloudera or Shark from UC Berkeley. The key point is that all processing is done in memory and, as Ashish says, “higher memory instances make more sense with presto.” This is one of the main reasons why Presto is able to achieve latencies several orders of magnitude lower than Hive – although it is not clear yet how this will compare with Hive 12 due to the improvements brought by the Stinger project, like described by Ashish:
Hive is also certainly becoming faster. We have done some initial tests and will soon publish the results in a blog post.
Scalability-wise, the fact that it is used at Facebook on their 300PB data warehouse should be a testimony to its robustness. And other companies like AirBnb and Dropbox have started adopting it: “It's an order of magnitude faster than Hive in most our use cases”, says Christopher Gutierrez, manager of online analytics at Dropbox.
One of the side effects of Qubole’s new service is that the Presto community will grow even stronger, and even Qubole developers like Siva Narayanan said on the Presto group that they “plan on being good citizens of Presto-land and look forward to contributing fixes and features back to trunk”. With already more than 2,000 stars and 350 forks on Github, the project is already more popular than other similar and older open-source projects like Impala.
How Can We Use Our Creative Power and Technological Opportunity to Address the Challenges of the 21st Century?
Gyorgyi Galik Feb 26, 2015