A Survey and Interview on How Hadoop Is Used Today
We are living in the era of “big data.” In today’s technology-powered world where there is an increase in computing power, electronic devices, and accessibility to the Internet, more data than ever before is being transmitted and collected. Organizations are producing data at an astounding rate. Facebook alone collects 250 terabytes a day. According to Thompson Reuters News Analytics, digital data production has more than doubled from almost one zettabyte (a zettabyte is equal to 1 million petabytes) in 2009, and it is expected to reach 7.9 zettabytes in 2015, being estimated to reach 35 zettabytes in 2020.
As organizations have begun collecting and producing massive amounts of data, they have started to recognize the advantages of data analysis, but they are also struggling to manage the massive amounts of information they have. According to Alistair Croll,
companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue …
This means that unless your business understands the data it has, it will not be able to compete with businesses that do. Businesses realize that there are tremendous benefits to be gained in analyzing Big Data related to business competition, situational awareness, productivity, science, and innovation. And today, most companies see Hadoop as a main tool for analyzing their massive amounts of information and master the Big Data challenges.
According to the Hortonworks survey, Hadoop today is deployed by many large mainstream organizations (50% of survey respondents were from organizations with over $500M in revenues) across many industries including High-tech, Healthcare, Retail, Financial Services, Government and Manufacturing.
In the majority of cases, Hadoop does not replace existing data processing systems but rather complements them. It is typically used as a supplement to the existing systems to tap into additional business data and a more powerful analytics system in order to get a competitive advantage through better insights in business information. 54% of respondents are utilizing Hadoop to capture new types of data, while 48% are planning to do the same. The main new data types include the following:
- Server logs data enabling IT departments to better manage their infrastructure (64% of respondents are already doing it, while 28% are planning to it).
- Clickstream data enabling to better understand how customers are using applications (52.3% of respondents are already doing it, while 37.4% are planning to it).
- Social Media data enabling to understand perception of the public about the company (36.5% of respondents are already doing it, while 32.5% are planning to it).
- Geo/Location data enabling to analyze travel patterns (30.8% of respondents are already doing it, while 26.8% are planning to it).
- Machine data enabling to analyze machine usage (29.3% of respondents are already doing it, while 33.3% are planning to it).
According to the survey, while traditional data grows with an average rate of about 8% a year, new data types are growing at a rate of 85%+, and, as a result, it is virtually impossible to collect and process it without Hadoop.
InfoQ had a chance to discuss the survey results with David McJannet, VP Marketing at Hortonworks.
InfoQ: Based on the results of the survey, it seems that Hadoop is more widely then deeply adopted. It looks like more and more people are starting to leverage Hadoop, but its usage is limited, for the most part, to storage of vast amounts of data and simple Hive/SQL queries on this data. Do you think that this trend will continue?
McJannet: I would suggest that Hadoop adoption is very deep in some industries / verticals: it is foundational in the overall data architecture of those who were early adopters and are now using it extensively. But in 2013 we have seen it truly widen which is something that jumps out from this survey taken at the Hadoop Summit.
When we think about what's driving this rapid uptick in adoption, I'd say there are at least 3 clear factors:
- A better understanding of the use cases for Hadoop. This is actually represented in the survey results which show that the 2 primary drivers are (a) to power new analytic applications based on new types of data and (b) as part of an overall architectural initiative to manage the longer term growth in data.
- A rapid evolution in the technology itself which continues to simplify usage and enable adoption at scale an. Hadoop 2 is a huge step forward in multiple respects and representative of years of work done in the broader community.
- The embrace by the ecosystem of vendors in the market. For example, the work done by Microsoft enables Excel users to connect directly to the Hortonworks Data Platform (HDP) to pull data for analysis. Or for more complex analytics often done in a tool like SAS, they've done deep engineering to connect SAS analytic tools to HDP. This makes it relatively straightforward to adopt, and in many cases the end user has no idea they are using Hadoop at all.
InfoQ: What do you see as next steps in Hadoop adoption? How would you define a role of enterprises and vendors like Hortonworks in this process?
McJannet: We see a very consistent pattern as it relates to enterprise usage: most users initially adopt Hadoop to power a new analytic application — most typically driven by a Line of Business such as marketing, or a business group. From there and after the first couple of projects are successful, the data architecture team recognizes the value of Hadoop in the overall data architecture and which then drives the next phase of adoption, often to power a "data lake" or similar concept. For Hortonworks, we see our role as enabling the Hadoop market to function:
- Rally others in the ecosystem to ensure that open source Hadoop continues to evolve in the open and for all
- Provide a truly enterprise grade platform that has been integrated and tested at scale and incorporating the most recent innovations from the open source community
- Ensure integration and interoperability with the tools and technologies that a user already has. This is why we've worked hard to ensure that HDP is certified with those technologies from HP, Microsoft, SAP, SAS, Teradata and more — in fact all of those partners today resell HDP as a component of their offerings. Generally speaking the vendors that most organizations rely on, rely on Hortonworks for Hadoop and that helps the overall market to function and mature more quickly than it otherwise would.
InfoQ: Although Hadoop provides tremendous processing power, well beyond SQL, Hive still remains the 800 pound gorilla of Hadoop usage. And with more companies providing solutions for real time SQL queries on Hadoop data, the emphasis on SQL as main Hadoop programming language seems to grow even more. Do you consider this to be a temporary phenomena (low entry barrier?) or a long term trend?
McJannet: Given the abundance of SQL skills in the world today, it is no surprise that SQL access is one of the most common ways of accessing data stored in Hadoop. To that end, Apache Hive is by far and away the dominant tool used for SQL queries in Hadoop. There are certainly some new initiatives on top of Hadoop being pushed by proprietary vendors looking to capitalize on the market, but by and large Hive is the standard and likely always will be, particularly given the work being done by Microsoft, SAP, Hortonworks and others in the Stinger initiative to continue to enhance the Speed, Scale and SQL semantics in Hive.
There are many more ways of accessing data stored in Hadoop beyond SQL, but it is probably fair to say that using SQL / Hive will be a primary approach. For example, Hive tends to be the interface used by all of the BI tools on top of Hadoop. For more complex use cases however we do see extensive use of technologies such as Pig (scripted queries) and more generally, higher-end tools that make it invisible to the end-user which interface is being used (such as R or SAS).
Over time the most common interface is likely to be a packaged application (SAS, Microstrategy, Excel, Business Objects, Platfora, etc) and the end-user won't need to know what is being used under the covers.
InfoQ: Do you foresee Hadoop being used for building mainstream enterprise applications? When do you think we will see these applications?
McJannet: Absolutely! History is showing us that the web companies are pioneers in adopting these generational technological shifts such as the one underway now with Hadoop. They have based their mainstream applications on Hadoop for years and we are now seeing mainstream enterprises follow a similar path.
It is also for this reason that we have such a focus on integration with the development skills people already have. Case in point: .Net developer? The .Net SDK for Hadoop is based on open source HDP. Java developer? The certification of Java Spring (the predominant framework for building Java apps) with HDP will be a huge enabler of this transition.
When? It is always difficult to make these kinds of predictions but I am reminded that generational transitions often take longer than expected and at the same time are much more profound than expected. The move to Hadoop adoption has been underway for several years now, and is truly beginning to take hold in a material way as evidenced by the growth in our customer base. As a vendor, we see an important aspect of our role as focusing on the integration of technologies and skills to facilitate it in the most timely possible manner.
Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin May 21, 2015
Kai Kreuzer, Olaf Weinmann May 21, 2015