Bio Ashish Thusoo used to run the teams responsible for building and operating the analytics platform at Facebook. He is also a co-creator of the Apache Hive project which has become a very popular open source data warehousing framework for large scale data analysis on data stored in Hadoop. He has deep expertise in data processing and parallel processing technologies, infrastructure and applications.
QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community.QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.
1. I am here at QCon London with Ashish Thusoo, how you are doing Ashish? Ashish you worked at Facebook as an Engineering Manager and while you were there you were working on the HIVE project. Can you tell us a little bit about that?
I worked at Facebook from 2007 to 2011 and started off as I was a co-creator of HIVE there with Joydeep [Sen Sarma]. We started off that project in 2007, open-sourced it in 2008 and it’s become a significant part of the Hadoop community and the Hadoop ecosystem with a lot of people using it. My involvement kind of has been down ever since I took engineering manager role managing the entire data and engineering at Facebook in the last one and half years but I am still very proud of where HIVE is today in solving some really hard problems of scale in a really open-source, collaborative and community way.
So back in the day when we were working at Facebook, at that point, it was always a problem of scale. And it is a problem of scale in two dimensions of course one was the problem of scale in the dimension of the amount of data that is coming in. The second was the problem of scale in terms of how fast the organization wanted to move and stuff like that. So of course we started off using MapReduce because that solved the problem of scale for the data part. We know Hadoop is a great system to really scale to large and large clusters, it can hold lots of data and provide lots of computing horse power to process that data. The problem of scale on terms of how we could enable the organization to move fast was still there with Hadoop because people would have to write like Java programs and program MapReduce and all that stuff.
So we kind of said "Ok, we want to create interfaces which people are very familiar with so they don’t have to relearn things." And that was the genesis of Hive for example. It is essentially a SQL-like interface on top of Hadoop, probably a lot of people know about it. Then around it we also built, our whole vision was we wanted to democratize the access. We want to remove any intermediaries between who are the consumers of the data and their data. So we focused a lot in building a lot of self-service tools - a lot of tools around data discovery, a lot of tools around monitoring, a lot of tools around querying and running jobs on this data. And that kind of became the backbone of a lot of batch processing type of jobs that were done in Facebook.
Essentially to summarize the problem that we are always dealing with: it was scale whether it was scaling data and the infrastructure to deal with data or scaling tools to make it much easier for folks to run the stuff. Of course you know once you build all these tools you run into problems of abuse of infrastructure or I won’t say abuse but nobody does it on purpose but you give people much more powerful tools to work with. And then you have to think about how you can isolate faults or how can you isolate one user from the other of this shared infrastructure and stuff like that. So we built a lot of those things. It was a fantastic team very charged up team probably the best team that I worked with and they really pulled off some really great stuff in terms of what they built.
Back in the day when we started off, we were kind of on the traditional systems and this is generally true of how the Big Data ecosystem has also kind of evolved over the years. Initially, of course, MapReduce and Hadoop started off in the domain of search, it was probably the first domain where it kind of started. For us the problem was more on data warehousing and data processing and the systems at that time were fairly limited in what and how they would be able to scale with this stuff. And the limitations came from various different reasons. Of course there were structural, architectural limitations there, there were certain interface limitations for example, our typical RDBMS is extremely structured, there is a well-defined schema and in a dynamic environment you need a much more flexible kind of an interface wherein the schema it is more like schema-less or more like semi-structured stuff.
Those were the problems with that infrastructure and so we kind of took Hadoop into a direction into more towards data warehousing ideal kind of a direction and now if you see that is where Hadoop is finding its footprints in a lot of companies. It is not a substitute to like Big Data warehouses but it’s becoming this powerhouse engine where you could use it for doing a lot of pre-processing of data before you load it into data marts or you could even use it for doing processing of data which doesn’t fit into the typical structured processing of data so TechstanditX or analyzing different kinds of data, that has become possible with this particular technology.
For us solving those problems, even in Facebook it was very enabling technology for what it could achieve and the things that we could do with this which we would not have been able to achieve otherwise.
Mostly a Big Data architecture guy, not that involved with the NoSQL stuff. I have had interactions with the HBase folks which were more front and center in that part of the stuff so I know whole bunch of people in the HBase community. Most of my contributions though have been in the Big Data space.
Very interesting question. I think right now, as I was mentioning, first of all Big Data is becoming more and more of a prevalent problem. There are lot more companies who are hitting this big data problems for whatever reason. I mean you hear it all the time - emergence of sensors, emergence of mobile devices. It’s become so easy to instrument web applications. So of course big data is becoming more and more of a problem. Hadoop has sort of gravitated towards being the system, it’s kind of gravitated towards not replacing anything, it’s becoming more of an enabling technology as opposed to a replacing technology at this point. And the kind of things that it is enabling is all these things around semi-structured data or on large data sets, summarizing all that stuff and things like that.
It is obviously making a lot more inroads in enterprises nowadays, so in order to get there though it has to go through a maturization phase. For example Hadoop specifically I can talk about, there is lot of work which is being done on in the HDFS land to make it clearly enterprise class and things like failover, HA and all that stuff, snapshotting. There is a whole bunch of things which are on the roadmap of a few companies in this particular area. It has also got to start working well with the other systems that are existing in the enterprise so lot of standardized interfaces. People have started investing, like Microsoft recently announced that they were contributing in or they were developing an ODBC driver for Hive and Hadoop. There is already one in the open source but this will probably be a much better version.
So those kinds of things would have to happen for it to gain more and more ground in that ecosystem and that is where I think it is headed to. Of course there are definitely deficiencies in the system where it is right now but the onus is on the open-source community to really raise the bar in terms of what more use cases and what more other environments where this particular infrastructure can work. But I think fundamentally big data is here to stay and it is kind of headed to more adoption in general. I won’t call it completely becoming mainstream, but simply by seeing the evolution of where it was in 2007 and where it is now it seems to have tipped some critical mass where a lot more people are venturing out and trying this stuff. Of course that means a lot more people have this problem as well, but the technology is also getting to a point where lot of people can use it.
I think for adoption of course we all know that technology goes to early adopters and wind market and all that stuff, it is a very well-known theory of crossing the chasm or whatever, all that stuff. I think a very important part for technology to become main market is that it has to become a lot easier to use. It has a lot more education that has to happen for folks to know about this, and I think that’s already happening. But fundamentally if I was to point to one thing I think there is lot more investment needed to make it much more easier to use for folks who are used to a certain paradigm of thinking and they are now kind of shifting to this particular paradigm of thinking, so "What is parallel processing? What is MapReduce? What are the tools? How can I monitor my jobs? How can I easily bring up a cluster? Do I have to tune the heck out of the cluster? Do I have to use set end different parameters?" and things like that.
So that itself will drive a lot more adoption. Of course there are things around as I was mentioning specially as it starts getting more into enterprises a lot more peripheral things have to be built so that it kind of works with other systems in the enterprises. A lot of that work has been done. For example, last year I think in the open-source community, there was lot of focus on security and security for example is a very important part whenever you talk about data there is lot of security implications and things like that. But lot more has to be done to take it to the next level. So those are all various things that would do that.
As far as migration is concerned I don’t think Hadoop is displacing systems at this point but it is kind of enabling new technologies or rather enabling new capabilities. So I think the issue of migration becomes a little lower in the sense that it is not like "Hey, I have this system and now I want to move everything off of this system to Hadoop" and things like that. I think at this point it’s a lot more new capabilities so something that companies couldn’t do before, now they can do in a much more cost effective way with this kind of technologies. I think a lot of that is happening, so migration paths will probably be easier.
In order to displace anything Hadoop has to grow a lot, if you just do raw performance comparisons of course you see the scalability is huge but performance is not there and things like that. So I don’t think it’s kind of positioned to displace things at this point.
From the numbers that I have we talked about back when I was at Facebook, we were generating around 300- 400 terabytes of raw data every day which is a lot of data, it is like 400 one terabyte disks. So that was the raw amount of data which was coming into the system. We would compress it before storing it but there were systems which had to deal with both this amount of bandwidth of the data that is coming in and then process this data to put in forms where it could be stored in a much more compact way, in a way which is a little bit less intensive on resource utilization and things like that.
Sure, there are two things that our clusters in Facebook were always constrained on - one was disk capacity, the other was CPU. A lot of time was spent in fighting both these battles. And there was one set of optimizations that we had put in Hive to save the disk capacity part and this was some work that was done by Yong Ching when he was outside Facebook and kind of contributed back to the open source community but we kind of took it and made it mainstream which was around RC files which is a row-columnar format. Essentially it is rows stored in blocks but within a block you are storing things in a columnar format so that they can compress quite well and things like that.
That saved a lot of space. The other optimizations that we worked on or that was done by the team included putting things such as making map joins much more efficient in terms of how they would use, how much data they could hold in-memory and do processing on that. So those are just the two main ones that I can think of. Apart from that there were a lot of other things that we worked on to reduce the amount of latency and the amount of time that a Hive compiler would take to compile out queries. And a whole bunch of other things just to make the system much more efficient both in terms of latency as well as the resource utilization that goes in clusters.
It’s been great, I have had a great time at the conference and I really appreciate the opportunity to present at this conference as well as to talk to your readers in the form of this interview and I am really looking forward to what happens in the big data space over the years and will continue to be heavily involved in it in one form or the other.