Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Eli Collins on Hadoop

Eli Collins on Hadoop


1. I am Ari Zilka, with InfoQ here talking to Eli Collins. Eli, why don’t you introduce yourself?

Hi. I’m a software engineer at Cloudera. I’m the lead on the platform team and I work on Hadoop. And I’m a committee and PMC member on the project at the Apache Software Foundation.


2. So Cloudera recently announced Cloudera CDH4. Can you walk us through what versions do you guys have for then InfoQ reader? Basically give us sort of the primer on Cloudera releases, Hadoop releases?

Yes. So we’ve -- this is our fourth year. We do a major release every year, and so Cloudera is four years old. CDH4 is fourth major release of the platform. We do quarterly updates of every release, but every once a year, we kind of introduce major new functionality. So CDH4 recently became generally available and every release has kind of had a major theme to address, and for CDH3, it was security and for CDH4, it was availability.

That was kind of when you look at the problems that are preventing people from going into production. That’s been one of the -- the issues that’s been critical to address, so we really focused on availability for CDH4. And CDH4 is based on Hadoop, the Hadoop 2.0 series, which has got a lot of changes that have been introduced in the community over the last couple of years. So HDFS Federation, MapReduce 2.0, YARN, a lot -- of basically the last several years of development effort that have gone into it.

And CDH3 which our previous release is based on Hadoop 1.0 series of releases that people have been using in production for quite a while.


3. So to -- if CDH4’s main theme is availability and it’s based on Hadoop 2.0, people basically need to upgrade to Hadoop 2.0 to get the value of CDH4. Can they run Hadoop 1.0 and Hadoop 2.0 workloads together? How does it work?

Yes. We preserve the ability to run MapReduce, what we call MR 1.0 or Original MapReduce jobs on Hadoop on CDH4, so you can move over to MR 2.0 which is the new MapReduce framework or you can run the original CDH3 MapReduce jobs on CDH4 as well. So if you want, you know, the new HDFS high availability features, but you don’t want to move over to the new YARN resource management scheduling framework, you don’t have to, you can use the CDH3 MapReduce with the CDH4 HDFS or if you want to go whole hog and move over to the new framework, you can do that as well.

MR 2.0 is still kind of being developed and it’s way more active that MR 1.0 which is stable and been in production for a while, so it really depends on your appetite for stability and your use case.


4. You’ve mentioned MR 1.0, MR 2.0, YARN. Can you walk us through the difference between 2.0 and 1.0?

Yes. So MapReduce MR 1.0 basically refers to the original MapReduce kind of sub-project of Hadoop. And so, in the first version of Hadoop, the implementation of MapReduce, the scheduling resource management job monitoring, etc were all implemented in one system, and so the execution model of MapReduce was embedded in with all the other properties of the framework which are really not MapReduce specific, so there’s been a lot of good work to basically separate those out.

So, in MR 2.0, the cluster itself really doesn’t know anything about MapReduce, it’s just scheduling and resource management and then MapReduce is something that’s submitted with your job and that enables you to support other execution frameworks aside from MapReduce, but it's also just a cleaner more scalable separation of the components in the system.


5. What are some use cases for, let’s start with MapReduce 1.0 and Hadoop 1.0, of their different use cases for the different versions, maybe there aren’t. But what are some the common use cases or patterns that people -- problems people solve with Hadoop?

So we, I’d say most of the applications kind of fall under two broad use cases. One is data processing. So a lot companies just for there’s a huge number of application that’s essentially involved turning one type of data and transforming it into another type. So if you’re doing you know mediation for Telco or you’re doing closing extra books for transactions, there’s a huge -- or doing Click Stream Analysis, Sessionization. These sorts of problems are all about taking in a huge amount of data and transforming it, analyzing it, and writing back a more high value form of that same data. And that use case is use pretty much across all the major industries and is common both in MR 1.0 and MR 2.0.

And then the other use case is Advanced Analytics. So a lot of people, today they’re using Enterprise Data Warehouse or RDBMS and the only problems they can solve are the ones that they express in SQL. And so one of the key strings of Hadoop because it’s highly flexible, so you can write a MapReduce job from scratch or you can write a SQL query using Hive or you can use Pig to write an ETL flow or use Mahout for data mining. It really gives you a number of different ways to process the data and so that flexibility is key.

And that’s another use case is doing Advanced Analytics that you couldn’t express in your existing system with regular queries and that again is also common amongst MR 1.0 and MR 2.0. What MR 2.0 really opens the door for is running non-MapReduce execution frameworks. So for example, there are people who are using MR 2.0 to run MPI jobs and or to run graph processing or to do systems that are better for a low latency analytics, so Hadoop has been typically batch-oriented. So MR 2.0 is the kind of building block to let you do kind of low latency OLAP-style analytics on Hadoop.


6. I kind of see this capability coming soon. I wonder if you see the same HBase and MR running in a YARN-based environment. Are you guys working on that right now?

Yes. The issue for that was filed in HBase I think maybe six or seven months ago and the HBase community and the MapReduce community have been talking about it. And so, with that, today most people who are running HBase clusters are dedicated clusters, so they’re maybe using some MapReduce to basically do ETL for HBase, so doing data loading, cleansing into HBase, but not co-locating their HBase cluster with their regular Hadoop cluster.


7. Because you can’t really see the HBase data from a normal Pig job, for example?

Yes. You can run -- Hive now supports HBase. You could write a Hive query that has HBase, but the full stack is not totally integrated. And just from an architectural framework, HBase runs on top of HDFS, but it pretty much assumes that it owns the whole cluster. And so the key part about integrating HBase with YARN is that you could run an HBase cluster on 40% of your nodes, MapReduce jobs and 30% MPI and another 20%, and it really becomes a much more general purpose data operating system cluster and less -- you’re less thinking about that as Hadoop cluster or an HBase cluster or an MPI cluster. And so that’s really one of the key things that this next generation framework enables.


8. One of the things you were talking about just a minute ago, some things can’t be easily expressed in SQL; you are talking in the context of MapReduce 1.0. Do you have a simple canonical example for the InfoQ watcher that basically would drive it all home for them the power of Hadoop in the context of 'You couldn’t do this in SQL'?

Yes. So a classic -- so SQL it’s obviously a declarative language and it’s really useful for expressing what you want your end result to be, but not how to actually perform a particular job. And so, if your data model doesn’t fit the relational model or if you only do a type of computation that just can’t be directly expressed, you can’t use SQL. So an example of that is if you want to do data mining and you want to do some form or statistical regression or sessionization where the algorithm itself just isn’t expressible as a UDF as a User Defined Function that you could use in SQL. That kind of analysis just isn’t expressed.

So people historically who have done that type of processing haven’t used -- they just used other systems to do it and so…


9. So sessionization is the one I was thinking about. Can you just define that quickly?

Yes. So often in the data processing use case, you have a lot of data coming in which is kind of data from a source system that’s not designed to be analyzed. So a classic example of this is Click Streams. So your application tier your web tier is just logging 'These users click on these things, these things were present on the page when they click on them, this is how long it took to get to that page and what not', and so it’s a record of that behavior. And so, call -- in the Telco space call records are the same -- are the same kind of format or this call between these people took this long.

And what you want to get to is a higher value form of that data which is, “Okay, what’s --,” you know, “How do I know how much to bill this phone customer?” “How do I know what this person was actually doing on the site based on their clicks?” And so sessionization is about taking that raw -- that’s a specific example of taking that raw data and then applying Advanced Analytics to come out with a more high value data of what was actually going on with the site, so you know how to make your application better or run your network better or what not.


10. And I guess conceptually there’s no way in SQL to express like, this guy I can select all hits where the user ID is one primary key, but I can’t tell if that’s one session, two sessions or ten without associating some arbitrary time window to their -- to their Advanced Streams, right?

Exactly. And SQL is by design not Turing complete and you know it was designed to help you express the outcome in the tuple space of what you want, but not how to get there. And sometimes you just need to express how to get there.


11. Anything else you want to let the world know about in terms of Hadoop and Cloudera and stuff you’re working on?

Yes. So, one of the new things about Hadoop as we kind of been touching on this discussion with the new resource management framework is that it’s been super active over the last couple of years. So, you know, Hadoop if you check back every three months using major advances, so, you know, availability, security, disaster recovery, we’re really seeing Hadoop grow in ways and that it’s getting to the point where, you know, enterprises in general can adopt this technology and production and solve real problems. And that’s -- that trend just continues to accelerate which has been really exciting to watch.

There’s also been a whole ecosystem of people building tools around Hadoop in helping particular use cases or system integrators becomes successful with deployments. So you know for example we have hundreds of partners that help make their tools work with Hadoop so that people can adapt the solutions without kind of a lot of overhead in integration pain. And so that’s really letting or we’re seeing Hadoop get adopted in more and more markets and businesses just because the wheels are really getting greased for them.


12. So what does that lead to in terms of the landscape of competitors for Hadoop? Does Hadoop compete with relational? Does it compete with only OLTP? Does it compete with NoSQL? How does it all fit together?

So we see Hadoop as complementary to a relational technology. And so, what’s the kind of pattern that’s been forming in customer data centers, we call this, the data center operating system. So, previously you, you know you stored some data, but you couldn't compute on it, maybe you had an ETL process that loaded into your analytical system, but at that point you could only do what you would express in queries. And what people really want is more of a mental model like an operating system where I can store some data, I can operate on it, I can have two or three programs to operate on it. I can move a set of programs over the same pieces of data generating new data. I can query that data, but I can also store and analyze it at the same time. And so, it’s really more of a data operating system than a database and it’s part of building that data operating system. RDBMS is still a very key part of that, so if you want to do low latency or OLAP queries, or you want to express something with a hundred percent SQL like you need SQL 2009, 2012, or what not, you know RDBMS is really still a great answer to that. Or you need to integrate Hadoop with your real time serving systems that are already written again same SQL or Oracle, Hadoop has great integration there.

So we see kind of Hadoop as just being one of the components with RDBMS and an enterprise’s overall data operating system.


13. Last question, so you see relational databases as complementary. I totally agree with that. There are several vendors who assert that the Hadoop that you and I work on every day is not implemented well and proprietary implementations are better than it from things you rattled off, DR, availability, scalability, performance, what’s your view on these proprietary implementations without naming names?

Yes. So Hadoop is still young in its cycle. I mean it started it’s been around at this point had been you know Facebook has a multiple hundred of petabyte file system and Yahoo has you know over forty thousand computers running Hadoop. So in some ways Hadoop is -- is really battle tested at scales that existing proprietary systems haven’t even been tested at.

But I also think there’s a grain of truth in that. Hadoop itself is still rapidly evolving and growing. Unlike the relational market which is -- has pretty much been around for forty or so years. So they’re definitely different ends of the spectrum. There’s a huge amount -- one of the things that I see at Cloudera is there’s a huge amount of investment being poured into Hadoop. So there’s a number of companies that are all basically supporting Apache Hadoop, investing in not just Hadoop, but the twelve or thirteen other projects that you need with Hadoop to solve real problems. And there’s a huge ecosystem of database founders like Oracle is a big partner, Cloudera, hardware vendors, HP, IBM, Dell, and what not that are all super invested in Hadoop. So it’s kind of like Linux, you know when Linux started, you know it was originally hobbyist OS and kind of grew and you know now it’s -- you know everything from your cell phone to your super computers to any real database system is running Linux.

I see the same with Hadoop. The net investment that the community and existing players are investing in it pretty much make it inevitable from a technology perspective you know. The places that are deficient like high availability, we work on it we address it. DR, we work on it we address it. And so that we’re just going to keep turning the crank and you know removing all the roadblocks to make Hadoop successful for people.

Aug 17, 2012