Bio Eva Andreasson has been working with JVMs, SOA, Cloud, etc. for 10 years. At JRockit she got two patents on GC heuristics and algorithms. She also pioneered Deterministic GC which became the JRockit Real Time product. After two years as the PM for Zing at Azul Systems, she joined Cloudera in 2012 to help drive the future of distributed data processing through Cloudera's Distribution of Hadoop.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
Who am I? I am Eva Andreasson and I am a product manager working for Cloudera at the moment. I also have a past as a JVM developer and I love Java.
For someone who does not know what Hadoop is, it is a new way of processing data. Many people might be familiar with a database or a enterprise data warehouse and how you structure data up front to serve different queries, business insight queries you want to ask your data to serve your business.
Hadoop has come in as a very new way of thinking about how to process data. It kind of turns the world upside down. It actually allows you to store your data as is, you do not have to structure it at point of storage. You can decide later, as use cases come up, on how and what structure to apply to your data to serve various questions; meaning you do not have to scale down or lose any of your data to fit into traditional systems, you can actually store everything and decide later what you want to do with it. That is kind of the power of Hadoop. In addition, it is open source and it is linearly scalable and it is very, very cost efficient. So, it is this flexibility on the very cost efficient storage and processing framework has revolutionized the world in the data processing space.
3. Let's drill into that a little bit. So you have a couple of components; you have got a distributed file system [EVA: Yes] and then you've got HBase as well, which is kind of like a key value store, as I understand it, on top of HDFS. Is that right?
It is a column oriented key value store where you can have sparsely populated data, meaning many people use HBase, for instance, to save all your clicks on a web site. You might not click all the links, but for each visitor you save the data that visitor A clicked on all of these 10 links or visitor B clicked on 50 links, allowing you to match behavior. So you can look up a customer’s behavior on your web site and match out who else is similar, as they are visiting your web site, and maybe dynamically change the experience - present similar items that the previous customer with a similar behavior actually purchased. So, you can adapt your customer experience and do a better user experience, shorter sales cycle, and more efficient “get to the information you want on a web page”, as the visitors are there. So, it is very interesting use cases you can do with HBase.
4. If I am executing a MapReduce query on a set of files on HDFS where does that actually happen? Is the data being bought in, and is it being executed centrally, or is it being executed on all the individual nodes and then assembled afterwards? How does it work?
MapReduce is a Java framework and it follows a MapReduce algorithm, where in the mapping phase you write some code on how or what data is of relevance to solve a specific use case, a question you want to ask. So basically return all the rows from all my files that are stored in a distributed way on HDFS that contain “red sweater” or something like that. Then you get all the data that contains “red sweater”. It just maps it out, extracts that data, puts it into temporary work files that you then can apply logic to. You do the reducing step. You actually apply the structure in the mapping phase, while applying the logic in the reducing phase. So now you can count all those entries containing “red sweater” and you can do this over a petabyte of data or 100 terabytes or 1 terabyte. It does not matter because it is parallel processing and linearly scalable with as many nodes as you add to it. So the aggregation and logic in the reducing step returns the end result to you.
So it is a two-step algorithm, executed in parallel on a distributed file system and then aggregated back to you, with no need of structure up front. You decided when you designed your MapReduce job. But that is the simple way of using Hadoop. There are more advanced ways today.
5. That is quite a big shift, I think, in terms of how we think about working with data. So, if I am a large traditional enterprise, perhaps I have a big BI team and a bunch of DBAs who are very used to working with relational data, how easy is it for me to make the switch to working with something like Hadoop?
That is a very good question and I get it a lot. Working for Cloudera, we help customers do that all the time. There is a lot that has happened in the recent years. It is not just MapReduce, the batch oriented Java framework. You have a number of tools, a number of workloads that you can actually execute on the same integrated file storage. So, you have Hive which is a batch oriented query engine, you have Impala, which is a real time query engine, both supporting standard SQL and JDBC and ODBC. So what you can query, the application layer, many of the BI tools can already work on top of Hadoop and its ecosystem today.
Then you also have search, for instance free text search; we have recently integrated Solr to Cloudera’s distribution of Hadoop. Then you have other frameworks and machine learning and even running SasS workloads natively in this platform, allows you to do many of the workloads you previously did in your enterprise data warehouse, for instance, in a much more scalable and cost efficient way on Hadoop.
I would though recommend to start with something very well defined, something you know is repetitive, you have known results in your enterprise data warehouse, where you struggle from the business side perhaps with scale, you have to finish more data processing within the same fixed time window – that is a very common scenario. Take a well scoped out project and work with someone like Cloudera or some other vendor that has experience, some SI that has experience to do this kind of migration, and then learn from that. After that, you will be an expert and will see for sure the value of having Hadoop in your enterprise and you can grow incrementally. That is how our customers have been successful.
6. That is interesting. So I have got very different workloads there if I've got Impala and MapReduce and Solr all running on the same platform. How do you deal with things like resource management issues that I'm presuming that would throw up?
Yes. That is a very good question. We also get that a lot. With the new version coming out in March, we have a productionized YARN, which is a new way of resource scheduling. It is now production ready. It is going to be the default scheduling capacity within the entire platform. We have also added ways to control the resources through our production tool – Cloudera Manager – to make sure you can isolate various workloads from starving each other. We are starting with a static resource allocation for workloads that are more RAM focused, in-memory focused, such as, for instance Solr and HBase; so you can assure they can serve those real time use cases securely with having your data scientists or developer team explore with MapReduce and modeling and machine learning in the background, without affecting those running services.
But the whole idea through Cloudera Manager to control based on what user group they are and what priority they are and what resources should be firmly allocated or dynamically allocated to various use cases; that is key. That is what we provided through our production tool.
That is actually handled by ZooKeeper, which is the ... All these funny animal names in the Hadoop ecosystem. Apache Hadoop ecosystem, has a ZooKeeper - handling the fail over and the process management; a very key component that is seldom highlighted but it definitely deserves the spotlight. It handles HBase process management, it handles SolrCloud process management and it handles a lot of those fail over that you just expect from Cloudera’s distribution in production.
Come to our online free training on our website; that is a good place. Or download the quick-start VM to experiment, to wrap your head around this, because it opens the door to so many possibilities. You need that learning, you need that experimentation. You can also go and download HUE – Hadoop User Experience. It is a project launched by Cloudera, also open-source. There are a lot of “How to” videos that show different applications on how you can interact with the back-end; the various workloads on the enterprise data hub today. And ask us questions in our forums or in the open-source. There are many, many people involved; many engineers happy to help.
9. So of course you have a background with working on GC for Java and I am interested in ... We have talked a little bit about RAM intensive workloads with the search engine, with Solr and so on. Are there specific unsolved GC problems that have a significant impact on the work you are doing with Hadoop?
There are multiple questions in there. Let me try to think and dissect them. I think as we brought more real time workloads to Hadoop, to the enterprise data hub, there are more needs of in-memory workloads such as Solr and HBase and Impala. For Solr and HBase, they are very RAM intensive and they are both based on Java, which means they will rely on good garbage collection algorithms. And, as you might have heard in my talk, the biggest problem today that I see with garbage collection algorithms is the “stop-the-world” operations; especially that is evident when it comes to compaction and the way to handle fragmented heaps; and fragmented heaps are the result of a long time running Java process such as, for instance, Solr processes or HBase processes. So I definitely see GC problems coming down the road, if not already there, with these kinds of workloads.
If you look at the other workloads, like spinning-up sudden processes, applying structure to data, returning the result – it is a more dynamic, up and down spin; that is a different profile. There I do not see much garbage collection innovation helping.
But for those profiles where there are long-time running processes, we need to solve how we get all the threads to safe point, which I think is the biggest bottle neck for stop-the-world operations in garbage collectors today.
Charles: Great. OK. Thank you very much.