00:16:04 video length
Bio Ron founded Think Big Analytics to help customers leverage new data processing technologies like Hadoop and NoSQL databases and R for statistical analysis. Ron leads a team of experts in flexible data processing and analytics, helping customers rapidly develop analytic solutions that integrate unstructured and structured data.
QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community.QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.
1. Hello, I am Michael Floyd and I and I am here at QCon San Francisco with Ron Bodkin. Ron founded Think Big Analytics to help costumers build data processing solutions with Hadoop and NoSQL databases. So how are you doing Ron?
Great. How are you Michael?
I’ve been previously the VP of engineering at Quantcast where we were early users of Hadoop, one of the first companies who used it in production and we processed up to 2 petabytes of data a day in Hadoop cluster working also with MPP database from Greenplum. And so I saw the power of that technology and I saw an opportunity as starting to get attraction in the enterprise to help companies really use this technology successfully.
3. We are seeing very high profile companies: Twitter, Facebook, Yahoo, EBay, LinkedIn, Netflix, are all high profile examples of big data users, with billions of nodes of graph data, but in addition to these large data sets, these companies typically face really high request rates in the order of billion or more requests to their public APIs. So I guess what I am asking is can you set the stage for us for what the archetype company would be for a big data user?
Sure. You are right, Michael, the first wave of adoption of Hadoop and big data technologies has been in online companies and you listed some of the best known ones all of which are early adopters of Hadoop and using it at massive scale, the largest example being Facebook that’s got 70 petabytes of data in the Hadoop cluster and does massive scale, online analytics on petabyte scaling with H base which is NoSQL database built on top of Hadoop. But what we are seeing now is that there is a next wave of adoption happening beyond sort of your classic online companies that you are seeing the enterprise companies like JP Morgan Chase, and large Fortune 500 IT, device manufacturers and energy companies and healthcare providers and others really using the technology for their own data sets. So while it’s a very different problem space than analyzing online Web data and doing online search or advertising analysis that there is very interesting problems in a lot of domains, but people are seeing real power to these technologies.
And what we are seeing also though is that when companies like Quantcast and Yahoo and Facebook adopted Hadoop, they really had no choice but to build their own in house support and do it themselves, but things have shifted. Now there is commercial distributions of the technology that provide robust support and there is an opportunity to have partners like Think Big Analytics come in and work with your team to build solutions successfully. So we are seeing a real shift. Also the thing that is interesting is the enterprise is not typically, we are starting at petabyte scale, but maybe tens of terabytes or a hundred terabytes data people having data warehousing applications that have our growing larger by the day and they want to take some of that unstructured data, some of the stage data out and process it in Hadoop before they load it into the data warehouse.
People are using Hadoop for a variety of analytics. Many of the first uses of Hadoop are complementing traditional data warehouses I just mentioned, where the goal is to take some of the pressure of the data warehouse, start to be able to process less structured data more effectively and to be able to do transformations and build summaries and aggregates, but not have to have all that data loaded to the data warehouse. But then the next thing that happens is once people have started doing that level of processing they realize there is a power of being able to ask questions they never thought of before the data, they can store all the data in small samples and they can go back and have a powerful query engine, a cluster of commodity machines that lets them dig into that raw data and analyze it new ways ultimately leading to data science being able to do machine learning and being able to discover patterns in data and keep them improving and refining the data.
So the continuum is from doing more of the same analytics but more cost effectively, but ultimately opening up whole vistas, whole new ways of doing predictive analytics and supporting new applications that were never thought possible before on top of data.
People will identify all kinds of patterns. In the online advertising space patterns could be "What kind of users respond to an ad", "What time of day, what location, what context, what pages are they responding on" in the case of support for automatic devices coming in from insolation across a field they could be "what other kind of early warnings that some things are about to fail, what’s a good indication that it’s time to schedule a proactive service, what’s a case when somebody, a custumer, could get more value out and could be interested in purchasing some additional product or service from a company.
So what you see is the ability to analyze patterns and predict and find signals to do a better job of tailoring. A classic example would be Netflix and Amazon building the recommendation engines, so analyzing what kind of things that somebody has bought to be able to predict something else that might be of interest to that.
Sure. MapReduce is a technology that was invented at Google as part of their big data environment. It really is a simplified framework for distributed computing that has enough flexibility that you can express a lot of powerful calculations with it. So what happened was when Google was building search indexes as were all their competitors in the search engine space, they wanted to go from having a specialized distributed system that did just that to say "How can we make something that is more general purpose, but that is simple enough that we can reliably scale it and deal with failures?"
And so MapReduce is a functional inspired programming model where you divide up all the work in a parallel computation across many machines and do what is called a map operation on it, so that you do the same thing in parallel on randomly distributed bits of data, usually actually you compute locally near the data, so you have to move a lot of data over the network, on top of a distributed file system. Then the results of that map computation get grouped together, sorted and shuffled to a reducer that is responsible for processing a range of keys and that reducer then runs some calculation on that data. That calculation can be a classic reduction like adding things up, counting things or it can be simply a transformation, an operation on a whole group such as transforming data, maybe adding in some subtotal, which is a common thing for a ETL type use case where you want to have a set of information but emit as much information on the other end.
The net effect is that it’s a simple way of organizing computation that can be distributed well and reliably, but that it gives you a building block for building more powerful computations. It’s often the case that MapReduce applications consist of a series of MapReduce steps where the output of one reduce step is fed into the next mapper. So you don’t just write one map reduce program and everything is computed. Instead you typically have pipelines of multiple stages that transform data repeatedly to produce output.
There is not an optimizer in classic MapReduce. It’s a low level primitive for programming, it’s based on the machine code of the distributed environment, so the developer, you, have to write code in a way where you know the access patterns, you know how to tune your access for efficiency, what data to access when, how to do joins, what is an efficient way of representing data, how do I use things like combiners to cash results so that I have less output, how do I compress data to minimize the total I/O throughput of the system. But what’s happened is there has been emerged a set of tools, libraries and languages on top of Map Reduce that do in fact have optimizers built in. So you have prevalent things like Hive, which provides a SQL interface on top of map reduce, Pig which is a dataflow language for composing MapReduce operations and Java frameworks such as cascading, crunch and tap, which provide a higher level API for working with MapReduce jobs.
And these libraries all have optimizers, these tools and languages all have optimizers built in so they are getting smarter and smarter about knowing how to organize computation, how to do joins, how to minimize the stages of calculation and ultimately a good optimizer as we know from the database world does a better job. Optimizing a large scale Hadoop calculation is different than optimizing a database request. Hadoop is about batch analytics so you want a partition data in a small chunks, but you want to do efficient batch calculations, table scans over a limited set of data and not try to do like a classic database where you index things and do disk seeks to minimize requests. You’re not trying to look up a handful of records in Hadoop, you are trying to do a processing over a large set of records with high throughput.
Absolutely. What we see is that people are working with big data when they have problems that are better solved with a combination. So people are not ripping out relational databases and using Hadoop. Hadoop in no way is a replacement for relational database, instead it’s a powerful batch analytics system that we typically see people first using to compliment a data warehouse. They will be using a scalable relational database but they want to handle unstructured data better, stage data into having a deep archive that is more cost efficient and then load data into the data warehouse when it’s structured aggregated and summarized.
The other thing people will do is they will take data that is computed in Hadoop, apply those patterns, do some analysis exploration to come up with better predictions and they will build models and do data science and they will apply that at massive scale. So they will crunch data as it comes in and apply a model so then respond in real time. So they will take a(n) algorithm, like one of the recommendations for this user or in a social network context what are the people I might know and apply that in batch so they pre compute, here is a list of likely people that you might know and then when you show up on the site, it will look it up and say: "Ok, here is list". I will show you some of the people that I pre computed or likely that you will know and the same kind of calculation of models and scoring are applied on a wide variety of domains: IT security, pre computing the kinds of signals that are likely to be threats that you can do using Hadoop to analyze and then often NoSQL scale out technologies to respond in the real time.
Hadoop security has traditionally been done through really restricting access to the environment. For the first few years of Hadoop, you tended to trust the users who logged in and limited access to the environment. As it’s matured we’ve got really more robust multitenant solutions where Hadoop now has security integrated in where you can have strong authentication with Kerberos so that you can rely on knowing the identity of the user who is submitting requests to the system. It’s still good practice to firewall off access to the slave nodes in the cluster and not let other people access them so they can’t inspect network traffic, they can’t tamper with those machines, but you now have the ability to that really secure access to the data and the computation based on permissions that a user or their group have.
10. As a consultant you are meeting with a lot of companies right now. How would you characterize the companies that you are meeting with, what stage of adoption are they in? Are they past a certain point where they need your help? Are they at the very beginning? What is the typical client?
What we are seeing is there is a range of stages of adoption where we’re engaging with customers. We’ll deal with companies that have done some exploration with Hadoop and NoSQL are trying to get their arms around a big data strategy, to understand "What do we have now, what is a road map, how do we start build a pile that’s effective so we can really use this technology to move forward?" We also deal with some companies that have gone aways as with adoption and who are realizing that they need help on how to take it to the next level, how to start taking advantage of some patterns and best practices.
So it varies and what we see is that effective adoption tends to come from having a clear focus on how do I build a first use case with a technology, get something out into production, have some help from outside, this is complicated and there is some different shifts in mindset and skills from deploying on a classic database environment. People need to think about building a data in a distributed environment, how do develop that, know how to operate these systems and they need to start thinking about analytics on unstructured data. So there is a lot of new ideas, so having someone to work with you, who has a good background, who has a structure and best practices is really important. So that is what we are seeing as an effective way of adopting the technology.
Some of the best practices around Hadoop are to really start with a controlled, relatively simply use case that is adding value, get something into production and then iterate building on that success. So it’s an Agile approach, we typically recommend people work in releases of about 8 weeks, so getting something stood up, build some value and move as quickly as you can from the basic use cases of offloading data into starting to get some additional value out of your data, to start getting the experience of analyzing unstructured data in Hadoop, but we do see that the first use case is typically people wanting to take data that was being loaded to a warehouse and pick some of the pressure off the warehouse and have a longer term storage for it, so they get some economic benefits.
Absolutely. Continuous delivery is an important part of the process and indeed having a system that allows you to work with data which doesn’t have a schema associated is very much an Agile approach so that people can start playing with data, looking with it and invest in structure as they are starting to get value out of the data.