That’s right Floyd. Thanks, it’s a pleasure to be here. I was previously at Quantcast where we grew an audience measurement and ad targeting business processing up to two petabytes of data a day. Out of the lessons I learned from that scale of processing and the need for constantly improving the data science, I was really excited about the opportunity to bring open source and big data technologies to the market and help people build solutions.
2. Tell us what the big data space. As an architect just trying to understand it, what’s going on?
There has been a tremendous amount of innovation. In the last few years, there has really been this sea shift that the emergence of large scale web architectures has lead to some great innovation. For a long time all of our data processing needs were being handled through a traditional relational database. The emergence of the search engines, social media and large scale advertising broke the bottleneck so that now you have a lot of companies that have innovated and built a distributed processing model on top of distributed file systems and MapReduce and high volume NoSQL and other non-traditional databases. There is a wealth of investment that’s happened. Hadoop is an open-source implementation of MapReduce and distributed files that Yahoo! invested in heavily over the last 5 years.
Quantcast and Facebook and Twitter and LinkedIn have all adopted and used in varying ways but built technologies on top of. That innovation has resulted in some great opportunities to leverage data in new ways. It’s very timely, because the data volumes are exploding. If you look at the growth of transactional data from the applications, that’s increasing, but nothing like the way that unstructured application data events, clickstreams, data about preferences and interactions, products, etc. is growing.
3. What are the various usage architectural patterns with big data right now?
There are a number of interesting patterns. One basic pattern that tends to be foundational is getting insight out of the data through large scale MapReduce or batch processing. There are also patterns around near real-time access, for quick access to information at large scale and the two fit together well in terms of how do you efficiently transfer data from a real-time environment into a batch environment and then back out.
4. Tell us more about the real-time analytics pattern.
Real-time analytics is typically concerned with being able to look up information about something in very short cycle. We can evolve read mostly patterns as well as read-write patterns. In the advertising industry, like at Quantcast, patterns of use could include looking up a profile for an individual to be able to decide in near real-time whether they are a good lookalike for an impression. It could include looking up a profile to give information about something; for Netflix it could include making a recommendation for somebody in near real-time, etc.
There is a range of scenarios where sometimes you can compute a lot of that information upfront and it doesn’t change very frequently. Often, though, a lot of analysis data mind can go into building a good profile, but then you have immediate short-term signals that say "There is something that changed right now! These users said they are not interested in this thing. They just purchased something and I need to immediately shift my response to them based on something they did recently."
A lot of times, real-time analytics are going to be built on top of a scalable high-volume data store. So key value stores and NoSQL databases have become quite popular. If you look at things like a Membase, an HBase, a Riak, a MongoDB, Cassandra - all of those have different flavors. Of course, there are a number of technologies that are vying for popularity in that space, but a common pattern there - Voldemort, for example - is that those technologies are designed to make it simple to do some really basic key look-up operations, maybe with some facility for range access. They excel by being simple, they allow better scalability and some cases they are well optimized for both read and write traffic.
One of the important requirements though is often to have a really good interface to be able to load information into those systems at scale periodically. So having a batch cycle that can compute the information every few minutes to every day where you can take hundreds of gigabytes, a terabyte of data and load it in parallel to the live serving servers without downtime and with the ability to load the data quickly.
6. Are you talking about transactional or analysis patterns there?
In this case we’re talking about transactional. So this is when you need to make a decision as a request is coming in, it’s a transactional pattern that you’re responding to a given user interaction in some form. You could make an immediate decision; maybe even record a result of an immediate decision, so it’s transactional in nature. Often it’s advisory, it’s decision making, it’s not recording the fact that you actually purchased something. So you’ll often see a traditional relational database being used with ACID properties to be sure that if somebody buys something, it’s recorded, but you’ll have a system that’s fairly reliable, but once in a while cannot make an update that’s used for decision making because the goal there is just to have good optimization properties to usually get it right.
In many cases it’s true you’ll develop a schema that’s relevant for the problem domain. Key value stores have a fairly simple overall design and the tendency is to design a key and a value that makes sense for your problem domain rather than having a standard schema.
There are definitely tradeoffs amongst different requirements, like "Do you have a need for having data that’s active in more than one data center?" and "Can you partition your traffic fairly cleanly between the data centers or are you going to be forced to have traffic that’s running to multiple data centers at the same time?" so it’s hard to shard. You have to, of course, look at performance and reliability characteristics and scalability. You have to look at the size of objects and how much read versus write traffic there is. Also, another factor that matters a lot is the amount of memory. Some systems have pretty good cacheability properties and you can serve almost everything out of cache. Some systems excel in that environment, others are more robust - it’s the case of large working sets where you can’t have that all on RAM.
9. What did you do at Quantcast? What did your solution look like?
At Quantcast we ended up building a number of our own technologies. At Think Big Analytics we’re helping different customers of ours work through looking at different, more standard open-source and commercial NoSQL databases because when you are a pioneer sometimes you’ve got to build things that you wouldn’t want to do. A few years later there is such great technologies on the market that makes a lot more sense in my mind to start by looking at what’s already available and open-source.
Big data is used for business intelligence to provide reporting and analytics; it’s used for data science where you can have machine learning, data mining and statistics. People are digging into data to come up with insight, for patterns of how to make systems better, more optimal, like the example of a recommendation engine. It’s also used for basic reporting and profile generation to slice and dice data to come out with a sort of standard information set. It also can be used for investigation purposes like if you want to run down anomalies or understand specific details of what’s going on in the system.
11. Is MapReduce the right approach for business intelligence?
I think that MapReduce is a powerful approach to a lot of these problems. MapReduce gives you a way of harnessing parallel processing so you can divide a problem up and distribute it across many machines in parallel. And you can coordinate the work so then you have a standardized model and you can have a single system that supports a whole range of different data processing applications. That’s pretty powerful because precursors to MapReduce, like the way the first search engines handled problems like this, [used] an entirely custom-distributed system [that] was extremely expensive to build and maintain and therefore you couldn’t build a lot of applications that way. It’s very promising in that way.
In terms of using MapReduce with business intelligence, I think there is a lot of power because you have the capacity to put Petabytes of data online and have access to it in the way that you want. Classically, in data warehousing this is intelligence that was needed to pre-calculate summaries, so you had certain ways you knew you wanted to look at the data and you could slice it only in the ways that you had anticipated. It was costly, if not prohibitive to recompute and to look at things in a different way. That manner was expensive to load the data and it would limit the way you could work with it. In a MapReduce environment you have more capacity to store data without a schema or with less interpretation, so that you can decide later on "I want to analyze it in a different way".
A common pattern is you load your raw data into your MapReduce environment and then you’ll do some downstream processing to put some useful intermediate results you can work with, but you still have raw data that you could process a different way if you wanted to. For example, come up with a different video detection mechanism or a different way of tokenizing key words or a different way of correlating interests. On top of MapReduce there are some interesting emerging products and technologies - you’ve got Datameer and "GOTO Metrics". As product vendors that are building interesting business intelligence capabilities you’ve got Pentaho, a Business Intelligence vendor that’s really embracing Hadoop and MapReduce.
So you are starting to get interesting capabilities for Business Intelligence natively on top of Hadoop and at the lower level you’ve got technologies like Hive that are providing a kind of data warehouse SQL-like interface on top of Hadoop. But there are still a lot of cases where organizations that want to get low latency Business Intelligence, that want to be able to really slice and dice data effectively would use a hybrid architecture, although slice through method data sets inside of Hadoop and then put it into a smaller scale database that they can use a more traditional business intelligence approach on.
That’s still powerful because they could compute those summaries rapidly based on what they wanted, rather than having to pre-decide on them. At Quantcast we would use Greenplum as a mechanism. It’s a parallel database and it could let us work with hundreds of Gigabytes of data that we would cull from Petabytes of data.
There is definitely an interesting shift in the kind of skills and people that are required to make big data systems work. When you look at it, there are some new challenges for enterprise software architects. There are some very different ideas about the kind of people, the kind of skills needed for writing software, whether you are taking software developers that came from a web development background or from traditional ETL or database programming, [there are] different skills there. Administrating and operating a cluster, a cluster administration role, is very different and especially the data scientist [role] is a new and unique one where you really have people with machine learning or a statistics background being able to look at data and tune it and come up with how to get insight out of the data.
Data science is very much a discipline that’s evolved recently, as people are starting to really work with big data and work through large volumes of data. One of the most important things about being effective as a data scientist is to be really good at looking at data and understanding the business problem. So a classic thing is I know somebody is good at that if they can look at a set of information about something specific and they can notice anomalies and they can start to think about the patterns of "What is the data telling me?" More than anything else it’s being on top of the data. Yes, it’s important to have background in statistics and yes, it’s important to know various algorithms for classification and regression, but for the data scientist, the hallmark is a person who really is into the data and can get good insights and thinks about their specific problem.
If you look at it, I think a lot of programmers and a lot of traditionally even statisticians come from a lot more of a classical Euclidian model of thinking about things in terms of abstract reasoning, like "Here is a beautiful way of processing" or "Here is an algorithm for computing some statistic on this set of data." But you really need to dig in and get dirty with the data to be effective in the data science role. Stepping back from that, there is also interesting question about how do you manage a data science function.
14. How do you manage a data science function?
Managing data science is also an emerging art or not actually an emerging art. It’s been understood at some level for a long time, but most enterprises haven’t managed this kind of applied research around data in IT or in many cases [not] in a product function where they’ve had much more traditional project management skills of relatively well-defined lower-risk activities. Applied research is hard, because you can’t schedule innovation in the same way. You can’t know what things are going to work out. If you are doing your job in an applied research department, you’ll quickly discover a lot of things that look promising are not a good idea. The basic mantra is "How quickly can you iterate, how quickly can you test out ideas? How quickly can you prove things are wrong?"
Maybe you get something good out of them and report it so in future you can build on that. That all has to be managed. You need to harness the energy with data scientists so that they are focusing on a problem that makes sense, that’s worth working on, that they’re passionate about, but that has to be tied into something that you are trying to do more generally. Just like developers often work on a problem that’s secondary to the main system and interesting and could be useful isn’t the right priority for the company, the same applies to data science. It’s really important to have a clear metrics-driven approach.
That lets data scientists be creative and think about how to solve the problem, but it gives everyone a common goal. So collaboratively coming up with the right metrics for what you are going to work on is super-important and part of that is because it gives you the freedom to not work on some things because they are not important, they’re not the real objective right now.
I think that big data is a term that applies to both large scale data, but it often refers to less structured data. Whenever you are looking at areas where either of those factors comes in, you see some significant opportunities to use some of these new big data techniques. It emerged around the Internet where often you’ll have large scale data about consumers - it could be advertising data, it could be marketing data, [or] customer interactions on a website. It could be pulling in social data to understand more about your customers and understand what’s going on in the broader community outside the four walls of your enterprise. I think you’ll see it around advertising and marketing that’s going to be a broad theme.
Large scale businesses that are interacting with consumers will often have interesting opportunities to apply these techniques. Pulling in social data and getting smarter about what’s happening in the broad world so that they know how to organize their interactions with customers. We also see this, though, on other datasets. If you look at RFID and Supply Chain where you can have large amounts of data about objects and about transportation and about real-time updates to events. Those are other areas where there are interesting opportunities to apply these techniques. If you look at trading financial instruments, if you look at modeling risk, if you look at those kinds of information problems, pulling in new information sets, variables can also be powerful techniques where these techniques apply.
In science in general, there is a lot of opportunity. Often you’ll see that science has engaged people with great data science skills have traditionally worked in very constrained computational environments where they had to distil everything down to a small set. So giving them the ability to pull from a bigger set and distil it down to the right information set for this problem is powerful and innovative and that could be exploration for energy, it could be modeling things in physics, it could be some detailed product design evolution as well.
16. In these environments data can explode, so how do you deal with that? What are the tradeoffs?
I think one of the nice things is this kind of approach of having a massive distributed file system to hold information allows you to keep a lot more data and a lot more of it in a raw form. It dramatically changes the economics of storing and accessing information, so there is a lot more potential to store raw information. At the same time there are certainly some datasets that can be truly massive. Imagine getting video feeds from all the security cameras in a large retail store or science data from remote sensors around the globe. There are going to be cases where you just can’t keep everything and in cases like that, the tradeoffs can be around keeping focused summaries that keep as much information available that you expected to be able to use or time windows of how much data you need to retain.
That being said, often times what I’ve seen is that there is even more of a proliferation of downstream derived data that came from raw source data and it’s often very fruitful to manage the downstream datasets. Having policies around how long you are going to keep data and being able to automate the deletion of still intermediate data that you could always recompute tends to be a really common situation that you need to manage in a large scale environment. As you add more and more data scientists and engineers working with data, each of them are going to tend to have their own working set of things that they want to be working on. With most of us, the tendency is to "keep it in case I’ll need it later" and it’s really important to have policies and approaches that mitigate against that tendency.
17. Looking at the environment of tools and frameworks in this space, are there any missing pieces?
There is certainly a lot of energy in some areas around big data. There is a lot of work around Hadoop and MapReduce, there are people innovating around enterprise MapReduce to improve on distributed file system properties, and to be able to give better management tools that have been traditionally some gap. There is innovation around log delivery being able to pull events into a large-scale Hadoop environment. Those are areas that are starting to be some interesting technologies emerging. The access layer is still an area that I think there have been some gaps, so there is Hive which gives you a nice SQL-like interface, but it’s not quite all of SQL and it’s still fairly immature. It’s promising, but it’s not at the point where you could just plug in something and pretend it’s a database, nor does Hadoop give you low latency.
So I think around Business Intelligence there is a lot of need. Also in terms of productive large scale management of pipelines there are some early efforts, like the Cascading Project and Pig. Both of those have some drawbacks in terms of not giving you a clean OO API that the developer can really work with. I started the project Colossal Pipe to distil some of what I’ve seen that works well in that space to make it effective for developers to be able to build MapReduce applications using POJOs and have really clean dependency management and sharding on top of it. But it’s a young space, so there are a lot of areas where there are some nascent technologies that look promising, but there is a lot of work still required to mature.
It’s an exciting time, but it’s also a very good time to get some guidance from people that have been there to think about what are the pieces that work well together and what are the tradeoffs, because everybody is innovating in different areas and nobody has got a dominant solution that’s the best in everything. I think there is also a lot of room to improve integration between, say, NoSQL databases and Hadoop. We are starting to see emerging solutions there as well but it’s still early days on that.
18. When does small data become big data?
Small and big data are a continuum, so I think that the term "big data" is a lot more about having flexible access to data. I think that’s really what’s at the heart of "big data". Some people get scared off, they say "I don’t have petabytes of data, I don’t have a big data problem" but if you are struggling with having flexible access to the data, if you are working with unstructured data that doesn’t fit well into classical database environment, you may well benefit from some of these non-traditional database processing techniques, these new, less structured, more flexible data techniques. Hadoop can be useful there, too. The term "big data" has sort of gelled around the space, but it’s a little misleading.
19. Any last words?
It’s a very exciting time in big data and it’s something that’s really important for enterprises to be thinking about "How can I leverage more data?" it’s well worth trying out Hadoop at small scale and test it out and see what it can do and think about it might do to improve your organization and how you work with things. I guess I’ll also say that we’re hiring, so I’d love to talk to people that are really interested in working more in the space.