Bio Ville Tuulos is a researcher with Nokia Research in Palo Alto and has been working with large data sets since 1999, building solutions for statistical information retrieval. He started Disco, an Erlang/Python implementation of the Map/Reduce framework for distributed computing, now used by Nokia and others for prototyping of data-intensive software with hundreds of gigabytes of real-world data.
The Erlang Factory is an event that focuses on Erlang - the computer language that was designed to support distributed, fault-tolerant, soft-realtime applications with requirements for high availability and high concurrency. The main part of the Factory is the conference - a two-day collection of focused subject tracks with an enormous opportunity to meet the best minds in Erlang and network with experts in all its uses and applications.
I’m currently working with Nokia Research. As you know, Nokia is a global company that’s running many kinds of services and like making handsets. What happens is that we are getting data from different services and different parts of the world and they are actually pretty remarkable amounts of data. Now, of course the interesting question is "What can you do with that data?" and "What are the interesting questions that you can ask? What kind of new services and new insights you can derive from the data?" That’s what we do at Nokia Research. More specifically, we have been building this Map/Reduce framework, open-source Map/Reduce called Disco. Basically that’s the in the core of all our activities.
Basically I started building Disco in 2007 and it’s already many years ago and at that time we had the basic setup. Especially when people from the research backgrounds want to do something with data, what you have is that number of servers, and then basically you have some data. At first you can have all your data in a single box and then maybe you use some Python scripts and you do something with your data. Eventually the dataset grows and you decide that "This is kind of slow in a single box" or maybe you realize that "My box has multiple cores and I would like to utilize the kind of multiple cores and instead of just having a single process." Then they start maybe running two processes in parallel and now you have the problem that "I have two processes in parallel and I have two sets of results and then I have somehow combine them in the end and so forth."
Basically what people use is an ad-hoc set of scripts, like take some data and merge them and so forth. Or maybe some people take the approach that actually you build this like an ad-hoc distribution thing that you have a master process and slave processes running on different servers and you do this thing. Eventually you then realize you are spending too much time on just building the kind of infrastructure to actually distribute your computation and so forth and that’s what basically we did back in 2007. Then we realized that we actually want to do the work that we are supposed to do, which is to do the insights, the data mining machine learning and so forth and not spend so much time on building the infrastructure.
Actually exactly the opposite thing happened when we started the Disco project and we have been spending time on that, but on the other hand that has really enabled us to build all kinds of things.
Yes. It’s an interesting thing that the original Map/Reduce paper was by Google and I guess it was released and published in 2004. Some time passed and I was reading the paper and it seemed at first like something that people have been really appreciating for a long time - different models for distributed computing. Like I said, originally we had this ad-hoc model, nothing special and we didn’t realize we needed anything. Finally we came to the point that we really needed something and that was the motivation for starting to build Disco, realizing that we need something more and there was the Map/Reduce framework and we had a look at Map/Reduce.
Actually, people came to us asking "Could you use Map/Reduce?" since they knew we were playing with data and we were supposed to know something about these things. Then we thought "Yes, it seems like a decent model" and that’s how the Disco project was started.
At that time there was Hadoop, an early version of Hadoop and it was available. One thing is that we came from this background that we had, this custom ad-hoc set of scripts and so forth, but something that we realized early on was that really the focus should be on the data. The data is the valuable part, not the code on top of that and also we realized that once you have the data in one place and you have the machines and everything, you have the platform, then running the analytics it’s kind of easy. Basically what we wanted to have is we wanted to have a system that we really know inside-out that we can integrate with other parts of the system since we also knew that people were not so interested in the analytics itself, but the results. The results and the stuff you build on top of that is really what matters. We wanted a system that we could easily integrate with the other parts of the system and also easily build stuff on top of.
Of course, we had a look at for example Hadoop and it was like hundreds of thousands of lines of Java code. It could have been possible to maybe start really learning what’s going on, but given that I had a background in Erlang and I happened to know Erlang, I realized "Wait a minute, Map/Reduce is all about doing distributed computing and doing it in a fault tolerant manner". And that’s the basic idea behind Erlang as well, so it’s a perfect match, so why not just piggyback on Erlang and have a simple Erlang framework that allows us to take care of the distribution part the fault tolerance part and then we can use our favorite language for doing the data analytics part, which is Python and which is orchestrated by Erlang. It seemed like the perfect combination for us.
It’s a funny thing: originally it was something I happened to implement over the Christmas holiday at first. Actually I talked to somebody before Christmas and he was saying that "There is the Hadoop and it’s really awesome" and I was saying that "Map/Reduce is really simple and especially with some language like Erlang it’s something that you can implement in just a couple of weeks." Disco came to being in a couple of weeks and after that it has been just debugging for the next years. But seriously, the first version it’s maybe a couple of thousands of lines of code. Of course, the devil's in the details and the fact that once you start having more and more data how do you deal with the fact that more machines and more data and that’s what sets Disco apart from many different Map/Reduce systems in existence nowadays.
One thing is that people really like Python and of course you can use Python with Hadoop as well, but on the other hand, it’s not like a first-class citizen. There are certain things that you can do with Python, but it’s not like Python would be in the central place of the universe. With Disco the idea is that Erlang just stays there in the background just taking care of the nasty stuff which is the distribution and fault tolerance. It used to be so that you can do everything else in Python, so in that sense we wanted to really emphasize the productivity part. Also, another thing that we have seen is that when you have data and especially when you have real world data, you really don’t know what’s going on there and what’s in the data. So what you need to be able to do is that you need some approach that is extremely flexible, so you can first look at the data and to realize that "It looks like this" and then you know this is ok.
There are certain rough edges and outliers and so forth and you need to handle them and all this needs to happen in a really flexible manner. Otherwise, you will end up spending lots of time building a huge piece of code that just tries to somehow model the funny things that are happening in the real world, and that’s why we felt that having that flexibility is really important. I think that was a wise decision that we made at that time.
Often people think that Map/Reduce is a silver bullet that solves everything. Of course, what happens in practice is that you have this full stack of things. It’s just Map/Reduce in itself is just one part of the stack. Disco used to be like a thin Map/Reduce layer and providing the basic distribution and fault tolerance. Now, over time our needs have grown and the datasets have grown and we have more machines and so forth and of course, the community has given input. So what we have done is that we have kept adding layers basically below and above the Map/Reduce layer. Basically, what Disco is nowadays is that you have the storage layer that takes care of storing data, making it persistent, typically people have log files, so you have production systems that produce log files every day and you want to ingest them in your system, in your Map/Reduce cluster and you want to store them somewhere so that they are always available for analytics.
The idea is that it should be scalable so that’s the key thing. You can easily have terabytes or petabytes of data even and on top of the storage layer, you then have the Map/Reduce, the computation layer and that’s also one of the things that Disco provides. On top of that, the question is that then you want to do something with the results. Typically nowadays people want to build web services, and something like awesome data, intensive mashups and so forth and for that purpose we have a certain indexing mechanism like index framework so that you can take data that’s stored in the Disco distributed file system, do some processing, extract the parts of the data that you are interested in and then build these indices that then can be used to run data intensive web services and build data mining tools that work in the web.
The funny thing is that I was involved in a large-scale urban game called "Manhattan Story Mashup" and for that purpose we had a number of players in Time Square in New York and they were sending photos to our system and that was back in 2005, I guess. For that purpose I built a system in Python that basically like a piece of software that took the requests that were coming in and so forth and what I ended up doing was something that kind of looks like Erlang. You have these small concurrent processes that take care of logging, requests and doing something in the background and where you have many things going on. Then I realized that this is nice, but I don’t get support for doing that kind of stuff in Python by nature.
Then I heard about Erlang and I took a look and I realized that this is exactly what I need, this is exactly the kind of language that I need to solve problems like this and I started to learn Erlang and ever since when I need to build something that’s concurrent or distributed I mean it’s really a perfect language for that.
At Nokia, given that we are in this mobile business, all the data that we are getting is somehow related to the mobility and the fact that usually there is some location information available. What’s really interesting nowadays is that we are getting these data points round the world that show if people are searching for something or people were lost in some city in Europe and we get these requests coming in for different services that Nokia is providing. Now the question is "Can we improve these services and user experiences by looking at what’s going on in the data?" The challenge is that huge amounts of these requests are coming in and also the interesting fact is that also what makes a big difference is what’s the context for all the requests that are coming in. Let’s say it makes a big difference if the person was driving on a highway and wanted to navigate into a certain location, let’s say a gas station or maybe the person was walking in Paris and wanted to find the Eiffel tower or something like that - so basically we want to know the context.
The interesting data analysis task becomes that you want to take a map database that shows you what’s the context for this request and then you take this request information and you do all kinds of stuff to join this information and then produce results that show that people in certain kinds of location do certain kinds of things and stuff like that. So that’s maybe the more interesting side of things that we do with Disco. Then, there are some much more standard or typical types of things that we do, like doing business reporting and stuff like that, but it’s more like a data mining stuff and machine learning stuff that people find interesting. Also something that’s really important for us is that the results that we produce are usually in the form of something that anybody can understand. We make visualizations and we make these interactive tools - "This is what we have found out" and at the end of the day that’s really what matters.
It’s always nice to talk about big data and all kinds of things that you can do with the big data, but you really want something visible and you want something tangible and that’s what we do.
This is the first really active open-source project that’s I’ve been involved in and especially now that we are leading to the development of Disco is always interesting to hear and we are very interested to hear about what’s happening in the Disco community outside Nokia. It’s so surprising to hear all kinds of things: we have just heard that people are using it for trading, analyzing all kinds of stuff that they are getting from the stock market and people are using it for doing ad services in the web. One of the biggest group of users is in the academic community. People in that community really like using Python and they like to focus on the data itself and then use the minimal amount of code to get the results out.
They don’t have this kind of enterprise background where things have to be really fancy and unique, to have many layers of code before it looks official. In the universities people are doing all kinds of things starting from actual language processing to machine learning and so forth and bio-informatics - that’s one of the topics.
We had this strong idea that we want to use Python for doing the data analytics in the beginning and now what we have found out when using Disco in production and also hearing what other people are doing, we have realized that there are other tools that might be suitable for the analytics task as well. One of the things that comes first if you want to build something that’s higher performance than Python (Python is not the fastest language around), we have found out that there is a need for having something like that. Where Disco is heading now is that we are making Disco much more language agnostic, so it’s becoming a system that basically provides distribution and fault tolerance services and then it’s up to you if you’d also like to be able to do worker processes and the data analytics processes in any language you like and you get the first class support for that. So it’s not so like you are living in this foreign environment and you are getting just some pieces of data into your system, but it’s like you can take your favorite language.
We have some examples - now we’re getting Ocaml, which is a nice language for doing high-performance code or Haskell would be another example, so you can actually use that for your data analysis tasks. That’s the direction we want to make, a more general purpose Disco and also investigate or explore new paradigms maybe or extend the Map/Reduce paradigm. There is an interesting kind of development what’s now going on with multicore and given that you have more cores in your system and also that the clusters are growing bigger, so maybe you want to extend it a bit, so it’s not just the good old Map/Reduce.