BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Erik Meijer on Big Data, Types of Data Stores and Reactive Programming

Erik Meijer on Big Data, Types of Data Stores and Reactive Programming

Bookmarks
   

2. […] How is your interest in data and data modelling and data stores?

Sadek's full question: So I thought we could talk a bit about data and data storage, there are a lot of things going on and it seems like it's kind of a mess because people try to find new properties in data stores and so on but they come up with different ideas, sometimes they are about scalability, sometimes they are about other properties and still there is modelling logic and so on. How is your interest in data and data modelling and data stores?

Good questions, I must say I find it extremely exciting if you look at the world of data today, if I would look back five or ten days ago, everybody said, this field was done, we know how to deal with data and now suddenly, because of Big Data, and I’m going to explain what that means for me, suddenly a lot of assumptions that we have don’t work anymore and the whole field has opened up and there is tremendous amount of new developments. You said that it might be a little bit of chaos but I think it’s like when you look at a forest, if the forest starts to grow, in the beginning there is lots and lots of diversity and looks like chaos but at some point, after a while things start to really emerge.

I think this is super interesting and when people talk about Big Data, which I think fuelled this movement, they often talk about volume, they think about 'big' in terms like there is a lots and lots of data. Some people (and I really like this idea), they say it’s not really about the volume but it’s also about the variety of the data and the velocity of the data. That kind of characterisation corresponds exactly to the way I look at data and as your viewers and you may know that I love to use the notion of duality, so I think the idea of duality is one of the most beautiful things in mathematics, where you can look at things and they are symmetric, and there is deep symmetry between things. Now if you look at these three Vs of big data, they are all duals, so let’s look at them. For example, if you look at velocity, that can be push or pull, so when you have pull based data this is when you have data like traditional databases that are stored somewhere, and when you need a data you pull it out of the database.

But then what I think it’s much more interesting, I’m not saying that the other one is not, but I find it more fascinating is real-time data, where the data gets pushed at you and the whole world is really pushing data at us, so you look at us, there is like GPS signals that are pushing at us, Twitter streams, your mouse moves, there is a lot of data that is kind of pushed at you then you have to do something with it and then often it ends up in a database then you pull it out later, so I think the push based data it’s a fresh data and the pull based data is the data that you have kind of archived. So there is a duality there and you can show that these things are mathematically dual, so push and pull are mathematically dual, so we have one of the Vs – velocity – and you can show that these are mathematically duals. Then you can look at volume, and volume you can say it’s like big and small but the way I would like to look at volume is more like open and closed.

If you look at a typical relational database, not even relational database but where people want to have transactions and consistency, that only works in a closed world, because you have to control all the parties, because you have to synchronise changes such that you can do a transaction. But on the other hand, if you assume an open world where there is no party in control, and then immediately people talk about the CAP theorem and what happens if you have partitioning. Well, you can only have partitioning when you have an open world because by definition an open world is where there can be arbitrary delays between things, so that second V of volume for me is really like open versus closed, small versus big.

Or when you think about big a lot of people talk about Big Data, they think about Hadoop and so on, I think Hadoop is actually small data. What is the biggest thing you can think about, that is something that is infinite, imagine that you have a stream that is infinite like a stream of tweets and you want to do word count on the streams of tweets, all the traditional algorithms don’t work because they assume that the data has finite size. So I think one interesting thing is to think about really big data namely infinite data and do computations on that, and now you get very interesting algorithms to deal with infinite data. Now we have the second V and the third V is for variety and also there is a duality and people talk about key value stores and column stores or relational stores. Well, these two are also duals, so if you look at the relational data model foreign key, primary key and key value stores, if you model both of those using category theory, they also are dual.

And an informal way to look at it in key value store which is like what every computer is really, the RAM of the computer is a key value store, you have the address, you look it up and you get the value. In a database, the value has a key and then the foreign key points to the primary key of another thing, so the arrows are reversed which is one of the way you deal with duality. Now we have these three Vs, we know that they are all dual, but the three together give you a design space like a cube of data, so there is a cube of big data where there is eight points that you can look at and each of these eight points there are interesting databases, and time is now too short to going into the details, I have a CACM paper and an ACM Queue paper about this, but each of these points there are existing databases that fits on there.

Sadek: For instance, SQL is closed value to key...?

SQL sort of a traditional relational database would be pull, foreign key, primary key and closed.

   

3. So can we imagine a database that is just purely a stream of data going to the database and streams of data coming from it? Eg. an app pushing a stream of data would be getting stream of data from this data store of modifications that happened - does something like that exist or is that interesting to have?

We have to be a little bit more precise here, because the thing is if I look at a stream that let’s say pushed at me, then in some sense there is no notion of persistence, the stream just comes at me and every time I get notified when a new value appears. Whether that the data is persistent or not, is in some sense not an interesting question and that might sound strange on first sight, but the way I explain it: if you look at the mouse, your mouse for me is a database, because every time I move the mouse pushes new data out of the mouse. Maybe in a philosophical sense, the mouse stores an infinite amount of data, because by shaking it I can get the data out, but would you say that that data is stored in a mouse? I don’t know, I mean now we are going into philosophy.

Now what you can also say, is say that I have the mouse moves and I store them on the disk or something, then you can either pull them out or you can have the disk push things out. When you pull them out that’s more where you can observe there is persistence, because maybe you can pull them out again. On the other hand also with push, there is maybe a way you can observe that there is memory and this is what we call in the reactive framework, hot versus cold. So it can be, if I ask for the stream again I can see whether I get the same values or not, and with the mouse usually you get different values, the mouse just moves and when you look at it again you just pick the stream, you start receiving the stream where it was. But say I have a file with lock data so maybe I have visitor locks from my website, I can store these in a traditional database or in a file system, but now I can also ask these things to be pushed at me, at a certain rate, ordered by the time stamps.

So there it’s like you can turn a pull and push based data into each other, but you can do the same with key value stores and foreign/primary key stores, that is what object relational mappers do, you program in terms of key value but it’s stored in terms of foreign and primary key. Being dual doesn’t mean they're not isomorphic or that you cannot map between them, so I think that there is a very interesting space where you can pick your data store, let’s not call it store, you call your data model or source based on your application and then you can move around that cube, because maybe you have to deal with a data source that is in this case, but you want to program it in a different way, and programmers have been doing this mostly in one dimension with the variety. So they have a SQL database, they want to expose it in a different model, namely key value store, so they move around that axis, but you can do the same with streaming push versus pull and people do this by buffering, that is a common way too.

   

4. […] Regarding these three axes that you described, where would you position Datomic?

Sadek's full question: If for instance I can keep all the events, I can persist all the events with their time stamps in a data store all of them, then I can go back and see all, I can even get the same stream again because I persisted all of that, and it seems to me it’s more or less what Rich Hickey did with Datomic, where he stores all the facts and then you come back and get all the facts even at the stream if you like it. Regarding these three axes that you described, where would you position Datomic?

By the way I did a similar interview with Rich Hickey on this, on Channel 9, people can look at that as well. [Editor's note: http://channel9.msdn.com/posts/Expert-to-Expert-Erik-Meijer-and-Rich-Hickey-Clojure-and-Datomic]. So the way I would say there is one of the things that I didn’t talk about in this characterisation is updates. If you look at the relational database, that is really something where you can also update or mutate the values that are stored in the database, so the next time I ask for values, I might get different values back. And I think what Datomic does, is he takes the idea of persistent data structures, where things are not mutated and makes them, Rich has a special name for that, but he makes them persistent, he is very precise with his words. If you make these things persistent in time also when you store them on disk then you get the same idea, and that is a dimension that I don’t capture in this model, because I don’t talk about updates to the data, I only talk about access to the data.

You can probably model that as well because what are changes to the data, that is data itself as well, so you could also say when you have a database no matter if the data model mutates the data or not, it could notify you when a change happens or you can ask what are your changes so far. And that is more like an RSS stream, where you pull and then you say what has changed, or whenever something changes you can be notified, but that doesn’t say anything about if there is an underlying mutation, if you really do a CRUD operation or if you do it in a different way. I’m waving my hands a little bit here, and Rich should speak about Datomic, I think it’s an interesting approach because whenever you do an update you destroy something and I think you should never destroy something because then you can’t reverse, it’s lost, and I don’t like to lose things.

Sadek: But what you are saying is that rather than seeing a model of modification, you’d like to see as a source of things and that you can see it as a change set instead of seeing like each time notified by something? So you can query even the changes and you can see what’s happened in the past?

I think an ideal data processing system should never destroy information, and mutation often means destroying because you lost the previous values.

   

5. […] What do you think we should do, because we started this Map/Reduce, but Map/Reduce is very static, you need something that is continuously Map/Reducing the data and giving you interesting results. So what is your take on that?

Sadek's full question: It’s kind of scary today, we talk about data and streams of data but it seems like we are having way too much data, everything is happening in real time, you don’t even have the night, we used earlier, a few years ago to do batches in the night when you can analyse data at least, you don’t even have this luxury, you should analyse data in real time. And some of the data comes from the streams from other services, some of them are you own databases and you need do something about it, like maybe reactive was one step towards trying to deal with this data in real time. What do you think we should do, because we started this Map/Reduce, but Map/Reduce is very static, you need something that is continuously Map/Reducing the data and giving you interesting results. So what is your take on that?

I would not call it scary at all because it keeps us off the streets, full employment for us, as new technical challenges arise, we need to come up with solutions. I think it’s actually quite exciting that we have so much data, and what you said is that the real time aspects of it is what I think makes it kind of different from what happened in the past, if you look at a traditional data warehousing or something where you would put your data, somewhere you’d create this very elaborate schema, talk about your data, and whereas now it’s all much more real-time and things come in, you have to process it as you go and then decide what you do with it. Now there are several interesting things, I already mentioned that schema, I think there is also if you look at the old data systems, they were very highly schematised and what you get in return for that is that you can have all kind of different views of your data.

So if you normalise your data like in a relational database, and you build the right indexes, then you can slice and dice your data in different ways, but that takes time and preparation and hence money, so you need to know what data you want to keep to do that with, whereas if you get this data in real-time, maybe you don’t want to normalise it. For me, the most concrete example of real-time data is UI Events. UI Events are as good as data as customers and orders in a database, but every time the mouse moves you want to immediately react to it and do something, like move something on the screen, invoke an action because you clicked on a button, so UI Programming is a streaming data problem and often people do transformations on the data but they are little things because you have to also react in real-time.

The other thing that you mentioned is also interesting, what happens if you get data from different sources? They might be in different schemas, might be in JSON. Say you want to combine, your mouse moves with web service and your GPS coordinates because you are building some apps for you mobile phone. Now you have all this data in different formats that you have to munge and transform and so on. But now you come into a field where we as developers are pretty good at, because that is what we do all day, taking data in this format and translate it into another format.

Sadek: It seems we forgot about this, it seems like for some period of enterprise development wherever we want to call it, it stopped being transforming data, is started playing with classes and not with the data. Until now we are back at transforming data a lot, like from JSON to HTML to XML to whatever it is and throw it, and all that is data transformation disguised, but it was some period when it was very static.

That is an interesting observation, because if you look at a typical Java or C# programmer even JavaScript, when you create an instance of a class, you are creating the data out of nothing while you are creating out of some bytes that the memory manager allocates, but you are not transforming that data, it’s a transformation too but it’s kind of a trivial transformation, whereas, what you want to do is now your data comes in in this format and you want to expose it into another format. On the other hand, maybe we have forgotten it but not really because people have been like writing parsers and all these things for a long time. But I think the way to create objects or values by calling new, that is going away because the thing is don’t new them up, they come to you, they are either pushed to you or you ask for them to come from somewhere else. That is an interesting answer, this is the retirement of the new operator.

   

6. Now you need more power not in structuring your classes or structuring your whatever architecture but rather structuring your data, and it’s very dynamic, very runtime thing. […]

Sadek's full question: Now you need more power not in structuring your classes or structuring your whatever architecture but rather structuring your data, and it’s very dynamic, very runtime thing. It seems like functional programming also it’s the right spot in there, we are talking about concurrency, it’s very interesting because we have multicore and functional programming is easier to reason about, but now another value of functional programming is transforming data, it’s trivial and functional programming is much harder, using functions let’s call it this way, and it’s much harder with other paradigms, do you think this is true?

That is true, that is very true, if you look at what functional programming is good at and now talking about functional programming without thinking about whether it’s pure or not, really what functional programming tries to do in this context is to be composable. You want to compose things. The things that you want to compose are transformations, they are transformations of something that takes an input and an output and then you have another thing that takes that thing as the input and then you compose them together and then you get a new transformation. In some sense, if you look at the pure functional language, the composition of transformations is at the core of the paradigm, whereas you could also say that in object oriented programming it’s more decomposition of objects that is at the core. So there you talk about data modelling and defining classes in terms of other classes whereas in functional programming it’s really talking about creating functions out of other functions.

So that is an interesting switch where now it’s all about in some sense the data is given and you want to create functions that transform the data into something else instead of prescribing the data with a class, where you say: “Ok, I have a customer, I say new customer and puff I have it”, where now if something comes in and maybe I want to view it as a customer by picking out certain values and doing some computation.

   

7. […] Programming is becoming interesting again because we have to do all these manipulations that are much more interesting than just reasoning about very static structure?

Sadek's full question: By picking up some values from JSON, parsing another thing that is maybe a CSV or another format and combining it with a database. So it seems like it’s getting as you said, maybe not scary, but rather exciting again, programming is becoming interesting again because we have to do all these manipulations that are much more interesting than just reasoning about very static structure?

That is true, and the other thing what you are saying I think that is very true is that people often have a very static view of types. You think of a customer and that is an instance of this class customer, that has these and these fields but in some sense, and again I like to think in terms of mathematical structures, and my favourite example here is the natural numbers. I have never seen a natural number, I know that they exist somewhere in like some platonic space but there are many concrete instances of syntactic representations that denote a natural number. So I can write the natural number 5, by having 1, 1, 1, 1, and then stripe, this is what children do - that is a representation of that denotes the number 5 that I don’t see, or I can have a bit pattern that denotes number 5 or in my Java code I have the character that looks like a 5.

They all represent this abstract mathematical thing and I think it’s the same with data - the class is not the truth about the data, that is just another thing that represents this abstract notion that you want to manipulate. And I think with functional programming that becomes much more clear, because with objects is easy to believe that the customer thing, that is the customer. That is not true, I have never seen a customer that looks like that, this is only one representation of the customer, so maybe this whole notion also brings back more mathematical ideas to programming which I think would really help because mathematics is all about abstractions and then having operations on these abstract things and people define theories over abstract ideas and then you can manipulate them. We should move programming much more in that form. Natural numbers are not defined by a class, they are an abstract idea and then you can have different operations on them and can do interesting things with them and they can surface in different concrete forms.

   

8. Even in the same way if I’m getting tweets, a tweet can be represented in the same app in different forms depending of what I want to do with them and how do I want to look at them?

Exactly, and I would go even further, that tweet, that physical tweet, that byte stream that comes from Twitter maybe denotes something much more abstract, if I got a tweet from you, maybe that denotes your emotions or the fact that you are here and it denotes your location but it can also denote your mood or you state of mind. And that physical tweet is just a representation of this much more abstract thing, and when you are processing the data you have to think in terms of processing these abstract notions and tweets, you are not processing tweets, you are processing what the tweet really denotes.

   

9. This is extremely interesting and it’s really exciting that we are moving to these kind of needs where we need this kind of power and several languages already got this power because they integrate some functional programming aspects but yes is going to be back maybe interesting for the developers that are programming?

Maybe we are going back to the time of denotational semantics. So if you are an old guy like me, when people started to think about programming languages, they were using denotational semantics, so what they would do is that people read papers from Scott or [Christopher] Strachey about denotational semantics, you would take a program and you would ask “What mathematical thing does this program denote?” And then you talk about semantic domain of this program and then that’s beautiful but there are certain problems with that, for example when you deal with concurrency then the mathematics becomes quite complicated. So people move to operational semantics where they model these things in terms of abstract machines, but I think what we are discovering here as we speak, is that with this data processing we are going to work like a denotational semantics of this data that we are processing. You look at the piece of data, it’s like a program, what kind of abstract concept, does it really denote and then I can manipulate that.

Sadek: Very interesting, thank you Erik for being here for accepting the interview!

It was my pleasure!

Jan 04, 2013

BT