Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Dave McCrory on Data Services

Dave McCrory on Data Services


1. [...] Dave was name-checked the other day on one of the keynotes about coining the term data gravity. What is that?

Chris's full question: Hi, this is Chris Swan, one of the cloud editors at InfoQ. I'm here with Dave McCrory at QCon New York 2014. So Dave gave a presentation yesterday on his work at Warner Music and the services that he built there, and that's going to be available separately, so I'm not going to ask Dave about that stuff today, I'm going to start ask him about data gravity. So Dave was name-checked the other day on one of the keynotes about coining the term data gravity. What is that?

Data Gravity is an effect that I identified originally on Amazon web services, but realizing that it actually affected everything, as far as in data centres, clouds, wherever you're actually using computer resources and storing data. Effectively you have data gravity come into play as data accumulates. You have applications that interact with that data. Every interaction generally creates more data and by having more data you end up with the application being advantaged by being closer to that data.

If you think about having to go across a network, you are limited by latency, or how quickly you can go across the network, and bandwidth, how much data you can transfer at one single time across that network. You are more advantaged the closer you are to the data, because you have lower latency and higher bandwidth. So you can move more, faster. So that advantage means that applications can be advantaged by being closer to the data, so you get a gravitational effect. Because of that gravitational effect, because data then tends to grow, the likelihood of more applications wanting to consume that same data occurs.

So you have this mass of data that's growing and the attraction increasing for more applications to be closer to it. So that was why when EC2 East used to go down, we would see so many things go down entirely, because there were so many applications and services that lived there, because that's where the data was. That's where the data was and applications that depended on that data. The same effect occurs even inside of a data centre. Things that are more advantaged are put on systems that are directly tied to storage, to say even in storage area network. The things that have higher data demands, say traditionally on a relational database, would be attached to the SAN so that they can get to the data faster, it's simply across a storage area network instead of across – say – the Internet.


2. Yesterday you were talking about data services and some of the yak shaving you needed to do, in order to provide data services onto the side of Cloud Foundry. Do you think there is

Absolutely. I thing we are just starting to see the rise of microservices, these small services that give you specific capabilities. In my case, we have to deal with trying to abstract a lot of what we call the data primitive, so the different storage solutions that we chose, whether it be a graph database, relational database, a queue, any type of thing that could persist data, we had to wrap in services and as we did there is an increase in complexity. That complexity ended up causing quite a few headaches early on until we figured out how we could deal with it. That included quite a bit of trying to deal with common shared services. When you have common shared services you have to be very careful, because you can end up with many permutations.

Those permutations cause quite a bit of complexity. We also had to deal with services dedicated to specific data tasks. We had lots and lots of problems with legacy data, data from systems that had been around for even 30 years and you would have all sorts of odd notations and really bits and pieces of data that, unless you are an expert and you have been using that system for the last 30 years, you couldn't just simply guess, based on context. So we would have to use services to help us clean up all of that data.

So that's specialized services for that, this meant that we had a lot of services to keep track of. I think we are in a very beginning of seeing what services are really going to do in the enterprise, but I don't think there are a lot of good solutions for dealing with services. There are like service catalogs, there are ways of trying to manage services by tracking, tracking them as applications and trying to version them in source code control, like Git, but still no real sane in the way to deal with the large number of services. We managed through process more than anything else.


3. We've seen companies, like Joyent, perhaps respecting data gravity and offering services that are kind of taking the application to the data rather than dragging the data to the application. Do you think that's a trend, that is going to repeat?

I do. I think that is going to happen quite a bit. We have seen it with Joyent, we see a little bit of it with Hadoop and Yarn. I think we will see a lot more of it, because there are so many trends that are going to force it. We will see many more storage solutions making intelligent decisions – of does it make more sense to move the data to the application or the application to the data ? – and it all depends. There's no magic bullet that in every scenario will work. Sometimes it makes sense to move the application to the data, especially if it's an incredibly voluminous amount of data, and you have processing power close to it. At the same time, if you have a very small amount of data and a complex application that has high processing demands, it makes more sense to ship the data to the application.

Chris: talking of moving things, you recently moved to Basho as CTO. Tell us about what's going on there.

Sure. Basho has been around for a while in NoSQL space. There were some decisions, by several of the previous members of the management team to move on. Primarily because they have been at this for a long time and they wanted to move on and do something else. The investors decided that they wanted some new blood in the company and they hired Adam Wray and myself to come in and try and change things a bit. Basho itself has many notable customers right now and is getting close to their 2.0 release, which has been in the works for about a year now. We should be in release candidate within the next week. So very close to our 2.0 release.

Chris: So. Tell us about what's going to be in 2.0. The release candidate's imminent then give us run down of what we can expect to see.

We have some advances in a technology we call AAE, Active Anti-Entropy, that's a capability that ensures that you have data integrity when you're trying to store multiple copies of your data, which is something that Riak does do for performance and availability purposes. We have something that was originally code-named Yokozuna. It's actually Solr, which includes Lucene running within Riak, allowing us to have better search capabilities. We had originally implemented our own search that didn't offer the level of robustness that our customers wanted. So instead of trying to duplicate Solr or Elasticsearch on our own, we chose to use Solr and implement it in Riak. So we now run a JVM with Solr inside it, in conjunction with Riak itself. We also have some performance and scalability improvements that we've made. There are quite a few highly requested customer enhancements to Riak itself, the way we handle something else, called CRDTs. CRDTs are ways of dealing with data without having to make every single change across an entire cluster. Instead of making actual data changes, you do set operational changes. So if you do an add, remove, add, instead of doing that by actually transferring the data, deleting the data, transferring the data again, you actually are simply sending the equivalent of the operations not the data.


4. Sounds cool. So what do you see coming next with Riak, kind of once you've done with this 2.0 release? I don't know whether you used Riak in some of your previous roles, but what would you have wanted from Riak that you want a kind of do with it?

Sure. We had explored Riak at several of my previous roles. At the time when we were looking deeply we needed the capabilities that were going to be in 2.0, including the search capability. The capabilities that are there now are pretty attractive in 2.0. We are looking at doing several things. There will be a 2.01 and a 2.1, 2.01 because no one ships perfect software out of the gate. So that will include whatever fixes or problems we come across as we're out in the wild, and in 2.1 will be lots of performance enhancements, probably one or two new features, but not many. Really focusing on upping the performance and scalability of Riak more than anything else. And then following the 2.1 release we will be looking at new storage types that we can involve.

So that might be a graph capability or it might be a queuing capability, or in memory capability. We're not exactly sure, we have been anxious to get feedback from our customers but our customers have wanted to see 2.0 get out of the door first so we have been really focused on that. But I would say you'll see quite a few significant features like that. We have a push to add many of these different constructs for data so that customers can get more out of Riak, than simply key-value which is Riak and the way Riak CS, which is object storage. Riak CS is effectively object storage that's built on top of our key-value storage.


5. So switching topics a little, we had an open session yesterday in the cloud track at Qcon. One of the topics was trust. How do you think we are going to move forward into building more trustable services and getting better trusts around the data that we retrieve from those?

I think it's going to be difficult. There were couple of other people that had opinions on how we can handle trust. The one that I liked the most was probably the idea of short lived trust, or trust that might be established but you're only using it for a specified amount of time. The idea of infinite trust on many things, including certificates and all sorts of things around what we do with encryption and such, I think leaves us vulnerable to a lot of problems. When you look at over all the Internet of Things, short lived trust makes sense. Long running trust just leaves you open to more and more vulnerabilities which I think are going to become a bigger and bigger threat. You know, I mentioned yesterday that I wouldn't be so fearful if someone was able to take over my toaster but if they're able to take over medical equipment or something else that's an entirely different problem.

And the levels of trust in security that I would want to see would be very different between the two. I don't know exactly how we will handle trust, the idea of using your mobile device is an interesting one. Those are one way. I think the ideal way is some type of combination of factors to create trust and some type of time to live for that trust before it has to be renewed. I don't think there is any magic bullet just yet on trust. It's something that really ends up being a much more difficult problem that it appears.


6. You touched upon the Internet of Things there. We talk a lot about cloud scale, but are we facing another scaling challenge with the Internet of Things and how are we going to deal with that?

I think we are facing a challenge with the scale of Internet of Things. If you look at the number of things that even 10 years ago we wouldn't have guessed would have a network address and would be feeding us data. Internet of Things is a huge explosion in the amount of data and what you want to do with that data. From some of the things that I've seen there's a renewed look at IPv6. We've been able to put that off, with all sorts of network address translation and other magic. But at some point we will run out of those options and especially with the Internet of Things that will accelerate dramatically and will end up having to move to IPv6. That fixes the problem at least for, for quite a while.

What end up happening though, is you are going to have a lot of data flowing over many many networks. If I think about some of our customers that are looking at Internet of Things solutions, there are things like meters, smart meters and such running from homes. If you imagine 10 million homes all having a meter and that meter is reporting data every 15 seconds up to a system, how do you deal with the amount of data, how do you act on that data? Do you do processing at the meter? Do you do processing back in one central location? By the way, I think the answer in both. I think you do processing out of the edge for some problems and I think you do an entirely different set of processing out at a central hub for a different set of problems. But again, that makes things more complex. Even arguably our cars are going to be getting more and more in on the act.

I have a car that has Onstart today that sends a monthly report of what its health is and I get an email saying your car has these things: green on oil, green on this. I think that's just the beginning. I think we'll see all sorts of things that tie Internet of Things with our mobile devices. We are starting to see a little bit of that but not at the scale I think we'd will. It's going to create all sort of challenges, even in housing the data. How long can you sore data if you have millions and millions of devices generating data every 15 seconds, or every 5 seconds or something like that? How long is long enough to keep that data? Do you normalize the data? Do you just store it in perpetuity? I think there is going to be some challenges especially as we try and figure out what is valuable data versus just keeping things to keep them? Which seems to be right now the current trend in enterprise is we'll just store everything forever. That usually doesn't work out too well.


7. Another comment theme of discussions on the cloud track yesterday was containers. We talked a lot about containers in terms of, you know, disrupting the virtual machine model and the runtime substrate aspects of it. What do you think of the data aspects of the shift to containers?

I think the data aspect with regards to containers as an interesting one if we talk about the hottest container technology arguably right now, which is Docker. From what I understand, and I haven't played with Docker enough to know all of the details, of its strengths and weaknesses, but it's difficult to create persistent data unless you're storing it over the network with something like Docker. When you have these splits in containers I'm trying to have a persistent disk mounted into a container. Something that has been solved, but we're not at that, it’s mature. If you trying to move data around we're just starting to see some good solutions on the networking side. So there it's not incredibly complex to say network containers on different physical machines together.

All of these things have to be solved and mature, before data is really going to explode in containers. But I think containers add a lot of flexibility to the mix. They add flexibility and they add performance that previously you got the flexibility with virtual machines but they were much heavier weight. They were more complex to deal with than being able to spin-up a container. Provisioning time of containers, you know, can be a second. You can't do that with a virtual machine. At least no virtual machines that I've seen.

Those are the types of things that I think are going to cause people to be interested in, in getting data into containers as long as you can persist the data. I think that's the fly in the ointment. If all of the containers act idempotently then you put yourself at risk even if you keep it in multiple containers, that if they disappear you've lost all your data, versus persisting it to disk. So I think that's an interesting aspect with containers. I do think the future is containers. I've looked at several of the different cloud platforms, most of them are embracing containers and when I speak to enterprise customers, they are already experimenting with containers. Either for continuous integration, or they are using it for simple testing, I don't see that slowing down, I see it speeding up.

Chris: Well, I think we'll both going to be looking forward to that. So it's been great having you at the show Dave and thanks very much for stopping by this morning and answering a few questions.

Thank you.

Jul 28, 2014