BT

Rich Hickey and Justin Sheehy about Datastores, NoSql and CAP
Recorded at:

Interview with Rich Hickey and Justin Sheehy by Sadek Drobi on Aug 03, 2012 |
13:03

Bio Rich Hickey, the author of Clojure programming language and designer of Datomic Justin Sheehy is the CTO of Basho Technologies, the company behind the creation of Webmachine and Riak.

Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.

   

1. I’m Sadek Drobi. I’m here at QCon New York with Justin Sheehy and Rich Hickey; so Justin and Rich, Justin, you do Riak, right; and Rich, you recently did Datomic. There is a question like both of them are data storage and databases? So what are the problems that we are trying to solve at the database level?

Justin: I think we’re solving really different problem and I’ll let Rich talk about the ones that Datomic solves but with Riak, we’re solving problems that I think had been very much at the bottom, very infrastructural problems for developers and operations around building very highly available systems, very scalable systems, the sort of things that used to require a huge amount of custom engineering.

If you needed them, Riak gives you a data layer below your application, you can use like a database or in some cases, maybe beneath another database that gives you that easy scalability when it comes to throughput or capacity, gives you a real high available system and is all about predictable performance, a very repeatable latency sort of characteristics.

Rich: Datmoic is trying to solve a bunch of problems one of which is, how do you move forward with your ability to take advantage of systems like Riak that are highly distributed while when it’s important for your company to still have transactional semantics when that’s important and also to still have a rich data model when that’s important including query capabilities; and so we build on top of storage services and storage systems so that you can leverage their flexibility and the scalability while adding a data model that gives you control over time and real query capability.

   

2. You mentioned like transactions and consistency, where it’s like the model of some like Riak, it’s based probably on eventual consistency which is like not always consistent which is eventually consistent. So how could you build something like the Datomic on top of data storage like Riak which is eventually consistent? Do these two things fit together?

Rich: They do. The trick is that Datomic separates out the writing or transactional part of your systems from the reading part and because it’s a database that uses the storage system in an immutable manner, it means that you can actually make different decisions about availability and scalability on the write for processing side than you do on the read side.

And so on the write side, we have kind of traditional model of scalability which is a single server with high availability, back up kind of model; but on the read side, we’re highly compatible with the data source in terms of their distribution; and in fact, it’s a sweet spot because we don’t change the data. We don’t get into some of the consistency scenarios that you would have with distributed data store.

Justin: There is a lot of systems, Datomic being one example that can easily have transactional semantics and highly consistent semantics on top of Riak; eventual consistency doesn’t mean, "Oh it’s sometimes consistent." It means that if you want the system to remain very highly available for writes, we consider read availability sort of the boring easy part; then you get to choose to allow things to be stale sometimes in order to stay available. It’s not, "oh, sometimes the database isn’t consistent," it’s during exactly the times when another distributed database would have to become completely unavailable, Riak gives you the choice of instead having slightly temporarily divergent consistency in order to retain availability.

Now the system we’re using on top of Riak has components like Datomic’s one at a time transactor for its write path that can impose other constraints and gives you that transactionality.

Riak doesn’t disagree with that at all; it allows you to sometimes if you want to, not add those constraints but it certainly doesn’t fight them, it doesn’t say, "You can’t do consistent things here." It’s the other way around. It say, "If you’d rather sometimes not impose that constraint on yourself in the interest of availability, you’re allowed to."

   

3. So reads are consistent but does it avoid the n-1problem like if I write something that I try to read and then I will read an old copy, right, I could get into this problem even if each read is consistent by itself?

Rich: It would be more likely if the database semantics on top were update in place but the semantics of Datomic are not update in place; so since we don’t go back to a segment that we put in storage and update it, there isn’t a third possibility, the two possibilities are: it’s not there or it’s there and it’s consistent; it’s there and it’s not yet updated, possibility goes away because we don’t do that. We don’t use that aspect of the storage as we work on top of.

   

4. We started, of course, everyone knows SQL and then we started to have that NoSQL movement and like in a way schema-less data storage and it seems like we lost some of the good properties of structured databases for something more relaxed for some properties. So first question, like: why did we abandon in a way in some cases like the rich schema, or for instance, the Tuple Algebra or the relational Algebra for more document-based or blob-based kind of database; and what kind of things that we can put in the database? How much structure we should put in the database and how can it help this constructing business logic? It has several parts of the question but it’s the same problem, like what should I put in the database; how much structure should I put in the database; how much should I put in my own application?

Justin: So first of that, I’d like to address is with the NoSQL versus SQL and relational model, I think there’s a big redherring in there which is that NoSQL is not about not wanting a relational model and it’s not about not wanting any particular query capability; it’s a poor name that’s one we’ve got; if anything, what I think it really reflects is that we’ve got this new movement against the architectural mono culture that we’ve had in databases for these past couple decades where we almost always until the past few years, so almost every project out there using databases that were all basically the same; and the fact that they mostly use SQL isn’t the interesting part; it’s they made all the other choices the same too about data storage, about the semantics of reads and writes.

And so NoSQL isn’t a very useful category; some things are more structured than others; some are more highly available than others; so it’s not describing a technology choice; it’s describing that people are actually building and using different architectures of databases which used to be true and then sort of stopped being true for a couple of decades; and it’s just the resurgence of that.

And that said, I think that there’s a lot of room for that kind of expressiveness and for instance, Datomic using a data log like approach for querying.

I actually find that to be at least as approachable and interesting and rich as a SQL-driven model; but again, NoSQL doesn’t mean that relational is a bad way to think about data, it’s an incredibly powerful way to think about data; it’s that both in our query languages and the rest of our database architecture, there’s not a single choice that really is right for all cases.

Rich: I think what Datomic is just of realizing what Justin just said in saying that you have these different architectural choices with distribution; it doesn’t necessarily imply that you’re going to have to give up transactionality or rich query or rich representational capabilities, and so Datomic is specifically oriented around providing exactly that retaining as much of that model as you can while mapping to a system of systems which I think is what’s most important about than the fact that there are now alternatives. It’s a model culture but it also serves a monolithic solution and now you talk about say, "I like to combine this model for data on top of the storage engine because each one has a characteristic that are important to me," You should be able to do that, that’s the whole point of having a couple of systems of services and I think that’s really the most exciting part of the future.

So we’re going to see a wide variety of storage models and information models and query models. They evolve independently and be able to be combined to solve application problem.

Justin: I think we didn’t give something up in terms of giving up expressability or anything; we just acknowledge that sometimes the things you want to give up and the things you want to get will vary but that doesn’t even mean you have to lose that expressivity as Rich and company are doing with Datomic, there’s other ways to solve those problems.

   

5. Can you each give a very simple schema of how, for instance, Riak scales -why does it scale and very simple technical schema of why this scale and for the Datomic, how does it achieve this transactional semantics yet being scalable? I’d start with Justin.

Justin: Riak scales by using a bunch of mechanisms that are now pretty well known given the past couple of years. It’s a dynamo-like systems, much like Amazon’s dynamo or Linkedin’s Voldemort which is a share nothing systems; uses a consistent hashing like technique for object location and the nice thing about that there’s no central point that you have to ask to locate data.

Any given part of the systems can locally calculate where all the replicas of the given piece of data are; and so it gets its scalability through that kind of automatic distribution of data and automatic routing to the right data.

A lot of those same techniques are used for high availability; it turns out you want to use the same techniques for both of those problem.

Rich: On Datomic side, we take advantage of the fact that there are trade-offs and there are choices and we say it is a real trade-off; if you want arbitrary write scalability, you’re going to compromise transactionality. It’s a perfectly happy place to be to say, "I value transactionality more than arbitrary write scalability but it’s still arbitrary read scalability; and so that’s another point in the matrix and that’s the point that Datomic pursues; so the scalability of the transaction as a single server scalability model and the scalability of query and read is the new distributed model both of which can be made arbitrarily scalable and then the case of query also elastic.

There’s every new process that comes up with the embedded query engine this new query capability and you can add and remove those as well, it’s very elastic.

Justin: I even see that some people that power and need write scalability for some users in their application; it’s quite foreseeable that someone might use, say, Datomic on top of Riak and accept the single writer limitation for a large portion of their data that they want those semantics for and for some other data that they need either extremely high write scalability or extremely high write availability, they might write directly to the underlying Riak cluster and use that interface for the data that doesn’t need Datomic properties and doesn’t want to make that compromised.

It’s not like you have to make all the same choices for everything your application does.

Rich: I completely agree with that; that’s what I expect most customers to do to use the underlying storage like Riak for its own properties when that makes sense; and the beautiful thing is, having every separate systems have its own storage and its own back-up storage and its own distribution storage is really a tremendous resource-waste; and so to say, "Look, I can make one decision about storage, the storage I trust and its properties I like and then I can put different kinds of transactional semantics over it and use a hybrid model because there are certainly cases where going directly to something like Riak would be the correct thing to do.

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT