Bio Rich is the author of Clojure and designer of Datomic (http://datomic.com/ ) and has over 20 years of experience in various domains. Rich has worked on scheduling systems, broadcast automation, audio analysis and fingerprinting, database design, yield management, exit poll systems, and machine listening, in a variety of languages.
QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community.QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.
Datomic is a new database, it’s designed to give you the scalability you get with some of the new approaches to distributed storage, and combine it with the power you used to get from traditional database servers, so you maintain query capability and joins and transactions, but get scalable reads through distributed storage.
One of the objectives of a lot of new systems is to decentralize things in order to get greater scaling and have more flexibility. So there is actually a lot to like about traditional databases. I think relational algebra is a powerful tool , and I think transactions are powerful, so we want to retain transactions, we retain a relational-like model which we expose through a query language called Datalog, which has the same power as relational algebra. So we value those things quite a bit, but you want to move away from having one server or a cluster of servers do all the work, because it’s very difficult to scale that, and in an attempt to scale that, people are giving up some capabilities.
So what we do is we move storage out of the database server and put it into a service and we move query out of the server and put that into the applications, so applications can directly service queries from reading from the data servers, and that leaves the server component to only do transaction handling. We actually call it a Transactor, because it does not service reads and it does not service queries, and therefore can be a lot faster at doing what it does which is transactions. And then you get elastic compute because peers you can bring up and down as application demand increases or decreases, so that is powerful because it's elastic, it's not like pre-configuring replicas or shards and we think that storage as a service is the future and this first iteration we’ve built on top of Dynamo DB, but we expect to support any other similar kinds of services.
And that’s again a lot easier then configuring your own sharding or replicas or managing your own storage and Dynamo is an example of something that is already widely distributed and highly reliable. So we’re trying to get the best of both worlds; it’s not a rejection of the past, it’s actually quite an embracing of that.
So it's not actually a bottleneck because once you start putting databases up on the cloud and database servers on the cloud you have an issue with storage that survives the instance, so you tend not to use the storage on the device , you use network accessible storage, EBS would be an example of trying to do a traditional database on Amazon. Something like Dynamo is extremely fast, those are SSDs with very low latency, they are actually competitive with a local disk seek. And you get control over how much bandwidth you want to get, so it's a knob, and you decide how much you want to spend and how much speed you want to get. The trick to make it efficient in the applications to do query, there are a couple of tricks.
The first is you have to treat your data immutably so that is what we do and that means that, once an application has seen a segment of data it can retain it, because it doesn’t need to try to synchronize with any changes made to it, because changes are never made to it. The other thing is what you choose to put in storage, so if you look at a traditional database, what gets put on the disk are blocks of a B-tree, and that makes it efficient to search because the B-tree is a tree and has wide branching factors. We did the same thing: we basically take a tree representation of the indexes and put that into storage.
So what's being pulled from storage are blocks of indexes with the same kinds of efficiencies of B-trees, so that is exactly the same services that any query engine would want to have from storage. In fact from the SSD arrays on multiple machines, it’s quite competitive with, is actually better in many ways than physical storage.
You just pull your working set as you require it, it is a tree so you will probably have cached the top of it relatively quickly and once you have done that, any piece of data that you never seen before is one read away. That is how it's architected and those reads, again if you are all up in EC2, are very fast.
Yes, they are both based around logic, but Datalog is actually a subset of Prolog oriented around query, but it has a couple of good characteristics when compared with Prolog, because it's not trying to be a general purpose programming languag, it only is trying to satisfy queries. The order of clauses doesn’t matter, which it does in Prolog, and it’s subject to set at a time implementation, which is what we want, it’s like a relational engine would similarly work with whole sets of data as opposed to Prolog which is oriented towards producing one result at a time. But it is similar, you can look at it, it would seem superficially similar.
The other thing that it shares with Prolog is rules, so a big part of it is, if you look at Datalog initially it looks like patterns, data patterns with variables and it makes a lot of sense, and the joins are all implicit. But if you find yourself doing a set of queries with similar subcomponents all the time, you can make that in something called a rule and a rule would be analogous to a SQL view. The beautiful thing about rules and running on the client, is unlike a view it’s not something you have to install on a server, you can make one up ad hoc on the fly for one particular part of the application. The other nice thing about rules is that they can be recursive, so unlike trying to do recursive SQL, it's actually quite straight forward to do recursive joins, queries in Datalog. I think it's a good language for application programmers, now that you are bringing it into the application.
It is very, very simple and easy to learn, it is free of most syntax and does not have a lot of complex rules, but you say "I want to find someone who likes pizza and who likes something else". You just say "X likes pizza and X likes cars" and that is a join, a self-join in that case, but this is what you do, you write little patterns and it makes the variables match. And that unification of logic variables is where all the joins happen. So it’s very straight forward , it’s very declarative and our experience is it's a breeze for anyone to learn whether they are savvy about SQL or not.
You can consider it, right so logic an unification was sort of the birth place of the pattern matching, but yes, you can make analogies to that.
So the data model in Datomic is a very atomic model, we store entity, attributes and values and if you're trying to make things immutable and you want to keep track of time, then you move away from what I would call place-oriented programming which is what you do when you have tables or documents. There is a place for this fact to go. Instead you say "I want to keep track of all the facts ever", and when you do that, you need to be able to say "A lot of things aren’t facts", that is why we want to change our databases, because somebody moves and has a new address. So the way to do that without using places and going and replacing the address, is to say "This was Fred’s address at this point of time" and later you say "This is Fred’s address at that point of time".
Both are facts, one was true for the interval between the two, and one was true since that. So each fact then is atomic in being an entity, attribute, value and time, and we call that a "datum", and the time we actually encode by remembering the transaction you were part of, because transactions are serialized, your transaction designates when you happened relative to everything else. That way we can encode the time of day on the transaction. Transactions are entities like any others, so you can encode other facts about the transaction, for instance, who enacted it, or what was the providence of the data that came in. In this way you can, for instance, very easily find out what else was said when this was said, find the other things said in the same transaction. So that adds the time element to the data, and it allows you to keep all the data.
You don’t actually tend to do that. What you can do is much higher level things, so what we present to your application when you want to issue your query is you have a connection and that seems familiar, but instead of sending a query over to a remote machine , you are going to ask the connection for the value of the database - "When? - Now". With that value in hand you can ask as many queries as you want and you are not going to see any other changes, you will have a stabile view. We can take that value of a database and say "I have this value of the database, I would like to see it as of some point in time in the past".
When you do that you get another value of the database which only show you things earlier than that, so you just ask the database value for what it looked like last month and then you issue the same queries and that keeps you from actually doing what you said, which is putting the transaction data into your queries and putting the T everywhere. Basically instead you ask the database for view either at a prior point in time or over a window of time and issue the same queries against it, which is very powerful.
Absolutely, it's a giant persistent data structure, durable now.
It's persistent and persistent; persistent and durable I think is the better distinction.
How does the app talk to the storage engine? That happens implicitly from issuing queries, so it’s just a back end. When you ask for the database value we supply something that as you ask for the parts of the database, will pull any parts that you have not seen yet.
Right, so when you want to either record a new fact, or update a fact, which is basically recording a new fact with the same attribute, or retract a fact, you can combine those operations into a transaction along with functional transformations of the existing data, and that gets sent to the Transactor. The Transactor serializes all that activity, will take what you want to have happened, so your assertions and retractions and (we call them "data functions" which are transformations) apply them to the database within the transaction, which is sort of "the now" of the database globally, and send it to storage in a log-like fashion, and return that your transaction was accepted.
Periodically in the background those tree-like indexes are created. In the interim maybe the tree-like indexes were created or be created once an hour. That is actually trigged by the amount of date that you have. In the interim the Transactor and the peers were accumulate new changes in-memory so it’s logged the data, so it’s in storage and now it’s built a little persistent data structure in-memory, which represents what’s happened since the index was made in storage.
The other job of the Transactor is to reflect changes back out to the connected peers so everybody sees the novelty right away, and they are doing a merge-join effectively between what’s happened since the last indexing job, and what is in storage on the index. That gives you a view of world on all the peers that seems to be current. And is also efficient versus trying to keep an index updated live on disk, or in storage which is not a good approach.
One of the nice things about Datomic is that it’s very well compartmentalized, so the notion of the Transactor and the storage service collectively appears to the application engine as a service, and what we provide is an entire implementation of that engine in-memory, which comes inside the peer JAR, so just an ordinary Java Library. You can get started by just getting the peer library, and popping up a memory database and the entire API works perfectly the same there. The next thing we have as a deliverable is an appliance, it’s a virtual box appliance, that again emulates the entire system, so that has an embedded Transactor, an embedded memory based storage.
For more involved development you would start up one of those and that would persist beyond your individual process and allow you to connect multiple processes. And finally you can run in the production system and currently that is in limited availability and that means running your Transactor and your applications up on EC2 on AWS, and utilizing Dynamo and that is the full system, that is really what you would use for production, you use it for tests when you want to test in that mode as well, as opposed to sort of individual developer.
So today you can get the Peer Library and you can get the Developer Appliance and start using the API, because again your same application moves between those with just a change of the connection string.
I actually think that the applications are suffering quite a bit now from being relatively "dumb’’ clients of "smart" servers. We all are struggling trying to scale those servers, we are not doing as much interesting work because we are afraid of burdening the servers with queries. So I really would like to enable richer, more intelligent applications that have more decision making capability. I mean that is the mission, so I think Datomic is really good for that.
The other thing Datomic does, which we have not talked about yet, bring declarative programming into applications. That query engine that you can use against the database you can also use against in-memory data structures, ordinary Java collections. You can issue a query against a result of System.getProperties, so there is this integration now with declarative programming and the rest of your program. A lot of things I have talked about in the past about sort of using more declarative programming and isolating yourself from how things work and concentrating on what you trying to accomplish. We’re trying to deliver that by putting this query engine in the app.
So if you would value that, that would be a key point. Obviously if you are looking for a scalability without a lot of configuration, that is a big benefit, you get storage as a service, you can scale your peers the way you ordinarily do dynamic scaling with any kind of load balance driven dynamic scaling, so you get all those scaling properties you might look for in a NoSQL solution from using this as well. So I think it’s a good fit for a lot of applications, I think the place we would not target would be something that is extremely write-heavy, if you are looking for arbitrary write scaling, because that is where you get into that trade-off. As soon as you say "I want arbitrary write scaling" you lose transactions and often joins and a bunch of other characteristics. You get eventual consistency.
So I think we are looking for people who want a lot of those benefits, but don’t want to give up transactions and consistency. The other area I think people often look for alternatives to relational databases is in flexibility of representation and this datum notion, a sort of the ultimate flexibility in representation. But it’s even better than documents, because what people are finding with documents is there is a lot of flexibility upfront, but eventually you've stored a lot of documents in a particular structure.
And now it’s actually a very difficult thing to change how that structure works, much more so even then changing a relational database because at least there you could use views to encapsulate yourself from a change in the storage. When you use documents you can’t do that, so when you use datums you can. Obviously it’s easy to make a datum look like a tree, like a document or JSON. But you have not married any structure, so I think it’s a great fit for people who are looking for flexible representations, so it has those characteristics.
It is a commercial product, we have a couple of things available for people. First we have developer tools which are not for charge and then there is a free tier, which you are familiar with if you’ve used Amazon there is a certain number of hours of use that are free. So there is a thousand hours of free use, so many small applications we expect to fit under that free tier and use Datomic free of charge just for small things, pet projects, so things that people would normally want to be able to try out something new before they start using at work for instance. And finally we are making this service that we offer, obviously we cannot offer Amazon for free but we make the service that we offer free for applications that are open-source. So if your application is open-source, we won't charge you for our stuff.
17. So this query engine that I put in my application, can I just use that for anything? Because you mentioned that the query engines operate on arbitrary data, so I just can use that in my application or how does that tie in?
It is meant for use with Datomic and it’s not redistributable, but other than that you can use it, sure.
18. Talking about the query engine, you mentioned that I can use it on arbitrary data in my memory, so you can basically use your Datalog implementation on any kind of data like LINQ or other systems, it that right?
It does fill kind of a similar role to LINQ, I mean there is the "I" part of LINQ which is integration with the language which obviously I can’t change Java, so we get you as close as we can, but there is an integration of query into your application, and I think there is a lot of power to that.
The thing is that actually it’s not Datalog source code, it’s Datalog data structures. I eat my own dog food, I said today you should use data and we do. All of the interfaces points of the Datomic are data-driven so for instance a transaction is not a string, it is not a syntax of strings, it’s actually a list of lists. And you can make transaction data in your program by just writing however you make a list in your program. Similarly queries are data structures and query results are data structures, so there is a syntax and you can represent it with strings and in text files, and read it, but you can also directly construct it if you wanted to programmatically.
What gets read is eventually normal Java collections so the definition of what you pass to the transaction engine for instance is a java.util.List of Lists and similarly, but it’s the same idea - programming with data has a lot of power and so we are also bringing that to Java programs.
Well, because they are data structures Clojure data structures work right out of the box.
It’s really just data structures, so Clojure is a language that has direct support for writing all the data structures that we use, so you don’t need to do anything.
The API is definitely a Java API and that is what everything is driven, the Clojure API is wrapper for it, but the principle is there for everyone. The principle is the API is written in terms of data structures and I think it will be novel for Java programmers initially but everybody who has written Java SQL code that build transaction strings by using string builders, knows how awkward it is to programmatically manipulate SQL and this is very easy to manipulate programmatically, even from Java.
24. It's awkward to manipulate SQL strings; it is much easier to work with data structures which is what we do as programmers, ideally. This is going to be very useful; we have a lot of data lying around in our databases. It’s a great innovation.
I think people have a lot of interesting solutions combining all of it, querying their native Java data structures, utilizing the in-memory database, for instance for scratch information, and then interacting with the durable system, combining all of those techniques in a single application. And the other thing you do is you can join a database and a memory object in the same query, so it is wide open.