BT

The Datomic Information Model

Posted by Rich Hickey on Feb 01, 2013 |

Datomic is a new database designed as a composition of simple services. It strives to strike a balance between the capabilities of the traditional RDBMS and the elastic scalability of the new generation of redundant distributed storage systems.

Motivations

Datomic seeks to accomplish the following goals:

  • Provide for a sound information model, eschewing update-in-place
  • Leverage redundant, scalable storage systems
  • Provide ACID transactions and consistency
  • Enable declarative data programming in applications

Datomic considers a database to be an information system, where information is a set of facts, and facts are things that have happened. Since one can't change the past, this implies that the databaseaccumulates facts, rather than updates places, and that while the past may be forgotten, it is immutable. Thus if someone 'changes' their address, Datomic stores the fact that they have a new address without replacing the old fact (it has simply been retracted as of this point in time). This immutability leads to many important architectural benefits and opportunities. In a previous article, I covered the Datomic architecture. This article will focus on the information model and programming experience.

Traditional databases (and many new ones!) focus on 'now' - the set of facts currently true, but, in doing so, they lose information. Businesses are finding increasing value in historical information, and there are very few reasons not to preserve it. This is not merely a matter of keeping history around, as with backups or logs, but making it available to support the active decision making process. It is necessary for a business to know your current address in order to ship you something, but 'which customers move frequently, and where from?' might be very interesting to their marketing or product development teams. Ditto supplier price histories etc. They don't want to have to restore backups or replay logs in order to find out.

It's interesting to consider why keeping active history is even in question. After all, before computers we kept records by accretion, and, as the adage goes, 'accountants don't use erasers'. I'll conjecture that early computing systems simply didn't have the capacity (or no one could afford it). But that presumption deserves rethinking, given the million-fold increase in capacity during the past 25 years. What developers eschew revision control systems like git because their codebases will no longer fit on a floppy?

A database is a database in large part due to the leverage it provides over the data. Otherwise, it is just a storage system. This leverage usually comes from a combination of organizing the data (e.g. via indexes) and query systems which leverage that organization. Developers are getting interesting and ever more capacious distributed and redundant storage systems at their disposal, but often with decreasing leverage. Datomic seeks to work atop these storage systems to take advantage of their scalability, storing organized information in them and putting leverage back in the hands of developers.

Structure and Representation

Every database has a fundamental unit at the bottom of its model, e.g. a relation, row or document. For Datomic, that unit is the atomic fact, something we call a Datom.

A Datom has the following components:

  • Entity
  • Attribute
  • Value
  • Transaction (database time)
  • Add/Retract

This representation has obvious similarities to the Subject/Predicate/Object data model of RDF statements. However, without a temporal notion or proper representation of retraction, RDF statements are insufficient for representing historical information. Being oriented toward business information systems, Datomic adopts the closed-world assumption, avoiding the challenges of universal naming, open-world, shared semantics etc of the semantic web. A Datom is a minimal and sufficient representation of a fact.

Having an atomic unit at the bottom of the model ensures that representations of novelty (e.g. transactions) are only as big as the new facts themselves. Contrast this with resubmitting an entire document in order to update part of it, or the brittleness of delta schemes which attempt to avoid that.

Datoms constitute a single, flat, universal relation, and there is no other structural component to Datomic. This is important, as the more structural components you have in your model the more rigidity you get in your applications. For instance, in a traditional relational database, each relation must be named, and you need to know those names in order to locate your data. Worse, arbitrary join tables need to be created in order to model, e.g. many-to-many relations, and the names for these fabrications must be known as well. Extreme effort must be applied to provide a set of logical views in order to isolate applications from the physical structural decisions, but those views are no less numerous or specific. Document stores are even more structurally rigid, as the hierarchy within your documents is hard-coded throughout your applications, with few if any view-like tools to provide indirection from the structure.

Schemas

All databases have schemas. The only differences are how much they support (or require) schemas being explicit. In the case of Datomic, attributes must be defined before they are used.

Attributes are entities themselves, with attributes for the following (among others):

  • name
  • data type of values
  • cardinality (attributes can be many-valued)
  • uniqueness
  • indexing properties
  • component nature (your foot is a component of you, but your mother is not)
  • documentation

There are no constraints on the attributes that can be applied to entities, thus entities are open and sparse. Attributes can be shared across entities, and namespaces can be used to avoid collisions. The following specifies a

:person/name

attribute: 

{:db/ident       :person/name,
 :db/valueType   :db.type/string,
 :db/cardinality :db.cardinality/one,
 :db/doc         "A person's name"}

Schema, like all interaction with Datomic, is represented by data, the above being a representation of a map in edn format. There is no DDL.

With these simple primitives of datoms and sparse, (possibly) multi-valued attributes, one can represent row-like tuples, hierarchical document-like entities, column-store-like columns, graphs etc.

Transactions

At their most basic level, transactions in Datomic are simply lists of assertions and retractions submitted and accepted into the database atomically. A basic transaction is just a list of datoms:

[[:db/add entity-id attribute value]
 [:db/add entity-id attribute value]...]

Again, all interaction with Datomic is represented by data, the above being a representation of a list of lists in edn format, each inner list representing a datom in

[op entity attribute value]

order. If you want to submit several facts about the same entity you can use a map instead: 

[{:db/id entity-id,
  attribute value,
  attribute value}
 ...]

While it is necessary to express them as text in an article, it is quite important to the design of Datomic that transactions are actually ordinary data structures (i.e. j.u.Lists, j.u.Maps, arrays etc) you can build in your language. The primary interface to Datomic is data, not strings, not DML.

Notice how you do not specify the transaction part of the datoms. It will be filled in by the transactor. That said, transactions are themselves entities and a transaction can assert facts about the transaction itself, such as metadata about provenance, external time, the originating process etc.

Of course, not every transformation can be expressed merely as assertions or retractions without devolving into last-one-wins races and conflicts. Thus Datomic supports the notion of database functions. These are functions written in an ordinary programming language (e.g. Java or Clojure) that get installed into the database (submitted as data via transactions, of course). Once installed, a database function 'call' can be part of a transaction:

[[:db/add entity-id attribute value]
 [:my/giveRaise sally-id 100]
 ...]

When used as part of a transaction, a database function is considered a transaction function, and gets passed an additional first argument which is the in-transaction value of the database itself. Thus the function can issue queries etc. A transaction function must return transaction data. Whatever data it returns replaces it in the transaction. This process is repeated until all transaction functions have returned simple add/retracts. Thus in the transaction above, the giveRaise function might look up Sally's current salary, find it to be 45000, and return an assertion about the new value, making the resulting transaction data look like this:

[[:db/add entity-id attribute value]
 [:db/add sally-id :employee/salary 45100]
 ...]

Since :employee/salary is cardinality one, adding this fact about Sally's salary implicitly retracts the prior fact. Because transaction functions run atomically and serially within transactions, they can be used to perform arbitrary, conflict-free transformations. You can read more about database functions in the documentation.

Connections and Database Values

On the write side, things seem pretty ordinary. You obtain a connection to a database using a URI that includes information about how to reach storage, and, via storage, how to talk to the current transactor. Transactions are issued by calling the transact function on the connection, passing transaction data as described above.

On the read side, things are quite different. In a traditional database, reading and querying is also a function of the connection. You pass a query over the connection, it reaches the server where it is run in the (usually unreproducible) context of the current database state, subject to the limits of the query language embedded in the server, competing for resources and synchronization with all other users, including writers.

By contrast, in Datomic the only read operation of connection is db(), and it doesn't actually reach out over the wire at all. Instead, the connection is continually being fed enough information such that it can immediately deliver the value of the database for use as an immutable object in your application. Thus all consumption of the data, querying etc happens locally (the engine will transparently reach out to storage to obtain data as needed). Note that the entire database is not kept on each application server peer, just the most recent novelty and pointers to the rest in storage. Nor does any 'snapshotting' operation occur. While it feels to the application and query engine that the database is in hand, the realization is quite lightweight, just a few references to persistent data structures, in memory and in storage. Extensive caching happens under the hood.

Query

In Datomic, query is not a function of a connection, and is not even a function of a database. Instead, query is a stand-alone function that takes one or more data sources as arguments. These data sources can be database values or ordinary data collections, or any combination thereof. This is a big benefit of freeing query from running within the context of a database.

The Datomic peer library comes with a query engine based upon Datalog. Datalog is a declarative query language based upon logic, with a pattern-matching flavor well suited to querying datoms and in-memory collections.

The basic form of query is:

{:find [variables...] :where [clauses...]}

Or, this alternative (easier to type) list form:

[:find variables... :where clauses...]

Again, these are just text representations of data structures that you could build programmatically - queries are data, not strings, although strings are accepted and turned into data when supplied.

If you had a database containing these datoms (where sally, fred and ethel are stand-ins for their entity ids):

[[sally :age 21]
 [fred :age 42]
 [ethel :age 42]
 [fred :likes pizza]
 [sally :likes opera]
 [ethel :likes sushi]]

We could ask a query like this:

;;who is 42?
[:find ?e :where [?e :age 42]]

And get this result:

[[fred], [ethel]]

:where clauses match positionally, and for database sources, each datom matches as if a tuple of

[entity attribute value transaction].

You can elide any portions on the right (transaction in this case). Symbols beginning with ? are variables, and the result will contain tuples of values of the variables for any source tuple that matches. 

Joins are implicit, and occur whenever you use a variable more than once:

;;which 42-year-olds like what?
[:find ?e ?x
 :where [?e :age 42]
        [?e :likes ?x]

which returns:

[[fred pizza], [ethel sushi]]

The API for query is a function called q:

Peer.q(query, inputs...); 

where inputs can be databases, collections, scalars etc. Queries can also utilize (recursive) rules, and call your own code. You can find more information about query in the documentation.

Putting it all together:

//connect
Connection conn = Peer.connect("a-db-URI");
//grab the current value of the database
Database db = conn.db();
//a string for now, because Java doesn't have collection literals
String query = "[:find ?e :where [?e :likes pizza]]";
//who likes pizza?
Collection result = Peer.q(query, db);

Same query, different basis

Things start to get interesting when we leverage the fact that the db has all the historical information:

//who liked pizza last week?
Peer.q(query, db.asOf(lastTuesday));

The asOf method of a database returns a view of that database as of a prior point in time, specified by date-time or transaction. Note how we haven't gone back to the connection, nor changed the query. If you've ever rolled your own timestamps, you know a temporally-qualified query is usually much different than one for 'now'. There is a corresponding since method as well.

//what if we added everyone from Brooklyn?
Peer.q(query, db.with(everyoneFromBrooklyn));

The with method takes transaction data and returns a local value of the database with that data added. No transaction is issued over the connection. Thus you can do speculative, what-if queries, or check transaction data before issuing it. There is also a filter method which returns the database filtered by some predicate. Again, we haven't touched the connection, db or query.

What if we want to test the query without setting up a database? We can simply supply data in the same shape:

//test the query without a database
Peer.q(query, aCollectionOfListsWithTestData);

Again, the query is unchanged, but actually runs. Contrast that with mocking a database connection.

So far all of the techniques have worked with a specific point in past or future time. But many interesting analyses will want to look across time:

//who has ever liked pizza?
Peer.q(query, db.history());

The history method will return all datoms across time. This can be combined with, e.g. asOf etc. This query happens to work as-is, but often time-crossing queries will be different, do aggregation etc.

Queries can take more than one data source, and thus can easily cross databases, or use different views of the same database. Being able to pass collections to queries is like parameterized statements on steroids.

Different queries (or participants), same basis

Database values are immutable, so you can do a non-transactional, multi-step calculation and know nothing has changed. Similarly, the basis point of a database can be obtained and passed to another process, which can then get a database value in the same state. Thus different queries, separated by process or time, can work with the exact same basis.

Direct index access

Finally, the database values offer a high-performance API for iterative access to the underlying sorted datoms from the (immutable) indexes. This is the raw material from which other query approaches can be built. For instance, via this API you can query Datomic databases using Clojure's Prolog-like core.logic library.

Summary

I hope this has given you a feel for the nature of Datomic's information model and some of its details. Treating the database as a value is very different and powerful, and I think we are all still discovering the possibilities! You can learn more from the Datomic documentation.

About the Author

Rich Hickey, the author of Clojure and designer of Datomic, is a software developer with over 25 years of experience in various domains. Rich has worked on scheduling systems, broadcast automation, audio analysis and fingerprinting, database design, yield management, exit poll systems, and machine listening, in a variety of languages.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Clojure/West - Portland, OR - March 18-20th by Alex Miller

If you're interested in hearing more from Rich Hickey about Clojure, check out the upcoming Clojure/West! clojurewest.org

Clojure/West is a 3 day conference in Portland, Oregon and features Rich Hickey, the creator of Clojure, and Matthew Flatt, Racket developer, as keynotes. You can find the full schedule at clojurewest.org/schedule - it includes talks on macros, data readers, logic programming, using and debugging with Emacs, Hadoop querying with Cascalog, building REST services, ClojureScript, functional design, DSLs, and much more!

Question by peter lin

The concept of triple like RDF has been around a long time. The big issue I see with RDF and "triple inspired" approaches is the cost of rebuilding a complex object. Say I have a customer object that has phone numbers, addresses and other info.

When the application needs the entire customer object, the cost of reconstructing it from all the pieces adds overhead. Take it a bit further. Say I'm building a temporal database for auto insurance policy. A commercial auto policy might have 200-1000 cars for a taxi service. That means the object graph could have 5000 objects. Using traditional ORM approach, the number of rows a system would need to load would be 5000 objects x 10 version = 50,000 objects. Once it has those objects it has to reconstruct each version. Obviously, this is slow and CPU intensive.

If I want to load the last 10 versions of a policy with Datomic, how many queries would it take to reconstruct those 10 policy records? If I understand Datomic correctly it should be something like

sum(datom for each object) + d = rows returned

where d is the number of changes from a starting time.

Using ACORD schema as an example, the number of datoms might be as low as 80 or as high as several hundred. Usually, the bulk of the data is vehicle, coverage and endorsement. If I use 40 as the base number of datoms, 200 cars, 50 datoms for vehicle/coverage/endorsement fields and 30 changes in those 10 versions, I get this:
40 + (200 x 50) + 30 = 10,070 rows of data. As the number of vehicles grows, the number of queries grows rapidly. How does Datomic address this challenge?

How long would it take Datomic to reconstruct those 10 versions? I was planning on doing this experiment later this summer.

thanks.

Re: Question by Alexander Kiel

Hi Peter,

I think it's better to ask such questions in the Datomic Google Group.

Alex

Re: Question by peter lin

thanks for the suggestion. It's probably best I do the experiment myself and gather data.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

4 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT