Exploring the Architecture of the NuoDB Database, Part 1
Traditionally, relational databases were designed for scale-up architectures. In other words, to handle more load you get a bigger box. A few years ago this meant that to support a scale-out architecture you either ditched SQL or applied tricks like sharding and explicit active-passive replication. If you wanted a real ACID programming model against a flexible, logical database you were out of luck. This tension is what inspired the NewSQL movement and captures the heart of what NuoDB is all about.
NuoDB is a relational database designed to be cloud-scale. What does that mean? It’s a true SQL service: it has all the properties of ACID transactions, standard SQL language support and real relational logic that you know (and hopefully) love today. It’s also designed from the start to scale the way you want a cloud service to scale.
I won’t bore you here with some long definition of “cloud scale.” If you’re really interested, I’ve gone into more detail then you probably want over on our technical blog in an opinion piece on what it means to be cloud scale. The short version is that you need a scale-out model (of course) but I think that you also need something that’s agile, easy to work with, automatable, secure and highly available. The rest of this piece is motivated by that point of view.
Note that NuoDB is “just” a piece of software. That means we run on Linux, Mac, Windows or Solaris whether it’s on your laptop, a private cluster or a public cloud. You can use it in Amazon Web Services or Google Compute Engine, integrate us with Open Stack or run as a local Windows service on your laptop. The software is flexible, so you can get testing and developing wherever you like and later decide on where and how you want to deploy.
What I’m going to cover in this article is what NuoDB is, how it was architected to solve today’s class of challenges and what problems you can solve with it. By the end of this piece you’ll be familiar with the key concepts and architectural differences of NuoDB. You’ll also understand something of the practical deployment and management features and be ready to get scaling on your own NuoDB database.
The simplest way to approach NuoDB is to think of it as a three-tiered architecture. There’s an administrative tier, a transactional tier and then a third tier for storage. We’ll come back to the administrative layer later. For now let’s focus on the second and third layers.
Splitting the transactional and storage tiers is part of the key to making a relational system scale. Traditionally, a SQL database is an exercise in synchronizing the on-disk representation of data (as pages) with an in-memory B-tree structure. This tight coupling is efficient but highly sensitive to IOPS and therefore very hard to scale out. By separating these roles we have an architecture that can scale out without being nearly as sensitive to things like disk throughput.
In NuoDB durability is a completely separate task from transactional processing, meaning that you can scale these tiers and handle failure independently. You want to scale out transactional throughput? You can do it without adding more disks. You want independent archives to protect you against data center failures? You can do it without affecting transactional performance. This separation not only helps the system scale, it makes it much easier to scale on-demand and make provisioning decisions based on your requirements.
The transactional layer is responsible for Atomicity, Consistency and Isolation, but knows nothing about Durability. That means it’s an in-memory tier, so it runs fast and any piece of it can fail or be shut down at any point without loss of data or consistency. Because it’s in-memory it’s also a caching layer (specifically, it’s an on-demand cache, which I’ll come back to soon). You don’t layer any additional caching logic on top of a NuoDB database.
The storage layer, obviously, is where the D in ACID is enforced. It’s an always active, always consistent source of all the data. It’s responsible for making data durable on commit, and providing access to data when there’s a miss in the transactional cache.
Note that the meaning of “commit” in NuoDB is tunable. You have the flexibility to trade-off performance and high availability in part because of these separated layers. To understand how and why you would tune commit protocols, I need to explain what actually makes up these layers.
The two database tiers are made up of processes running across any number of hosts. There’s a single executable that can be run in one of two modes: a Transaction Engine (TE) or a Storage Manager (SM). All processes are peers, with no single coordinator or point of failure, and there’s no special configuration required at any of the hosts. By default, all peers are mutually authenticated and communicate over encrypted sessions.
Each of the TEs accepts SQL client connections and handles queries. The caches are kept in the process space of the TE. SMs and TEs communicate with each other over a simple peer-to-peer coordination protocol. When a TE takes a miss on its local cache, it can get the object it needs from any of its peers, which typically means going to another TE if there’s one with the object in-cache since that’s faster than asking an SM to provide something from durable storage.
This simple, flexible process model makes bootstrapping, scale-out and migration very easy. For instance, suppose you want the simplest possible NuoDB database. You’d start a single TE and SM on the same host. At this point you have a full ACID database running, but it’s all on one host. This is great for testing on your laptop, but not so much for real deployment.
One thing you could do now is install the software on a second host and send it a message to start a new TE. The new TE will mutually authenticate with the existing processes, load a few root elements into its cache and then report that it’s ready to start taking on transactional load. The whole process from the time the message is sent to the time the TE is ready to do work is typically less than 100ms. You’ve just doubled the transactional throughput of your database and increased your resiliency to failure by running TEs on two separate hosts.
Great, but you’ve still only got one point of durability. To address that, you could bring up a third host and send it a message to start a second SM. That SM will automatically synchronize with the running system and then when it’s ready, start participating actively in the database. You’ve now got two independent points of durability for your database. Pretty easy too.
Finally, let’s say you want to get the first TE and SM running on different hosts. You guessed it: bring a fourth host online, start either a TE or SM on that new host and when it’s ready, shut down the original TE or SM (whichever you started on the new host). What you’ve just done is live migration of database components with no loss of availability. No down-time. You’ve also configured a fully redundant database, since any host can fail and you still have a full archive of data and access to transactional processing.
All of this works so easily and quickly because of the simple process-based, peer-to-peer model that NuoDB is built on. It also works because of the simple on-demand caching scheme and the format of the data that’s actually being cached and shared. No, it’s not actually SQL structure. It’s something we call Atoms.
Everything is an Atom
While it’s true that NuoDB is a relational database, that’s not what the internal architecture looks like. The front end of a TE is something that speaks SQL and knows how to optimize transactions. Behind that layer, all of the logic operates on objects we call Atoms. Atoms are essentially self-coordinating objects that represent specific types of information (e.g. data, indexes, schemas, etc.). Even the internal meta-data is stored as Atoms.
Atoms are chunks of data, but they shouldn’t be thought of like traditional pages in a relational database. In a way, the Atoms are really the peers in our network coordinating between their instances in cached state and marshalling or unmarshalling at the storage layer. They contain arbitrary chunks of data, where the size of an Atom is chosen to help maximize efficiency of communication, number of objects in-cache, complexity of tracking changes, etc.
In addition to database content, there are also Atoms that represent catalogs. A catalog is how NuoDB resolves other Atoms. Essentially, it’s a distributed and self-bootstrapping lookup service. When a TE starts up, it needs to get exactly one Atom, something called the Master Catalog. This is the root Atom from which all other Atoms can be found. In the previous example where a second TE fetches an Atom from the first TE’s cache, it knows it can do this because the catalog tells it where the needed Atom can be found. It’s this simple bootstrapping mechanism that makes it easy and quick to bring a new TE online.
Atoms are more than just a nice way of organizing data. They greatly simplify internal communications and caching because we don’t have to think about specific SQL structure. Everything is contained in the same generic structure and identified in a consistent way.
This view of internal state also helps with consistency of the database as a whole. Because meta-data and catalog data is stored in the same Atom structure as database data, all changes are happening in the context of the same transaction. So if we’re changing some meta-data or catalog data in the context of a SQL operation, either all of the changes happen or none happen. We don’t run the risk of making some inconsistent change that skews the data and state.
Finally, treating all data as Atoms makes durability considerably easier. Because the contents of a database are simply a collection of named objects, our durability layer is just a key-value store. In theory, it could be any store. Out of the box in the 1.1 version of NuoDB, we support any filesystem, the Amazon S3 interface or Hadoop HDFS. There will be support for more storage systems in future versions. Within a single database you can even mix-and-match storage mechanisms, meaning you could (for instance) be using both a local filesystem and an S3 service to store your database.
Multi-Version Concurrency Control
The Atom structure is a powerful, simplifying component of the architecture that helps us to scale, but without some way of handling conflict and enforcing consistency NuoDB couldn’t support ACID semantics. For this, Multi-Version Concurrency Control (MVCC) is used. Unlike global lock-managers or distributed transaction coordinators, MVCC works by treating data as versioned, and treating the database overall as an append-only set of updates.
MVCC has a lot of nice properties when it comes to scaling out a distributed database. First, when data is changed what we’re really doing is creating a new, pending version of that data (it’s “pending” until the transaction commits). Within the cache there can be multiple pending versions as well the current “canonical” version so nothing is ever changed in-place. This makes rollback trivial (the transaction just never committed so the pending change is dropped) which in turn means that update messages can be highly optimistic.
Not only can these messages be optimistic, but in NuoDB they’re typically asynchronous. That is, when a transaction is trying to make a change it will send out the associated messages (more on this below) right away and then continue. If the TE hears back that the change is allowed before the transaction is ready to commit, then no cost for the messaging is seen. Otherwise the transaction blocks, but only once it’s come to the end and needs to figure out if it’s allowed to commit. This pairing of asynchronous and optimistic behavior, along with liberal message batching allows NuoDB to stay ahead of the expected latency spikes and other unpredictable network behavior in cloud environments.
A second advantage of versioning is that it provides a clear model for visibility. In its default mode, NuoDB provides snapshot isolation. In other words, you get a consistent view of the data as it existed at the moment that your transaction started. We also support read-committed isolation and select-for-update as a SQL operation, but I feel that snapshot isolation is what you really want in a distributed database.
In practice, what this means is that a transaction on one TE can read some data and then a transaction on another TE can modify that data without conflict. As long as everything else the first transaction interacts with is consistent relative to the version of this object, then consistency is maintained. MVCC gives the system knowledge of all the pending and committed versions so an always-consistent view of the system is maintained with a minimum of global coordination messages and conflict.
What does need to be mediated is write-write conflict. For that, NuoDB picks some host to act as the Chairman for the object. It’s a fancy name, but really all it’s doing is acting as a tie-breaker when two transactions both want to update the same object. Only TEs that have a given object in their cache can act as Chairman for that object, so in the case where an object is only cached in one TE, all mediation is local. When a TE shuts down, fails or drops an object from its cache, there’s a known way to pick the next Chairman with no communications.
Finally, MVCC provides one more key benefit to the performance of the system. SMs not only maintain the archive of the database but they also can (and we recommend that you do) maintain a journal. Because NuoDB is a versioned system, the journal is just an append-only set of diffs, which tend to be very small. This makes journaling very efficient and means that the cost of adding a journaled SM doesn’t have a strong overall impact to the running time of a typical transaction.
Just as update messages can be sent out optimistically, so too can updates be written to a journal before a transaction commits. If a transaction doesn’t commit then we’ll note that, and when we play through the journal the associated update won’t be included in the batch change to the archive. So it’s not just the in-memory coordination that can be optimistic but the durability of your data as well.One important detail to note is that versioning is done on records within Atoms, not on Atoms themselves. Atoms could contain many records and would be too coarse-grained to effectively minimize conflict.
In Part 2 of this article we will take a look at how the transaction system is implemented, the role of the administrative layer, how all components work together and what to expect in the future.
About the Author
Seth Proctor is CTO of NuoDB, Inc. and has 15+ years of experience in the research, design and implementation of scalable systems. That experience includes work on networks, languages, operating systems, security, databases, and distributed environments; all of which are integral to the NuoDB product. Prior to NuoDB, Seth worked at Nokia on its internal private cloud architecture. Before that he was at Sun Microsystems working in the research labs and collaborating with product groups and universities.
Seth holds eight patents for his cutting edge work across several technical disciplines. He has an additional five patents awaiting approval, all related to achieving greater database efficiency and end-user agility in NuoDB decentralized database deployments. You can contact him here.
NuoDB vs. Riak
Re: NuoDB vs. Riak
From an architectural point of view the consistency and caching models are also fairly different. NuoDB is an always-consistent system (Riak is eventually consistent) meaning that there's no burden put on the application to understand how to mitigate or reconcile conflicts. The caching model in NuoDB is on demand as apposed to an explicit, replicated cache structure in Riak. Both valid designs, just different approaches.
NuoDB was designed to provide relational programming, but scale out like a KV-store and support flexible schemas. From that we can work across multiple data centers without any explicit configuration about how/where to do replication and without the usual k-safe caching bottlenecks. I would argue that this makes the operations model easier. So personally, I think the key differences are in the programming model, the operations model and the way that consistency and durability are treated in the architecture.
While it’s true that NuoDB is a relational database, that’s not what the internal architecture looks like.
Francisco Jose Peredo Noguez
There is no such thing as a "relational internal architecture". A database is relational, if one can talk to the database using a relational language (BTW, SQL is only an incomplete implementation of the relational model).
¿But internally? There is no such thing! The relational model does not prescribe how you build your database.
So, maybe NeoDb has a an architecture that is very different from the architecture of other pseudorelational databases (like Oracle or SqlServer) it is not right to say that its internal architecture does not "look relational". A database is relational if it looks that way from the outside, and its internal implementation has nothing to do with that.
Myths: Thanks for breaking them, just don't create new ones!
Francisco Jose Peredo Noguez
Re: While it’s true that NuoDB is a relational database, that’s not what th
What I was getting at is the internal structure of data. In a page-based system you're dealing with data internally that's still known to map to SQL structure. In our system the internal structure has no direct mapping. We could replace SQL with something document or graph based. The translation between internal structure and SQL syntax happens way up in the stack.
So, internally, we're treating data as generic objects whereas as another database might be thinking about that data as rows or columns. I was just trying to call out how far up the stack you have to get before you "know" that the data is actually used for SQL as opposed to some other model.
Re: CAP theorem
CAP is a useful way to think about tradeoffs. It does a great job of capturing problems that you need to think about when you're weighing operations challenges. What we've tried to do at NuoDB is give you a simple way to understand your exposure and adjust accordingly. I'd say what we do is make availability a lot easier and more robust, but again I'll save all my thoughts on this for some place with more space to write :)
Re: While it’s true that NuoDB is a relational database, that’s not what th
Francisco Jose Peredo Noguez
I didn't say we have a "relational internal architecture" ...
No, but you did say: "While it’s true that NuoDB is a relational database, that’s not what the internal architecture looks like." and that can be interpreted as:
"NuoDB is a relational database, and its internal architecture does not look relational, unlike the internal architecture of other relational databases"
but sorry this wasn't more clear.
Don't worry. Maybe it is me that is misunderstanding, I apologize if that is the case.
What I was getting at is the internal structure of data. In a page-based system you're dealing with data internally that's still known to map to SQL structure.
The thing is... first, the concept of "page" is not a relational concept. There are no "pages" in relational algebra.
Second, SQL is only pseudorelational, and very flawed
And third, there is no rule requiring your internal data representation to "map to SQL" to have a relational database. (It might make sense to map it to relational algebra but only from a high level perspective)
In our system the internal structure has no direct mapping.
As all should have been built. Great work!
We could replace SQL with something document or graph based.
How about a proper implementation of the relational model? As in TheThirdManifesto by Hugh Darwen and C.J. Date
The translation between internal structure and SQL syntax happens way up in the stack.
Again, great work, that is the way all databases should be implemented...
So, internally, we're treating data as generic objects whereas as another database might be thinking about that data as rows or columns.
Great approach!. Although using a true relational language on top of this could allow you to do some query plan optimization based on properties derived from relational algebra that could get you even better performance (and that are currently impossible to implement with SQL due to its incomplete implementation of the relational model)
I was just trying to call out how far up the stack you have to get before you "know" that the data is actually used for SQL as opposed to some other model.
Ok, I get it. Any chance you will open an API so that a third party can plug-in other languages (such as a D?)