Moving from Transactions to Streams to Gain Consistency
When systems become more complex with each large database split into multiple smaller ones, maybe using derived databases for particular use cases like full text search, a challenge is to keep all this data in sync, Martin Kleppmann stated in his presentation at the recent QCon London conference.
The biggest problem working with many databases is that they are not independent from each other. Pieces of the same data are stored in different forms so when data is updated all databases having a piece of the updated data must also be updated. The most common way of keeping data in sync is to make it a responsibility of application logic, often done by independent writes to each database. This is a fragile solution, in failure scenarios, e.g. after network failure or a server crash, you may fail to update some of the databases and end up with inconsistencies between them. Kleppmann notes that this is not the kind of consistency that eventually will correct itself, at least not until the same data is written again:
This is not eventual consistency; this is more like perpetual inconsistency
The traditional solution is using transactions which gives us atomicity, but Kleppmann notes that this only works within a single database, with two different data stores this is not feasible. Distributed transactions (a.k.a. two-phase commit) can cross multiple storage systems but for Kleppmann they have their own challenges like poor performance and operational problems.
Looking back at the problem, Kleppmann notes that a very simple solution is to order all writes in the system sequentially, and ensuring that everybody then reads them in the same order. He compares to deterministic state machine replication where, with the same starting state, a given input stream will always create the same state transitions when run several times.
With a leader database (the master) where all writes are also stored as a stream in the order they are handled, then one or more follower database can read the stream and apply the writes in exactly the same order. This enables them to update their own data and eventually becoming a consistent copy of the leader. For Kleppmann this is a very fault tolerant solution. Each follower keeps tracks of its position in the stream; after a network failure or crash the follower can just proceed from the last saved position.
Kleppmann mentions Kafka as one tool when implementing the above scenarios. He is currently working on an implementation, Bottled Water, where he is using PostgreSQL to extract data changes which are then relayed to Kafka. The code is available on GitHub.
A presentation about developing with Kafka was recently published on InfoQ.