Trading Consistency for Scalability in Distributed Architectures
One of the key aspects of the system architect's role is to weigh up conflicting requirements and decide on a solution, often by trading off one aspect against another. As systems become larger and more complex so more and more of the conventional wisdom about how applications should be built is being challenged. At last year's QCon conference in London in March, for example, Dan Pritchard gave a talk about eBay's architecture. One of the main take-away points from his presentation which got a lot of subsequent coverage was the fact that eBay don't use transactions, trading a loss of easy data consistency for considerable improvements in the overall scalability and performance of their system.
Following on from his talk, InfoQ spoke to Dan Pritchard to get some more information:
Why doesn't eBay use transactions, or how might one decide against application level transactions?
It's not that we don't use transactions. We just don't use transactions that span physical resources because of the dependencies it creates across multiple components. The components can be the application server and database (e.g. client managed transactions) because a client failure can tie up database resources for longer than we can tolerate. We also don't use distributed transactions because making an application dependent upon multiple databases brings down the effective availability of the client. Instead we opt to design for the lack of transactions and build in failure modes that allow the client to succeed even in the event of database availability issues.
Application level transactions will always be somewhat problematic. Anytime you ask developers to manage resource life cycles, you will face the bugs that come about when that management goes wrong. Transactions are not much different than memory and we see the general trend in languages to removing the responsibility of memory management from developers due to life cycle issues. Declarative transactions, such as those in EJB's are [a] sledge hammer approach to simplifying transaction management, assuming that every database operation behind the bean is of equal importance.
Deciding whether to use transactions or not really comes down to your scalability and availability goals. If your application needs to reach hundreds of transactions per second, you're going to find that distributed transactions won't cut it. If you want to add another digit after the 3rd 9 of availability, you're not going to be able to assume that all database commits have to complete within the context of the web page or in some cases at all. Unfortunately there are no simple formulas for when to back away from application level transactions. Instead, as an architect, you have to decide when one constraint on the system requires you to relax another.
How have you built your own atomicity for things like 'place bid'?
Place bid by itself is an interesting problem as it's less about atomicity and more about not blocking any bidder at the critical last few seconds of the auction. It turns out that it is quite simple if you decide to compute the high bidder and bid price at display time, not bid time. Bids are inserted into a separate child table which is a low contention operation. Each time the item is displayed, all bids are retrieved and the business rules for determining the high bidder are applied.
The real question behind your question is how do we achieve consistency? To achieve consistency in large scale systems, you have to give up on ACID and instead use BASE:
Relax the need for the data to be consistent at the end of each client request and you now have opened the window to eliminate distributed transactions and use other mechanisms to reach a consistent state. For example, in the above bid case, we also update view tables that are organized by bidder for quick display in the My eBay page. This is done using a pair of asynchronous events. One relies upon an in memory queue as we prefer very low latency between when a bid is placed and when it appears in My eBay. However, in memory queues are unreliable so we also capture a bid event using a server side transaction with the bid operation. The bid event is processed as a recovery mechanism should the in memory queue operation fail. The bidder view table is therefore decoupled and not always consistent with the state of the bid table. But that's a tolerance that we can afford and allows us to avoid ACID compliance between the bid and bid view tables.
What advice do you have for other architects on large scale systems?
The simplest advice is that scaling in the large is not adding resources to an architecture designed to scale in the small. You have to break away from conventional patterns, such as ACID and distributed transactions. Be willing to look for opportunities to relax constraints that conventional wisdom state can't be relaxed.
For a couple of simple axioms, design for everything to be split, and think BASE, not ACID.
Amazon CTO Werner Vogels, also speaking at QCon, provided some further background on the tradeoffs, citing Eric Brewer's CAP Theorem. This theorem, described in a presentation to the PODC Conference in 2000 (.pdf document) which also covered ACID vs BASE, states that for the three properties of shared-data systems - data consistency, system availability and network partitioning tolerance - only two can occur at the same time. In other words, a system that is not tolerant to network partitions can achieve consistency and availability using common techniques such as transactions. However, for large distributed systems such as those of Amazon and eBay, network partitions are a given. The consequence of this is that an architect dealing with very large distributed systems has to decide whether to relax requirements for consistency or availability. Both options put some onus on the developers, who need to be aware of the characteristics of the architecture they are working with. If you've chosen to relax consistency, for example, then the developer needs to decide how to handle a situation where a write to the system is not immediately reflected in a corresponding read. As Windows Live program manager Dare Obasanjo puts it in his blog
We follow similar practices in some aspects of the Windows Live platform and I've heard developers complain about the fact that the error recovery you get for free with transactions is left in the hands of application developers. The biggest gripes are always around rolling back complex batch operations.
It is interesting to observe that many of the large web sites seem to have reached the same conclusion independently. Whilst smaller systems, with only a few nodes, don't have to concern themselves with these sorts of trade offs yet, the sorts of problems that eBay and Amazon are addressing are likely to start appearing in enterprise systems as they also scale to larger and larger audiences.