New-age Transactional Systems - Not Your Grandpa's OLTP
John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.
Posted by Sadek Drobi on May 31, 2008
Based on a number of conversations that have occurred around Google App Engine, Todd Hoff has outlined a set of principles that are instrumental for optimizing the use of distributed storage systems such as BigTable.
Todd starts with defining the perimeter of BigTable’s use. Given different tradeoffs it induces, Big Table only adds value if one wants to build an application that a) needs to “scale to huge numbers of users” and b) has a limited proportion of updates to reads. Todd also emphasizes that to “optimize for read speed and scalability”, the conceptual approach should be radically different from the one used with relational databases and may first appear rather counter-intuitive and even risky.
Relational world is based on error prevention; and normalization is used as a tool to remove duplication and prevent update anomalies. To scale data should be duplicated instead of being normalized. This path was choosen by Flickr as the decision was made “to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation” because “if your unit of scalability is the user shard there is no separate relation space”. Hence, even though denormalization goes against what Todd Hoff would call relational data ethics, it is an integral part in BigTable data paradigm.
Given that, Todd outlines some other principles to keep in mind for an optimized use of BigTable storage system:
Since “in BigTable data can be anywhere […] the average retrieval time can be relatively high”. You trade speed against scalability
To maximize concurrent reads, the solution is to denormalize, i.e. to “store entities so they can be read in one access rather than performing a join requiring multiple reads” and to “duplicate the attributes and store them where they need to be used.”
"[...] Your application can scale as large as it needs to simply by running on more machines. All scalability bottlenecks have been removed."
To improve queries speed, data’s format should be as close as possible to the format it is to be used. Hence, Hoff advocated for trading “SQL sets for application based entities”. It is important to highlight however that “this isn’t the same as an object oriented database”. The behavior is not bound to the entity but provided by applications and “multiple applications can read the same entities yet implement very different behaviors”.
This allows to “minimize the work needed at read time” and prevents “applications from iterating over huge data” which is inefficient.
Instead of normalizing and creating a lot of small entities, one should “create larger entities with optional parts so you can do one read and then determine what’s present at run time”
To keep data consistent across multiple entities in a denormalized context, schemas have to be “defined in code because it’s only code that can track all the relationships and maintain correctness.”
It helps updating the database in little increments.
Given that “the number of updates that can be performed in one query is quite limited” Todd suggests “to perform updates in smaller batches driven by an external CPU.”
“Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.”
Since “maintaining large lists is relatively inefficient” one should tend “to minimize the number of items in a list as much as possible.”
Todd advises to show only a limited number of most recent values from an attribute because “large queries don't scale”.
One should “avoid the global counter, i.e. an entity that keeps track of a count and is updated or read on every request.”
“All writes to an entity group are sequential”, hence is it preferable to ”use small, localized groups”.
Todd Hoff gives more insights on each of these principles and illustrates some of them with an example from an GQL thread.
Why NoSQL? A primer on Managing the Transition from RDBMS to NoSQL
Getting Started with Stratos - an Open Source Cloud Platform
Fair Trade Software Licensing - A Guide to Neo4j Licensing Options
John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.
Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.
Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.
Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).
Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.
Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.
One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.
InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.
No comments
Watch Thread Reply