InfoQ

News

Principles and Guidelines for an Optimized Use of BigTable

Posted by Sadek Drobi on May 31, 2008 06:53 PM

Community
Architecture
Topics
Database Design ,
Performance & Scalability
Tags
Scalability ,
Database

Based on a number of conversations that have occurred around Google App Engine, Todd Hoff has outlined a set of principles that are instrumental for optimizing the use of distributed storage systems such as BigTable.

Todd starts with defining the perimeter of BigTable’s use. Given different tradeoffs it induces, Big Table only adds value if one wants to build an application that a) needs to “scale to huge numbers of users” and b) has a limited proportion of updates to reads. Todd also emphasizes that to “optimize for read speed and scalability”, the conceptual approach should be radically different from the one used with relational databases and may first appear rather counter-intuitive and even risky.

Relational world is based on error prevention; and normalization is used as a tool to remove duplication and prevent update anomalies. To scale data should be duplicated  instead of being normalized. This path was choosen by Flickr as the decision was made “to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation” because “if your unit of scalability is the user shard there is no separate relation space”. Hence, even though denormalization goes against what Todd Hoff would call relational data ethics, it is an integral part in BigTable data paradigm.

Given that, Todd outlines some other principles to keep in mind for an optimized use of BigTable storage system:

 

  • Assume slower random data access rather than fast sequential access.

Since “in BigTable data can be anywhere […] the average retrieval time can be relatively high”. You trade speed against scalability

  • Group data for concurrent reads

To maximize concurrent reads, the solution is to denormalize, i.e. to “store entities so they can be read in one access rather than performing a join requiring multiple reads” and to “duplicate the attributes and store them where they need to be used.”

  • Disk and CPU are cheap so stop worrying about them and scale.

"[...] Your application can scale as large as it needs to simply by running on more machines. All scalability bottlenecks have been removed."

  • Structure data around how it will be used.

To improve queries speed, data’s format should be as close as possible to the format it is to be used. Hence, Hoff advocated for trading “SQL sets for application based entities”. It is important to highlight however that “this isn’t the same as an object oriented database”. The behavior is not bound to the entity but provided by applications and “multiple applications can read the same entities yet implement very different behaviors”.

  • Compute attributes at write time.

This allows to “minimize the work needed at read time” and prevents “applications from iterating over huge data” which is inefficient.

  • Create large entities with optional fields.

Instead of normalizing and creating a lot of small entities, one should “create larger entities with optional parts so you can do one read and then determine what’s present at run time”

  • Define schemas in models.

To keep data consistent across multiple entities in a denormalized context, schemas have to be “defined in code because it’s only code that can track all the relationships and maintain correctness.”

  • Hide updates using Ajax.

It helps updating the database in little increments.

  • Puts are Precious

Given that “the number of updates that can be performed in one query is quite limited” Todd suggests “to perform updates in smaller batches driven by an external CPU.”

  • Design By Explicit Cost Model

“Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.”

  • Place a many-to-many relation in the entity with the fewest number of elements.

Since “maintaining large lists is relatively inefficient” one should tend “to minimize the number of items in a list as much as possible.”

  • Avoid unbounded queries.

Todd advises to show only a limited number of most recent values from an attribute because “large queries don't scale”.

  • Avoid contention on datastore entities.

One should “avoid the global counter, i.e. an entity that keeps track of a count and is updated or read on every request.”

  • Avoid large entity groups.

“All writes to an entity group are sequential”, hence is it preferable to ”use small, localized groups”.

Todd Hoff gives more insights on each of these principles and illustrates some of them with an example from an GQL thread.

No comments

Watch Thread Reply

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.