Bindings, Platforms, and Innovation
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
Tracking change and innovation in the enterprise software development community
Posted by Sadek Drobi on May 31, 2008 06:53 PM
Based on a number of conversations that have occurred around Google App Engine, Todd Hoff has outlined a set of principles that are instrumental for optimizing the use of distributed storage systems such as BigTable.
Todd starts with defining the perimeter of BigTable’s use. Given different tradeoffs it induces, Big Table only adds value if one wants to build an application that a) needs to “scale to huge numbers of users” and b) has a limited proportion of updates to reads. Todd also emphasizes that to “optimize for read speed and scalability”, the conceptual approach should be radically different from the one used with relational databases and may first appear rather counter-intuitive and even risky.
Relational world is based on error prevention; and normalization is used as a tool to remove duplication and prevent update anomalies. To scale data should be duplicated instead of being normalized. This path was choosen by Flickr as the decision was made “to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation” because “if your unit of scalability is the user shard there is no separate relation space”. Hence, even though denormalization goes against what Todd Hoff would call relational data ethics, it is an integral part in BigTable data paradigm.
Given that, Todd outlines some other principles to keep in mind for an optimized use of BigTable storage system:
Since “in BigTable data can be anywhere […] the average retrieval time can be relatively high”. You trade speed against scalability
To maximize concurrent reads, the solution is to denormalize, i.e. to “store entities so they can be read in one access rather than performing a join requiring multiple reads” and to “duplicate the attributes and store them where they need to be used.”
"[...] Your application can scale as large as it needs to simply by running on more machines. All scalability bottlenecks have been removed."
To improve queries speed, data’s format should be as close as possible to the format it is to be used. Hence, Hoff advocated for trading “SQL sets for application based entities”. It is important to highlight however that “this isn’t the same as an object oriented database”. The behavior is not bound to the entity but provided by applications and “multiple applications can read the same entities yet implement very different behaviors”.
This allows to “minimize the work needed at read time” and prevents “applications from iterating over huge data” which is inefficient.
Instead of normalizing and creating a lot of small entities, one should “create larger entities with optional parts so you can do one read and then determine what’s present at run time”
To keep data consistent across multiple entities in a denormalized context, schemas have to be “defined in code because it’s only code that can track all the relationships and maintain correctness.”
It helps updating the database in little increments.
Given that “the number of updates that can be performed in one query is quite limited” Todd suggests “to perform updates in smaller batches driven by an external CPU.”
“Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.”
Since “maintaining large lists is relatively inefficient” one should tend “to minimize the number of items in a list as much as possible.”
Todd advises to show only a limited number of most recent values from an attribute because “large queries don't scale”.
One should “avoid the global counter, i.e. an entity that keeps track of a count and is updated or read on every request.”
“All writes to an entity group are sequential”, hence is it preferable to ”use small, localized groups”.
Todd Hoff gives more insights on each of these principles and illustrates some of them with an example from an GQL thread.
5 Ways to Ensure Application Performance
Ebook: Scaling Agile with C/ALM
The Role of Open Source in Data Integration
Comprehensive Threat Protection for REST, SOA, and Web 2.0 Applications
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.
This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.
This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.
This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.
After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.
IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.
Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.
No comments
Watch Thread Reply