Cloud Foundry: Design and Architecture
Derek Collison discusses the goals, the design premises and patterns employed in creating the architecture of Cloud Foundry, VMware’s open source PaaS, unveiling internal architectural details.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.
Posted by Ron Bodkin on Jul 18, 2011
Ed Anuff of usergrid presented on indexing techniques at Cassandra SF 2011. While Cassandra 0.7 and later have built-in secondary indexes, Anuff said they don't work well for high cardinality values, require at least one equality comparison and return unsorted results. Anuff presented patterns for alternative indexing including wide rows and tables that use Cassandra 0.81's new composite comparator operators to overcome these limitations, as well as cautioning against the use of super-columns.
Anuff said that in early versions of Cassandra, super-columns were typically used for alternative indexing, but says to "use with caution" noting that many projects have moved away form super-columns because of performance issues and issues like not sorting the subcolumn and not being able to doubly nest super-columns. He observed that column families in Cassandra have sort orders and comparators because they have been used as a way to implement secondary indexing.
Anuff explained that native secondary indexes are implemented with each index as a separate hidden Column Family. Nodes index the rows that they store, and when you issue a query it gets sent to all the nodes, distributing the work. He said that Cassandra 0.8.1 uses indexes for equality operations, and that range operations are performed in memory by the coordinator node. These characteristics limit their application, also limiting their use to data types that Cassandra understands natively.
Anuff said that newcomers to Cassandra often wonder why a row would need up to 2 billion columns. He argued that columns are the basis of indexing, organizing, and relating items in Cassandra, and that "if your data model has no rows with over a hundred columns, you're either doing something wrong or you shouldn't be using Cassandra." Wide rows can be used to model fairly large collections, such as recording a table of departments in a company like so:
departments = {
"Engineering" : {"137acd" : null, "e245116" : null, ... },
"Sales" : { "334762" : null, "17a632" : null, ... },
...
}
Anuff pointed out these advantages to using wide rows:
However, the wide row approach only works for keeping primary keys, rather than providing a lookup mechanism. In general, Anuff said that wide rows are limited to use for 1:1 mappings (i.e., where each value appears only once in a row). For example, consider having a column family for groups that's indexed entries by last name. Anuff recommended using composite keys, which have built in support in Cassandra 0.8.1 through two new comparators. CompositeType is a base comparator, with one column family per index in which the user specifies the specific types and order for each type. The DynamicCompositeType dynamic comparator supports other cases, where users want to use just one column family with many different indexes, with every row potentially holding a different index with different orderings of different values. Anuff noted that the DynamicCompositeType is used for generated indexes in the JPA implementation in the Hector project, which is one of the Java clients for Cassandra, and one that Anuff contributes to.
Composite keys can look like this:
User_Keys_By_Last_Name = {
"Engineering" : {"anderson", 1 : "ac1263", "anderson", 2 : "724f02", ... },
"Sales" : { "adams", 1 : "b32704", "alden", 1 : "1553bd", ... },
...
}
Anuff noted that it's easy to query these composite indexes, but that updating them is tricky because you need to remove old values and insert the the new values. In general, he said that reading before writing can be an issue with Cassandra. Rather than doing locking (e.g., with ZooKeeper), Anuff presented a technique that uses three Column Families. For example, in a table with a users Column Family and an indexes Column Family, there will be a third Column Family Users_Index_Entries. Updates first read the previous index values from this column family to avoid concurrency issues and both it and Users use timestamped columns to avoid the need for locking. Sample code for how to implement this technique can be found in Anuff's github project CassandraIndexedCollections as well as in the slides for this presentation.
Ron Bodkin is the Founder of Think Big Analytics, which builds big data solutions using Hadoop and NoSQL.
Introducing SQLFire: a memory-optimized, high performance SQL database
Early Access! Download JBoss Developer Studio 5.0 now, with packages for Mac, Windows or Linux!
Banking Case Study: Scaling with Low Latency using NewSQL
VMware vFabric SQLFire - Test drive the data management system with memory speed, horizontal scalability and a familiar SQL interface
Derek Collison discusses the goals, the design premises and patterns employed in creating the architecture of Cloud Foundry, VMware’s open source PaaS, unveiling internal architectural details.
Andrew Watson talks about the work of the OMG, where CORBA is alive and well (hint: in your car), UML and UML Profiles vs. custom Modeling languages, DDS and other middleware, and much more.
Sohil Shah discusses creating iPhone and Android enterprise mobile applications based on cloud services using the open source platform OpenMobster.
Paul Sanford presents the transformations supported by data throughout its life cycle, and how that can be better done with Splunk, an engine for monitoring and analyzing machine-generated data.
A common “best practice” for unit tests is to only write a one assertion in each test. I intend to question this advice by showing that multiple assertions per test are both necessary and beneficial.
John Rauser presents the architectural and technological evolution of Amazon retail websites starting with 1994 and ending with adopting Amazon Web Services.
Michael Stal discusses system architecture quality, how to avoid architectural erosion, how to deal with refactoring, and design principles for architecture evolution.
Every developer has had to integrate with another system, API or component. Tis article provides strategies to handle the change and for he separating system boundaries.
No comments
Watch Thread Reply