Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News CassandraSF2011: Progress and Futures

CassandraSF2011: Progress and Futures

This item in japanese

Johnathan Ellis, CTO of DataStax and project chair for Apache Cassandra, keynoted at Cassandra SF 2011. Major accomplishments for the project in the last year include better support for multi-data center deployments, optimized read performance, included integrated caching and improved client APIs including a SQL-like language CQL. The feature freeze for Cassandra 1.0 is expected in October 2011, emphasizing polish - efficient database repair, storage compression, optimized performance and an expanded CQL language.

Over 450 people came to Cassandra SF 2011, about triple the number from last year. In reviewing major features from Cassandra 0.7, Ellis noted that integrated caching is important to allow for cache coherence. He said that modern JVMs don't handle GC on heaps larger than 8 GB, so they have implemented a row cache that uses native memory to manage cache off heap, avoiding GC issues. The off heap cache stores data as serialized rows, so Ellis said that it is typically 4-8 times more compact than storing Java objects. Ellis said that Cassandra's read performance has improved 100% by using a memory mapped architecture to avoid copying data on the read path, allowing very high performance for applications whose working sets fit in RAM. Ellis cited three design choices for high performance in Cassandra:
  • it uses log structured storage, buffering in memory and then it does streaming writes not random writes
  • it has a concurrent engine: there are no table or row locks, updates are implemented with compare and swap - this is needed to support large rows for materialized views
  • it can be tuned for eventual or full consistency, including options to succeed when there's a quorum among local nodes

Cassandra 0.7 was released in January 2011 and included:

  • the ability to create column families without manually restarting nodes
  • expiring columns to allow automatic deletion of old data
  • secondary indexes that are now built-in (but see also the limitations from Ed Anuff's presentation on indexing at the conference reported by InfoQ)

Cassandra 0.8 was released in June 2011 and featured:

  • CQL - a simplified SQL variant, providing a higher level interface for client applications
  • Counters - the ability to atomically increment columns
  • Automatic tuning of memory for memtables: Ellis said in previous versions it was easy to overallocate memory resulting in JVM crashes, but it's now feasible to have hundreds or thousands of column families
  • Bulk load interface

In a subsequent tech talk CQL creator Eric Evans of Rackspace acknowledged that CQL would probably not support more advanced SQL idioms like nested queries or joins, because Cassandra can't support them efficiently. Likewise, he noted that CQL could support aggregators (like min and sum) when Cassandra supports coprocessors.

Ellis noted the following features for Cassandra 1.0:
  • CQL 1.1: will add support for compound columns and prepared statements 
  • Compression: The presence of highly variable row sizes makes compression more challenging for Cassandra. 1.0 will support compressing both rows per block and blocks per row.
  • Compaction: Cassandra will generalize the approach of Google's leveldb to have at most 1 SSTable per level that might have data for a given key, resulting in worst case merges of log(n) SSTables instead of the current worst case of n, where there are n SSTables used to represent a column family.
  • Repair optimization: the current implementation can transmit and store excess data, possibly exhausting disk. In 1.0 this will be optimized.
  • Read optimization: SSTables will be sorted by the maximum (client-provided) timestamp to allow early termination of merges when the newest values of requested columns have been found 

Beyond 1.0, Ellis said that Cassandra will be focused on ease of use for developers. Ellis also noted the availability of Brisk (described by InfoQ previously) to allow analytics for real-time data without ETL. He also mentioned Solandra, which is a clustered Solr built on Cassandra. Ellis said that these are the first two examples of a trend of broader data projects that are built atop Cassandra, a trend he expects to see more of in future.

Rate this Article