CassandraSF2011: Progress and Futures
Johnathan Ellis, CTO of DataStax and project chair for Apache Cassandra, keynoted at Cassandra SF 2011. Major accomplishments for the project in the last year include better support for multi-data center deployments, optimized read performance, included integrated caching and improved client APIs including a SQL-like language CQL. The feature freeze for Cassandra 1.0 is expected in October 2011, emphasizing polish - efficient database repair, storage compression, optimized performance and an expanded CQL language.
- it uses log structured storage, buffering in memory and then it does streaming writes not random writes
- it has a concurrent engine: there are no table or row locks, updates are implemented with compare and swap - this is needed to support large rows for materialized views
- it can be tuned for eventual or full consistency, including options to succeed when there's a quorum among local nodes
Cassandra 0.7 was released in January 2011 and included:
- the ability to create column families without manually restarting nodes
- expiring columns to allow automatic deletion of old data
- secondary indexes that are now built-in (but see also the limitations from Ed Anuff's presentation on indexing at the conference reported by InfoQ)
Cassandra 0.8 was released in June 2011 and featured:
- CQL - a simplified SQL variant, providing a higher level interface for client applications
- Counters - the ability to atomically increment columns
- Automatic tuning of memory for memtables: Ellis said in previous versions it was easy to overallocate memory resulting in JVM crashes, but it's now feasible to have hundreds or thousands of column families
- Bulk load interface
In a subsequent tech talk CQL creator Eric Evans of Rackspace acknowledged that CQL would probably not support more advanced SQL idioms like nested queries or joins, because Cassandra can't support them efficiently. Likewise, he noted that CQL could support aggregators (like min and sum) when Cassandra supports coprocessors.
- CQL 1.1: will add support for compound columns and prepared statements
- Compression: The presence of highly variable row sizes makes compression more challenging for Cassandra. 1.0 will support compressing both rows per block and blocks per row.
- Compaction: Cassandra will generalize the approach of Google's leveldb to have at most 1 SSTable per level that might have data for a given key, resulting in worst case merges of log(n) SSTables instead of the current worst case of n, where there are n SSTables used to represent a column family.
- Repair optimization: the current implementation can transmit and store excess data, possibly exhausting disk. In 1.0 this will be optimized.
- Read optimization: SSTables will be sorted by the maximum (client-provided) timestamp to allow early termination of merges when the newest values of requested columns have been found
Beyond 1.0, Ellis said that Cassandra will be focused on ease of use for developers. Ellis also noted the availability of Brisk (described by InfoQ previously) to allow analytics for real-time data without ETL. He also mentioned Solandra, which is a clustered Solr built on Cassandra. Ellis said that these are the first two examples of a trend of broader data projects that are built atop Cassandra, a trend he expects to see more of in future.