Practical Cassandra: A Developer's Approach - Book Review and Interview
It covers all the aspects of Cassandra application development lifecycle, like installation, data modeling, query model (CQL), performance tuning and monitoring. Authors also discuss the different drivers to access Cassandra database tables with code examples in most commonly used languages like Java, C#, Ruby, and Python.
The book also covers several case studies providing examples where the Cassandra database is being used in various applications. Case studies include applications used at Ooyala, Hailo, and eBay.
InfoQ spoke with the authors about the book, Cassandra data model, design considerations and how Cassandra performs concurrency and versioning of the datasets.
InfoQ: Can you define Cassandra NoSQL database? There is some confusion on if it's a Columnar (column based data store) v. Row based Column Family data store.
Bradberry & Lubow: Cassandra is a Row-based column family data store. This means that the data is stored in terms of “rows” rather than “columns” and stored where all rows that fall within the same “partition”, as defined by the user, are stored in order together on disk. This guarantees that all rows for a given partition will exist on the same node(s). This makes sequential scanning of data a possibility and very performant. The term “Column Family” comes from the Google BigTable white-paper that influenced the Cassandra storage model.
InfoQ: What are some design considerations when using a column family data store like Cassandra?
Bradberry & Lubow: Model your queries, not your data. All Cassandra tables have a primary key. This key is comprised of 2 distinct parts, the “partition columns” and the “clustering columns”. The partition columns determine what nodes the data will live on. The clustering columns determine the order of the data within that given partition. When querying your data you must at the very least specify the partition columns. This means that you must know something about the data before querying it; there are no ad-hoc searches in Cassandra. If you have data that needs to be satisfied by two different queries then you must create a different table that stores the data in a way that is retrievable for that query pattern, which leads us into:
Duplicating data is not a bad thing. Cassandra was built on the foundation that “disk space is cheap”. Keeping this in mind, duplication of data doesn’t have to be seen as bad. As db admins, we have all been taught that we must not duplicate data. Therefore we must “normalize” our duplicated data into different tables that then get joined and cobbled together to provide the view of the data you want. While this saves disk space, it is vastly inefficient in many newer data models, especially in times when the cost per gigabyte of storage can go below 4 cents.
InfoQ: In the book, you wrote a chapter on Data Modeling. What is the typical data modeling process when using a column family database and how does the process compare with modeling against a RDBMS?
Bradberry & Lubow: Referring to the above answer, RDBMS data modeling is designed for, amongst other things, efficient storage of data going back to the days when storage wasn’t “cheap.” While Cassandra data is modeled for efficiency as well, it modeled for efficiency of retrieval, not the number of bytes on disk.
InfoQ: Can you discuss about the Cluster and failover model in Cassandra? How does the peer to peer model compare with Master Slave model that other databases support? What are the pros and cons of each model?
Bradberry & Lubow: Cassandra doesn’t use a failover model per se. When one talks about failover, that has to do more with master slave relationships where the database primary needs to alert applications that the authoritative read/write source (the master) is located elsewhere. Since Cassandra doesn’t operate under the master/slave relationship rules, there is no failover required (in the traditional sense). When a driver connects to the cluster, it gets a list of all available nodes in the cluster and can write and read from any of those nodes. The responsibility is placed on the node to locate the data (if it exists) and the responsibility is placed on the driver to know if a node it is trying to connect to is currently available.
The advantage of this model is that it generally allows for much higher availability. A more highly available system generally has a better uptime which is good for the app, good for clients, etc. The disadvantage of this model is that the driver must be aware of the state of every node in the cluster that it is responsible for being able to connect to. This can make for a complex driver and put a bit of extra load on the application servers.
There are also advantages to the master/slave model. It’s much easier to scale reads out horizontally in the master/slave model. Just add a few slaves and your read capacity increases. While a similar approach is available with Cassandra, when you add additional nodes, your write capacity also increases. Scaling also tends to happen in a more verticalized fashion in master/slave systems. You can turn the master (and typically its hot standby as well) into beefier machines to scale vertically. Whereas to scale out a Cassandra cluster vertically, you would typically replace every node in the datacenter (not necessarily the ring) in order to keep Cassandra performing happily.
One of the main advantages (or disadvantages depending on your perspective) to the Cassandra approach is the fact that the data is eventually consistent. There are ways of ensuring a higher level of consistency like doing reads or writes and requiring more nodes to respond to the fact that the write took place or requiring more nodes to verify the data read is consistent with the latest round of updates for that object.
InfoQ: How does Cassandra support the concurrency and versioning of data to provide a scalable data access?
Bradberry & Lubow: When it comes to versioning of the data, Cassandra opts for a type of last-write-wins (LWW) approach. This was done to achieve a simpler model than just a standard vector clock implementation where the decision of the correct data to store is pushed back to the client. The idea of LWW works in Cassandra because the data model breaks rows up into smaller chunks (columns) and allows timestamp resolution of data for each of the columns. This approach scales to schema versioning because the schema itself is stored in Cassandra. Therefore, when multiple schema updates are made, they are applied in the order they are received. The same can be said for how concurrency is handled. When many mutations are sent to Cassandra at once (or approximately the same time) that are modifying the same object, then the LWW for each individual column approach will be taken to decide which mutation “wins."
Another important concept in providing concurrency is being able to choose the consistency level (CL) at which data is written and read. By giving certain data types within your application a higher priority for accuracy (writing and reading at a higher CL), you can allow your system to run more concurrent operations while performing fewer taxing operations on the cluster.
InfoQ: What are the limitations of a column family DB compared to Relational and other NoSQL databases?
Bradberry & Lubow: There is no ad-hoc querying. If you want to query your data, you need to store it in a way that is easy to retrieve. There are no aggregate functions, and there are very limited ordering capabilities. What few methods there are of handling data in this fashion (like secondary indexes or the CQL count method) tend to become unusable at a data size with any scale.
InfoQ: You also discussed performance tuning techniques in the book. Can you talk about some of these techniques and how they help with scalability and performance of applications?
Bradberry & Lubow: The methods discussed in the book are basically broken down into 3 parts: JVM tuning, Cassandra option tuning, and system (or kernel) level tuning. Without going into specific details since they are all listed in the book, it still makes sense to discuss the reasoning behind each of the parts.
Starting with the lowest level, the kernel contains the core set of instructions that the operating system uses to decide how it is going to handle its workload. So here we can configure the priority for disk operations, network operations, CPU operations, etc. Since this is a network heavy data base node, we’ll want to give the kernel instructions supporting that application. In other words, we may want to configure the system to give a higher priority to network traffic because that is how our application accesses the database. Or if our application is consistently running complex queries, we may want to give a higher priority to the IO subsystem.
The second part moving up the stack is the JVM. It’s the container that runs Cassandra. There are a huge number of tunables in the JVM for everything from memory to garbage collection. Depending on the way you use Cassandra, you can easily make a huge impact in performance by adjusting the knobs the JVM gives. You can generally be smarter about garbage collection than the defaults once you know your usage patterns. You can also adjust memory usage throughout the entire JVM once you see your usage patterns. Which brings back the important point of only tuning knobs once you understand your usage patterns and how they affect the system.
The third part at the top of the stack is Cassandra itself. There are also many things here that can be done from knobs in Cassandra itself to the way data is stored and accessed (like compression or bloom filter settings). At a high level, these settings can have an impact on how nodes store data on disk (compression), how they communicate with each other to determine health (gossip, failure detection, encryption), and how nodes allocate and clean memory within the JVM. All these settings can massively impact the system. Improperly setting up memory usage can bring a Cassandra server to its knees and can even affect the entire cluster.
What all tuning really comes down to is the ability to know your system and the way the application functions and what it should be optimized for. Once you have made the performance goals of your application clear, knowing what needs to be tuned at the different levels of the infrastructure becomes a lot more obvious.
InfoQ: Are there any interesting features in CQL3 version?
Bradberry & Lubow: CQL3 provides a familiar and easy way to query Cassandra. It provides a binary protocol that allows efficient communication with Cassandra. The protocol gives the clients the ability to be notified when a node enters/leaves the ring and the ability for the client to discover the entire ring without having to have knowledge of all the host names. The binary protocol also gives the ability to prepare CQL statements which reduces the amount of time required to parse the incoming CQL. In addition to the binary protocol, CQL3 introduces Collections which give the client the ability to store Set, List and Map types in addition to the basic types that Cassandra offers.
InfoQ: What are the tools available for developers when using Cassandra?
Bradberry & Lubow: The Datastax docs are invaluable, also, OpsCenter, and DataStax DevCenter. There are also the user forums, a mailing list and a healthy set of people that follow Cassandra on StackOverflow. If you need answers in a more timely manner, then there are many folks who use Cassandra in #cassandra on Freenode IRC.
InfoQ: Are there any features in Cassandra you would like to see added in the future releases of the product?
Bradberry & Lubow: We would really love to see data streaming, as in to be able to have large queries run and the data starts streaming to the client as it is read. Also, idempotent counters would be awesome. Being able to see the status of a set of repairs would be great too.
Russell and Eric also talked about how to approach the NoSQL database solutions for different use cases:
Bradberry & Lubow: It is important to note that all NoSQL is not the same. The use cases that make Cassandra a great choice are not the same use cases that can make MongoDB a great choice. Asking the question, should I go with X or Y should only be asked after an in-depth look into your use case and how it would fit into a database system. 99% of use cases out there are satisfied with a simple PostgreSQL installation.
About the Book Authors
Russell Bradberry is primary author of the NodeJS Cassandra driver Helenus. As Principal Architect at SimpleReach, he is responsible for architecting and building out highly scalable data solutions. He has delivered a wide range of products, including a real-time bidding ad server, a rich media ad management tool, a content recommendation system, and most recently, a real-time social intelligence platform. He is a DataStax MVP for Apache Cassandra.
Eric Lubow, CTO of SimpleReach, builds highly-scalable distributed systems for processing social data. He began his career building secure Linux systems. Since then he has worked on building and administering various types of ad systems, maintaining and deploying large scale web applications, and building email delivery and analytics systems. He is also a DataStax MVP for Apache Cassandra.