Book Excerpt and Interview: Amazon SimpleDB Developer Guide
A new book by Prabhakar Chaganti and Rich Helms Amazon SimpleDB Developer Guide provides a simple step-by-step guide on how to develop applications for Amazon Simple DB in different programming languages including Java, PHP, and Python.
With the scalability and cost issues of relational databases, which are a function of the fact that relational databases were not designed to be distributed popularity of “NoSQL” databases has skyrocketed.
This practical book aims to explain how to use SimpleDB in applications. The book quickly leads user through the differences between relational databases and SimpleDB, the implications of using SimpleDB, its strengths and limitations and the ways to overcome limitations. Additionally book explains how to combine SimpleDB with Amazon S3 to work with large binary files with the metadata accessible through SimpleDB. It also describes usage of cache for avoiding excessive SimpleDB requests, thus both improving performance and reducing SimpleDB usage costs. Finally a book describes batch operations, allowing to take advantage of SimpleDB’s support for concurrency and parallel operations. Throughout this book, there is an emphasis on demonstrating key concepts with practical examples for Java, PHP, and Python developers.
Packt Publishing provided InfoQ readers with an excerpt providing comparison of SimpleDB and traditional RDBMs. The excerpt is a chapter 3 of the book: SimpleDB versus RDBMS. InfoQ spoke with the book authors to learn the motivations behind the book and their experience in SimpleDB usage.
InfoQ: The book starts by giving a great explanation of SimpleDB metaphor and the ways its simplify interacting with the database. It will also be very helpful to introduce landscape of “NoSQL” databases and show where SimpleDB fits.
Our work was in SimpleDB and how it supports high-scalability. We didn’t focus on looking at the other “NoSQL” databases in the market.
InfoQ: Can you provide (or point to) a quick comparison of different popular “NoSQL” databases with some suggestions on when these databases are most applicable, for example key/value vs. documents, vs. column store?
Vineet Gupta has written up a comparison of all the popular NoSQL databases in his blog.
InfoQ: Can you provides pros and cons of private vs public clouds?
This is a pretty heated debate right now. We feel they both have their uses. Security seems to be the major differentiator between the two currently. In case of a public cloud, you essentially need to trust the cloud provider, and that certainly seems to be a concern for enterprises. The clear advantage of cloud services is only paying for use vs paying for a system up front to supported anticipated need.
InfoQ: Can you describe a class of applications for which eventual consistence does not create issues? Can you provide any design/code practices to deal with it?
Eventual consistency is a cost of having a large master/slave topology. In SimpleDB you can force a transaction to route via the master but the cost is you are dealing with a single machine. When you read from any you are not guaranteed the absolute latest data but you do get a quick response even with a large number of concurrent queries. Our experience is that the data is consistent in about four seconds. The question is having 4 second old data acceptable as a trade-off to extreme scalability.
In the chapter on storing files in S3 and the information on the files in SimpleDB, the user would probably not be querying for their updated file list in 4 seconds.
InfoQ: Can you provide additional explanation for your statement about SimpleDB advantage:
“The lack of object-to-relational mapping that is common for an RDBMS allows your structured data to map more directly to your underlying application code and reduce the application development time”?
An example we were thinking of is the telephone number in the next question. Rather than a linked table for multiple numbers in a record you just use multiple attribute/value pairs.
InfoQ: Can you give an example where SimpleDB’s support for multiple values for a given interface are useful?
A simple example from the book is tracking telephone numbers. As people add numbers, there is no need to build in support for multiple numbers. You could do a delimited string of numbers in the field but this would limit searching to a LIKE %...% which forces a table scan.
InfoQ: In chapter 6 you have a great explanation of SimpleDB query including capabilities to limit the amount of results. Does SimpleDB support data pagination or cursors?
No. SimpleDB does not have the concept of cursors. It does break response data into chunks limited by data volume. The first buffer is delivered with a pointer (Next Token) to the balance of the results. The system does not perform the query again. It just delivers the next buffer.
InfoQ: Does SimpleDB support indexes to improve performance of queries?
Yes. SimpleDB automatically indexes your data as it is added to a domain. All fields are automatically indexed to enable searching. This is also where multiple values come into play. In the telephone example above the program could search on any of the multiple phone numbers quickly. In an RDBMS system a table would be needed with a join to the master table to enable quick query of any phone number.
InfoQ: In chapter 9 of the book you are writing: “Caching can help alleviate both the issue of making extra requests to SimpleDB and the issue with eventual consistency.” Typically caching is used to improve scalability and performance of the underlying database. How are these two coming together?
If a user is searching for their own updates, the cache would have their latest data. This could help with eventual consistency issues. We agree scalability and performance are typically why caching is used.
InfoQ: In your book one of the measures to improve security of memcached that you are proposing is to put inside firewall. Do you consider this measure to improve security considering that the majority of system breaks are done from inside the company?
Putting it behind your firewall will remove the threat from the external malicious users. If an inside employee has full access to the cache area then they probably have access to other areas of exposure. If internal security is a key concern, just encrypt the cache.
InfoQ: When describing logic of using caching with SimpleDB you are assuming that there is only one instance of cache for existing SimpleDB. Is this a save assumption, especially for a multinational company?
This is just one example of using a cache with SDB. It is entirely possible to use a cluster of cache servers instead if your application needs dictate it.
InfoQ: When describing caching, instead of read/write-through-cache you are proposing independent read/updates to both cache and SimpleDB in the client code. What is the rational for such approach?
The examples in the book were done to explain the mechanics of doing caching with SimpleDB. In the PHP interface there are even comments on adding caching to the API. In a real application caching would be usually done either in the API or in a layer between the API and the application.
InfoQ: Can you explain whether usage of the SimpleDB batch APIs impact cost of its usage and what this impact is?
Batch is cheaper than doing single calls. This is covered in the batch put discussion of box usage in Chapter 8. There are no studies that I know that give an equation of how much cheaper the batch APIs are.
InfoQ: It is not immediately clear whether in the case of parallel processing (multithreading) of SimpleDB interactions every thread require an independent connection. Does this impact the cost of SimpleDB usage?
Every thread will require an independent connection. As such each thread has an associated cost. Multithreading is a performance technique. It does not effect costs.