CloudCrowd - A 100% Ruby Cloud Solution
One year ago, The New York Times and ProPublica filled a grant proposal in the Knight News Challenge 2009 competition. DocumentCloud was awarded $719,500 and the mission to build a document-based application which should make it easy to organize and examine documents. With highly expensive processing tasks to accomplish in parallel, DocumentCloud decided to implement its own cloud solution in 100% Ruby: CloudCrowd.
DocumentCloud primarly uses CloudCrowd for PDF processing, but it can be used for other expensive tasks such as:
- Generating or resizing images
- Running text extraction or OCR on PDFs
- Encoding video
- Migrating a large file set or database
- Web scraping
As described in the CloudCrowd architecture documentation:
CloudCrowd is not intended for large volume of small jobs, but rather for a moderate volume of gigantic, expensive jobs.
CloudCrowd was inspired by the MapReduce framework. The architecture is based on a central server with worker daemons doing the real processing. It ships with a REST-JSON API and a web console to monitor what is happening. CloudCrowd uses Amazon S3 as file storage but alternative storage can be plugged in as desired.
InfoQ had the chance to catch up with the author Jeremy Ashkenas. Jeremy is a fresh hire at DocumentCloud and is also the author of Ruby-Processing.
Jeremy Ashkenas: So, CloudCrowd actually uses the RightScale AWS gem. It uses S3 as the distributed data store for all results (both intermediate and final). To go off on a tangent, it would be an interesting future direction, and a wonderful contribution, to support autoscaling of a CloudCrowd cluster by launching more EC2 instances via the RightScale gem. The central server has all the information needed to decide to autoscale -- it knows the number of workers, their locations, their status, and the size of the work queue. It's just a matter of determining the algorithm for instance launching, and making sure that new instances have all requisite dependencies installed.
As for a comparison to Nanite, CloudCrowd is trying to be a simple, comprehensible tool that Ruby hackers can easily understand and customize. It uses technologies that most Ruby developers are familiar with, like ActiveRecord and standard ActiveRecord databases, and HTTP and S3 for communication and scaling. All of this is eminently debuggable and hackable. The minimal CloudCrowd action is a simple Ruby class that defines a "process" method, performing the parallel part of the computation and saving it to S3. It should be thought of as an alternative to industrial-strength systems like Hadoop and Nanite. No Erlang, AMQP, or RabbitMQ required. It doesn't promise to scale up to extreme sizes, but should hold up pretty well for most uses. A lot is going to depend on how effectively you can parallelize your problem.
InfoQ: Why did you use Ruby? Did you think about other alternatives, such as Erlang?
Jeremy Ashkenas: I used Ruby because that's what we're planning to use to glue the rest of DocumentCloud together. Note that the actual heavy lifting in many of the example actions isn't performed in Ruby itself, but by shelling out to the appropriate tools. Ruby's great as a glue language, and the actual image processing, PDF conversion, and video encoding can be handed off to GraphicsMagick, Tesseract, and FFmpeg, as is appropriate.