InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

CloudCrowd - A 100% Ruby Cloud Solution

Posted by Sebastien Auvray on Sep 14, 2009

Sections
Architecture & Design,
Development,
Operations & Infrastructure
Topics
Cloud Computing ,
Ruby ,
Architecture
Tags
CloudCrowd ,
RightScale ,
Nanite

One year ago, The New York Times and ProPublica filled a grant proposal in the Knight News Challenge 2009 competition. DocumentCloud was awarded $719,500 and the mission to build a document-based application which should make it easy to organize and examine documents. With highly expensive processing tasks to accomplish in parallel, DocumentCloud decided to implement its own cloud solution in 100% Ruby: CloudCrowd.

DocumentCloud primarly uses CloudCrowd for PDF processing, but it can be used for other expensive tasks such as:

  • Generating or resizing images
  • Running text extraction or OCR on PDFs
  • Encoding video
  • Migrating a large file set or database
  • Web scraping

As described in the CloudCrowd architecture documentation:

CloudCrowd is not intended for large volume of small jobs, but rather for a moderate volume of gigantic, expensive jobs.

CloudCrowd was inspired by the MapReduce framework. The architecture is based on a central server with worker daemons doing the real processing. It ships with a REST-JSON API and a web console to monitor what is happening. CloudCrowd uses Amazon S3 as file storage but alternative storage can be plugged in as desired.

InfoQ had the chance to catch up with the author Jeremy Ashkenas. Jeremy is a fresh hire at DocumentCloud and is also the author of Ruby-Processing.

InfoQ: How does CloudCrowd compare to RightScale Gems or Nanite? Why did you create your own solution?

Jeremy Ashkenas: So, CloudCrowd actually uses the RightScale AWS gem. It uses S3 as the distributed data store for all results (both intermediate and final). To go off on a tangent, it would be an interesting future direction, and a wonderful contribution, to support autoscaling of a CloudCrowd cluster by launching more EC2 instances via the RightScale gem. The central server has all the information needed to decide to autoscale -- it knows the number of workers, their locations, their status, and the size of the work queue. It's just a matter of determining the algorithm for instance launching, and making sure that new instances have all requisite dependencies installed.

As for a comparison to Nanite, CloudCrowd is trying to be a simple, comprehensible tool that Ruby hackers can easily understand and customize. It uses technologies that most Ruby developers are familiar with, like ActiveRecord and standard ActiveRecord databases, and HTTP and S3 for communication and scaling. All of this is eminently debuggable and hackable. The minimal CloudCrowd action is a simple Ruby class that defines a "process" method, performing the parallel part of the computation and saving it to S3. It should be thought of as an alternative to industrial-strength systems like Hadoop and Nanite. No Erlang, AMQP, or RabbitMQ required. It doesn't promise to scale up to extreme sizes, but should hold up pretty well for most uses. A lot is going to depend on how effectively you can parallelize your problem.

InfoQ: Why did you use Ruby? Did you think about other alternatives, such as Erlang?

Jeremy Ashkenas: I used Ruby because that's what we're planning to use to glue the rest of DocumentCloud together. Note that the actual heavy lifting in many of the example actions isn't performed in Ruby itself, but by shelling out to the appropriate tools. Ruby's great as a glue language, and the actual image processing, PDF conversion, and video encoding can be handed off to GraphicsMagick, Tesseract, and FFmpeg, as is appropriate.

No comments

Watch Thread Reply

Educational Content

Jesper Boeg on Priming Kanban

In this interview, Jesper Boeg, author of the new InfoQ book – Priming Kanban, discusses the keys to using Kanban effectively, and how to get started if you are currently using other approaches.

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.