BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Skynet, A New Ruby MapReduce

by Sebastien Auvray on Jan 29, 2008 |
The MapReduce design pattern to distribute data processing was introduced by Google in 2004, and came with a C++ implementation. A new Ruby implementation is now available under the name of Skynet released by Adam Pisoni.

Skynet is an adaptive, self-upgrading, fault-tolerant, and fully distributed system with no single point of failure.

There are notably 2 key differences between Google's design paper and Skynet:
If a worker dies or fails for any reason, another worker will notice and pick up that task. Skynet also has no special ‘master’ servers, only workers which can act as a master for any task at any time. Even these master tasks can fail and will be picked up by other workers.
Skynet is very easy to use and set up which is the real strength of the MapReduce concept. Skynet also extends ActiveRecord with MapReduce features such as distributed_find.

Another Ruby MapReduce-like has existed in Ruby for 1.5 years: Starfish. Reading about Peter Cooper's mixed feelings about Starfish, InfoQ caught up with Adam Pisoni about Skynet features and how it compares with Starfish.

How do you compare Skynet with Starfish?
I looked at Starfish before developing Skynet and decided it wasn't robust enough for what I needed. Starfish is a simple system with a great many limitations in terms of scalability, and control. Also there is some question as to how well starfish actually distributes tasks since Ruby actually can't marshal and send code blocks over the wire, only references to it. So if you say, run block X over on machine Y, then machine Y will merely call for that code to be run on X to begin with. In that sense, I'm not sure how it's distributed.

The part of Starfish I'm still confused at and I even exchanged some emails with the author about is how it deals with actual code distribution with DRB. In Starfish you just provide a block of code to use for the map. It turns that block into a DRB object and sends a reference to that object over to a worker. That worker is supposed to execute that code locally...  but Ruby DRB does not allow this. The code is always executed on the machine it was compiled on. So as long as all your workers are running on the same machine it would work, but as soon as you try running workers on another machine it will only appear as though the code is being sent to that machine, when in fact the code is really executed on the source.

The biggest other limitation of Starfish is that you can not run a job asynchronously. So, for example, if you want an action on a web page to kick off a map/reduce process, you can't just start a starfish job and let it run while you move on. Whoever kicks off a starfish job has to wait for that job to finish.
In Starfish you write these little apps with the code you want to run built into them. Unless I'm mistaken, you can't run multiple types of MR jobs on the same machine. Skynet is a general MR system that can be running many jobs of many types, ie a lot of different code.

Can you tell us about the strength of Skynet?
Skynet is built on top of a message queue where you can choose what message Q you use based on your scalability requirements. It currently supports tuplespace and mysql. We use mysql because it is far more scalable than TS.
You can then create jobs with total flexibility on how skynet will distribute and execute your job. The most common case we use at geni is to run just asynchronously (which starfish can't do). So you create a new MR job, call run on it and it immediately returns. In the background it added your job to the queue and will get worked on by workers. You can later retrieve the results by calling results on the job object you've got.

Skynet also supports failover.  Workers watch each other. If one worker fails to complete a task in time, another worker can pick it up and try and complete it. Skynet also supports streaming map_data, meaning it can work with very large data sets, too large to put into a single data structure.

What is map_data streaming?
Most of the time when you want to run a map_reduce job you have to provide an array of data you want split up and executed in parallel. What if your array would be too large to fit into memory?   In this case you provide Skynet with an Enumerable instead of an array. Skynet knows to call :next or :each on that object and start spinning off map_tasks for 'each' one. In this way, no one ever tries to create some massive data structure all at the same time.

Any other features you want to talk about?
There are many more features, but the last feature I'd mention is how well Skynet integrates with your existing applications, including rails apps. It even comes with an extension to ActiveRecord that will let you execute some task on all of your models in a distributed manner.  We use this functionality at Geni to run particularly complex migrations that involve executing some ruby code on millions of models.
> Model.distributed_find(:all, :conditions => "id > 20").each(:somemethod)
As long as your running Skynet, it will execute :somemethod on each model, but in a distributed manner (on as many workers as you have).  It does this without instantiating the models before distributing it, or even fetching all the ids ahead of time. So it can work on infinitely large data sets.

What are users feedbacks?

There are a few people starting to use it, but its definitely in the early stages. Release 0.9.2 is a pretty significant release with a lot of rewrites, performance improvements, and feature enhancements. We've submitted to give a talk at Railsconf on Skynet, but haven't heard back yet. We're also planning on creating a Screencast on how to use Skynet.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Performance by Matan S

Any comment about performance? this is usually a problem in Ruby applications

MapReduce or Actors by Alex Popescu

If my understanding is not completely wrong, this sounds like an Actors implementation.

Also so following comment is a bit confusing to me:
Also there is some question as to how well starfish actually distributes tasks since Ruby actually can't marshal and send code blocks over the wire, only references to it.


I don't know how Google implementation looks like, but I am having hard times understanding how C++ would distribute code blocks and execute these on various machines.

./alex
--
.w( the_mindstorm )p.

Re: Performance by Adam Pisoni

The complain about performance is leveled against almost all dynamic languages, including Java. In almost all cases, they have a point that these languages are slower than C, but miss the point that the performance tradeoff is made consciously for two reasons. The first being the belief that engineering resources are more valuable than computer resources. Obviously this argument has a limit, which brings us to the second argument. Ruby is being used in plenty of large scale production environments where performance is important. Ruby is not likely slower than any other interpreted language be it Java or Perl. A map/reduce framework written in Ruby will be slower than one written in C, but not necessarily slower than one written in Java.

Re: MapReduce or Actors by Adam Pisoni

Good point and to be honest I'm not sure how they do this internally at Google. I guess there's just an assumption made that you should be able to pass code blocks in a dynamic language... though this is a poor assumption. I actually haven given up implementing this in Skynet, though I still question how much utility it has. In a dynamic language like Ruby, you tend to write very little code relying on a great deal of code. How much code would you really want to send with each data slice? How much code would that code rely on? How would you know whether that code is on the worker machines? Given all of this ambiguity, it seems the idea of passing code has limited real world value. That said, having a system that is self upgrading is very useful. Skynet is self-upgrading in a rudimentary way, but we have big ideas for how it might upgrade more intelligently in the future.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

4 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT