InfoQ

News

Need to Scale Fast? Just Re-Architect it!

Posted by Gavin Terrill on Jun 13, 2008 01:00 AM

Community
Architecture,
Ruby
Topics
Performance & Scalability
Tags
Memcache

While geeks talk about the perils of the "/. effect", it turns out that the most trafficked site on the Internet is in fact the Yahoo! home page. Lukas Biewald tells the story of how their site, FaceStat, had to scale fast when it was hit with 100,000 visitors after being featured on the Yahoo! home page:

Sunday morning I was sitting in my kitchen reading the newspaper and thinking about brunch, when Chris called me, saying he had a bunch of voicemails on his home phone in Iowa claiming something about our web app FaceStat being down. FaceStat is a site we made to show off our Dolores Labs crowdsourcing technology, and has had a small loyal following since we made it live about a month ago. I checked the site and it gave me a 500 error — only 1 in 10 requests seemed to get me an actual page — so I logged into our app server and saw the disk was full. The log file had grown to 20 GB! I deleted it and asked my friend Zuzka to check and see if we’d been Slashdotted.

The first response was to create a static page, but they found that this couldn't stem the tide:

Unbelievably, our webserver (nginx) couldn’t even reliably show that static page… Brendan discovered that we were exceeding the system’s open file limit — set at 100,000 — because connections were counting as open files.

The team then went about getting more capacity from their hosting provider, adding more caching, and as well as removing feature:

While Brendan worked on setting up boxes I started ripping out every database intensive feature of our system and Chris added more caching… Around 1 AM we were back online and looking pretty stable.

On Monday the load continued, so the team added memcached, monitoring tools, and moved the database to a larger machine:

So now it’s Tuesday night and the site seems to be cranking along under 50x the load that used to work on one box. We have 6 app servers and a big database machine. I’m really impressed what awesome hackers Chris and Brendan are and what amazing tools are available these days. Slicehost has scaled up as fast as we’ve needed them to. Amazon’s S3 serves all the images, and while the latency isn’t great, we never could have dealt with the bandwidth issues on our own. Capistrano lets us deploy and rollback everywhere; git with github lets us all hack frantically on the same codebase then merge and deploy. God keeps all the servers running, and memcached has given us great caching with very little pain (mostly… :) ). 

Lukas describes the lessons learned over the 3 days:

It’s one thing to code scalably and grow slowly under increasing load, but it’s been a blast to crazily rearchitect a live site like FaceStat in a day or two. I figure at this point we’ve been on the number 1 (or 2) page on the internet, so there’s no bigger instant spike in traffic that could happen to us… Some lessons I learned for the next time this happens:

(1) Monitor the site better. We had exception handling emailing us, but there were so many exceptions that I didn’t really look at them and I wasn’t online. It wouldn’t have made sense to scale our site to handle this kind of load in advance, but it’s unfortunate we had to rely on random people deciding to lookup Chris’s email address to call his home phone number to yell at him…

(2) Don’t be afraid to put up an error page. We had lots of excited users emailing us when we had a page up saying we were down and explaining why. We had lots angry users emailing us when the site was up but with intolerable lag or crashing intermittently. I think wishful thinking caused us to put up the site an hour or two before it was ready.

(3) A statically generated homepage is a very good thing and memcached is awesome.

Brendan O'Connor talks about the technology behind the FaceStat application in a follow up article:

Yes, we’re pretty much using Rails. We actually use an offshoot called Merb — which is a bit more efficient — on top of Thin. We find that a Rails-like platform is invaluable for rapidly prototyping a new site, especially since we started FaceStat as a pure experiment with no idea whether people would like it or not, and with a very different feature set in mind compared to what it later became. And it’s invaluable that Chris on our team is such a Ruby expert :).

However, the high-level platform really doesn’t matter compared to overall architecture: how we use the database (postgres), how much we cache (memcached/merb-cache), how we distribute load, how we deploy new systems (xen/slicehost), etc. It’s hasn’t been trivial since FaceStat is write-heavy and performs fairly complex statistical calculations, and various issues remain. But we are serving many users at nearly 100x our old load, so something must be going right — at least for now!

On Scalability by Talip Ozturk Posted Jun 13, 2008 2:40 AM
The Yahoo Effect by Oren Hurvitz Posted Jun 13, 2008 12:07 PM
Re: The Yahoo Effect by Jason Carreira Posted Jun 13, 2008 3:12 PM
  1. Back to top

    On Scalability

    Jun 13, 2008 2:40 AM by Talip Ozturk

    Places for scalability stories like this one:
    1. High Scalability blog
    2. Werner Vogels' blog
    3. Cloud computing group
    4. If you live in New England
    5. InfoQ scalability
    6. Grid Today

    -talip
    Hazelcast- distributed data structures for Java
    http://jroller.com/talipozturk

  2. Back to top

    The Yahoo Effect

    Jun 13, 2008 12:07 PM by Oren Hurvitz

    I wrote my thoughts about their experience in The Yahoo Effect. In short, it's usually not cost-effective to prepare for massive traffic spikes in advance. You can either buy a lot of hardware, which may go unused, or design your application to use Cloud Computing; but that's often not a viable alternative for existing applications.

  3. Back to top

    Re: The Yahoo Effect

    Jun 13, 2008 3:12 PM by Jason Carreira

    Then maybe the answer is to make it easier to deploy on a cloud? The right answer is very rarely to do it wrong the first time...

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.