InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Need to Scale Fast? Just Re-Architect it!

Posted by Gavin Terrill on Jun 13, 2008

Sections
Development,
Architecture & Design
Topics
Performance & Scalability ,
Architecture ,
Ruby
Tags
Memcache

While geeks talk about the perils of the "/. effect", it turns out that the most trafficked site on the Internet is in fact the Yahoo! home page. Lukas Biewald tells the story of how their site, FaceStat, had to scale fast when it was hit with 100,000 visitors after being featured on the Yahoo! home page:

Sunday morning I was sitting in my kitchen reading the newspaper and thinking about brunch, when Chris called me, saying he had a bunch of voicemails on his home phone in Iowa claiming something about our web app FaceStat being down. FaceStat is a site we made to show off our Dolores Labs crowdsourcing technology, and has had a small loyal following since we made it live about a month ago. I checked the site and it gave me a 500 error — only 1 in 10 requests seemed to get me an actual page — so I logged into our app server and saw the disk was full. The log file had grown to 20 GB! I deleted it and asked my friend Zuzka to check and see if we’d been Slashdotted.

The first response was to create a static page, but they found that this couldn't stem the tide:

Unbelievably, our webserver (nginx) couldn’t even reliably show that static page… Brendan discovered that we were exceeding the system’s open file limit — set at 100,000 — because connections were counting as open files.

The team then went about getting more capacity from their hosting provider, adding more caching, and as well as removing feature:

While Brendan worked on setting up boxes I started ripping out every database intensive feature of our system and Chris added more caching… Around 1 AM we were back online and looking pretty stable.

On Monday the load continued, so the team added memcached, monitoring tools, and moved the database to a larger machine:

So now it’s Tuesday night and the site seems to be cranking along under 50x the load that used to work on one box. We have 6 app servers and a big database machine. I’m really impressed what awesome hackers Chris and Brendan are and what amazing tools are available these days. Slicehost has scaled up as fast as we’ve needed them to. Amazon’s S3 serves all the images, and while the latency isn’t great, we never could have dealt with the bandwidth issues on our own. Capistrano lets us deploy and rollback everywhere; git with github lets us all hack frantically on the same codebase then merge and deploy. God keeps all the servers running, and memcached has given us great caching with very little pain (mostly… :) ). 

Lukas describes the lessons learned over the 3 days:

It’s one thing to code scalably and grow slowly under increasing load, but it’s been a blast to crazily rearchitect a live site like FaceStat in a day or two. I figure at this point we’ve been on the number 1 (or 2) page on the internet, so there’s no bigger instant spike in traffic that could happen to us… Some lessons I learned for the next time this happens:

(1) Monitor the site better. We had exception handling emailing us, but there were so many exceptions that I didn’t really look at them and I wasn’t online. It wouldn’t have made sense to scale our site to handle this kind of load in advance, but it’s unfortunate we had to rely on random people deciding to lookup Chris’s email address to call his home phone number to yell at him…

(2) Don’t be afraid to put up an error page. We had lots of excited users emailing us when we had a page up saying we were down and explaining why. We had lots angry users emailing us when the site was up but with intolerable lag or crashing intermittently. I think wishful thinking caused us to put up the site an hour or two before it was ready.

(3) A statically generated homepage is a very good thing and memcached is awesome.

Brendan O'Connor talks about the technology behind the FaceStat application in a follow up article:

Yes, we’re pretty much using Rails. We actually use an offshoot called Merb — which is a bit more efficient — on top of Thin. We find that a Rails-like platform is invaluable for rapidly prototyping a new site, especially since we started FaceStat as a pure experiment with no idea whether people would like it or not, and with a very different feature set in mind compared to what it later became. And it’s invaluable that Chris on our team is such a Ruby expert :).

However, the high-level platform really doesn’t matter compared to overall architecture: how we use the database (postgres), how much we cache (memcached/merb-cache), how we distribute load, how we deploy new systems (xen/slicehost), etc. It’s hasn’t been trivial since FaceStat is write-heavy and performs fairly complex statistical calculations, and various issues remain. But we are serving many users at nearly 100x our old load, so something must be going right — at least for now!

On Scalability by Talip Ozturk Posted
The Yahoo Effect by Oren Hurvitz Posted
Re: The Yahoo Effect by Jason Carreira Posted
  1. Back to top

    On Scalability

    by Talip Ozturk

    Places for scalability stories like this one:

    1. High Scalability blog

    2. Werner Vogels' blog

    3. Cloud computing group

    4. If you live in New England

    5. InfoQ scalability

    6. Grid Today




    -talip

    Hazelcast- distributed data structures for Java


    jroller.com/talipozturk

  2. Back to top

    The Yahoo Effect

    by Oren Hurvitz

    I wrote my thoughts about their experience in The Yahoo Effect.
    In short, it's usually not cost-effective to prepare for massive traffic spikes in advance. You can either buy a lot of hardware, which may go unused, or design your application to use Cloud Computing; but that's often not a viable alternative for existing applications.

  3. Back to top

    Re: The Yahoo Effect

    by Jason Carreira

    Then maybe the answer is to make it easier to deploy on a cloud? The right answer is very rarely to do it wrong the first time...

Educational Content

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.