Need to Scale Fast? Just Re-Architect it!
While geeks talk about the perils of the "/. effect", it turns out that the most trafficked site on the Internet is in fact the Yahoo! home page. Lukas Biewald tells the story of how their site, FaceStat, had to scale fast when it was hit with 100,000 visitors after being featured on the Yahoo! home page:
Sunday morning I was sitting in my kitchen reading the newspaper and thinking about brunch, when Chris called me, saying he had a bunch of voicemails on his home phone in Iowa claiming something about our web app FaceStat being down. FaceStat is a site we made to show off our Dolores Labs crowdsourcing technology, and has had a small loyal following since we made it live about a month ago. I checked the site and it gave me a 500 error — only 1 in 10 requests seemed to get me an actual page — so I logged into our app server and saw the disk was full. The log file had grown to 20 GB! I deleted it and asked my friend Zuzka to check and see if we’d been Slashdotted.
The first response was to create a static page, but they found that this couldn't stem the tide:
Unbelievably, our webserver (nginx) couldn’t even reliably show that static page… Brendan discovered that we were exceeding the system’s open file limit — set at 100,000 — because connections were counting as open files.
The team then went about getting more capacity from their hosting provider, adding more caching, and as well as removing feature:
While Brendan worked on setting up boxes I started ripping out every database intensive feature of our system and Chris added more caching… Around 1 AM we were back online and looking pretty stable.
On Monday the load continued, so the team added memcached, monitoring tools, and moved the database to a larger machine:
So now it’s Tuesday night and the site seems to be cranking along under 50x the load that used to work on one box. We have 6 app servers and a big database machine. I’m really impressed what awesome hackers Chris and Brendan are and what amazing tools are available these days. Slicehost has scaled up as fast as we’ve needed them to. Amazon’s S3 serves all the images, and while the latency isn’t great, we never could have dealt with the bandwidth issues on our own. Capistrano lets us deploy and rollback everywhere; git with github lets us all hack frantically on the same codebase then merge and deploy. God keeps all the servers running, and memcached has given us great caching with very little pain (mostly… ).
Lukas describes the lessons learned over the 3 days:
It’s one thing to code scalably and grow slowly under increasing load, but it’s been a blast to crazily rearchitect a live site like FaceStat in a day or two. I figure at this point we’ve been on the number 1 (or 2) page on the internet, so there’s no bigger instant spike in traffic that could happen to us… Some lessons I learned for the next time this happens:
(1) Monitor the site better. We had exception handling emailing us, but there were so many exceptions that I didn’t really look at them and I wasn’t online. It wouldn’t have made sense to scale our site to handle this kind of load in advance, but it’s unfortunate we had to rely on random people deciding to lookup Chris’s email address to call his home phone number to yell at him…
(2) Don’t be afraid to put up an error page. We had lots of excited users emailing us when we had a page up saying we were down and explaining why. We had lots angry users emailing us when the site was up but with intolerable lag or crashing intermittently. I think wishful thinking caused us to put up the site an hour or two before it was ready.
(3) A statically generated homepage is a very good thing and memcached is awesome.
Brendan O'Connor talks about the technology behind the FaceStat application in a follow up article:
Yes, we’re pretty much using Rails. We actually use an offshoot called Merb — which is a bit more efficient — on top of Thin. We find that a Rails-like platform is invaluable for rapidly prototyping a new site, especially since we started FaceStat as a pure experiment with no idea whether people would like it or not, and with a very different feature set in mind compared to what it later became. And it’s invaluable that Chris on our team is such a Ruby expert :).
However, the high-level platform really doesn’t matter compared to overall architecture: how we use the database (postgres), how much we cache (memcached/merb-cache), how we distribute load, how we deploy new systems (xen/slicehost), etc. It’s hasn’t been trivial since FaceStat is write-heavy and performs fairly complex statistical calculations, and various issues remain. But we are serving many users at nearly 100x our old load, so something must be going right — at least for now!
The Yahoo Effect
In short, it's usually not cost-effective to prepare for massive traffic spikes in advance. You can either buy a lot of hardware, which may go unused, or design your application to use Cloud Computing; but that's often not a viable alternative for existing applications.