Foursquare's MongoDB Outage

| by Ron Bodkin Follow 0 Followers on Oct 15, 2010. Estimated reading time: 8 minutes |

Foursquare recently suffered a total site outage for eleven hours. The outage was caused by unexpected uneven growth in their MongoDB database that their monitoring didn't detect. The system outage was prolonged when an attempt to add a partition didn't work due to fragmentation, and required taking the database offline to compact it. This article provides more details on what happened, why the system failed, and the responses of foursquare and 10gen to the incident.

Foursquare is a location-based social network which has been growing rapidly - reaching three million users in August. On October 4th, Foursquare experienced an eleven hour outage, because of capacity problems due to that growth. Nathan Folkman, Foursquare's Director of Operations, wrote a blog entry that both apologized to their users and provided some technical details about what happened. Subsequently 10gen's CTO Eliot Horowitz posted a more detailed post mortem to the mongodb users mailing list. 10gen develops and supports MongoDB, and provides support for Foursquare. This post mortem generated a lively discussion, including more technical details from Foursquare's engineering lead, Harry Heymann.

Basic System Architecture

The critical system that was affected was Foursquare's database of user check-ins. Unlike many history databases, where only a small fraction of data needs to be accessed at any given time 10gen's CEO Dwight Merriman told us that "For various reasons the entire DB is accessed frequently so the working set is basically its entire size." Because of this, the memory requirements for this database were the total size of data in the database.If the database size exceeded the RAM on the machine, the machine would thrash, generating more I/O requests than the four disks could service. In response to our questions he clarified that "a very high percentage of documents were touched very often. Much higher than you would expect."

Initially the database ran on a single EC2 instance with 66 GB of RAM. About two months ago, Foursquare had almost run out of capacity and migrated to a two-shard cluster, with each shard having 66 GB of RAM and replicating data to a slave. After this migration, each shard held approximately 33 GB of data. Data in the shards was partitioned into "200 evenly distributed chunks by user id" so all data for a given user is held in a single shard.


The Outage

As the system continued to grow, the partitions grew in an imbalanced fashion.  Horowitz notes:

It’s easy to imagine how this might happen: assuming certain subsets of users are more active than others, it’s conceivable that their updates might all go to the same shard.

Chunks get split by MongoDB as they grow past 200 MB, into two 100 MB. The end result was that when the total system grew past 116 GB of data, one partition was 50 GB in size but another one had exceeded the 66 GB of RAM available on the machine, causing unacceptable performance for requests that hit that shard.

In an attempt to fix the system, the team added a third shard to the database, aiming to transfer 5% of the system data so all shards would fit within RAM. They transferred only 5% of data because, Horowitz notes "we were trying to move  as little as possible to get the site back up as fast as possible." However, this didn't relieve the performance problem on the full shard. As Horowitz notes:

...we ultimately discovered that the problem was due to data fragmentation on shard0. In essence, although we had moved 5% of the data from shard0 to the new third shard, the data files, in their fragmented state, still needed the same amount of RAM. This can be explained by the fact that Foursquare check-in documents are small (around 300 bytes each), so many of them can fit on a 4KB page.  Removing 5% of these just made each page a little more sparse, rather than removing pages altogether.

The data that was migrated was sparse because it was small and because "Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks."  In order to fix the performance problem, they had to compact the overfull shard. MongoDB currently only allows offline compaction of a shard. This took four hours, which was a function of the volume of data to compact and "the slowness of EBS volumes at the time." When this completed, they size of the shard had shrunk by 5% and they were able to bring the system back online, after an eleven hour outage. No data was lost during the incident.

Follow Up

After the system was restored, they set up additional shards and distributed data evenly. To fix fragmentation they compacted the partitions on each of the slaves, then switched the slaves to be masters and compacted the masters. The system ended up using about 20 GB on each partition.

Horowitz noted that

The 10gen team is working on doing online incremental compaction of both data files and indexes.  We know this has been a concern in non-sharded systems as well.  More details about this will be coming in the next few weeks.

Horowitz notes:

The main thing to remember here is that once you’re at max capacity,
it’s difficult to add more capacity without some downtime when objects
are small.  However, if caught in advance, adding more shards on a
live system can be done with no downtime

The foursquare team also responded by promising to improve communication, operational procedures, and according to Folkman:

we’re also looking at things like artful degradation to help in these situations. There may be times when we’re overloaded in the future, and it would be better if certain functionalities were turned off rather than the whole site going down, obviously.

Heymann noted:

Overall we still remain huge fans of MongoDB @foursquare, and expect to be using it for a long time to come.


Community Response

In response to the post-mortem a number of questions were raised:

  1. Nat asked:
    Can repairDatabase() leverage multiple cores? Given that data is broken down into chunks, can we process them in parallel? 4 hours downtime seems to be eternal in internet space.

    Horowitz responded:
    Currently no.  We're going to be working on doing compaction in the
    background though so you won't have to do this large offline
    compaction at all.
  2. Alex Popescu asked:
    is there a real solution to the chunk migration/page size issue?

    Horowitz responded:
        Yes - we are working on online compaction would mitigate this.
  3. Suhail Doshi asked:
     I think the most obvious question is: How do we avoid maxing out our
     mongodb nodes and know when to provision new nodes?
     With the assumptions that: you're monitoring everything.
     What do we need to look at? How do we tell? What happens if you're a
     company with lots of changing scaling needs and's tough
     to capacity plan when you're a startup.

    Horowitz answered:
    This is very application dependent.  In some cases you need all your
    indexes in ram, in other cases it can just be a small working set.
    One good way to determine this is figure out how much data you
    determine in a 10 minute period.  Indexes and documents.  Make sure
    you can either store that in ram or read that from disk in the same
    amount of time.
  4. Nat also asked about back pressure: "It seems that when data grows out of memory, the performance seems to degrade significantly."

    To which Roger Binns added:
    What MongoDB needs to do is decrease concurrency under load so that existing
    queries can complete before allowing new ones to add fuel to the fire.
    There is a ticket for this:

There was also some discussion of whether solid state drives would have helped improve performance, but there was no conclusive statement about how they might have impacted performance. One also wonders why the user id partitions grew in such an imbalanced fashion. Partitioning by a has of user id would presumably have been nearly balanced - so it's likely that a biased partitioning (e.g., putting older users on one shard) resulted in the imbalanced data growth.

Monitoring and Future Directions

We interviewed 10gen's CEO Dwight Merriman to better understand some issues in this case. In general, we asked how to best monitor large scale MongoDB deployments, to which he replied that there are many monitoring tools, noting that Munin is commonly used and that it has a MongoDB plugin. We asked:

From the description, it seems like you should be able to monitor the resident memory used by the mongo db process to get an alert as shard memory ran low. Is that viable?

If the database is bigger than RAM, MongoDB, like other db's, will tend to use all memory as a cache.  So using all memory doesn't indicate a problem.  Rather, we need to know when working set is approaching the size of ram.  This is rather hard with all databases. One good way is to monitor physical I/O closely and watch for increases.

Merriman agree that in the case of Foursquare, where all the database needs to be held in RAM, monitoring resident memory or just total database size would be sufficient to detect problems in advance. This implies that it would have been fairly simple to identify the problem before a shard filled up. It seems like that no one expected the imbalanced growth so whatever monitoring was in place was inadequate to identify the problem.

We asked what changes in development priorities resulted from the experience, to which Merriman replied that they will be working on background compaction features sooner. In addition, Horowitz had noted that MongoDB should "degrade much more gracefully. We'll be working on these enhancements soon as well." Merriman noted that MongoDB would be allowing re-clustering objects to push inactive objects into pages that are on disk, and that they believe that for them memory mapped files work well. Horowitz elaborated that

The big issue really is concurrency.  The VM side works well, the problem is a read-write lock is too coarse.
1 thread that causes a fault can a bigger slowdown than it should be able to. We'll be addressing this in a few ways: making yielding more intelligent, real intra-collection concurrency.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Cache behaviour by Martin Probst

What I don't get is the description of the problem. So their data set went above their cache size simply by adding new users/entries. Shouldn't this see a gradual degradation of performance then? I.e., if you have 95% of your data in cache, you'll only see slow responses for the 5% of the queries that need to do a page fault, and then that slowly increases as more data is added? Why did they experience such a sudden and hard degradation?

That sounds a lot like some bug in the caching layer.

MongoDB is web scale by Cameron Purdy

Re: Cache behaviour by Seun Osewa

Two responses:
(1) It appears they weren't monitoring things properly.
(2) Actually, swapping is not gradual like that. when you run just over capacity, everything goes to hell. You get x100 degradation almost instantly.

Re: Cache behaviour by Ron Bodkin

Seun - you are right. On #2, once the working set exceeds RAM the machine will "thrash" (as described in the article) and performance will fall apart.

Martin - it's surprising that they have all the data in working set. 10gen can't say why that is (and foursquare didn't answer my email to ask). 10gen did explicitly say that "a very high percentage of documents were touched very often. Much higher than you would expect." So basically they are saying the data couldn't be cached, it was all required to be in RAM.

I suspect they were surprised because they *expected* the shards to be balanced, so they only monitored total data size, not size per shard (so they monitoring the wrong thing).

Re: Cache behaviour by Nati Shalom

To be more accurate i'll point to the recent stanford research on that subject: The case for RAM Cloud

"even a 1% miss ratio for a DRAM cache costs a factor of 10x in performance. A caching ap-proach makes the deceptive suggestion that “a few cache misses are OK” and lures programmers into con-figurations where system performance is poor."

If the entire set of data needs to be served out of memory it might have been better to use In-Memory-Data-Grid then disk based db with LRU based cache.

Nati S

No CAP by Nati Shalom

I've just posted more of my personal thoughts on this subject: NoCAP.


Re: Cache behaviour by Martin Probst

Well, yes and no. I mean, if their performance requirements are that every single request must be served under say 1 ms, then those requests that need to page something in pay 10 ms for the disk seek, and they are out.

But then again it sounds like a really strange application if you constantly have to access _all_ of your data. I don't know anything about Foursquare's implementation, but shouldn't e.g. old checkins after some time get "cold"? I think in normal applications the app only accesses a fraction of all the data really frequently (i.e., the latest checkins, with historic data being aggregated and seldom touched).

Also, you'd typically get bad response times for the fraction of the requests that need to touch data moved out of the cache. Which, again, could be a problem, but at least it doesn't kill your application instantly (you only lose some requests).

So this only makes sense if more or less every request they issue has to touch all of the data, which means all requests will cause page faults and thus compete for the disk _and_ also run slow. Maybe they have a really good reason why they need to do that, but that just sounds like bad application design. Which wouldn't really surprise me; from what I've read I got the impression that many of the NoSQL users just don't understand database technology ("what do you mean, you have to create an index?") and then switch to something else because clearly, databases don't scale ...

Re: Cache behaviour by Ron Bodkin

It depends - if enough requests are blocked waiting for disk then you'll get queueing and soon throughput drops. I don't believe MongoDB separates requests that can be served from RAM from those that require paging and has different queues. That would be a good optimization in a case like this, but I don't think it has it.

I do wonder about touching all the old records - seems like that should be done by a batch process in Hadoop rather than an online process.

Re: Cache behaviour by Thomas V

But then again it sounds like a really strange application if you constantly have to access _all_ of your data. I don't know anything about Foursquare's implementation, but shouldn't e.g. old checkins after some time get "cold"? I think in normal applications the app only accesses a fraction of all the data really frequently (i.e., the latest checkins, with historic data being aggregated and seldom touched).

Wild guess: since MongoDB is document-oriented and since they apparently shard by user, retrieving a single user might very well end up pulling all of their history as child objects. If a large proportion of your users are active at any given time, you're toasted.

IIRC a document can have references to other objects as attributes (as opposed to children stored inside the parent object), but I don't know if that allows you to store a user and its dependant object on the same shard.

Nothing more than a couple of suppositions, though :)

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

9 Discuss