Foursquare recently suffered a total site outage for eleven hours. The outage was caused by unexpected uneven growth in their MongoDB database that their monitoring didn't detect. The system outage was prolonged when an attempt to add a partition didn't work due to fragmentation, and required taking the database offline to compact it. This article provides more details on what happened, why the system failed, and the responses of foursquare and 10gen to the incident.
Foursquare is a location-based social network which has been growing rapidly - reaching three million users in August. On October 4th, Foursquare experienced an eleven hour outage, because of capacity problems due to that growth. Nathan Folkman, Foursquare's Director of Operations, wrote a blog entry that both apologized to their users and provided some technical details about what happened. Subsequently 10gen's CTO Eliot Horowitz posted a more detailed post mortem to the mongodb users mailing list. 10gen develops and supports MongoDB, and provides support for Foursquare. This post mortem generated a lively discussion, including more technical details from Foursquare's engineering lead, Harry Heymann.
Basic System Architecture
The critical system that was affected was Foursquare's database of user check-ins. Unlike many history databases, where only a small fraction of data needs to be accessed at any given time 10gen's CEO Dwight Merriman told us that "For various reasons the entire DB is accessed frequently so the working set is basically its entire size." Because of this, the memory requirements for this database were the total size of data in the database.If the database size exceeded the RAM on the machine, the machine would thrash, generating more I/O requests than the four disks could service. In response to our questions he clarified that "a very high percentage of documents were touched very often. Much higher than you would expect."
Initially the database ran on a single EC2 instance with 66 GB of RAM. About two months ago, Foursquare had almost run out of capacity and migrated to a two-shard cluster, with each shard having 66 GB of RAM and replicating data to a slave. After this migration, each shard held approximately 33 GB of data. Data in the shards was partitioned into "200 evenly distributed chunks by user id" so all data for a given user is held in a single shard.
The Outage
As the system continued to grow, the partitions grew in an imbalanced fashion. Horowitz notes:
Chunks get split by MongoDB as they grow past 200 MB, into two 100 MB. The end result was that when the total system grew past 116 GB of data, one partition was 50 GB in size but another one had exceeded the 66 GB of RAM available on the machine, causing unacceptable performance for requests that hit that shard.
In an attempt to fix the system, the team added a third shard to the database, aiming to transfer 5% of the system data so all shards would fit within RAM. They transferred only 5% of data because, Horowitz notes "we were trying to move as little as possible to get the site back up as fast as possible." However, this didn't relieve the performance problem on the full shard. As Horowitz notes:
The data that was migrated was sparse because it was small and because "Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks." In order to fix the performance problem, they had to compact the overfull shard. MongoDB currently only allows offline compaction of a shard. This took four hours, which was a function of the volume of data to compact and "the slowness of EBS volumes at the time." When this completed, they size of the shard had shrunk by 5% and they were able to bring the system back online, after an eleven hour outage. No data was lost during the incident.
Follow Up
After the system was restored, they set up additional shards and distributed data evenly. To fix fragmentation they compacted the partitions on each of the slaves, then switched the slaves to be masters and compacted the masters. The system ended up using about 20 GB on each partition.
Horowitz noted that
Horowitz notes:
it’s difficult to add more capacity without some downtime when objects
are small. However, if caught in advance, adding more shards on a
live system can be done with no downtime
The foursquare team also responded by promising to improve communication, operational procedures, and according to Folkman:
Heymann noted:
Community Response
In response to the post-mortem a number of questions were raised:
- Nat asked:
Can repairDatabase() leverage multiple cores? Given that data is broken down into chunks, can we process them in parallel? 4 hours downtime seems to be eternal in internet space.
Horowitz responded:
Currently no. We're going to be working on doing compaction in the
background though so you won't have to do this large offline
compaction at all. - Alex Popescu asked:
is there a real solution to the chunk migration/page size issue?
Horowitz responded:
Yes - we are working on online compaction would mitigate this.
- Suhail Doshi asked:
I think the most obvious question is: How do we avoid maxing out our
mongodb nodes and know when to provision new nodes?
With the assumptions that: you're monitoring everything.
What do we need to look at? How do we tell? What happens if you're a
company with lots of changing scaling needs and features...it's tough
to capacity plan when you're a startup.
Horowitz answered:
This is very application dependent. In some cases you need all your
indexes in ram, in other cases it can just be a small working set.
One good way to determine this is figure out how much data you
determine in a 10 minute period. Indexes and documents. Make sure
you can either store that in ram or read that from disk in the same
amount of time. - Nat also asked about back pressure: "It seems that when data grows out of memory, the performance seems to degrade significantly."
To which Roger Binns added:
What MongoDB needs to do is decrease concurrency under load so that existing
queries can complete before allowing new ones to add fuel to the fire.
There is a ticket for this:
http://jira.mongodb.org/browse/SERVER-574
There was also some discussion of whether solid state drives would have helped improve performance, but there was no conclusive statement about how they might have impacted performance. One also wonders why the user id partitions grew in such an imbalanced fashion. Partitioning by a has of user id would presumably have been nearly balanced - so it's likely that a biased partitioning (e.g., putting older users on one shard) resulted in the imbalanced data growth.
Monitoring and Future Directions
We interviewed 10gen's CEO Dwight Merriman to better understand some issues in this case. In general, we asked how to best monitor large scale MongoDB deployments, to which he replied that there are many monitoring tools, noting that Munin is commonly used and that it has a MongoDB plugin. We asked:
From the description, it seems like you should be able to monitor the resident memory used by the mongo db process to get an alert as shard memory ran low. Is that viable?
If the database is bigger than RAM, MongoDB, like other db's, will tend to use all memory as a cache. So using all memory doesn't indicate a problem. Rather, we need to know when working set is approaching the size of ram. This is rather hard with all databases. One good way is to monitor physical I/O closely and watch for increases.
Merriman agree that in the case of Foursquare, where all the database needs to be held in RAM, monitoring resident memory or just total database size would be sufficient to detect problems in advance. This implies that it would have been fairly simple to identify the problem before a shard filled up. It seems like that no one expected the imbalanced growth so whatever monitoring was in place was inadequate to identify the problem.
We asked what changes in development priorities resulted from the experience, to which Merriman replied that they will be working on background compaction features sooner. In addition, Horowitz had noted that MongoDB should "degrade much more gracefully. We'll be working on these enhancements soon as well." Merriman noted that MongoDB would be allowing re-clustering objects to push inactive objects into pages that are on disk, and that they believe that for them memory mapped files work well. Horowitz elaborated that
The big issue really is concurrency. The VM side works well, the problem is a read-write lock is too coarse.
1 thread that causes a fault can a bigger slowdown than it should be able to. We'll be addressing this in a few ways: making yielding more intelligent, real intra-collection concurrency.
Community comments
Cache behaviour
by Martin Probst,
Re: Cache behaviour
by Seun Osewa,
Re: Cache behaviour
by Ron Bodkin,
Re: Cache behaviour
by Nati Shalom,
No CAP
by Nati Shalom,
Re: Cache behaviour
by Martin Probst,
Re: Cache behaviour
by Ron Bodkin,
Re: Cache behaviour
by Thomas V,
MongoDB is web scale
by Cameron Purdy,
Cache behaviour
by Martin Probst,
Your message is awaiting moderation. Thank you for participating in the discussion.
What I don't get is the description of the problem. So their data set went above their cache size simply by adding new users/entries. Shouldn't this see a gradual degradation of performance then? I.e., if you have 95% of your data in cache, you'll only see slow responses for the 5% of the queries that need to do a page fault, and then that slowly increases as more data is added? Why did they experience such a sudden and hard degradation?
That sounds a lot like some bug in the caching layer.
MongoDB is web scale
by Cameron Purdy,
Your message is awaiting moderation. Thank you for participating in the discussion.
nosql.mypopescu.com/post/1016320617/mongodb-is-...
;-)
Re: Cache behaviour
by Seun Osewa,
Your message is awaiting moderation. Thank you for participating in the discussion.
Two responses:
(1) It appears they weren't monitoring things properly.
(2) Actually, swapping is not gradual like that. when you run just over capacity, everything goes to hell. You get x100 degradation almost instantly.
Re: Cache behaviour
by Ron Bodkin,
Your message is awaiting moderation. Thank you for participating in the discussion.
Seun - you are right. On #2, once the working set exceeds RAM the machine will "thrash" (as described in the article) and performance will fall apart.
Martin - it's surprising that they have all the data in working set. 10gen can't say why that is (and foursquare didn't answer my email to ask). 10gen did explicitly say that "a very high percentage of documents were touched very often. Much higher than you would expect." So basically they are saying the data couldn't be cached, it was all required to be in RAM.
I suspect they were surprised because they *expected* the shards to be balanced, so they only monitored total data size, not size per shard (so they monitoring the wrong thing).
Re: Cache behaviour
by Nati Shalom,
Your message is awaiting moderation. Thank you for participating in the discussion.
To be more accurate i'll point to the recent stanford research on that subject: The case for RAM Cloud
If the entire set of data needs to be served out of memory it might have been better to use In-Memory-Data-Grid then disk based db with LRU based cache.
Nati S
GigaSpaces
No CAP
by Nati Shalom,
Your message is awaiting moderation. Thank you for participating in the discussion.
I've just posted more of my personal thoughts on this subject: NoCAP.
Enjoy...
Re: Cache behaviour
by Martin Probst,
Your message is awaiting moderation. Thank you for participating in the discussion.
Well, yes and no. I mean, if their performance requirements are that every single request must be served under say 1 ms, then those requests that need to page something in pay 10 ms for the disk seek, and they are out.
But then again it sounds like a really strange application if you constantly have to access _all_ of your data. I don't know anything about Foursquare's implementation, but shouldn't e.g. old checkins after some time get "cold"? I think in normal applications the app only accesses a fraction of all the data really frequently (i.e., the latest checkins, with historic data being aggregated and seldom touched).
Also, you'd typically get bad response times for the fraction of the requests that need to touch data moved out of the cache. Which, again, could be a problem, but at least it doesn't kill your application instantly (you only lose some requests).
So this only makes sense if more or less every request they issue has to touch all of the data, which means all requests will cause page faults and thus compete for the disk _and_ also run slow. Maybe they have a really good reason why they need to do that, but that just sounds like bad application design. Which wouldn't really surprise me; from what I've read I got the impression that many of the NoSQL users just don't understand database technology ("what do you mean, you have to create an index?") and then switch to something else because clearly, databases don't scale ...
Re: Cache behaviour
by Ron Bodkin,
Your message is awaiting moderation. Thank you for participating in the discussion.
It depends - if enough requests are blocked waiting for disk then you'll get queueing and soon throughput drops. I don't believe MongoDB separates requests that can be served from RAM from those that require paging and has different queues. That would be a good optimization in a case like this, but I don't think it has it.
I do wonder about touching all the old records - seems like that should be done by a batch process in Hadoop rather than an online process.
Re: Cache behaviour
by Thomas V,
Your message is awaiting moderation. Thank you for participating in the discussion.
Wild guess: since MongoDB is document-oriented and since they apparently shard by user, retrieving a single user might very well end up pulling all of their history as child objects. If a large proportion of your users are active at any given time, you're toasted.
IIRC a document can have references to other objects as attributes (as opposed to children stored inside the parent object), but I don't know if that allows you to store a user and its dependant object on the same shard.
Nothing more than a couple of suppositions, though :)