Scaling Out the Most Popular Social Game, FarmVille
With 83.75 million monthly active users, FarmVille is the most popular game on Facebook and one of the most popular web-based games on the Internet. To scale out, the application is deployed inside the cloud, uses cache extensively, has the ability to turn off some of the functionality during peak times and makes use of performance monitoring and managing.
Launched in June 2009, FarmVille had its first million users after 4 days, and 10 M after 60 days, according to Luke Rajlich, a developer working for Zynga, the game’s creator. With over 80 M monthly active players, FarmVille manages to engage over 20% of all Facebook users and over 1% of the world’s population. Scaling out at these proportions and in such a short time requires certain hardware and software solutions.
InfoQ interviewed Luke Rajlich to find out some architectural details. First of all, the application runs in the cloud on virtualized Linux servers, so it can request and receive additional computing power pretty easily. The application runs on a basic LAMP stack, where P stands for PHP. The application uses caching extensively:
We are basically an Object Oriented, MVC application with a custom written DB/Cache interface. We heavily rely on caching, specifically memcache, to support our workload. As well, we have a horizontally sharded database.
To handle spike traffic, the application relies on adding extra capacity in short time:
Architecturally, we are able to add capacity quickly since the application workload can be partitioned at any layer (load balancer, web server, memcache, database). In addition, we have very specific and formulaic procedure for adding capacity at any given layer. Thus, the execution of adding capacity is easily managed and can be executed quickly. We additionally run on a virtualized environment, thus we can add capacity without directly provisioning additional hardware, which significantly cuts down the time from when we make the decision to add capacity to when we actually have the necessary hardware available. We additionally have adopted configuration tools, such as puppet, that reduce the overhead required to add additional hardware. The difficult part that remains is knowing and finding which part of the application breaks in terms of performance first. In order to accommodate that concern, we have invested time into the aforementioned service degradation as well we spend a considerable amount of time working on application performance monitoring.
The game has a number of components, and when there is performance bottleneck “we can effectively turn off the less important functionality we use on the platform” in order to alleviate the demands on the application:
There are a number of other components [beside the game itself], such as friend ladders, gift requests, etc. We can strip those elements away from the game so that the basic parts of the game aren't as impacted by the performance of those components. This is crucially important as our game is primarily a timing based game where users come back to the game at a certain time to perform certain actions. Those specific actions have a big user experience impact when we have downtime, thus we want to avoid that happening for users.
A user's state contains a large amount of data which has subtle and complex relationships. For example, in a farm, objects cannot collide with each other, so if a user places a house on their Farm, the backend needs to check that no other object in that user's farm occupies an overlapping space. Unlike most major site like Google or Facebook, which are read heavy, FarmVille has an extremely heavy write workload. The ratio of data reads to writes 3:1, which is an incredibly high write rate. A majority of the requests hitting the backend for FarmVille in some way modifies the state of the user playing the game. To make this scalable, we have worked to make our application interact primarily with cache components.
The traffic between FarmVille and the Facebook platform peaks at about 3GB/s so the client application needs to turn off some calls to the platform to avoid blocking the communication links:
The amount of traffic between FarmVille and the Facebook platform is enormous: at peak, roughly 3 Gigabits/sec of traffic go between FarmVille and Facebook while our caching cluster serves another 1.5 Gigabits/sec to the application. Additionally, since performance can be variable, the application has the ability to dynamically turn off any calls back to the platform. We have a dial that we can tweak that turns off incrementally more calls back to the platform. We have additionally worked to make all calls back to the platform avoid blocking the loading of the application itself. The idea here is that, if all else fails, players can continue to at least play the game.
For performance monitoring and management “we use nagios for alerting, munin for monitoring, and puppet for configuration. We heavily utilize internal stats systems to track performance of the services the application uses, such as Facebook, DB, and Memcache. Additionally, when we see performance degradation, we profile a request's IO events on a sampled basis.”
As a side note, according to Inside Social Games analyst Justin Smith, Zynga, the company behind FarmVille, made $490 M in revenue last year and expects to make $835 M this year.
What is the technology stack
What is the tech stack? What kind of servers host the app? Which cloud vendor - public or private?
Re: What is the technology stack
the article specifies the stack: " The application runs on a basic LAMP stack, where P stands for PHP." M=MySQL.
They do run in a cloud, I suppose a public cloud. They have been very laconic at giving details about their relationship with Facebook or their cloud provider.