Collaboration: At the Extremities of Extreme
Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.
Posted by Abel Avram on May 18, 2010
Steve Huffman, co-founder of Reddit, shares the main lessons he learned scaling Reddit from a small web application to a large social website.
Reddit was started by Steve Huffman and Alexis Ohanian in 2005 and consisted of running the web application, the application server and the database, all on one machine. Over the time, Reddit has grown to 7.5M users/month and 270M page views/month. Huffman told in a presentation what he learned along the way and talked about many mistakes they made and how they corrected them.
1. Crash Often
They had lots of crashes in the beginning and Huffman used to sleep with his laptop next to him, waking every few hours to check on the website to see if it is up and running. The solution found was to use Supervise, a daemon which restarts crashed applications. That lead to an interesting approach to running an application and on its design: if the app has memory leaks or consumes too much memory, they simply killed it and restarted it from scratch. This was not a definitive solution, but a temporary one until the application would be fixed based on what was inside the log files.
2. Separation of Services
Huffman suggests grouping similar processes together on the same machine or group of machines to avoid context switching all the time which consumes resources. He also considers as a good practice to use a database machine for similar types of data in order to avoid swapping the indices cache all the time and move other types to other machines.
Huffman strongly suggests avoiding threads which in Python are “the kiss of death, a recipe for slowness”. If various tasks are assigned to separate processes rather than threads, it is easy to move those processes to different machines when the demand raises and more resources are needed. The only problem with that is inter-process communication, but otherwise it is better because the architecture can grow smoother.
3. Open Schema
Every new feature required a schema update which became more painful as the DB grew. Adding a new column to a 10 million rows table can take a long time especially if you have a backup copy and a replication one. They had days when they ran without having a backup because they built the replica in that time.
The solution was to use an open schema or an entity-attribute value, a key-value store. Now there are two tables for each data type:

Things can be Users, Links, Comments, etc. all sharing the same schema. The Data table is comprised of a huge number of rows but only 3 columns: an ID, a Key and a Value. The result of this new schema was that adding new features does not involve changing the schema and does not require creating new tables. Also, there are no more joins in the DB, so the DB can be easily distributed.
4. Keep it Stateless
One goal for any web application is for any of its app server to be able to handle any request. That is very easy to obtain when you have only one machine (obviously), but it may get complicated using multiple servers and cached application state. Each server added increased cache redundancy and difficulty in accessing cached data.
The solution in this case was to switch to memcache and to stop using state on any app server they had. One of the immediate benefits was the fact that a crashed application server would not affect any other server. Also, scaling is done easily, simply by adding more servers.
It is important to separate the cache server from other servers to avoid resource contention.
5. Memcache for Everything
Reddit started to use memcache for about everything: database data, session data, rendered pages, memorizing internal functions, pre-computed pages, global locking. They also used memcachedb for data persistence.
6. Store Redundant Data
A recipe for slowness is to “keep data normalized until you need it.” When the user needs some data presented in a specific format, taking the raw data and processing it on spot may delay the response to the point that the user gives up waiting on it. The solution is to keep in memory and disk all the formats used to display the data. While this approach has some disk and memory impact, it has the advantage of a quick response to the user’s request.
For Reddit, the key for speed is “pre-compute everything and dump it on memcache.”
7. Work Offline
When an user makes a request, the system needs to do what is absolutely necessary to return a proper response, and everything else is derogated to queued jobs that perform things offline. Examples of works done offline include: pre-computing listings, fetching thumbnails, detecting cheating, removing spam, computing awards, and updating the search index. When an user votes on a link, he does not need to wait until all the indexes and listings are updated, and those tasks can be done after a response is returned to the user.
The blue arrows in the following picture represent activities performed on user’s request, while the pink arrows are activities performed offline:

Getting Started with Stratos - an Open Source Cloud Platform
Why NoSQL? A primer on Managing the Transition from RDBMS to NoSQL
Mobile and the New Two-Tiered Web Architecture
Agile Practices to Improve Project Management Organization (PMO) Effectiveness
18 agile and lean practices for effective software development governance
As a side note, Reddit (along with other high profile websites) has recently dumped Memcache for Cassandra.
See www.infoq.com/news/2010/03/Digg-Reddit-NoSQL-Ca...
Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.
Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).
Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.
Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.
One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.
InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.
Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.
John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.
1 comment
Watch Thread Reply