Scaling Twitter to New Peaks
For many of us Twitter has become an essential communications utility. People and businesses use Twitter every day in broader and deeper ways, indeed we all have an interest in how well Twitter scales. Earlier this month Twitter experienced and seamlessly handled a new peak load of 143,199 tweets per second—a substantial spike above its current steady-state of 5,700 tweets per second. Raffi Krikorian, VP of Platform Engineering at Twitter reported the new record and took some time to review the engineering changes they've made to scale to this new level of traffic.
Three years ago, peaks of 2000 tweets per second from activity around the 2010 World Cup caused major stability problems for Twitter and a realization that they needed to re-architect their systems. A subsequent engineering review found that Twitter had the worlds largest Ruby on Rails installation, everything was in one codebase and both the application and the engineering team was monolithic. Their MySQL storage system had reached its limits, hardware was not fully utilised and repeated "optimizations" were ossifying the codebase. Krikorian reports that Twitter came out of their review with some big aims: to reduce the number of machines by 10x, to move to a loosely coupled service oriented architecture with cleaner boundaries and more cohesion, and to be able to launch new features faster with smaller empowered teams.
Twitter moved to the JVM and away from Ruby. They had hit the limits of Ruby's process-level concurrency model and needed a programming platform that provided higher throughput and better use of hardware resources. Rewriting their codebase on the JVM yielded better than 10x performance improvement and they now push 10-20K requests/sec/host.
Twitter's largest architectural change was moving to a service oriented architecture focussing on their "core nouns" of tweet, timeline and user services. Their development approach relies on "design by contract" where interface definitions are agreed up front and then teams work independently on the implementation. The services are autonomous and self-contained and that is reflected in the new engineering team structure. An asynchronous RPC platform, Finagle, was developed to handle concurrency, failover and load balancing in a standard manner across all engineering teams.
The new architecture is reflected in the organization of Twitter's engineering teams. The services and their teams are autonomous and self-contained. Each team owns their interfaces and their problem domains. Noone needs to be an expert across the system and not everyone has to worry about scaling Tweets. Critical capabilities are abstracted behind APIs that make them accessible to everyone who needs them.
But even with a less monolithic architecture, says Krikorian, persistence remains a huge bottleneck. Twitter's single master MySQL database has been replaced with a distributed framework of sharded, fault-tolerant databases using Gizzard.
Reinforcing a common theme for scaling large systems, observability and statistics are a key tool to manage the system and provide concrete data to support optimization efforts. Twitter's development platform incorporates tools which make it very easy for developers to provide request tracing and statistical reporting.
The final element in Twitter's scaling story is the effort put into their runtime configuration and testing environment. Testing Twitter at "Twitter scale" can really only be done in production. Deployment of new features could also require a challenging level of coordination across teams. So Twitter have developed a mechanism called Decider to switch on new features only after they have been deployed. Features can be deployed in an "off" setting and switched on either in a binary fashion (all at once), or gradually for a percentage of operations.
The overall result for Twitter today is that it is more scalable, more resilient and more agile than before. Traffic volumes are breaking new records and new features can be rolled out without significant disruption. Krikorian finishes his blog post urging us to keep an eye on @twittereng for more details about Twitter's re-architecture.
Stephanie Davis (nee Stewart) Dec 21, 2014