The Infrastructure Behind Twitter: Scaling Networking, Storage and Provisioning
The Twitter Engineering team has recently provided an insight into the evolution and scaling of the core technologies behind their custom data center infrastructure that powers the social media service. Core lessons shared included: architect beyond the original specifications and requirements, and making quick and bold changes if traffic trends toward the upper end of designed capacity; there is no such a thing as a “temporary change or workaround” - workarounds are technical debt; focus on providing the right tool for the job, but this means legitimately understanding all possible use cases; and documenting internal and community best practices has been a “force multiplier”.
The social networking and online news service Twitter was created in 2006, when hardware from physical enterprise vendors “ruled the data center”, according to a recent post on the Twitter Engineering Blog. In the ten years preceding the launch, the rapid growth of Twitter’s user base has provided many engineering challenges at the hardware layer. Although Twitter has a "good presence" within public cloud, they have primarily invested within their own private infrastructure; in 2010 Twitter migrated from third-party colocated hosting to their own private data center infrastructure, which over the following six years has been “continually engineered and refreshed [...] to take advantage of the latest open standards in technology and hardware efficiency”.
By late 2010 Twitter had finalised their first in-house network architecture that was designed to address scale and service issues that were encountered with the existing third-party colocated infrastructure. Initially, the use of deep buffer Top of Rack (ToR) switches, and carrier-grade core network switches allowed Twitter to support previously unseen transaction per second (TPS) volumes that were generated as a result of global events such as the World Cup Football in 2014. However, over the next several years the Twitter infrastructure team added network point-of-presence (PoPs) over five continents, and in 2015 Twitter moved from a traditional hierarchical data center network topology to a Clos network using Border Gateway Protocol (BGP) for routing. Highlights of the Clos approach include a smaller 'blast radius' of a single device failure, horizontal bandwidth scaling capabilities, and higher route capacity due to lower CPU overhead.
The Twitter engineering team has learned a number of important lessons from the scaling of their network infrastructure:
- Architect beyond the original specifications and requirements, and making quick and bold changes if traffic trends toward the upper end of designed capacity.
- To rely on data and metrics to make the correct technical design decisions, and ensure those metrics are understandable to network operators. This is particularly important in hosted and cloud environments.
- There is no such a thing as a “temporary change or workaround”. In most cases, workarounds are technical debt.
Storage and messaging represent 45% of Twitter’s infrastructure footprint, and the infrastructure team provides multiple services to internal users, including: Hadoop clusters, Twitter’s Manhattan (Apache Cassandra-esque) distributed database clusters, FlockDB graph storage clusters (utilising Gizzard and sharded MySQL), Blobstore clusters, Twemcache and ‘Nighthawk’ sharded Redis caching clusters, DistributedLog messaging clusters (for processing by Heron), and relational stores (MySQL, PostgreSQL and Vertica). In 2010, Cassandra was also added as a storage solution for metrics, but usage has been deprecated internally since the launch of Manhattan in April 2014.
Nearly all of Twitter’s main caching has been migrated from bare metal to their massive-scale deployment of the open source Apache Mesos cluster management system (Twitter is one of the lead contributors to the Mesos codebase). The primary exception to the deployment to Mesos is ‘Haplo’, the cache for Tweet timelines that is implemented using a customised version of Redis. Scale and performance are the primary challenges for caching as Twitter runs hundreds of clusters with an aggregate packet rate of 320M packets/s, delivering over 120GB/s to clients. In order to meet high-throughput and low-latency service level objectives (SLOs), engineers continually measure the performance of systems and look for efficiency optimizations. For example, Twitter engineers have created an open source cache performance testing tool, rpc-perf, and use this to get a better understanding of how cache systems perform under various load scenarios.
Core lessons learned when implementing storage and caching included:
- Additional storage engines (LSM, b+tree etc.) have been adopted within Twitter's Manhattan distributed database in order to better serve specific traffic patterns, and storage layers have been protected from abuse by sending a back pressure signal and enabling query filtering.
- Focusing on providing the right tool for the job means legitimately understanding all possible use cases. A "one size fits all" solution rarely works; avoid building shortcuts for corner cases, as there is nothing more permanent than a temporary solution.
- Moving to Mesos was a "huge operational win" for Twitter, as this allowed the codification of infrastructure configurations and enabled programmatic deploys that preserved cache hit-rate and avoided causing issues within the persistent storage tier.
- Caches are logically partitioned by users, Tweets, timelines, etc. and in general, every cache cluster is tuned for a particular need.
Twitter uses Puppet for all configuration management and post-kickstart package installation of their systems, and they have over 100 committers to this code per month, over 500 modules, and over 1,000 roles. There are three branches, which Puppet refers to as environments, and these enable controlled testing, canarying, and eventually pushing of changes to a production environment. The highest impact changes to the growing Puppet codebase has been code linting, style check hooks, documentation of best practices, and holding regular office hours:
- With over 100 Puppet committers throughout the organization, documenting internal and community best practices has been a "force multiplier".
- Having a single document to reference has improved the quality and speed at which code can be shipped.
- Holding regular office hours for assistance (sometimes by invite) has allowed for 1:1 help, whereareas tickets and chat channels don’t offer high enough communication bandwidth or didn’t convey the larger picture of what was trying to be accomplished.
Additional information on scaling the infrastructure that powers the Twitter platform can be found in the Twitter Engineering Blog post “The Infrastructure behind Twitter: Scale”. A complementary earlier blog post, “The infrastructure behind Twitter: Efficiency and Optimization”, focuses on the evolution of Twitter's private Platform as a Service (PaaS).