Interview with Raffi Krikorian on Twitter's Infrastructure
Raffi Krikorian, Vice President of Platform Engineering at Twitter, gives an insight on how Twitter prepares for unexpected traffic peaks and how system architecture is designed to support failure.
InfoQ: Hi, Raffi. Would you please introduce yourself to the audience and the readers of InfoQ?
Raffi: Sure. My name is Raffi Krikorian. I'm the Vice President of Platform Engineering at Twitter. We're the team that runs basically the backend infrastructure for all of Twitter.
InfoQ: With the help of "Castle in the Sky," Twitter created a new peak TPS record. How does Twitter deal with the unpredictable peak traffic?
Raffi: Sure. So what you're referring to is the Castle in the Sky event is what we call it internally, and that was a television show that aired in Tokyo. We did our new record of around 34,000 tweets a second came into Twitter during that event. Normally, Twitter experiences something on order of 5,000 to 10,000 tweets a second, so this is pretty far out of our standard operating bounds. I think it says a few things about us. I think it says how twitter reacts to the world at large, like things happen in the world and they get reflected on Twitter.
So the way that we end up preparing for something like this is really years of work beforehand, like this type of events could happen at any time without real notice. So what we end up doing is we do load tests against the Twitter infrastructure. We'd run those on the order of every month -- I don't know what the exact schedule is these days -- and then we do analyses of every single system at Twitter.
So when we build architecture and we build systems at Twitter, we look at the performance of all those systems on a weekly basis to really understand what do we think the theoretical capacity of this system looks like right now on a per service basis, and then we try to understand what the theoretical capacity looks like for overall. So from that, we can decide: (1) do we have the right number of machines in production at any given time or do we need to buy more computers? (2) we can have a rational conversation on whether or not the system is operating efficiently.
So if we have certain services, for example, that can only take half the number of requests a second as other services, we should go look at those and understand architecturally, are they performing correctly or do we need to make a change.
So for us, the architecture to get to something like the Castle in the Sky event is a slow evolutionary process. We make a change, we see how that change reacts and how that change behaves in the system, and we look and we make a decision on the slow rolling basis of whether or not this is acceptable to us, and we make a tradeoff, like do we buy more machinery or do we write new software in order to withstand this?
So while we never have experienced a Castle in the Sky-like event before, some of our load tests have pushed us to those limits before so we were comfortable knowing when it happened in real life. We're like "Yes, it actually worked."
InfoQ: Are there any emergency plans in Twitter? Will you guys do some practice at usual time, such as shut down some servers or switches?
Raffi: Yeah. So we do two different things basically as our emergency planning, maybe three, it depends how you look at it. Every system is carefully documented on what would it take to turn it on, what would it take to turn it off, so we have what we call runbooks for every single system so we understand what we would do in an emergency. We've already thought through the different types of failures. We don't believe we thought through everything, but at least the most common ones we think we've documented and we understand what we need to do.
Two, we're always running tests against production, so we have a good understanding of what the system would look like when we hit it really hard so we can practice. So like we hit it really hard, teams on call, they might get a page or something might happen or a pager might go off, so we can try to decide whether or not we do need to do something differently and how to react to that.
And third, we've taken some inspiration from Netflix. And Netflix has what they call their Chaos Monkey which proactively kills machines in production. We have something similar to that within Twitter so we can make sure that we didn't accidentally introduce a single point of failure somewhere. So we can randomly kill machines within the data center and make sure that the service doesn't see a blip while that's happening.
All this requires us to have really good transparency into what the success rate of all the different systems are. So we have a massive board. It's a glass wall with all these graphs on it so we can really see what’s going on within Twitter. And then when these events happen, we can see in an instant like whether or not something is changing, whether it would be traffic to Twitter or whether it's a failure within a data center so that we can react to it as quickly as we can.
InfoQ: How to isolate the broken module in the system? When something goes wrong, what's your reaction at the first moment?
Raffi: Yeah. So the way that Twitter is architected these days is that a failure should stay relatively constrained to the feature that the failure occurred in. Now, of course, the deeper you get down the stack, the bigger the problem becomes. So if our storage mechanisms all of a sudden have a problem, a bunch of different systems would exhibit a behavior of something going wrong. For example, if someone made a mistake on the website, it won't affect the API these days.
So the way that we know that something is going wrong again is just being able to see the different graphs of the system, and then we have alerts set up over different thresholds on a service-by-service basis. So if the success rate of the API fell below some number, a bunch of pagers immediately go off, there's always someone on call for every single service at Twitter and they can react to that as quickly as they can.
Our operations team and our network command center will also see this, and they might try some really rudimentary things the equivalent is like should we turn it off and on again and see what happens on that, while the actual software developers on a second track try to really understand what is going on and wrong with the system. So operations is trying to make sure the site comes back as quickly as it can. Software development is trying to understand what actually went wrong, and do we have a bug that we need to take care of.
So this is how we end up reacting to this. But, like I said, the architecture at Twitter keeps failure fairly well constrained. If we think it's going to propagate or we think that, for example, the social graph is having a problem, it's only being seen in this particular feature, the social graph team will then start immediately notifying everyone else just in case they should be on alert for something going wrong.
It is very much one of our strengths these days, I like to say jokingly, is emergency management, like what do you do in a case of a disaster because it could happen at any time and my contract to the world is that Twitter will be up so you don't have to worry about it.
InfoQ: The new architecture helps a lot in stability and performance. Could you give us a brief introduction of the new architecture?
Raffi: Sure. So when I joined Twitter a couple of years ago, we ran the system on what we call the monolithic codebase. So everything you had to do with the software Twitter was in one codebase that anyone could deploy, anyone could touch, anyone could modify. So that sounds great. In theory, that's actually excellent. It means that every developer in Twitter is empowered to do the right thing.
In practice however, there's a balancing act that developers then need to understand how everything actually works in order to make change. And in practical realities, the concern I would have is that the speed at which Twitter is writing new code, people don't give deep thought into the right – in just places that they haven't seen before. I think this is standard in the way developers write software. It's like I don't understand what I fully need to do to make this change, but if I change just this one line it probably gets the effect I want. I'm not saying that this is a bad behavior. It's a very prudent and expedient behavior. But this means that there is technical debt that's being built up when you do that.
So what we've done instead is we've taken this monolithic codebase and broken it up into hundreds of different services that comprise Twitter. This way we can have actual real owners for every single piece of business logic and every single piece of functionality at Twitter. There's actually a team who one of their jobs is to manage photos for Twitter. There's another team who one of their jobs is to manage the URLs for Twitter so that there are subject matter experts now throughout the company, and you could consult them when you want to make a feature change that would change something where URLs work, for example.
So since we've broken it up in all these different ways, we now have subject matter experts but this also allows things that we've spoken about: isolation for failure, also isolation for feature development. If you want to make a change to the way tweets work, you only have to change a certain number of systems. You don't have to change everything in Twitter anymore so we can have real good isolation both for failure and for development.
InfoQ: You have a system called Decider, what's the role of Decider in the system?
Raffi: Sure. So Decider is one of our runtime configuration mechanisms at Twitter. What I mean by that is that we can turn off features and software in Twitter without doing a deploy. So every single service at Twitter is constantly looking to the Decider system as to what are the current runtime values of Twitter right now. How that practically maps is I could say the discover homepage, for example, has a Decider value that wraps it, and that Decider value tells discover whether it's on or off right now.
So I can deploy discover into Twitter and have it deployed in the state that Decider says it should be off. So this point we don't get an inconsistent state. The discover, for example, or any feature at Twitter runs across many machines. It doesn't run on one machine, so you don't want to get in the inconsistent state where some of the machines have the feature and some of them don't. So we can deploy it off using Decider and then when is on all the machines that we want it to be on, we can turn it on atomically across the data center by flipping a Decider switch.
This also gives us the ability to do a percentage-based control. So I can say actually now that it's on all of the machines, I only want 50% of users to get it. I can actually make that decision as opposed to it being a side effect of the way that things are being deployed in Twitter. So this allows us to really have a runtime control over Twitter without having to push code. Pushing code is actually a dangerous thing, like the highest correlation to failure in a system like ours, not just Twitter but any big system, is software development error. So this way we can actually deploy software in a relatively safe way because it's off. Turn it on really slowly, purposely, make sure it's good and then ramp it up as fast as I want.
InfoQ: How does Twitter push code online? Would you please share the deployment process with us? For example, how many different stages, you choose daily pushing or weekly pushing or both?
Raffi: Sure. So Twitter deployment, because we have this services architecture, is really actually up to the control of every single individual team. So the onus is on the team to make sure that when they're deploying code everyone that could probably be affected by it should know that you're doing it, and the network control center should also know what you're doing just so they have a global view of the system. But it's really up to every single team to decide when and if they want to push.
On average, I would say teams have a bi or tri-weekly deploy schedule. Some teams deploy every single day; some teams only deploy once a month. But the deployment process looks about the same to everybody which is: you deploy into a developing environment. This is so developers can hack on it really quickly, make changes, look at the product manager, look at the designer, make sure it does the right thing. Then we deploy into what we call a "Canary system" within Twitter, which means that it's getting live production traffic but we don't rely on its results just yet. So it's just basically loading it off to make sure it handles it performantly, and we can look at the results that would have returned and compare it and manually inspect it to make sure that it did what we thought it would do given live traffic.
Our testing scenarios may not have covered all the different edge cases that the live traffic gets, so it's one way we learn to understand what the real testing scenarios should look like. Then after we go into Canary, then we deploy at Dark, then we slowly start to ramp it up to really understand what it looks like at the scale. And that ramp up could take anywhere from a day to a week actually, like we've had different products that we've ramped to a hundred in the course of a week or two. We've added different products that we've ramped up to 100% in the course of minutes.
So again, it's really up to the team. And each team is responsible for their feature, is responsible for their service. So it's their call on how they want to do it, but those stages of development, canary, dark reading, ramp up by Decider is the pattern that everyone follows.
InfoQ: There are huge amounts of data in Twitter. You must have some special infrastructure (such as Gizzard and Snowflake) and methods to store the data, even processing them in real-time.
Raffi: Yeah. So that's really two different questions I think. So there is how do we ingest all this data that's coming into Twitter because Twitter is a real-time system, like I measure the latency for a tweet to get delivered in milliseconds to Twitter users. And then there's the second question of you have a lot of data. What do we do with all that data?
So the first one you're right; we have systems like Snowflake, Gizzard and things like that to handle tweet ingest. Tweets are only one piece of data that comes into Twitter, obviously. We have things like favorites. We have retweets. We have people sending direct messages. People change their avatar images, their background images and things like that. So all people click on URLs; people load web pages. These are all events that are coming into Twitter.
So we begin to ingest all this and log them so we can do analysis. It's a pretty hard thing. We actually have different SLAs depending on what kind of data comes in. So tweets, we measure that in milliseconds. In order to get around database locking, for example, we developed Snowflake that can generate unique IDs for us incredibly quickly and do it decentralized so that we don't have a single point of failure in generating IDs for us.
We have Gizzard which handles data flowing in and sharding it as quickly as possible so that we don't have hot spots on different clusters in the system, like it actually tries to probabilistically spread the load so that databases don't get overloaded by the amount of data coming in. Again, tweets go through very fast on a system.
Logs, for example, like people are clicking on things, people view tweets, have their SLA measured in minutes as opposed to milliseconds. So those go into completely different pipeline. Most of it is based around Scribe these days. So those just slowly trickle through, get aggregated, get collected and get jumped into HDFS so we can do a later analysis of them.
For long-term retention, all of the data, whether it be real-time or not, ends up in HDFS and that's where we run massive like MapReduce jobs and Hadoop jobs to really understand what's going on in the system.
So we try to achieve a balance of what needs to be taken care of right now especially given the onslaught of data we have and then where do we put things because this unclogged data accumulates very fast. Like if I'm generating 400 million tweets a day and Twitter has been running for a couple of years now, you can imagine the size of our corpus. So HDFS handles all that for us so then we can run these big mass of MapReduce jobs off them.
InfoQ: Twitter is an amazing place for engineers, what's the growing path of an engineer in Twitter? Especially, how to become a successful geek like you, would you please give us some advice?
Raffi: Well, I can't say I'm a successful engineer since I don't write software anymore these days. I started at Twitter as an engineer, and I've risen into this position of running a lot of engineering these days.
Twitter has a couple of different philosophies and mentalities around it, but we have a career path for engineers which basically involves tackling harder and harder and harder problems. We would like to say that it doesn’t actually matter how well the feature you built does. In some cases, it does. But really like what's the level of technical thought and technical merit you've put into the project you work on.
So growth through Twitter is done very much in a peer-based mechanism. So for example, to talk very concretely about promotions. To be promoted from one level to the next level at Twitter requires consensus -- not consensus but requires a bunch of engineers at that higher level to agree that yes, you've done the work needed in order to get to this level at Twitter.
To help with that, managers make sure projects go to engineers that are looking for big challenges. Engineers can move between teams. They're not stuck on the tweet team, for example, or a timeline team. If an engineer says, "I want to work on the mobile team because that's interesting. I think there's career growth for me. In fact, my job as a person that manages a lot of this is to make that possible." So you can do almost whatever you want within Twitter. I tell you what my priorities are in running engineering and what the company's priorities are in either user growth or money or features you want to build. And then engineers should flow to the projects that they think they can make the biggest impact on.
And then on top of that, I run a small university within Twitter that we call Twitter University. It's a group of people whose whole job is training. So for example, if an engineer wants to join the mobile team but they are a back-end Java developer, we're like "Great. We've created a training class so you can learn Android engineering or iOS engineering and you can take a one weeklong class that will get you to the place that you've committed to that codebase and then you can join that team for real." So this gives you a way to sort of expand your horizons within Twitter and a way to safely decide whether or not you want to go and try something new.
So we invest in our engineers because honestly they're the backbone of the company. The engineers build the thing that we all thrive on within Twitter and the world uses, so I give them as many opportunities as I can in order to try different things and to geek out in lots of different ways.
About the Interviewee
Raffi Krikorian is Vice President of Platform Engineering at Twitter. His teams manage the business logic, scalable delivery, APIs, and authentication of Twitter's application. His group helped create the iOS 5 Twitter integration as well as the "The X Factor" + Twitter voting mechanism.
Anatole Tresch Mar 03, 2015