Bio Serkan Piantino is an engineering manager at Facebook and is the head of Facebook’s new engineering office in New York City. He manages the teams that power Facebook Messages and Chat. His previous experience includes building the infrastructure and ranking algorithms for News Feed, managing development of new versions of the Facebook home page, and leading development of Timeline.
Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.
Sure, so as you said my name is Serkan Piantino. I work at Facebook, I am a software engineer. Today I’ll be talking about some of the work we did in helping to build the infrastructure that powers News feed and most recently I’ve been heading up opening our New York engineering office here in the city.
Yes, something like that.
3. I’m used to dealing with system, large scale systems that are maybe a few million users in that low numbers. When you plan and work on something as large scale as a Facebook feature, what kind of numbers do you use for reference?
So I think it’s interesting with the fundamental design of things, a million users can be scaled because you face the same problems when you can't do a lot of things on one host, you can join tables or some of the other things that people can do at the smaller scales.
So even a million users depending on how active and how difficult the type of queries you’re doing, that can be a lot.
The thing that I think gets very different when you’re talking about massive scale is the value of efficiency – how much it’s worth to shave a couple milliseconds off a query or how much you have to think about the real nitty-gritty of how everything is going to work; and that’s basically the way we design systems. These days, what we’re looking at are those basic timings that you get on hardware, you know, it’s a couple nanoseconds to access RAM; it’s 8 milliseconds to move a disc head somewhere.
You know, all those sort of fundamental perimeters of the servers that you’re going to be getting and the hardware that you have. We try to use that to figure out different configurations given the different things that the servers are going to do, how many of them will we need?
And the goal with our software is to keep our software almost out of the way so that we’re actually getting the performance of the underlying hardware, it’s basically where we start.
4. Makes sense. Are you like any other traditional large scale system trying to minimize disc seeks and taking RAM for granted; or are you even thinking about RAM access timings as something very important?
So typically, yeah, RAM is so much faster in orders of magnitude than moving a drive head that it is basically free.
Interestingly though News feed is mostly in RAM and so we spend a lot of time thinking about things like page tables, cache entries, even NUMA architectures, if people are familiar with that; so News feeds is one of the rare cases where we’re squeezing performance out of silicon and not too much out of spinning discs. When you add spinning discs to the equation, they kind of drown out all the numbers and the number of seeks ends up being the only thing that you really think about.
It’s definitely not exponential. It definitely kind of flattens out. I mean, looking at when we were first designing the version of News feed now in 2007, we were designing it for I think a 100 million users and at the time I think we had 25 million; so we knew that we’re going to be scaling.
And since then every year it’s tripled in query volume. So but there’s not a lot of engineering that goes into that. We built a system that’s fundamentally scalable and it has scaled and that’s kind of what’s great about it.
So I definitely think it diminishes in terms of difficulty once you get past 100 million users, you're kind of already there.
As I said, efficiency becomes more important that the gains in broad dollars of making something more efficient gets higher; and you also more of your problems move from this class where I call them ‘episodic problems’ like a server goes down and you have that, you have to deal with that and maybe you didn’t have a plan because that has not happened yet. They move into the realm of being constant problems when you have tiers of thousands of machines, our servers are always going down or we’re always having network issues or pieces of infrastructure always fail.
And so they become more steady state errors, things you’re facing day-to-day; you can track better.
And it’s actually easier to fix the steady state stuff that you can track and know than it is to fix the stuff that happens once in a while because you don’t get a lot of data and it’s usually a fire drill when it actually happens.
So in some subtle ways, it can actually get easier over time when you scale thing out but if the core of your stuff can scale to some large number, it can probably scale to some really large number.
6. Do you design for constant failure today; or do you plan to have kind of positive feedback, the server fell down, it’s just the normal thing that happens all the time at a very large numbers of servers?
Absolutely, I mean, that’s probably half of the equation. First, you figure out how things are going to work and if that plan is going to be able to scale and replicate and if that’s going to work; and then the second thing is what we’re going to do and stuff goes wrong?
So the failure scenarios and not only thinking about what one machine should do in the failure case or what one type of request can do in the failure case but thinking about all of the emergent behavior if you start having some networks spottiness and retries, exacerbates that problem, I’m pretty sure you’re overwhelming switches.
You have all these things that emerge from the local behavior of one server and you need to think about those and you need to think about what’s desirable and how you can get the thing you want out of the code you’ve written basically.
I think the biggest choice we made was the first one which is to put everything in memory and to query it all live; and that was a great decision.
There’s certainly things we've innovated along the way that made our jobs a lot easier; and some of the stuff I’ll be talking about in my talk. There’s not a ton that we totally skipped. It’s surprisingly not that much code and we’ve changed it a number of times but it pretty closely resembles what it used to look like; so I think we did a pretty good job.
8. You talk a lot about memory. Do you think the computers that are going to run the system when you are in several billion users are going to have only RAM, no disc; are they going to look fundamentally different in ten years to support this kind of large scale system?
Definitely, I mean we’re even seeing in the consumer market, you have SSDs taking over and they have drastically different problems to solve to be effective and drastically different performance characteristics; and in the server market, you have the same thing, you know, Flash that doesn’t have this sort of random seek problem; it’s very fast; it comes in huge sizes and the price is decreasing; so that’s going to be exciting.
You probably are going to have things moving to be more and more parallel; so concurrency is probably going to be something that people are talking about not just at the billion user scale but all the way down to the earlier things.
Surprisingly most expensive thing we have is powering our servers like it’s a lot of megawatts or whatever we use; so I think efficiency is going to be another big part of it but then, you know, I’m sure there will be ton of innovation that we can’t predict.
I hope that’s the case. I hope we solve some of the energy problems we have in the country and in the world. Maybe that’s it, maybe efficiency becomes less; actually that’s not true; so even if power gets cheaper, cooling is never going to be that efficient, it’s like a basic law of thermodynamics and it always pays even putting the pocketbook aside, it always pays to make thing more efficient; you can make them smaller; you can put them in more things; so I think that would be the trend regardless.
Yeah, I think there are a few things that we do that are important. First of all, we have a set of tools, a growing but usually a pretty small set, of tools that we consider trusted. We will use these; we know exactly what it does; we can count on it.
So we’re sort of hesitant to deploy brand new technologies because there’s plenty of reasons where that can go wrong, that’s the first thing; the second thing is how important it is to be able to answer simple questions about your service even at the smaller scale if you can’t readily plot a histogram of different latencies for queries; if you can’t understand what your memory is being used by; if you can’t get cache hit rates.
These are sort of like simple metrics that you need to be able to not just pull up but know off the top of your head; and often when you work on that scale, these are sort of leading indicators of things going wrong. You know, you have retry mechanisms and …
So anyway, these things are really important to make sure that you have great monitoring and great graphs; everything is real time; that’s going to help as you scale and help you sort of reason about the direction to go.
The third thing I will say is it’s often overlooked when people build systems that they need to optimize not just for the number of servers they have; how long things will take but also for the developer experience.
If you can build something where a developer can take your code, make a change to it and immediately see the effect, they don’t have to kick off some MapReduce job where you can calculate everything; it’s just such a win for the overall company that it ends up being almost more important than the number of servers that we scale.