Transcript
Yue: I'm going to talk to you about a type of platform and maybe not the most common kind you run into. This is a case study of how we tried to solve a problem but ended up accidentally almost doing platform engineering. This is about my tenure at Twitter. I was at Twitter for almost 12 years, so this is mostly the second half of it. Now I'm no longer at Twitter.
Debugging a Generator (Charles Steinmetz)
I want to start with a story. I grew up in China, so I read about the story in Chinese. There's this wizard of an electrical engineer and scientist, Charles Steinmetz. He was originally German, he came to the U.S., so he became a wizard figure in this local town. One of the most famous anecdotes about him was he was asked by Henry Ford to debug a problem about the generator. He's like, I don't need tools, give me some pen and paper. He just sat there and thought. On the second night, he asked for a ladder, and he climbed up to the generator, and marked on the side and said, "Replace the plate here and also 16 rounds of coil." The rest of engineers did as he said, and they did, and the generator performed to perfection. Later, he sent a bill to Ford asking for $10,000. I looked it up, $10,000 back then is about $200,000 in today's purchase power. For two days' worth of work, that's not so bad. Ford was a little bit taken aback, said, why are you asking for so much money? This is his explanation, making the chalk mark, $1, knowing where to make the chalk mark the rest of it. Ever since then, whenever I see performance engineering stories, usually on Hacker News, it's usually how it goes, like, here is a core loop and we changed that one line in this loop, and suddenly our performance went up by 300%. Then it never really goes to the front page, everybody has a great discussion about what is this strange, AVX vectorized instruction, or how you fit things into L1, L2 cache. Everybody is like, if only I knew where to find those one-line changes, I can be Hacker News famous, and I can make a lot of impact. That tends to be the popular conception of what performance engineering is, is you know exactly where to look, you make that magical touch, and then things automatically get better. Is that true? Should we hire basically a room of wizards and set them loose and then go around holding a chalk and then marking all over the place?
Background
I was actually at QCon, speaking at QCon SF in 2016. Back then I was talking about caching. I spent my first half at Twitter mostly managing caches and developer caches. I weathered through a dozen of user facing incidents, and I spent lots of time digging in the trenches. Actually, shortly after giving the talk, I founded the performance team called IOP. I started the performance team at Twitter in 2017. That went on until last year. Generally, I like things about systems, especially distributed systems. I like things about software and hardware integration. Then performance really is an area that allows me to play with all of them. That is what I have spent time working on.
Why We Need Performance Engineering (More than Ever)
I want to talk to you a little bit about performance engineering. If you want to start pivoting into performance engineering, this is the year to do it. How many companies here have given some cost effectiveness or efficiency mandate? This is one reason to care about performance. If that's on your mind, here's what you can do. Number one, I want to first start by saying a little bit about why we actually need performance engineering now more so than ever. There was a wonderful talk by Emery Berger, who is a professor at University of Massachusetts. In his talk called, "Performance Matters," he pointed out the fact, he said, performance engineering used to be easy because computers get faster about every 18 months. If you're a performance engineer, you just sit on the beach, drink your cocktail, and buy the next batch of machines in 18 months. I'm like, that's wrong. If it's that easy, then you don't need it. In fact, what happens is people don't really hire performance engineers until they're extremely large companies, and the rest of the engineering department just sat on the beach, drank their cocktail, and wait 18 months, and then buy the next generation of computers. Now the moment has finally come that you actually have to put in work. Why is that? This is a slide about all kinds of hardware features. It's intentionally made busy starting from things like NUMA nodes, and PCIe accelerators like GPUs, and other devices, which have been on the market for at least 15 years. There's all these terms that represent certain kinds of specialized technologies that you may or may not have heard of. There's CXL, there's vectorization, programming NICs, RDMA, whatever. The point is, there's a lot going on in hardware engineering, because we have all these petty laws of physics getting in the way. There's the power cap, there's the thermal cap. You cannot make a computer just run faster, simple things run faster, that's the best. Instead, you do all these tricks. The heterogeneous evolution of the hardware means it's getting really difficult to write a good program or the right kind of program to actually take advantage of them. If you think programming many core computer is difficult, just wait until you have to learn 20 different new technologies to write a simple program. It's really getting out of control.
On the other hand, our software engineers are not doing ourselves any favors. If you look at a modern application, it's highly complex. On the left it's a joke by XKCD, but it's very much true, you have this impossibly tall stack, and nobody knows where your dependency tree really is, and nobody really wants to know. If you zoom out, and you say, let me look at my production. Your production is not even one of those tall stacks, your production is a bunch of those tall stacks. No matter how complex a graph you draw to represent the relationships between the services, reality is usually much worse. On the bottom right is a graph that is actually quite old by now, that's a 5-year-old graph back from when Twitter was simpler. This is the call graph between all services. You can see the number of edges and how they connect with each other. It's enormous headache. The problem here is that things are complex. If Steinmetz were to live today and were to solve what is the software equivalent of a generator, I think what he's going to be facing is, you have these 6-feet tall generators packed into a 40-foot-long corridor, and then you have an entire warehouse of them. That's not the end, you have three such warehouses connecting to each other, and to generate any electricity at all, you have to power up all three rooms, and they constantly are bustling. When you have that level of complexity, basically, there is no room for magic, because who needs magic when you have systems of complexity instead? The answer is no, we cannot purely do performance engineering relying on knowing everything, and being able to hold the state in our head because reality is far more challenging than that.
The answer basically is treating complexity with the tools that is designed to handle complexity. One of the languages for complexity actually is systems. This is something that has been mentioned over again. I think systems is not mythical, systems essentially is you have a lot of things and those things are connected. System essentially is building a model to describe the relationship between different parts. If you see the fundamental reality through that lens, then what is performance? Performance is merely a counting exercise on top of such relationship. You're counting how many CPUs or how much memory bandwidth or how many disks you have. You are counting how often are they used, and to what extent are they used. The key here is you need to have the system model in place, so you have the basic structure. We need to count these resources at the right granularity. If you care about request latency, for example, and the request latency tend to happen on the level of milliseconds, then you need to count at those time granularities. Otherwise, the utilization would not reflect what the request is experiencing. All of these things will change over time, and therefore this exercise will have to go on continuously.
Performance as a system property allows us to draw some parallels with other system level properties. I think those are also fairly well understood, for example, there's security. The intuitive way of thinking about security is that the weakest link determines the overall system security. It's like the and relationship. Anything that fails means the whole system fails. Availability, on the other hand, is much more forgiving. If any part of that system works, the system is available. Performance is somewhere in the middle. Performance loosely can be seen as the sum of all parts. You add up the cost of your parts, or you add up the latency of all your steps. This means that when we approach performance, we often can trade off. We can say a dollar saved is a dollar saved. It doesn't matter if it saves here or there, because in the end, they all add up together. Performance also links to some other higher properties, or maybe business level concepts that we care about. Businesses generally care about reliability, they want their service to be available. Performance is that multiplier of that. Better performance means you can handle more with the same resources. Cost, on the other hand, is inversely related to performance. If you have worse performance, if performance is lower, then you have to spend more money to get the same kind of availability. Through these links this is how performance ties itself to actually the overall business objective.
Performance engineering implies that things can be made better, in particular, performance can be made better. Can we make that claim? I really like this book, that's called, "Understanding Software Dynamics." In the first chapter, what it establishes is that there is a limit. There is a baseline saying, how slow should it be? For example, you can never violate the law of physics. You cannot run faster than what the CPU can run according to the clock. If there's a delay because of PCIe, it cannot transfer data any faster than the physical limitation of those links. All of these boundaries, and all of these constraints actually tells us if you do everything right, what can you expect? The other aspect is predictability. If you keep measuring things over again, over time you get a distribution. You can see how bad are things at the tail, how bad are things in the worst cases? Those tend to have an effect when you have a very large infrastructure. On the other hand, you can understand the consistency. This means we're like, if you measure them over time, this is the behavior you can characterize in aggregate. The TL;DR is this, if you design something really well for performance, I think good performance engineers or good software engineers tend to converge on the thing, maybe the same design that tends to look similar. If you know what good performance looks like, what you can do is you can measure what you actually get. If what you actually get is different from what is good, that delta is the room for optimization. Because such a limit exists and because we know how to measure them, we can do optimization. Another thing I want to mention is just like, there is a structure in performance engineering. Just like building a house, the most important thing is you have the structure right. After that you want to do the plumbing because once you put the drywall on, you cannot go back and change the plumbing very easily. There's an analogy here, if you want to design a program, how you communicate between different threads between different parts of your program is really important because those are hard to change. The kinds of ideas we associate with performance engineering in anecdotes like changing our inner loop, those are actually the least important, because those are the easiest to change. You can always change one line of code very quickly but you cannot change the structure of a problem. When you do performance engineering, focus on the things that cannot be changed very easily or quickly. Then, eventually you get to those local optimizations.
Performance Engineering at Scale
Now we say we know how to do performance engineering, just in concept. The talk is about how to do performance engineering at Twitter scale. How do we do it at scale? We just basically follow the blueprint, which is, we want to build a model for the system, we want to do counting on that system. What actually happens when we build that model, and when we do the counting at scale, is we translate what is otherwise like a reliability or system man, or magical wizardry engineer looking into an isolated piece of code. If you want to do it at scale, it actually gets translated into a data engineering problem. Let's use this as an example. If you think about modern software as a local runtime, and software engineers are very good at abstraction. We have all these layers, we have our application code, which sits on top of a bunch of libraries, it may have a runtime like the JVM. Then we have the operating system. Underneath it we have the hardware and then even the network. All these layers of abstraction actually provide the structure of how we would think about software in terms of hierarchy. Once we have this mental model in mind, we can collect these metrics, because in the end, what we care about is how resources are used. You can count these things, at the bottom you have your hardware resources, but they get packaged into all these kernel level syscalls, and other low-level functionalities. Then you can count them to get a higher-level unit that you can do tally on, and so on and so forth until you get all the way to the top of the application.
What we did is, we essentially said, these are the data we need, and these are the data that would apply universally. Now we're in the domain of essentially data engineering. In data engineering, what you care about are two things. One is where you get the data and what data you're getting. Number two is, what are you doing with that data? How are you treating them? That's exactly what we did. When it comes to data generation or signal generation, remember, performance is about counting resources at the right granularity. One thing that is unique about performance is that actually the granularity that performance looks at is actually much finer, generally speaking, with general observability. If you get metrics once every minute, or once every 10 seconds, that's considered fairly ok as an interval. A lot of the things about performance happens on the level of microseconds or milliseconds. Often, you need to collect signals at those levels. If you have a spike that lasts only 50 milliseconds, you'll be able to see it. That requires us to write our own systems of samplers that gives us all the low-level telemetries at very high frequency at very low overhead. The way to get there is we actually wrote our own. We heavily used eBPF, which gives a low overhead and also lets us look into the guts of the layers of abstraction and pretty much get any visibility we want. We also have a project called long-term metrics, which is synthesizing pretty much all levels of metrics. You can correspond the request-response latency with very low-level resource utilization. You can see, is this request slowed down due to a spike in CPU usage or something else?
Because we are a performance team, we're not explicitly a platform team. The point of building all this data infrastructure is to solve real problems. One of the ways we use our data is that we look at everybody's GC. Then we concluded, what is the right distribution of GC intervals and how often they should be running. Then, we just generated a set of instructions and UI to tell people how to tune their GC automatically. It would compute it for everybody, and you will see the result. We also have fleet health reports showing how much bad hosts is actually becoming outliers and slowing everything down. We also have all these utilization reports and telling people where their utilization is and where they should be and how much they have room to improve. One of the unintended usage actually is people started using our dataset for things that may or may not have to do with performance engineering. Capacity engineers decide they want to validate their numbers using the data we have. Service owners see what we do when it comes to optimizing for certain services. They're like, we know how to do this, it's a simple query, we can copy yours. They started doing it on their own and actually saved quite a bit in some cases. Just because we have data in queryable SQL ready state, several teams actually start to prefer the data that we produce, because they're just easier to use. That is localized data engineering for performance. We took a bit of shortcut here, as you can see, we actually did not really understand the mapping between these layers together. That's because Twitter applications actually tend to be very homogeneous. We have libraries like Finagle and Finatra which allows us to cut corners, because all applications actually look alike. If you want to understand the heterogeneous set of services, you may need to do things like tracing and profiling, so the relationship between different layers might be more obvious.
Then we say we zoom out. We have all these local instances of applications, but they don't tell the end-to-end story. When you have a very complex application that has relationships between services, we also need to understand how they talk to each other. This is often in the domain of tracing, or just a distributed system understanding. Here we take the same approach, we say, what is the system here? The system here is you have all these services, and they talk to each other. When they call each other, that's when they are connected. We capture these edges, and then same thing, we do counting on top of that. Counting here come in very different flavors. You can do counting with time. You can say how much time does this spend on this edge? You can do counting on resources, like how many bytes or what kind of attributes is propagated. Anything you can think of actually can be applied to the edge. The most important thing is the structure, which is this tree-like graph, where all the interactions are happening on top of. We did very similar data treatment. First, we say, let's improve the signal word gallery. Twitter started the Zipkin project, which later became the standard in OpenTelemetry. All of Twitter services actually came with tracing. Tracing data is particularly fraught with all kinds of data problems. Some of them are inevitable, like clock drift. Some of them are just due to the nature that everybody can do whatever they want, and you end up having these issues like missing a field or having some garbage data in the field. All of those actually requires careful validation and data quality engineering to iron out. This took numerous attempts to get to a state where we can trust. On top of that, we then can build a very powerful trace aggregation pipeline. What it does is, it collects all the traces, instead of looking at them one by one, because nobody can ever enumerate them. You started putting them together and building indices. Indices really is the superpower when it comes to databases. We have indices of traces themselves. We have indices of the edges. We have indices of the mapping between the trees to the edges. All of these data are queryable by SQL.
What we were able to do are things like, how do services depend on each other? This is a screenshot of what we call a service dependency explorer. It tells you who calls who to execute a particular high level, like if you're visiting Twitter home timeline, how many services are called along on the way, and for each call of the highest level, how many downstream calls are necessary? You instantly get an idea of both the connectivity and also the load amplification between services. We also did a model which is turning into a paper called LatenSeer, essentially allows you to do causal reasoning of who is responsible for the latency that you're seeing at the high level. The latency propagation or latency critical path is a product. What this allows you to do is you can ask, what if? What if I migrate the service to a different data center? What if I make it 10% faster, does it make a difference overall? All these properties can be studied. We did lots of analysis on-demand, because aggregate tracing often can answer questions that no other dataset can give you answers for. One unintended usage of this dataset is the data privacy team want to know what sensitive information is in the system and who has access to them. They realized the only missing thing they have is they don't have how services connect to each other. The status that we build mostly to understand how performance essentially propagates through the system ends up being the perfect common ground. They throw away all our accounting, they replace that with the property they care about, but essentially, they kept the structure of the system in place. This allows them to do their performance and privacy engineering with very little upfront effort.
In summary, basically, we spend majority of our time doing data engineering, which is certainly something I was not thinking about when I started, but it made our actual work much easier. We were number one and primary dogfooding customer of our own platform. We were able to do all of these. Each thing I list here on the top, probably is on the order of tens of millions, no big deal. We were able to finish each task in a few months. All of that was possible because we were using the data that we carefully curated. Any question we have is no more than a few queries away. This, essentially, is the power of platform engineering. Towards the end, actually, we started building even more platforms. As I said, we cheated on the software runtime as a system, because we just read the source code and understood what's going on. That's not the sustainable way of doing it. We actually very much intended to do program profiling and get the data, they are treated very much the same way as we were treating other data, and turn that into a dataset that anybody can look into. Also, we were thinking about having a performance testing platform. Whatever change you want, you can push through the system and it would tell you immediately what the difference is with any of the other settings you care about. A lot of these things did not happen because we all got kicked out of Twitter.
How Do Performance Engineers Fit In?
That was the technical side of things. To fit in this track, you have to talk about things that are not just technical. Like, how do performance engineers fit into the broader organization if this is the sort of thing we do? The short answer is we don't fit in. I think the best analogy may be from "Finding Nemo 2." There's this octopus that is escaping from the aquarium. I sometimes think of the performance engineering team as the octopus, because we are dealing with system property here. Whatever we do can take us really to any part of the system, and maybe talking to any team in any part of the organization. Whenever executives or management tries to box performance engineering into a particular part of the organization, which is very hierarchical, we somehow will find something to grab on to and then sling ourselves out of the box that they do for us. If you are interested in performance engineering, or really any of the system properties like reliability, security, and you care about having a global impact, actually, you have to deploy a lot of thought into managing the hierarchical org, which is, I think, fundamentally incompatible with the way these properties work. You have to take some first principles approach, figuring out the fundamentals, and then be very clever about it. I think the fundamentals are actually very straightforward. Like performance engineering, you either line up with the top line, which is making more money, or you line up with the bottom line, which is spending less money. Sometimes you need to pick which one is more valuable. Sometimes you will switch between them, but you need to know what you are dealing with. The value has to be associated with one of these. It will affect questions like, who's your customer. Then, who is your champion, because you need to know who you're helping. Also, the decision-making structure is really important. If you have a top-down structure of decision making, you find the highest touchpoint you can find and that person is going to give all the commands on your behalf. Then your life is easy. If you have a bottom-up organization, then you need to grow a fan base, essentially, people who like using your thing, and they were the advocate for you on your behalf. Constantly think about like, how to align incentives, how to convince others to do your bidding. One last thing, how to get promoted and recognized because that's not obvious. If you're doing this type of work, you are not the same as majority of the rest of the engineering org. Your story has to be strict.
I have some other thoughts. One thing early on that was a little controversial, was that people said if you have a performance engineering team, does that mean people won't do performance work elsewhere? Are you going to do all the performance work? I think the answer is absolutely no. In any engineering org, there will be lots of often very good engineers who are willing and capable of doing excellent performance engineering work, but they won't join your team for various reasons. I think the answer is always, you let them. Not only you let them, you help them. The symbolic significance of having a performance engineering team cannot be underestimated, because it means that the org values this type of work. Also, nobody ever shoots a messenger who delivers the news of victory. You can just take out your bullhorn, and give everybody credit, you will still come out as a hero. I think, be a focal point where people can come to you in the right context, say, let's talk about performance, and they contribute their ideas, they even do the work for you, and then you give them credit, you make them look good. I think this is a great way of having a culture in the end that values this kind of collaborative approach and also values just this type of system thinking and global optimization in general. There's also a bit of a funny boom and bust cycle going on with performance engineering, which is the opposite of the general business. When the business is good, everything is growing, nobody cares about cost because you're just floating in money. Nobody cares about performance, and that's ok. Nobody gets fired when the company is doing amazing. The reverse could be true, the company is really doing poorly, like everybody is so stressed out. At this point you can be the hero. You can say, I know how to make us grow again, I know how to save money. Then now everybody wants to work for you. Take advantage of reverse boom and bust cycle. Do your long-term investment when nobody is paying attention to you, but have all the plans ready when the company is in need, when the business is in need, and you can be there ready to make an impact felt throughout the company.
How It Actually Happened
I presented the things we did, and how we fit in. It's all nice and neat. I figured this out. As everybody could probably guess, the birth of anything was never this neat. It's very messy. This is what actually happened. What actually happened is I had no idea back in 2017, like any of this would happen, or how I would go about it. All I had was a niche. I just had my first baby. I was sleep deprived constantly and holding a baby in my hand like 16 hours every day. I was in this dazed moment I started thinking about performance. I had this drifting into these thoughts, like just think about performance in a very deep philosophical way. If you don't believe this is philosophical, just replace performance with happiness and you'll see my point. I spent months with it. Then finally a vision occurred. I figured out the ideal performance engineer, who is: knows hardware, good at software, has operations experience, and can do analytics, and can speak business and also deal with any kind of people you will run up and down the org. Might as well walk on water at that point. A unicorn, not just any unicorn, a rainbow unicorn.
Coming back to reality in engineering speak, you need a team. Not just any team, but a diverse team, because it's very hard for anybody really to check three of those boxes. You have all these fields to carry. Everybody needs to be good at something. It's very difficult to hire or optimize for six different attributes. I have a simple rule of thumb. My rule of thumb was I need to find people who are much better than me on at least some of the things. I just kept going at it, and it worked out ok. When we first started back in 2017, it was like four people. They were all internal. We were all like your SRE adjacent. We all have this mindset that we need to think about the system property. We started doing Rezolus, which is the performance telemetry work. That's the one concrete idea we have, otherwise, we're just figuring out. We did a lot of odd jobs, like talk to a different team, and then say, what's bothering you? Is there anything else we can do? Someone says, can you install the GPU driver? We're like, ok, maybe. Then sometimes we run into small opportunities. When I say small opportunities, I mean a couple million dollars here and there. When you have an operational budget of like $300 million a year, you will find million-dollar opportunities under the couch. Then we also did lots of favors, not necessarily things that has anything to do with performance engineering. I was at some point in charge of infrastructure GDPR work, I don't even know why. It's something that nobody wanted to do and I had the bandwidth to do it. I did it, and everybody was happy that someone did it. Just justifying that you're helpful, valuable. Survival mode, I would say. We made lots of mistakes. Certain things, ideas didn't hold, and we moved on. We wrote those off saying, don't do this again. Over time, we figure out what is our vision, and whether there's another methodology, and we wrote it down. Once we had a little bit of trust, people saw us as helpful, we actually got enough small wins to start ramping up. This is where actually a lot of the foundational work started to happen. With the tracing work, with the long-term metrics work, and all these happened because we ended up hiring people who were extremely good at those things. They brought in their vision, and I didn't tell them what to do, I had no idea. They told me, I'm like, "This makes great sense, please." Then we dogfooded, and we used our own product like Rezolus to debug performance incidents. We were also looking forward, we were looking at all the crazy hardware things that are happening out there, and then see if any of those would eventually pan out. Speculative investments. We went out and continued to talk to everybody. It was like, can we help you now that we have a little bit of tooling and with a bit of insight, we can answer some of those questions. We did all of those things.
That paid off. At the beginning of the pandemic, there was a big performance capacity crunch. At that point, people started noticing us and they felt the need to come talk to us. That's when the consulting models start to flip, instead of going out to reach people, people are coming to us. Then we tried to make them happy, and we tried to make our effort impactful. As our primary datasets matured, and we started building our own products on top of our platform, people started to use those products, and people started to understand how they can use the same data to do their own thing when may or may not have to do with performance. Then we got invited to more important projects, mostly to look after the capacity and performance aspects of things. People want to tick a box, say, performance wise, this is fine. We're often there for it. All of these made me think, this is really wonderful. People know us. We turned a lot of effort into intentionally branding ourselves, like we are the performance people, we are the systems people, we can help you answer performance adjacent questions. Then really put effort into talks and publishing and even writing papers, stuff like that. The team got quite big, and then we've got all kinds of talent that we didn't used to have.
Finally, this would have been a happy ending if we actually executed it. We matured, so we figured out what are the functionalities that are closely adjacent to performance, which is things like capacity, and a lot of the fleet health stuff. The idea was not to grow the performance team just linearly, but instead to start having a cluster of teams that work really well together. The team actually get a little bit smaller, because that's easier to work with. Nonetheless, 2022 everybody was feeling the crunch of the economic downturn. That's what I'm talking about. When there's a drought, performance team could be the rainmaker, because you're the only one who is sitting on top of a bunch of reserve in terms of opportunities, and then you can really make it better for the business. We went with it. Then all of our pent-up projects were greenlighted. For three months it was glorious, and then Elon Musk almost showed up.
Lessons Learned
What are the lessons? Especially for a team that really deals with the entire organization, I think the technical and social considerations are equally important. There are several people talking about this. In my opinion, I have these principles or methodologies for performance engineering, I think they stood the test of time. One is, we need a pipeline. There's the opportunity, some of them may not pan out. Some of them may turn out to be not as useful as they want. If you have a lot of opportunities, because we've surveyed the entire system, and we measured everything, then eventually you will have a very reliable, consistent pipeline that keeps churning out opportunities that you can go after. For performance teams, usually the best place to look is low-level infrastructure, which affects everybody, or things that takes up like 25%, 30%, 40% of your entire infrastructure. You know who those are. They're elephants in the room, they're very obvious. Finally, if we want to scale, then it's really about creating platforms and products that allow other people to do similar things. They may not be worth your dedicated team's time, but a local engineer can do that in their spare time. It turned out to be a very decent win.
I think the other organizational lesson is, design the team to fit the organizational structure. Is it top-down? Is it bottom-up? Do it accordingly, don't copy what we do if your organization is different. Also, outreach is serious work, treat it as such. Adoption is serious work, not just building a product, all of these things, they apply in every sense of that word. Also, people make work happen. It's really true. Everybody's strengths and their personality is different, and we have to respect that. I think there's a huge difference between people who are at the p99 of the talent pool and p50. When you find such a person, really cherish them, because they're very hard to come by. Seek diversity, both in skills and in perspective, because when we do this kind of diverse work, that is really what's needed to deliver. Embrace chance. What I mean by that is, had we found a profiling expert, we probably would have built the profiling platform first, and then maybe the metrics platform later. Because the talent was interesting in those things, we'd let them lead us down the path of least resistance. It's ok, we can always come back later and make up for things we didn't build in the first place. Don't have a predetermined mindset saying, this has to happen first. As long as everybody is going in the right direction, the particular path is not that important. Finally, it's the things my kid's teacher tell them all the time, be kind to each other. Software engineering is really a social enterprise. To succeed in the long term, we need to be helpful, be generous, make friends, and make it into a culture, everything will be just so much more enjoyable for everybody.
IOP Systems after Twitter
What happened to IOP after Twitter? A bunch of my co-workers found really excellent jobs in all kinds of companies. A number of us banded together and then started doing our own. What we're doing is not so different from what we used to do at Twitter but now our goal is different. We want to do something for the industry, we want to build something that everybody can use.
Questions and Answers
Participant 1: [inaudible 00:44:36]. What was one of the things that was most surprising on that journey that [inaudible 00:44:47]?
Yue: I think what was a little unexpected might be just how much more valuable understanding the structure of the service and the structure of the distributed system is. Like the index, like this service talks to that service, just that fact that there is a connection of any kind exists between those two, or they are two steps away. That information. It's not obvious that this is useful for performance. Only after several other steps do we get value out of it. I think if I have to go blindly into a new problem or maybe seeking a new property that is desirable, I would just say, let's understand the system, regardless of the goal. I think now this is becoming actionable to me. At the time, it was somewhat accidental.
Participant 2: You mentioned that you guys are using eBPF for [inaudible 00:45:59]. What kind of system metrics are you looking for? [inaudible 00:46:06]
Yue: eBPF is like this little bit of a C program you can do that doesn't have loop and has very limited functionality, you can just execute it in kernel. It's like god mode, but with limited domain. The metrics we get from the eBPF, we use to gather things we couldn't get otherwise. If you want to get the I/O size distribution, like I'm going to the disk or I'm going to the SSD, am I getting 4 bytes, or am I getting 16 kilobytes? That distribution is quite important, because that tells you what is the relationship between number of I/O and also the bandwidth they consume. One thing we realized later on is that even for the things you can get traditionally, like the metrics you have in proxies, CPU metrics, sometimes memory metrics, all of these metrics can be obtained by eBPF as well, actually, for far less cost. There's a reason they have a lot of recent papers at USENIX ATC conference, or USENIX's OSDI conference. Most of what they did is they do a thing using eBPF in kernel space, and suddenly everything is quantitatively better. We are in this transition of doing as much of the things we want using eBPF. eBPF is written such that you have this basic data structure in kernel space, you would do your count increments there. If you want to report anything, usually you have a companion user program to pull the data out and then send it to your stats agent. We do the standard thing where there's a user space, and then kernel space duality. That's how we do it.
Participant 3: You mentioned that [inaudible 00:48:11], dependency graph. How will you eventually store such data, and where? How do you store traces?
Yue: We have a real-time data pipeline doing these things. There's the hidden assumption that things are partitioned by time. Traces from today is going to be closer together than traces from yesterday and today. Beyond that, I think there's not an obvious standard for how you would organize them that would answer to every type of query. We basically didn't bother. We have five different indices depending on what level of information you care about. Once you get to the right index, generally what you would do is you actually scan most of them to filter out the ones you want. The query is not designed to be fast. I think the value is that it can answer questions no other data sources can answer so people put up with the 5-minute delay to execute those queries.
See more presentations with transcripts