BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Adventures in Performance: Efficiency Analysis of Large-scale Compute

Adventures in Performance: Efficiency Analysis of Large-scale Compute

Bookmarks
46:15

Summary

Thomas Dullien discusses how language design choices impact performance, how Google's monorepo culture and Amazon's two-pizza-team culture impact code efficiency, and why statistical variance is an enemy.

Bio

Thomas Dullien started his career in the field of computer security under the pseudonym "Halvar Flake", and spent ~20 years in that field. After selling his company to Google, he spent 7 years at Google. He then switched fields to efficient computation and cloud economics, started a company that shipped the first multi-runtime profiler, and then joined Elastic as part of getting acquired.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Thomas Dullien: My name is Thomas Dullien. Welcome to my talk, adventures in performance, where I'll talk about all the interesting things I've learned in recent years after switching my career from spy versus spy security work, to performance work. Why would anybody care about performance? First off, there's three economic trends or three macro trends that are really conspiring to make performance and efficiency important again.

The first one is the death of the economic version of Moore's law, which means that new hardware does not necessarily come with a commensurate speedup anymore as it used to in the past.

The second one is the move from on-premise software to SaaS, where the cost of computing is now borne by the vendor of services and the vendor of digital goods, which means computational efficiency, which used to be borne by the customer is nowadays paid for by the provider, which means inefficiency cuts straight into gross margins, and through that into company valuations.

Lastly, the move from on-premise hardware to cloud, where as you go into a pay as you go scheme, if you find a clever way of optimizing your computing stack, you realize savings literally the next day.

All of these things are coming together to push performance from something that was a bit of a nerdy niche subject back into the limelight. Efficiency is becoming really important again. With the cost of AI, and so forth, this is only going to continue in the next years. Here's a diagram of single core speedup. Here's a diagram that shows the relative cost per unit per transistor over the different process nodes in chip manufacturing. We can see that the cost per unit is just not falling at the same rate that it used to fall anymore. Lastly, I mentioned gross margins and company valuation. You can see in this slide that there's a reasonably linear relationship between gross margins and company valuations for SaaS businesses, which means improving your gross margins will add multiple of your revenue to your overall company valuation.

Why Care About Performance? (Personal Reasons)

Aside from the business reasons, what are personal reasons why anybody would care about performance? For me, throughout my entire career, there's always been the issue that was difficult for me to align what I found technically interesting, what I found economically viable, and align that with what I found ideologically aligned with my values.

One of the things that I really like about performance work is, it tends to be technically interesting, because it's full stack computer science. It tends to be economically viable because if I make things more efficient, I save people money, and they're usually willing to pay for that. Lastly, it's ideologically aligned because I'm working on abundance, meaning I'm trying to generate the same amount of output for society with fewer inputs, which is something that aligns with my personal value set. I prefer not to work on scarcity, and I prefer not to work on human versus human zero-sum games.

My Path: Spy vs. Spy Security (Performance Engineering)

My personal path goes from essentially spy versus spy security to performance engineering. The interesting thing here is both are full stack computer science, meaning in both cases, you get to look at the high-level design of systems, you get to look at the low-level implementation details. All of that is relevant for the overall goal, meaning security or insecurity on the one hand, and performance on the other hand. In both situations, you end up analyzing very large-scale legacy code bases. In the security realm, if you find a security problem, you've got to pick your path. You've got the choice of selling the security problem that you found to the highest bidder, which then risks getting somebody killed and dismembered in some embassy somewhere. Or you need to be prepared to be the bearer of bad news to an organization and tell them, can you please fix this?

Usually, nobody is happy about this, because security is a cost center and interferes with the rest of the business. The advantage of doing performance work is it's got all the technical interestingness of security work. If you find a problem, you can fix it, the resulting code will run faster, cheaper, and eat less energy. It's much less of a headache at that point when security work.

Here's a map with a very convoluted path. This slide is here to tell you that the rest of this talk is going to be a little bit all over the place. Meaning, I've gathered a whole bunch of things that I've learned over the years, but I haven't necessarily been able to extract a very clear narrative yet. I'll share what I've learned. I'll ask for forgiveness that the overall direction of this talk is a much less clear narrative than I would like it to.

Let's talk about the things I've learned. There's four broad categories in which I learned lessons over the last years. One is the importance of history when you do performance work. Then there is a category of things that I learned about organizational structure, and organizational incentives, and so forth, and performance work. There's a bunch of technical things I learned, of course. There's a few mathematical things that I learned and underappreciated when I started out doing performance, and we'll walk through all of them.

Historical Lessons

We'll start with the first lesson, which is, the programming language you're most likely using is designed for computers that are largely extinct. The computers you're using these days were not the computers that this language was designed for. To some extent, it is misdesigned for today's computers. I'll harp a little bit about Java as a programming language. The interesting thing is, if you look at the timeline of the development of Java, the Java project was initiated in 1991 at Sun. It was released in 1995, which means it started prior to the first mentioning of a term called the memory wall.

The memory wall is a discrepancy between increases in CPU speed over the years or in logic speed, and the speed with which we can access memory from DRAM. At the time, 1994 was the first time when people observed that the growth rates for memory and for logic, the speedup growth rates were not the same, and that this would lead to a divergence, and that this would lead to problems down the line. By the time Java was released, this was not really an issue yet. Now it's a couple decades later, and the memory wall rules everything when it comes to performance work. If you have to hit DRAM, you're looking at 100 to 200 cycles. You can do a lot of computation in 100 to 200 cycles on a modern superscalar CPU, so hitting memory is no longer a cheap thing.

This interestingly led to a few design decisions that were entirely reasonable when Java was first designed, that turned out to be not optimal in the long run. The biggest one of these is the fact that it's really difficult in Java to do an array of struct. Meaning, in C, it's very easy to do an array of structs that have equal stride and that are all contiguous in memory next to each other.

At least out of the box, Java has no similar construct and actually not impossible but reasonably difficult to get a similar memory layout. If you look at the object array, instead of having all the data structures consecutively in memory, you have an array of pointers, pointers to the actual object. Which means traversing that array implies one extra memory dereference per element, which then is bound to cost you between 100 and 200 cycles extra depending on what's in cache and what isn't.

There's a number of assumptions that were baked into the language when it was designed. Traversing a large linked graph structure on the heap is a very reasonable thing to do. Garbage collection is nothing but traversing a large graph. Clearly, the calculus for this changes once you hit something like a memory wall, or once you have to face something like a memory wall.

The other assumption was that dereferencing a pointer does not come with a significant performance hit, meaning more than perhaps one or two cycles. These were all entirely correct assumptions in 1991 and they're entirely wrong today. That's something that you end up paying for in many surprising parts of infrastructure. Why do I mention this?

The reason I mention this is because it means that software is usually not designed for the hardware on which it runs, meaning, not only is your application not designed for the hardware on which you're running, your programming language wasn't designed for the hardware in which you're running. Software tends to outlive hardware by many generations.

Which brings me to the next topic, your database and your application are certainly not designed for the computers that exist nowadays and are likely designed for computers that are also extinct. I'll talk a little bit about spinning disks and NVMe drives, because they're very different animals and people don't appreciate just how different they are. If you look at a spinning disk, you've got 150 to 180 IOPS, which means you can read perhaps 90 different places on the disk per second, perhaps 100. Seeks are very expensive.

You physically have to move a hard disk arm over a spinning platter of metal. Data layout on the disk, where precisely it is physically located matters because seek times depend on the physical distance that the driver has to travel. Latency for seeking and reading is in the multiple millisecond range, which means it can afford very few seeks and reads if you need to react to a user clicking on something. Most importantly, there is very little internal parallelism, meaning only a few seeks in the I/O queue are actually useful, meaning the drive can usually only read one piece of data at a time.

The other last thing to keep in mind is multiple threads contending the same hard disk will ruin performance, because if multiple threads are trying to read two different data streams, the end result will be that the hard disk has to seek back and forth between those two areas on the disk. That's just going to be terrible for throughput.

If you look at a modern SSD drive, or a modern NVMe drive, they've got 170,000 IOPS, which is more than 1000 times more. It's still 11 times more than Amazon's Elastic Block Storage. It's just entirely different categories, three orders of magnitude more than a spinning disk. Seeking is really cheap. NVMes have internal row buffers, which means that you may actually get multiple random accesses for free. They have near-instant writes because the writes don't actually hit disk [inaudible 00:11:44], but they're buffered by internal DRAM.

The latency for a seek and read is under 0.3 milliseconds, but they also require a significant amount of parallelism to actually make use of all these IOPS. Because you've got a certain latency for reading things, you have to have many things in flight, many requests in flight, to actually hit peak performance, often on the order of 32, or 64, or even higher I/O requests at the same time to really hit peak performance on NVMe.

That leads me to the way that many storage systems are misdesigned for an era which is no longer a reality. What I've seen repeatedly is storage systems that feature a fixed size thread pool, let's say in the number of cores and so forth, approximately, to minimize contention of threads to cores, but also to minimize contention on the actual underlying hard disk. What I've also seen then is you combine the fixed size thread pool with Nmap. You map files from disk into memory and rely on the operating system and the page cache to make sure data gets shuffled between disk and memory. You rely on reasonably large reader heads, because you want to get a lot of data into RAM quickly because seeking on disk is so terribly slow on a spinning disk. These things make some sense provided you're on the spinning disk.

The reality is that on a modern system, you'll find the strange situation that you hit the system, and the system doesn't max the CPU. It doesn't max the hard disk. It just sits there. Because you end up exhausting your thread pools while all the threads are sitting there waiting for page faults to be serviced. The issue really is that page fault is inherently blocking, and then it takes 0.3 milliseconds to handle it end-to-end, which means the thread resumes after 0.3 milliseconds, which means that a single thread can only generate about 3000 IOPS, but you need 170,000 to actually hit the max capacity of the disk, which means to saturate all the IOPS the drive can give you, you will need 256 threads hitting page faults constantly all the time.

The upshot of this is that if you do any blocking I/O, given the internal parallelism of modern NVMe drives, your thread pools are almost certainly going to be sized too small. The end result will be a system that is slow without ever maxing out CPU or the disk. It will just spend most of its time idle.

Modern SSDs are real performance beasts, and you need to think carefully about best way to feed them. What about cloud SSDs? Cloud SSDs are an interesting animal because if you look at it, cloud attached storage is a different beast from both spinning disks and physically attached hard disks. Because spinning disks are high latency, low concurrency. A local NVMe drive is low latency, high concurrency. Network attached storage is reasonably high latency because you've got the round trip to the other machine, but it can have essentially almost arbitrary concurrency.

It's interesting that very few database systems are optimized to operate on the high latency near limitless concurrency paradigm. What's interesting about us as a software industry is, we seem to expect that the same code base and the same architecture and the same data structures that are useful in one scenario should be useful in all three. We're really asking the same software to operate on three vastly different storage infrastructures. That's perhaps not a reasonable thing to do.

Technical Lessons

Another important technical lesson I've learned, libraries dominate apps when it comes to CPU cycles. What I mean with this is, the common libraries in any large organization are going to dominate the heaviest application in terms of the amount of time spent there. Let's imagine you're Google, or Facebook, or Spotify, or anybody else that runs a large fleet and runs many services, the cost of the biggest service is going to be nowhere close to the aggregate cost of the garbage collector, or your allocator, or your most popular compression library, because these libraries end up in every single service. Whereas the service itself is going to be fragmented all over the place, meaning you'll have dozens or hundreds or thousands of services.

In almost every large-sized organization I've seen, once you start measuring what is actually driving cost in terms of compute, it is almost always a common library that eclipses the most heavyweight application. At Google, while I was still there, there was a significant number, I think 3% or 4%, of the overall fleet was spent on a single loop of FFmpeg. That single loop was later on moved into hardware and to a specialized accelerator. The point is that, if you start profiling across a fleet of those services, it's very clear that your Go garbage collector is going to be bigger than any individual application.

That's quite interesting, because it means that there's room for something that I've long wanted to have. My previous startup would have liked to do that if we had continued existing in the way that we wanted to, or that we assumed we would. The dream was to be a global profiling Software as a Service, which means that we would have data about precisely which open source software is eating how many cycles globally.

Then you could do performance bug bounties on GitHub. Going around and saying, "This is an FFmpeg loop, we estimate that the global cloud cost of this loop, or the global electricity cost of this loop is $20 million. If you managed to optimize it by a fraction of a percent here, you'll earn $50,000." The interesting thing there is you could generate something where individual developers are happy because they get paid to optimize code. The overall world is happy because things get cheaper and faster. Unfortunately, we're not there. Perhaps somebody else will pick this up and run with it.

Another important technical thing I learned but I've underestimated previously is, garbage collection is a pretty high tax in many environments. People spend more money on the garbage collector than they initially think. I think the reason for this is, first of all, common libraries dominate individual apps all the time. Garbage collection is part of every single app, given a particular runtime. Garbage collection is reasonably expensive, because traversing graphs on the heap is bad for data locality, which then makes it heavier than many people think.

It is very common in any infrastructure to see 10% to 20% of all CPU cycles in garbage collection, with some exceptional situations where you see 30% spent on garbage collection. At that point, you should start optimizing some of the code. I was surprised, before I started this, I wouldn't have bet that garbage collection is such a large fraction of the overall compute spend globally.

One thing I also found extremely surprising is that, in one of the chats I had with our performance engineers, somebody told me, "Whenever we need to reduce CPU usage fleetwide, what we do is memory profiling." As a C++ person, I heard that, I was like, that is extremely counterintuitive. Why would you do memory profiling in order to reduce CPU usage?

The answer to this is, in any large-scale environment where all your languages are garbage collected, the garbage collector is going to be your heaviest library. If you reduce memory pressure, you're automatically going to reduce pressure on CPU across the entire fleet. They would focus on memory profiling their core libraries to put less stress on the garbage collector, and that would overall lower their CPU usage and billing. That was a surprising thing that I had not anticipated.

It turns out that pretty much everybody becomes an expert at tuning garbage collectors, if they are tunable. It's also interesting that a lot of high-performance Java folks become experts at avoiding allocations altogether. I just saw a paper from the Cassandra folks where they praised the new data structure partially because it could be implemented by doing only a few very large allocations that can get reused between runs. It's also interesting, you see Go engineers become surprisingly adept at reading the output of the compiler escape analysis to determine whether certain values are allocated on the stack or on the heap.

What's interesting there is I think the big pitch for garbage collection was, don't worry about memory management, the garbage collector will do it for you. It's interesting to see that in the end, if you care about performance, you don't need to worry about memory management. You may still use a garbage collector, or you may try to use a language that provides safety without a garbage collector. Either way, you will not escape having to reason about memory management.

Organizational Lessons

One of the big lessons that I took away is that your org chart and the way you organize your IT matters quite a bit for being able to translate code changes into savings. The reason I say this is that pretty much all large organizations have either a vertical or horizontal emphasis. They all have both a vertical and horizontal structure. Google, for example, is a very vertical organization. Meaning, you've got the infrastructure team. You've got a big monorepo. You've got very prescriptive ways of doing things.

You've got a very prescriptive tech stack, essentially where Google tells the engineers, if you need key value, use Bigtable. For every if-else, there's a clearly prescribed solution for how to do it. All of it lives in a big monorepo. These are big services that are shared across the entire company, which are the vertical stripes in this diagram. Then, every project which is a horizontal stripe, picks and chooses from these different vertical stripes, and assembles something from it.

Then there's other organizations which are much more horizontally oriented, where you have separate teams, two-pizza teams, or whatever, that may have a lot of freedom in what they choose. They may choose on their own database. They may choose their own build environment. They may choose their own repository, and so forth. While this is excellent for rapid iteration on a product and keeping the teams unencumbered from what some people would perceive red tape, it's also not necessarily ideal for realizing savings across the board.

What I've observed is that vertical organizations are much better at identifying performance hogs, which are usually a common library, then fixing that library and then reaping the benefits across the entire fleet. If you do not have an easy way, or a centralized repository, at least of your common artifacts, it's much harder to do that. Because if you identify, for example, that you really should be swapping an allocator against another one, or you have a better compression library, you will now walk around the organization trying to get a large number of teams to perform a change in their repository. That gets much more strenuous, which means the amount of work you have to do to realize your effects is much higher.

Another really surprising thing that I learned was that companies cannot buy something that has net negative cost, and companies are really avoidant to buy savings. What I learned is, when we started on this entire journey, we initially thought that we're going to cut off savings work, meaning we would offer to people that we'll look at your tech stack, we'll work with you to improve the tech stack, for free, but we would like to get a cut of the savings that we realize over the next couple months. It turns out that this looks really sensible from a technical and economic perspective, it is almost impossible to pull off in the real world, largely because neither the accounting nor the legal department are set up to ever do anything like this. Meaning accounting doesn't really know how to budget for something that has a guaranteed net negative cost, but at the same time an unknown size outflow of money at some point.

The legal department cannot put a maximum dollar value on the contract and is worried about arguments about what the actual savings are going to be. The experience here has been that after a couple months of trying to do this and succeeding with only tiny players. A friend of mine that has done professional services for 20-plus years told me, "You may be able to do this if you're Bain, and if you've played golf with a CEO for 20 years, but as a startup, this is really not something that big companies will sign up for."

For me, as a technical person, it was somewhat counterintuitive that large enterprises cannot just buy savings because they're very happy to the buy products in the hope of realizing savings. A contract that guarantees them savings in return for a cut off the savings is something that is too unusual to actually be purchased. That was definitely a big lesson for me to learn, and a surprising lesson.

Another organizational thing that I learned, that I found surprising and interesting is the tragedy of the commons of profile guided optimization and compilation time. What's happening here is that, essentially, nobody likes long compilation times. The Go developers, for example, are very keen on keeping Go compilation times really low. People that build your upstream Linux packages have limited computational resources, and they don't really want to spend a lot of time compiling your code either.

This at one point led to a situation where the Debian upstream Python was compiled without profiling guided optimization enabled, which meant everybody running Debian derived Linux distribution, or of Ubuntu and so forth, everybody was paying 10%, 15% in extra performance for every line of Python executed globally. Largely because the people building the upstream package didn't want to incur the extra time it took to build the PGO optimized Python builds, because it takes like an hour to build. What we see here is a tragedy of the commons. Because for many common libraries, if the library runs on 1000 cores for 1 year, increasing the performance by 1%, is worth 10 core years, meaning you could spend a lot of time compiling that library if you can eke out a 1% performance improvement.

There's an argument to be made that it would make global economic sense to pick certain really heavyweight libraries and try to apply the most expensive compile time optimizations you can possibly dream up to them. It doesn't matter if it ends up compiling for 2 weeks, because the global impact of the 1% saving will be much higher than a week of hardcore computation to compile it. We end up with this tragedy of the commons where nobody has an incentive to speed up the global workload.

This is made worse by the fact that on x86, everybody compiles for the wrong microarchitecture, because the upstream packages are all compiled for the lowest common architecture denominator. Almost certainly, your cloud instance has a newer microarchitecture. You can often get a measurable speedup by rebuilding a piece of software precisely for your uArch. The issue is that Linux distributions don't necessarily support microarchitecture specific packages.

Cloud instances don't clearly map to microarchitectures either. What we have here is a global loss of CPU cycles and a global waste of electricity, caused by the fact that the code is always going to be compiled for slightly the wrong CPU. It's interesting if you think about all the new Arm server chips. Perhaps one of their advantage is that, in general, your code will be compiled for the right microarchitecture.

Mathematical Lessons

Mathematically speaking, the biggest thing that I learned here is that even for me as a trained mathematician, benchmarking performance is a statistical nightmare. Every organization that cares about performance should have benchmarks as part of their CI/CD pipeline. These benchmarks should in theory, be able to highlight performance regressions introduced by a pull request. In theory, organizations should always have an awareness about what changes lead to performance deteriorations over time, and so forth.

In theory, answering the question, does this change make my code faster, should be easy because it's classical statistical hypothesis testing. In practice, it is actually extremely rare to see an organization that runs statistically sound benchmarks in CI/CD, because there are so many footguns and traps involved with it.

The very first thing that I underappreciated initially is, variance in performance is actually something that is your enemy. Because if you try to make a decision like, does this code change make my code faster? If you have extremely high variance performance, you need many more measurements to actually make that decision.

The upshot of this is, if you tolerate high variance in your performance to begin with, any benchmarking run to determine whether you've improved things in terms of overall performance is going to take longer because you need more repetitions before you've got enough data to make that determination. Another problem that you run into is the abnormality of all distributions that you encounter in performance work. I sometimes look at the distributions that I see in practice and I want to yell at a statistician, does this look normal to you?

The main issue is that, if you deal with really unusual distributions that you have difficulty in modeling parametrically, and performance work almost always deals in these distributions. You're very quickly at a point where the statisticians tell you, ok, if that is really the case, we have to say goodbye to parametric testing and we have to resort to nonparametrics, which are, like a friend of mine that is a statistics professor, call them, methods of last resort. Unfortunately, in performance, the methods of last resort are usually the only result you've got.

Another problem that you run into is, the CPU internals that you face when you run your benchmarks mean that your tests are not identically distributed in time. Modern CPUs have sticky state, which means your branch predictor will be trained by a particular code path taken, which means your benchmarks will vary in performance, based on whether they run for the third time or for the fifth time. That's very difficult to get rid of. One of the solutions is doing random interleaving of benchmark runs where you do one benchmark run and then you run a different one, and so forth. You still have to contend with things like your CPU clocking up and down, and sometimes architectural things.

There was an entire generation of CPUs, where, if any of the cores switched to using vector instructions, all the other cores would be clocked down. You have all these noisy things that destroy your independence across time. Then you've got all the mess created from ASLR, caches, noisy neighbors on cloud instances, and so forth. Meaning, first of all, depending on the address space layout of your code, you may actually get almost 10% noise in your performance measurements, just from an unlucky layout. You can have noisy neighbors and cloud instances that are maxing out the memory bandwidth of the machine, stalling your code. There can be all sorts of trouble that you did not anticipate.

If you start controlling for all of these, meaning you run on a single tenant, bare metal machine. You disable any frequency boosting and so forth, and you try to really control the variance of your measurements, you end up in a situation where your benchmarking setup is becoming more different from your production setup over time. Then the question becomes, if your benchmarking measurement is really not representative of production, what are you benchmarking for?

In the end, the end result of this is that approximately, almost nobody has statistically reliable benchmarks that show improvement or regression on CI/CD. Because, in many situations, running enough experiments on each commit, in order to establish that this is actually making something faster, with a confidence interval that is meaningful, is often prohibitive, because you need too many samples. This doesn't seem to bother anyone. In the end, there's a few people that have done fantastic work on this. MongoDB have written a great article about their struggles with change point detection.

ClickHouse has written a great article about how they control for all the side effects in the machine and all the noise on the machine. The trick is relatively clever, they run the same workload on the same machine at the same time, like the A and B workload, arguing that at least the noise is going to be the same for both runs. If you really want to get into nonparametric statistics for the sake of performance work, there's a fantastic blog by Andrey Akinshin that works at JetBrains. I can much recommend it. It's heavy, but it's great.

Concrete Advice - What Are the Takeaways from All These Lessons?

After all of this, what's concrete advice from all of these anecdotes? What are the takeaways from all of the lessons I learned? On the technical side, it is crucially important as a performance engineer to know your napkin math. Almost all performance analysis and every performance murder mystery begins by identifying a discrepancy between what your napkin math and your intuition tells you about what should happen, and what happens in a real system. This should not be slow, it's the start of most adventures.

A surprising number of developers have relatively poor intuition for what a computer can do. Knowing your napkin math will help you figure out how long something should take approximately.

Another important thing on the technical side is you will, as a performance engineer have to accept that tooling is nascent and disjoint. I ended up starting the company because I needed a simple fleet-wide CPU profiler. You are going to be fighting with multiple, poorly supported command line tools to get the data that you want. The other big takeaway is, always measure. Performance is usually lost in places that nobody expects. Performance issues are almost always murder mysteries. It's very rarely the first suspect that ends up being the perpetrator of a problem. Measuring, it's very scientific, in a way, it's very empirical. I quite like that about that work.

Another thing is, there's a lot of low-hanging fruit. Most environments have 20% to 30% of the relatively easy wins on the table. The reality is that in a large enough infrastructure, that is real money. That's real, demonstrable, technical impact.

On the organizational side, if you were to task me with introducing a program to improve the overall efficiency in terms of cost, one of the most important things to do is trying to establish a north-star metric, which means for a digital goods provider, you want to work towards something that approximates the cost per unit served. Let's say you're McDonald's. McDonald's has a pretty clear idea of what the input costs for hotdogs are and what the output costs are. They can drive down the input costs.

Similarly for software, if you're a music streaming service, you would want to know, what's the cost of streaming one song? If you're a movie streaming service, what's the cost of streaming a movie, and so forth? Trying to identify a north-star metric that is unit cost is really the most effective step that you can take. Once you have that, things will fall into place somewhat magically, I think. What I've observed is that dedicated teams can have pretty good results at a certain size. Prior to that, just doing a hackathon week can be pretty effective. During a hack week, you focus on identifying and fixing low-hanging fruit. Usually, the results of such a hack week can be then used as a leverage or as an argument why a dedicated team is a sensible thing to do.

If you've established a north-star metric, the return of investment on such a team or on a hack week are going to be measurable and very visible and very easily communicated to the business decision makers. Which lastly leads me to the fact that very few areas in software engineering has such clear success metrics as performance work. I like the clarity of the game, like nature of it, because you get to really measure the impact you have. I also predict that the organizational importance of performance and efficiency is only going to grow over time. We're facing a period of DRAM money.

It's just going to be harder to get investment capital, and so forth. Margins do matter these days. I would not be surprised if over the next year's efficiency of compute becomes a board level topic on par with security and other areas.

On the mathematical side, my advice is, unfortunately, most likely methods of last resort are going to be your first resort. Reading up on nonparametric statistics is a good idea. Because you are in a nonparametric place, you will need to deal with the fact that your tests are going to have relatively low statistical power, which means you'll need a certain number of data points before you can make any firm conclusions. That means there is a cost for scientific certainty in terms of compute. You need more benchmarking runs, more data points, and you need to weigh carefully when that is warranted and when it isn't.

Historically, the advice would be, always keep yourself up to date in terms of drastic changes in computer geometry. A thousandfold increase in IOPS, a change in cost for a previously expensive operation, like integer multiplication used to be a very expensive thing to do, nowadays it's very cheap to do so you can use it in hash functions easily. Changes in the state of the art for various data compression things, if you look at zstd versus zlib. These are the changes that have drastic multi-year impact on downstream projects. You can still get reasonable performance improvements by switching your compressor to something else. You can still get performance improvements by making use of certain structures that are nowadays cheap, and so forth.

Keeping yourself up to date in terms of what has recently changed in computing is a very useful skills, because odds are, the software has been written for a machine where a lot of things were true that are no longer true. Being aware of this will always find you interesting optimization opportunities. The other thing to keep in mind is code and configurations outlive hardware, often by decades. Code is strangely bimodal, in the way that it either gets replaced relatively quickly, or it lives almost forever.

Ninety percent of the code that I write is going to be gone in 5 years, and 10% will still be there in 20 years. You would assume that pretty much all parameters in any code base for tuning anything and that haven't been updated in 3 years, are likely going to be wrong for the current generation of hardware.

Technical Outlook

Where are we going? Where should we be going? If I could dream, what tools would I want? I'll dream a bit and wish things that I don't yet have. The one thing that I've observed is that diagnosing performance issues requires integrating a lot of different data sources at the moment. You use CPU profilers and memory profilers. You use distributed tracing. You use data from the operating system scheduler to know about threads.

Getting all that data into one place and synchronizing it on a common timeline, and then visualizing it, all of that is still pretty terrible. It's disjoint. It's not performant. It's generally janky. There's a lot of Bash scripting involved.

The tools I wish I had. There's different tools for different problems. For global CO2 reduction, I really wish I had the Global Profiling Software as a Service database with a statistically significant sample of all global workloads, because then we would do open source performance bug bounties, and we could all leave this room with a negative CO2 balance for the rest of our lives.

That would have more impact than many individual decisions we can take. For cost accounting, I would really like to have something that links back the usage of cloud resources like CPU, I/O, network traffic, back to lines of code, and have that integrated with a metric about units served so we can calculate the cost breakdown. Literally, for serving this song to a user, these are the lines of code that were involved and these are the areas that caused the cost, and this is the cost per song.

For latency analysis, I would really like to have a tool that combines CPU profiles and distributed tracing and scheduler events, all tied together into one nice UI, and deployable without friction. In the sense that you would like to have a multi-machine distributed trace for a request that you send. Then, for each of the spans within this request, you would like to know what code is executed, and which parts of the time is the CPU actually on core, and which parts of the time is it off core across multiple machines.

Have all of that visualized in one big timeline so we can literally tell you, out of these 50 milliseconds of latency, this is precisely where we're spending the time on which machine, doing what.

Lastly, I would really like to have Kubernetes cluster-wide causal profiling. There's some work on causal profiling. There's a tool called Coz, which is really fascinating. These things are still very much focused on individual services. They're also not quite causal enough yet in the sense that they can't tell me if my latency is due to hitting too many page faults. What I would really like is something that you install on an entire Kubernetes cluster, or wherever other cluster software you use. Then you can ask a question like, this is a slow request sent to the cluster, why is it slow? Why is it expensive?

Then have some causal model of, this is expensive because machine X over here took too much time servicing this RPC call, because it took too much time servicing this page fault, for example. I'm not sure how to get there. We've had pretty dazzling advances in large language models in recent months that can now tell me something non-trivial about my code. Perhaps we will get to something that can tell me something non-trivial about the performance of my software.

One big hurdle with all performance tooling that I've observed is deployment friction. Most production environments need some sort of frictionless deployment. Ideally, you're deploying on the underlying node and there's no instrumentation of the software necessarily. The moment that any team outside of your Ops team has to lift a finger, your tool becomes drastically less useful.

What's Next?

What is next in regards to performance work? I've been wandering on the landscape for a bit. I've learned a bunch of things. I'm excited what comes next, but I have no idea what that will be.

 

See more presentations with transcripts

 

Recorded at:

Aug 11, 2023

BT