BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Joy of Building Large Scale Systems

The Joy of Building Large Scale Systems

Bookmarks
53:47

Summary

Suhail Patel discusses the art and practice of building systems from core principles with a focus on how this can be done in practice within teams and organisations.

Bio

Suhail Patel is a Staff Engineer at Monzo focused on building the Core Platform. His role involves building and maintaining Monzo's infrastructure which spans nearly two thousand microservices and leverages key infrastructure components like Kubernetes, Cassandra, Etcd and more.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Patel: I'm going to start my talk with my conclusion. I believe many of the systems that we build today really don't take full advantage of the hardware that is available to them. Sometimes this is by design. Much of our learnings and lessons and things that we've built up over time, and infrastructure came from a different world, a world of hardware that had different tradeoffs. Some of the times it can be organizational pressures. Your ops team maybe don't want to change on to a particular new piece of technology. I was talking to Sid Anand, and he was talking about whether I should maybe reframe this title as the joy of building large scale systems and the pain of operating them. You're not alone. We're all practitioners in the room. I want to go a little bit into how things have evolved in the industry, and things to get excited about for the hardware of today.

Choose Boring Technology

I'm going to be talking about various systems and services and design patterns, and things that are going to sound new and novel and often associated with complexity. I really do subscribe to the boring technology club. If you are a startup, if you're doing less than one request per second, there is really no reason to go through a lot of the struggle at this point in the lifecycle with all the ins and outs of building a cutting edge, blazing fast production system, because you've got other things to worry about, such as how to survive in business. You can get your favorite cloud provider or some vendor to run it for you for a couple hundred dollars. You can really focus on building your business with plenty of options to scale further. This talk is more for the people who are in that mid-tier, so not in like the hyperscalers. If you're Amazon or Google or Netflix, then you've probably heard all of this before. This is for the vast majority of people who are in the mid-tier, where you have some business need that is being served by software, but it's not being done in the most efficient manner. Think of this talk as a little bit of a retrospective of where we've been in the industry and where systems are heading in the future.

Profile

My name is Suhail. I am a staff engineer at a company called Monzo, based in the UK. I work in the platform group. We provide all of the underlying infrastructure and libraries and tooling, so that engineers can really focus on building banking software, and not have to worry about how their systems are running, and whether they're reliable, and whether they're up, and whether the database is going to be backed up, and things like that. If you folks have ever worked with Heroku, or similar platforms, that's the magical experience that we try to emulate, but in an opinionated manner suitable for a regulated environment such as banking.

Make Money Easy

Monzo is quite popular in the UK. We have over 7 million customers. I think we are the seventh or sixth largest bank in the UK. Our unique selling point is that we have an app and we have no physical branches. All of our branches are on GitHub. We actually have an API, which we're mandated to have, which is /branches. That API has an empty response. It quite literally says all of our branches are on GitHub. It was the easiest API that we've ever had to write. Scales to infinity. It's fantastic. We don't have to keep it up to date. We wrote it basically on day one when the company started, and I haven't had to change it since, which is really fantastic. We power all of the banking features through an online app. If you've heard of Chime here in the U.S., we're very similar. I think we have nicer looking cards, in my opinion. That coral one really does stick out. We are fully licensed and regulated just like all the other big players in the UK, and also here in the U.S., so we are under the exact same constraints. We can't just move fast for the sake of moving fast. With 7 million customers, you can't play around with people's money.

Monzo's Infra Integrations

Our whole philosophy at Monzo is to build infrastructure using composable microservices. We have over 2500 microservices in production powering all different parts of our banking backend. A lot of people like gospel dance, like, there are so many services, especially because we only have a few hundred engineers. We are a consumer-facing retail bank with direct integrations with payment providers like MasterCard. In the UK, we have the Faster Payment scheme. It's similar to ACH, and all these various different schemes that provide different payment integrations in the UK. We have integrations with the vast majority of them. On the screen are all the different services that we develop just to power one of these many schemes. For example, when you tap your Monzo card, which is integrated with the MasterCard network, all of these services need to get involved in order to process that payment. This includes things like maintaining a banking ledger. Making sure that you are a valid customer. Make sure that your card is valid and hasn't expired. Make sure that you're not using it for nefarious purposes. Make sure that we fight financial crime. We are under the same regulations and restrictions as most of the other consumer banks, and we want to make sure that a payment is legitimate, and it is actually the customer that is making that particular payment.

I mention all of these, because all of these systems and services need to talk to databases and queues and other core abstractions. You can imagine that those core abstractions and core systems get a lot of request volume, and have a very high throughput. These things need to be fast. We don't want you to be waiting at Whole Foods forever, waiting for your card machine with a spinner to process that particular payment. We want you to be in and out as quickly as possible. The retailers also want the same. In actual fact, MasterCard only give you a couple seconds before they will make a decision on your behalf. That decision that MasterCard makes, is usually a much poorer decision than one you would make internally. It is in our incentive as a bank to make sure that we make the best decision for our customers, because it also saves us money in the long run. That time constraint means that we need to be fast.

B-Trees

Speaking about speed in the base layer, let's go a little bit back to the core abstractions portion and get into the nitty-gritty about why all of this matters. If you've ever opened a computer science course or an algorithms book, you will have come across a B-Tree data structure. B-Trees are typically used as index implementations in many popular databases. If you've ever defined an index using the CREATE INDEX command, it's almost certainly going to be implemented as a B-Tree or some variant of a B-Tree, and represented in your database system and on disk. I want to present this slide just so that you can keep it in mind. We will be revisiting this in a couple slides. In 2012, there was a publication on latency numbers that programmers should know. I've put some of those numbers here on the screen. We'll revisit these in a little while to see how they have changed. Keep in mind, for example, the last number there, the cost of a disk seek. These numbers are a little old. Throughout this talk, I want to talk about how some of these have changed in order of magnitude as hardware has advanced, and how we should be building our systems to be able to take advantage of these orders of magnitude changes. It's still important to keep in mind now.

If you're familiar with a binary tree, B-Tree is very similar, where you've got sorting of elements, which means that you can traverse the tree in order. I've got lots of boxes here with numbers on them, and all looks very complex. Imagine that they are like a reference of primary key and ID or something like that, 82 might reference a row, which is like my record, and 83 might be Sarah, and other people. It's like, effectively a key-value lookup mechanism. Similar to a binary tree, doing a search, you traverse the tree in order. Similar to doing a binary search in a large list, you start at the head and you make your way down to the node that you are trying to find. You can find almost any record by traversing the tree. In this particular example, you can find any record in four comparisons, which is pretty good going. Four comparisons, you could do that relatively quickly. Imagine all of these items corresponded to a data record. The numbers are an identifier to that record. You're going to want an efficient representation, so you don't have to traverse every particular element to get to a particular record.

Imagine that you want to insert an item, for example, we have that sub-tree highlighted in orange there, and we want to add a few more records that align with that particular area of the tree. We're getting a few inserts to our database, and we want to add a few more records. We've added two more items there. Maybe these are two new customers in banking parlance, or something like that, 86 and 87, which is all fine. This tree is now looking a little bit imbalanced. We've got quite a few things clustered at the leaf there. We're going to have to do a little bit of rebalancing. Most of our leaves there have two or three records. This one is clustered with five records, which is not ideal. B-Trees, one of the properties is that they need to self-balance, when you've inserted a bunch of items in the leaf, so we're going to juggle these items around to maintain an optimum distribution. Many implementations will try to prioritize keeping the height at a constant level, so you don't have to do too many comparisons. We can rejuggle this tree around to make sure that we still keep the same height, while still maintaining a good distribution within the leaves. Here is our rebalanced tree. We've shuffled a few of the things around. All of the items in red have been reshuffled, so that we have more even distribution of the load. We no longer have five nodes clustered against one leaf. We've had to do quite a few different operations in order to make that happen.

This is a spinning platter hard drive. There's a mechanical head that whizzes around and makes clunking noises. It's all fun and games, except when you put a magnet to it, and all of your data is erased, which I may have done by accident quite a few times, when you whap some hard drives in a drawer. B-Trees are a data structure that lives on the hard drive when they were originally designed, as opposed to, for example, being in RAM, where it would be much more volatile. You want to keep these in a persisted medium. That allowed the B-Tree to grow as your data has grown, because RAM wasn't all too massive back in the day. When you were running low on RAM, you still wanted an ability to be able to look up things in your B-Tree as your dataset grew. However, if your data is written over time, your B-Tree isn't going to live on a contiguous place on your spinning platter hard drive. You don't want your B-Tree to be splattered all over your hard drive. That's what you end up with, because your B-Tree grows organically over time, you're adding records over time. Different blocks are going to correspond to different elements of your B-Tree. That means every time you need to do a change or a lookup in your particular B-Tree, for example, if you remember our search example from the previous slide, we had four comparisons. That means that head needs to move to different places and need to wait for the spinning platter to spin around and get into the right place before you can read that bit of data. That's a couple milliseconds right there when you're accumulating it all, and aggregating it all, just getting the read-write head into position. Imagine you need to do that a couple times within your B-Tree. Imagine the top to bottom has grown into 10 comparisons, for example, that can accumulate really quickly.

The era of hard drives might have disappeared. I don't think anyone is running systems with spinning rust nowadays. If you are, then you should definitely look into NVMe SSDs. They're great. If you're on cloud storage, like EBS, each time you need to do a particular operation, that translates into an I/O operation, and you're going to be billed for that. There's a fixed capacity that you need to provision upfront so that they can provide you the quality of service. This stuff still does matter. Where, if you were to choose a different algorithm, you might be able to do all of these lookups in one I/O operation. Again, a little bit of an age division here. I don't know if folks are old enough to remember the days of disk defragmentation? The general principle especially in the days of spinning platter hard drives was that you want to lay out your data to minimize the amount of seeking you need to do, and to minimize the number of operations you need to do with your read-write head. If you lay it out in a linear manner, you can take most advantage of the hardware. It's why defragmenting your hard drive made your computer feel much faster. It wasn't just moving these colored boxes and look all pretty. It did make your computer feel faster, because it relayed your data out in a manner which was much more efficient for reading, which is what a lot of our systems do nowadays when you look at read-write ratios.

Random and Sequential I/O

There has been some research, even in the age of solid-state drives, SSDs. SSDs are much more efficient reading sequential data than doing random access. The above chart is from 2009. The premise for optimizing for reads and sequential reads very much exists today. On many SSDs and NVMe drives, you can read 1 kilobyte, or 4 kilobytes of data in a single I/O operation. That seek operation is much quicker. If your bytes are scattered all over the place, and if you've got a couple bytes here and a couple of bytes there and a couple bytes everywhere, then you're still going to have to do a bunch of I/O operations even on an SSD. That's going to accumulate much more than just doing a single I/O operation where you can read a bunch of kilobytes of data all in one go. Ultimately, that reduces the performance of your application or your database. Because you have to remember that when these database systems are running, you're doing millions, if not billions of these operations. That accumulates very quickly.

Disks Are Really Fast Nowadays

When a lot of algorithms in most of the modern systems we run today were designed, we really did have a completely different set of tradeoffs. To read an item from this, really, we're talking milliseconds, and with SSDs, that reduced by an order of magnitude to sub-millisecond. I remember when I got my first SSD on my machine, I was like, everything is so blazing quick. Now with NVMe SSDs, that has gone even further into nanoseconds. Again, with throughput, a hard drive, you'd be able to saturate it at like 200 megabytes per second. Nowadays with NVMe drives, you can breeze past that, multiple gigabytes a second. There was a talk from the folks at ClickHouse where they were able to analyze over a billion records and add a couple gigabytes per second, which meant that you could search through a couple billion records in under 1 second, which is pretty remarkable. I saw an NVMe drive a couple days back, which could reach over 7 gigabytes per second. Some of these drives are getting wickedly quick. They're getting so quick that they're looking at adding water cooling to the hard drives, because it's generating so much heat. That's the level of absurdity that we're getting to that we can't keep them cool for the speed that we're reading and writing to them.

All of this is fine. This is like effectively our next Moore's Law almost. Because many of these improvements we can take advantage of without much work in our systems. For example, this is some research from the folks at Samsung. They do have a bias because they make some of these drives. You can see how there is a massively different order of magnitude in the amount of transactions that we're able to run per minute, for this workload that they were running for different types of drives. NVMe drives have absolutely breezed past compared to SAS hard drives. You can get a lot of the benefits of added CPU or network headroom or faster disk speeds, just by deploying our existing applications on more modern hardware. This was a MySQL workload, where they were testing some transactions. This might sound a little bit contradictory to the B-Tree example that I gave a little bit earlier. Because clearly by doing nothing we've still been able to realize speed gains, a massive order of magnitude, because there's much higher throughput that we see here. With a few tweaks in our software, we too can realize even further orders of magnitude of gains, and that would take much better advantage of the speed of hardware.

There was a really interesting research article that I came across on I/O performance. This is the level again of absurdity that we're getting into, which is, a bunch of folks have realized that software applications are difficult to change. What they're going to do is they're going to implement some stuff in device drivers for disks, because disks nowadays are little mini-computers anyway, where they're going to translate random write calls. When you're doing random writes into sequential calls at the device layer. They have their own separate lookup table, and they try to optimize for doing sequential writes, rather than random writes. This matters because it helps reduce wear and tear on NVMe drives and SSDs. Even if you're running in the cloud, you still got to worry about this. Because eventually, if you wear out your SSD, you're going to get a very nice notification from AWS saying your instance is being retired, which means that you have made the hardware sad, and it's now time for you to exit. This matters because in their benchmarking they achieved a 5% to 11% increase in write transaction throughput for common file systems. I think 11% was for ext4. By, again, just putting this into the device driver, they're able to realize further speed gains. We can take advantage of these, again, for free.

CPUs Are Getting Faster

Shifting away from the world of hard disks, and into the world of CPUs, we've continued to see 10% to 15% year-on-year gains in single-threaded performance. This is a far cry away from Moore's Law in effect for the last couple decades. We're still able to extract significant performance out of a single thread. It's not just about adding more cores and more threads to our systems. It's not all about CPU clock frequency either. The caches on CPUs, for example, have gotten much larger too. I've plotted different types of EC2 instances, for example, and I've plotted their L1 and L2 cache sizes, for their various different compute optimized instances. The leap might not look that dramatic, but every time we've doubled these sizes, the probability of a cache hit can dramatically reduce the amount of CPU cycles spent. The cost of an L1 cache hit versus an L2 and L3 cache hit, can make a dramatic difference on how quick your application performs. An L1 cache hit is 200 times more faster than going into memory.

Even as an industry for a long amount of time, we are focused on x86 based processors. Now many providers are providing Arm based CPUs. Some of this you may argue is for competitive advantage, maybe trying to get rid of the Intel-AMD foothold in the industry. There may be other perverse incentives. Ultimately, there is some benefit for us as consumers. I remember when I got the M1 laptop, I still don't believe to this day it has a fan, because I've never heard it. It does have one, because I've opened the screws, and it does have one, it's there, but I've never heard it. My Intel machine, it'd be running away. It'd be nice and toasty up on the stage, even presenting these slides. I've personally not used these Arm based CPUs in data centers. The folks at Honeycomb have reported a 30% improvement in price to performance for their server fleet. All they had to do was to tweak their CI to build artifacts for Arm and take advantage of this. Thirty percent is a pretty remarkable number. Imagine you're able to go to your CFO and say, there's a potential way we can reduce our cloud spend by 30%. I'm sure they'd be elated.

Networks Are Getting Faster

There is one principle about distributed systems that has remained true and probably will never change, which is the network in general has not gotten more reliable. If you're building a system that spans over multiple machines, and they need to do some coordination, there's still a heap of work that you need to encode in your systems to handle network disruptions. Networks have gotten significantly faster. In our original latency numbers, it assumed a 1-gigabit network which is perfectly fine for back in the day. It was completely reasonable about a decade ago. Most providers even on the basic tier of instances, give you 10 to 25 gigabit networks, even on the bog-standard generation of instances, and our networking software has also gotten much quicker. There has been a significant focus on hardware offload and bypassing the traditional OS networking stack through things like eXpress Data Path, and a bunch of other technologies, because CPUs can't keep up with the amount of processing that needs to happen for the super-fast networking speeds. You really don't have time to doodle when you're getting tens of millions of packets per second. While you can definitely sustain tens of gigabytes per second, I'm afraid there is one thing that has remained true, especially if you're on the cloud, which is the pesky bandwidth egress charges. It hasn't become any cheaper. You just get to spend more money quicker. If you do need a high throughput interface, it is available to you.

Latency Numbers

Going back to our latency numbers, our L1 and L2 caches have gotten much larger. The likelihood of a cache hit skipping main memory has gotten much larger as well. We didn't have a chance to dive into optimizations in memory and RAM. We're seeing massive iterations in the DDR spec, which brings three to four times higher throughput and bandwidth. This is especially important for multi-core systems. Memory is also getting much larger. It's now common to get hundreds of gigabytes of RAM. You can get machines with multiple terabytes of RAM. For many of us, we can fit our entire dataset into main memory, which is pretty remarkable. It almost negates the need for disk access in the first place apart from just a persistent storage medium. Hard drives as well have gotten remarkably quicker, and it's removed a massive source of bottleneck, especially if you needed to read something off disk.

The Free Lunch Is Over

In 2005, there was an article titled, "The Free Lunch Is Over," which I think is a very remarkable title, by Herb Sutter. The article talked about the slowing down of Moore's Law, and how the dramatic increases in CPU clock speed were coming to an end. When you read it in retrospect, I'm going to bring you some choice quotes, it's almost as if like I've had a crystal ball into the future. Take a second to digest that quote there. No matter how fast processors get, software consistently finds new ways to eat up that extra speed. If you make a CPU 10 times as fast, software will find 10 times as much to do. Our applications haven't gotten any faster. If you open an app today, it's spinners galore while waiting for things to load. We're not waiting for things like networking and other associated systems. This should be quick. You're just writing a few rows to a database, it should be quick. Why is it so slow? Gone are the days where you could see the capital expenditure in racking your own servers and data centers. We've become accustomed to the world of infinite compute, especially because we're running in the cloud. We've taken advantage of it by writing distributed systems but they have not scaled massively well. We've just shifted our complexity into running many more machines.

You may have often heard the phrase, let's just throw more machines at the problem, or let's just make the machine bigger, rather than look at, why is our software so slow for the machine that it is currently running on? We've often erred towards scaling upwards and outwards, because that is an easy fix. You just pay your favorite vendor a bit more money, and you can get a faster instance, probably up and running in minutes. Herb predicted in 2005 that the next frontier would be in the world of optimizing software. As practitioners, we will be racing towards that frontier. That frontier was declared to be concurrency, for example. Herb had a very great quote, which is that, engineers have now finally grokked object-oriented programming, and now concurrency is the thing that we had to grok as an industry. Herb's thesis was that engineers would need to become very proficient at efficiency and performance optimizations to identify places where concurrency could make your applications faster. Herb predicted that languages and paradigms would lead to heavy optimization work. That will find new life and that will be the next frontier. I think in today's era of software, that realization is more true than ever. I think we have had a gap for the last 15 years, because in that meantime, cloud has come along. The cost of getting a server is now pennies.

Why does all of this matter? Cumulatively, we use approximately 1% of worldwide power on data centers. Some may argue that 1% is not that large of a number compared to many other industries. For example, when you put it in comparison to the monetary or the societal value that is being added, it is not that significant. Our demand for compute is ever increasing, and there is no reason why we should be wasteful.

Thread-Per-Core Design

There was a really fascinating report that I read a while back, which stated that 50% of greenhouse gases are due to infrastructure and software inefficiency in the data center. That figure on assets seems really high. You think about the software that runs in your organization, and when you account for provisioning buffers, and serialization overhead, and runtime overhead, and operating system overheads, and virtualization overhead, the inefficiency does accumulate. We can see how that 50% comes about. Hardware has gotten much faster, and we've got a lot more throughput. What can we do in the world of software to take advantage of some of these optimizations? None of these ideas are very revolutionary. In our multi-threaded applications, we typically follow the shared everything architecture. We have our data in memory, and we work on different parts of the shared address space at the same time. Naturally, if two cores need to access the same piece of data, so you get a request writing a particular record, and you need to read from that particular record, you're going to have to use some synchronization primitive. For example, with a thread-per-core design, your application is bound to a particular CPU thread, and you partition your data space so that each core is handling a different part of the address space. This pattern requires a bit of effort to encode because your application needs to be aware and it needs to partition this particular database. You got to have some form of request steering, so that two requests for the same record at the same time go to the same core effectively. While it will now run in serial, by removing the synchronization overhead, reduces the amount of CPU context switches.

It's not only about data, though, CPUs need to do context switching for things like interrupts. For example, when network packets come in, and they need to do CPU interrupts, you don't want to be doing those on the same core where you're running your application, because you'd have to do then some context switches. Being able to have some core affinity can really speed up your applications and reduce the amount of cache evictions that you're constantly doing. In Linux, you can use features like IRQ, to pin network requests to particular cores, which is really handy. There was a paper in 2019 that measured the impact of tail latency of doing this. It's not very complicated to set up. You can find some commands on the internet, maybe a bash script that you can curl in, if you're feeling also inclined. There was a paper in 2019, which measured this for Memcached and Sphinx, pinning particular CPU cores using IRQ. They found a 71% reduction in tail latencies, which is, again, pretty significant. This model isn't revolutionary at all, maybe over a decade out of date. Systems like NGINX have been doing this for ages, and with massive benefit. As an industry, we haven't really encoded this into our applications. This programming can be quite fiddly. You need to get deep, for example, into the networking stack and your application stack and build a model for thread based message parsing. There's some really fantastic work in the world of open source. We have systems like Seastar, from the folks at ScyllaDB, and they have a really nice tutorial on how to get started. If you're a C++ user, definitely check that out.

io_uring

Even in the world of I/O, there has been some significant advancements, so io_uring. The world of asynchronous I/O in Linux has typically been quite complicated. You can get into a world where, for example, buffers were full or by the time the I/O you were doing wasn't quite matching what the file system was expecting, and you'd filled your disk request queue, which meant that your asynchronous call had now become accidentally synchronous. There was a lot of memory overhead as well with asynchronous I/O. io_uring provides a new interface. At the core, it looks quite simple. There are two queues. You submit your requests into the submission queue, and you read your results from the completion queue. You've essentially got two ring buffers, and you keep going round and round. On the outset, it looks quite simple.

This has had a dramatic effect, though, on the industry. This is now all merged into the mainline within the Linux kernel. Here's some analysis that was done. Here's two charts of random reads and random writes for 4-kilobyte block size. I was talking about random reads and writes in the context of B-Trees a little bit earlier. If we're using standard Linux async I/O operation, there's a cap on the number of operations we can do before we are starved. Look at that chart on the bottom there, io_uring compared to standard Linux I/O is an order of magnitude difference. Just over the last couple weeks, libuv, which is a widely used library for async I/O, added io_uring support. I love this comment from the pull request author, which is, "Performance looks great. An ax increase in throughput has been observed." It's always important to be wary about benchmarks, because benchmarks can tell whatever story you want them to tell. Any multiplier improvement is a great improvement for users of libuv. This is work that is happening in the community right now. Even io_uring is continuing to evolve. There's an active effort in, for example, making network I/O go via io_uring, for that to have zero copy to extract even more performance.

Systems Programming Languages

Moving away from specific features that are in specific languages and the kernel, the world of programming languages itself has seen very rapid development. To really take advantage of development close to the metal, you'd usually have to resort to using C or C++, which is difficult and error prone. You've got memory leaking everywhere. It is difficult. I could see why engineers would want to avoid that. With the rapid rise of programming languages like Rust, or even Zig. It's a really cool programming language. The world of systems programming has become even more safer. The tools around it have become really delightful. The barrier to entry for these programming languages are much lower compared to C and C++. Speaking on Rust specifically, I'm no Rust fanboy, but I have a very strong admiration. It's not a niche technology for hobbyists anymore. Organizations like Amazon, and Google, and Microsoft have gone all in, and they're using it in core components like Windows, like EC2, like S3, and more. If you do have the opportunity to check this out, have a go with Rust. It's a really fantastic programming language. For example, I've worked with Python. I've written C modules in Python. I found it really difficult to get right and package. Rust has some really nice bindings with PyO3 and the ecosystem maturity, and the developer tooling is really quite phenomenal. You can get up and running really quickly, which means you don't have to worry about the fiddly part of packaging and building wheels and eggs and stuff like that in Python. The ergonomics are much better.

There was some research in 2017, on the energy efficiency around different programming languages. They used a data processing task for this particular test, and then ranked different programming languages for energy and time usage. It's interesting to see how different programming languages compare. For example, Python, is 75 times slower than writing something in C or Rust. That is just such a massive delta. Many moons ago, I used to work for a company that dealt with public transit data, so think buses and trains, and all other forms of public transit. I'm going to give a pretty benign example. We had millions of timestamps that we'd need to parse and process, and this was happening on the hot code path. All these were ISO, ASIC, so on timestamps, so there was no variation to worry about. We were just using naive Python date parsing libraries, the stuff that's included in the standard lib. We changed this out to a third-party C-based library, and saw a 10x speedup. A 10x speedup in this hot path just for parsing a date stream. That's how I really got interested in this particular space of compute efficiency. It's quite a minor change when you look at it in the outset, but when you're doing this processing millions, if not billions of times per second, at runtime, if you truly understand the constraints of your system, you can make some wicked fast improvements.

I had a go over the last couple weeks and months on trying to port this library to Rust just to practice what I preach, and be able to come up here and say, actually, you should really do this to get up and running. Again, if you've written C modules in Python, you probably have recognized how difficult it is to get right. There are a lot of footguns, merging the global interpreter lock in C is an absolute nightmare. Merging and translating objects between Python and C is an absolute nightmare. Dealing with PyObject all over the place, again, a massive nightmare. Then you got to figure out the magic incantation of your set of tools and your makefile to get it all right. It's just an absolute nightmare. In Rust, for example, there's this really fantastic tool called maturin, which has a super nice UX. You install it using pip, and you can call maturin new, and you can have a package up and running. Then you write your code. This is the only bit of code in this particular presentation. We're not going to do a Rust 101 or anything like that. Hopefully, the code is relatively easy to follow. We've got a library, a recorded R ISO-8601 library. That's the name of the module.

We've got a particular function that you can call from Python, which is the parseDateTime function. All we need to do is to fill in the middle. If you have a particular timestamp, I'm sure pretty much all of us can figure out how to parse this particular timestamp, you extract all the number components and put it into the Python representation, while keeping information about the time zone. There is some implementation there. The code is pretty benign. It's just walking through the string, and parsing it into its numerical components and building up the Python datetime object. If we get some malformed input, we can raise a Python stack trace and exception information as well. Once we have all of that implementation, we can use maturin, for example, to build a Python wheel. Again, this is really simple. You run this one command. I've not had to touch a makefile. I've not had to touch setup or disk details. It's really easy. This is ready to go on PyPI, or you've got a wheel that you can distribute in your wheelhouse, and send to your other engineers that they can use as a library. We've not had to do any heavy lifting here. If you call a Python interpreter, you can quite literally import that module, call the function, and you are ready to go. Now you are using Rust and Python seamlessly. Again, we've not had to do any heavy lifting.

Inflection Points

I'm not advocating that we stop everything that we're doing now and switch to writing all of our software in C, or Rust, or Zig, or what have you. There is an inflection point, and typically it takes a long amount of time for organizations to realize this. To start with, we optimize for development and organizational cost. We want to get something out there as quick as possible, and see that if it will generate revenue or what have you. While we are doing that, there was an inflection point where the system's runtime cost is getting linear, or if not, exponentially higher. There is a runtime cost that we need to consider, especially as our systems continue to scale. Hardware has changed really, quite significantly. There was a big focus in the world of software, with new frameworks and tools to take advantage. Let's go through a few of those very quickly.

Even in some of our mature ecosystems, in Java, for example, there has been a big focus on language features. The one that I've been most excited about is the ZGC garbage collector. It gives sub-millisecond pause times. Garbage collection is a significant time consumer within the applications. It can affect your tail latencies. That can make your systems feel slower. ZGC, sub-millisecond pause times, no matter how large your heap size is, even if it's multiple terabytes. This isn't some distant reality, this is production-ready today, with LTS support in Java 17. If you're running, for example, like a Cassandra, or a Kafka, or any of these systems that rely on Java, if you're writing Java applications yourself, Java systems and services, definitely check out ZGC. It is production-ready. We're running it in production. We've found sub-millisecond pause times. I thought I'd never see it. It's absolutely phenomenal.

We're very close to a world where we have basically nearly pauseless Java. Again, even basic optimizations that we can do in our applications. This is an example from our production workload. This is an optimization we did to the JMX exporter, which we run alongside Kafka. This is an exporter that exports Prometheus metrics, so metrics into our telemetry system. We use tools like Java Flight Recorder and Mission Control, to find out where the CPU was being spent. We found that a significant amount of CPU was being spent in a code path, which was just doing a bunch of redundant string comparisons. We put in a bit of a HashMap, we traded off a little bit of memory, I think a couple megabytes of memory, and we were able to halve our Kafka CPU utilization. Which means that we have much less garbage being created, and much more CPU overhead to be able to scale out our actual Kafka application rather than spending that time in redundant metrics collecting.

Flight recorder comes embedded with JRE 11 onwards, so there was no reason to not have that available. Node.js, for example, has a dedicated performance team. They're doing some fantastic work to tune performance for underlying APIs that engineers rely on. Example, reading an ASCII encoded file, has seen a large throughput jump in Node 18. Even Python is getting a massive speedup as well. There's a small task force being run under the Microsoft umbrella, and led by the BDFL of Python, Guido van Rossum, to tackle some of the performance bottlenecks in Python. You can take advantage of all of this as well. There's a little bit of work for you to do if you're on Python 3, in handling maybe a deprecated library or two, but you can introduce this in your programming tool chain as well and take advantage of these optimizations in production.

At the company that I work for, we write all of our systems in Go. This is something that we expose to all of our engineers, we give them access to all of the profiling tools. These profiling tools run with a very small amount of overhead in your production application. There's no reason why engineers shouldn't be able to access these tools and take advantage of, like understanding where their systems are spending CPU time. We even have an incentive, we have like, for example, Slack channels, and incentives where different leaders come in and applaud efforts to improve optimization. Of course, there is an incentive for us because it means that we get to reduce our costs. It also means that we have a bit more mechanical sympathy for the computers and the infrastructure that we're running on. Ultimately, that leads to less complexity in our systems. The least complex systems are the ones that run on the minimal number of machines.

Even if you have an existing application, where, for example, you might not be able to add profiling tools in the world of the kernel and with technologies like eBPF and BPF. You can leverage kernel level tracing, assuming that you're on Linux 4.1 and above. If you're not, then please definitely upgrade. My favorite Swiss Army Knife is bpftrace. I even like to use it alongside other language specific profilers, because it gives really good insight into syscalls into the file system, and a bunch of other areas. All of the different arrows are different tools that you can use to probe different parts of your operating system stack, so there is literally no excuse. We can go into every single particular area of the kernel and our application stack to get a solid understanding of where CPU time and compute time and hardware time is being spent.

This is Jackie Stewart, who coined the term around mechanical sympathy. In the context of motor racing, it was about caring and understanding man and machine to extract the best performance. It's a combination of human and machine coming together. With the shifts in hardware and software, in general, there was a cultural shift happening in how we write software, even rewriting core software, to take better advantage of the hardware. There was some incredible work from Facebook, for example, adding a new storage engine, for example, with MySQL Rocks. We can all play a part in that shift by evaluating systems that we run, and understanding how much power there is in modern computing.

Generative AI has captured the minds of masses. One massive discussion point right now is the sheer cost of compute used to train and serve these models to customers. There's a lot of low-hanging fruit in this particular space as well. Llama, is a model by the folks at Meta. There were some open source implementations in C++ that a bunch of folks are building. There was a really excellent pull request that I came across where they introduced Nmap, which is very old technology. Probably, a lot of people with systems level thinking are like, "Yes, that sounds so obvious." It led to 100x improvement in load times, and half as much memory consumed at runtime. This is relatively modern software. Llama was written relatively recently. There is a lot of low-hanging fruit, even in systems that are being written today. Even old tricks like Nmap, and memory alignment, and thread-per-core, can still yield massive benefits.

Most of the systems that we build are glorified data processes. We take some data on one end, we transform it a little, and we flunk it into another end, ad infinitum. There's a lot of low-hanging fruit in that layer as well. If you do a lot of JSON parsing, there's been a really fantastic library that has been introduced that uses SIMD, Single Instruction, Multiple Data, to make JSON parsing even quicker. There's a lot of screws being turned, a lot of magnifying glasses being placed on various different components of our application stack to see where we can extract more performance.

Summary

Let's go back to my statement at the beginning of the talk. Many of the systems we rely on, were built for an era where hardware was different, where there were bottlenecks in hardware, like your disk couldn't get any faster, or your network was constrained. The bottleneck over a decade ago was in network interfaces and hard drives and main memory. Today, hardware has gotten so quick, and the bottleneck is firmly, without shadow of a doubt, in the software that we write, and hardware continues to accelerate even quicker. These optimizations are now staring at us right in the face. In many of our organizations, we put up faux barriers. Maybe you have an ops person who's very stubborn and says, "We must use Java 8. Java 8 only until the day that I die." We have the perfect audience to lead and influence a lot of these changes. We're leaving all of these optimizations on the table for organizational reasons, which is quite silly. These improvements are in production and LTS systems. There is very little reason to be stuck on old Java or Python or Node versions. There are small things that we can do, we can introduce to our organizations to yield massive improvements.

We've seen that there are tools out there that allow us to keep pace. These are features being introduced at every level of the stack, from data serialization, from network hops, from kernel level features, programming languages, even programming language features for existing programming languages. There's optimizations being done at all levels of the stack, but it is up to us to take advantage. For example, if you're using older libraries, or older language versions, there is some work that we need to do in order to get to the latest and greatest. These are things that are production-ready, and there is legitimately no excuse to start tackling some of these things. Hopefully, that will allow us to build faster, more efficient applications, or rich applications. I long for the day where you never ever have to look at a spinner ever. I think that will be an absolute glorious day. By keeping pace, we can use less compute time, less energy, and less resources for the same level of output. We can build richer systems as a result.

 

See more presentations with transcripts

 

Recorded at:

Dec 07, 2023

BT