Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Understanding CPU Microarchitecture to Increase Performance

Understanding CPU Microarchitecture to Increase Performance



Alex Blewitt presents the microarchitecture of modern CPUs, showing how misaligned data can cause cache line false sharing, how branch prediction works and when it fails, and how to read CPU specific performance monitoring counters and use that in conjunction with tools like perf and toplev to discover where bottlenecks in CPU heavy code live.


Alex Blewitt has been working with Java since its first release, and has worked on JVM projects at Goldman Sachs and Credit Suisse, where he was the JCP representative until 2016. Before moving to Santander in 2020 he worked at Apple on Swift, and has authored books in Swift and Eclipse plugin development. He writes for InfoQ about Java and JVM topics.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


This is the talk on Understanding CPU Microarchitecture For Maximum Performance. What we're going to be talking about is really what happens inside a CPU, what goes on in the bits and bobs in there, how does that hook in with the rest of the system, how does the memory subsystem work, how does caching work, and so on. We're going to look at the tools that we have available for being able to do analysis of how we can make our CPU run faster.

We're going to be focusing on this performance pyramid. We're going to be talking about the instructions we use, the way the memory works, the way the CPU works, really at the top part of it here – this is what this talk covers. (The presentation, by the way, will be available afterwards; I'll send out a tweet that'll also be on the QCon website.) There are other things that you can do for performance. Specifically, if you're looking at the performance of a distributed system, fix your distributed system first, fix your system architecture first, fix the algorithms first. Really, the top level is the last couple of percent of that particular process so other QCon talks are available.

Computers have been getting really quite complicated. Back when we were just talking about 6502 and the BBC, Apple and other such systems – Commodore 64 was mine – you had a single processor, it just did one thing and it did it very well. These days, server processors come in multi-socket configurations; and these multiple sockets are connected to multiple memory chips and there's a communication bandwidth path between them. This talk I'm going to be focusing really on Intel specifics and on Linux as far as operating systems concerned: some of the ideas will be Linux and Intel-specific but the idea is we'll apply to other operating systems on other platforms as well.

You can have dual socket servers; in this case, you've got two sockets talking to each other. You can get four-socket configurations and you can get eight-socket configurations. Each one of these sockets is connected to a bunch of RAM chips that then is local to that socket, and the other RAM chips – whilst accessible – are further away and therefore slightly slower. This is known as Non-Uniform Memory Architecture (NUMA). Pretty much any serious server-side system is a non-uniform memory architecture these days.

Inside the Socket

What happens inside the chip? it turns out that if you go down another level, you see the same pattern. This is what the Broadwell chips look like, with a ring bus around for communicating backwards and forwards across the cores. Each one of these little squares, for those of you standing at the back, are the individual cores themselves with cache and processing associated with it.

The 18-core die looks a little bit weird because it's got a bidirectional pump between the left-hand side and the right-hand side. If you're going from a core down at the bottom to one over the far side you then have increased traffic to be able to get over there and, therefore, a slower delay as well. The same thing is true of the 24-core as well.

These sockets and these cores are really getting quite complicated and importantly although we think of machines as being a Von Neumann architecture that you can just reference any memory and get the result back; the time it takes to get that back can vary dramatically depending on where that data is being loaded from.

The current generation of Intels have moved over to a mesh-like architecture. For things like the Cascade and Skylake systems, they have a mesh for being able to move sideways. This gives more paths to be able to get from one place to another and you can theoretically do it in less time. They come in a 10, an 18, and a 28-core variety for being able to move forwards. Each one of these chips has got bi-directional memory ports going out either side and so you can actually partition these things into a left and right Sub-NUMA clusters and being able to say, "I want things to run on this half and talk to this bank of memory," whilst you have another set of processes that run on this side and talk to that bank of memory.

In fact, there's been a recent release of the Cascade(Lake) 56 core ‘die’ – actually it's a package rather than a die because what they've done is they've taken two of the existing dies, stuck them next to each other, and actually put in the pipeline on the same piece of Silicon that goes in.

Cache Is King

Things are getting complicated. If we drill down further into the cores then we see why, because we've got different levels of cache inside each one of these processors as well. They use a dollar sign because it's a play on the word ‘cash’. For Skylake systems you've got a register file – which is the number of in-flight variables, if you like, registers that are used to hold data – for Cascade and Skylake systems at 180 integers and 168 floating point values. These can usually be accessed in one clock cycle, which is half a nanosecond if you're running at two gigahertz – or slightly less if you're running a little bit faster.

Those, in turn, delegate to a Level 1 cache, which is split into an instruction and a data half. The idea for splitting it is so that when you're processing large amounts of data, your data isn't pushing your program out of the space. In particular, most of the time, you're just reading from the instruction cache, whereas the data is a two-way street. That access time is about 4 cycles.

That then delegates back to a Level 2 cache, which is shared between them; typically 12-15 cycles – something along those lines, depending on the architecture. Of course, these have different sizes. In the case of Skylake's and Cascade's systems, it's about a megabyte at the Level 2 cache that's specific to that particular core.

If you want to talk to external memory, there's a Level 3 cache as well – and that's shared across all cores on the same die that you're loading it from. There's usually a 16-megabyte or something of that size that is stored on the die itself and each core can access memory from there. Of course, the time it takes to access it is really a function of how local it is to that data.

One thing I'll also point out, the Level 3 cache on the Intel chips is non-inclusive at the L3 layer, but inclusive at the L2 layer. What that means is if you've got some data in the L1 it will also be in the L2 but it doesn't actually have to be in the L3. AMD have just launched a new chip that's got an absolutely massive L3 cache inside there: that has a number of performance advantages and we'll probably see Intel coming with bigger L3 caches in the future as well.

Thanks for the Memory

Of course, these delegate out and load data from RAM. In this case, DRAM depending on how far away it is, could be anywhere from 150 to 300 cycles. It gets a little bit imprecise talking about cycles there, because it's a function of both the process of cycle speed and also the memory speed as well.

In fact, if you use a program called lstopo, it will show you how your computer looks.

This was taken on my laptop:iIt's a single-core system with a bunch of cache split over various different levels inside there – and it's actually reporting a Level 4 cache. It isn't really a cache as such, it's memory shared between the GPU and the CPU on my particular machine: that's because it happens to be a laptop.

Actually, we're seeing Level 4 cache turning up and, in particular, we're seeing non-volatile RAM coming online at some of the Level 4 caches as well. It'll be interesting to see how that evolves. The cores that are shown at the bottom, we've got a four-core processor with hyperthreading available.

Translation Lookaside Buffer

But there's more than just the memory cache: when people talk about memory caches, they usually think of this Level 1, Level 2, Level 3 combination, but there's a bunch of others that are inside there. One of them which is very important is called the Translation Lookaside Buffer, or TLB. The TLB is used to map the virtual addresses to where the physical addresses are on the system. The reason why this is important is because every time you do a process change or potentially every time you go into the kernel and back out again, you need to update what those tables are.

Those tables will have a listing essentially that says, "If you see an address that begins with 8000, then map it to this particular RAM chip, which is going to be somewhere on the system. If you see something beginning with ffff, then map it somewhere else." This happens for every address you look up and so, therefore, it needs to be quite fast. Specifically; if you have something in that cache, great, you can access memory quickly; if it's not in that cache, it's going to take you a while.

Page Tables

That's because it has to do what's called a page table walk. Each process in your operating system has a page table, and that page table is stored in the CR3 register – but each time you do a process switch, it's changed over to something else. Essentially, this is the map that your process is running on that particular machine. It's a tree, so I've demonstrated this as a two-level hierarchy here for being able to step between them, but actually, on modern processors, it's four levels deep, and on Ice Lake, which is Intel's next generation, it'll be a five-level deep page structure.

In particular, this level of page structure can give you a certain amount of memory space. At the moment, four-level page tables will take 47 bits, 48 bits' worth of space for the virtual addresses; five levels will bring that up to 57 bits, which means you can address far more virtual memory than we need. Sixty-four terabytes should be enough for anyone.

(Huge) Pages

These memory pages are split into 4k sizes. Now, this 4k size made sense back in the 386 days when virtual memory came along, but it's not really great for systems that are taking hundreds of megabytes’ or hundreds of gigabytes' worth of space. You can change the level of granularity that this mapping happens from a 4k size to a huge page. A huge page basically means something that isn't a 4k size.

Most Intel systems will have two different sizes for this; they'll have a two-megabyte and a one-gigabyte support. It's under operating system control as to which one of those it uses, and each CPU will have flags to say which one it is. Different architectures have the same idea of large pages as well, but they'll work in slightly different ways or have slightly different sizes by default.

The purpose of using huge pages is so that the TLB doesn't need to store as many pointers. If you've just got one giant page for your process, then all of your lookups go through that one entry in the table and you can load this fairly quickly. It's good from that point of view; but it does have some downsides. It might be slightly more complex to set up and use: and if you're using Hugetblfs, then you need to configure it.

Hugetblfs is a bit of a pain. This was the first thing that came out in Linux to be able to support large pages. What you'd have to do was to be able to specify ahead of time how many large pages you wanted, and what the size was going to be, and then you had to be in a certain permission group and then your application would then decide to use these things. Generally speaking, people tried it, people didn't like it, people stopped using it.

Transparent Huge Pages

There was something called Transparent Huge Pages instead. Now, this has been slightly more successful, but not without its pain points. Transparent Huge Pages says: when you ask for a page, instead of giving back a default 4k one, then give them back a two-megabyte one or a one-gigabyte one depending on how it's configured. However, most applications were written to assume that when you did an allocation of a page you just get a 4k size back. They'd only write 4k's worth of data and you've ended up allocating two megabytes worth of continuous physical space and you're only using a small fraction of it.

Transparent Huge Pages when it first came out and just giving everyone large pages by default didn't really work. There are several configuration options you can do. One of them is in the hugepages/enabled is to specify something that will work if you use madvise. Madvise is a call that you can say, "Yes, I'd like to use Huge Pages, please," and you can specify that in your code. If you don't do that, you get the small page, if you do, you get a big page.

One of the problems that happen for high performing systems was that you would ask for a large page and the operating system will go, "I don't have a large page yet. Standby while I go and get one," and it would then assemble a whole bunch of little pages and it would take some time, which is not good if you have a low latency system.

A relatively new option that was added to this (within the last few releases of Linux) is the defer option. What the defer option will do is it will say, "Ok, I'm going to ask something. I would prefer a large page but if you don't have one then that's fine, I'll just take a bunch of small pages and you can fix it again afterwards." That has – on the whole – reduced the issue of the blocking that you would see. You'll still see a bunch of blog posts and Stack Overflow answers that say, "Don't use huge pages." Give this a try with madvise and with defer and see what happens.

Cache Lines

While the operating system deals with memory in the unit of a page size whether that's 4k or two-megabytes, the actual processor is dealing with memory at a cache line size. A cache line size is, at the moment, about 64 bytes. Now, I say about 64 bytes. It's exactly 64 bytes for the Intel processors you're using on your laptops, but it may well be 128 bytes in the future. Don't assume that it's going to 64 bytes.

Particularly, if you're working with mobile devices or ARM architectures, ARM has got something called big.LITTLE and you end up with processors with different size cache lines inside them. So be aware that different ones exist. For Intel servers mostly you're looking at 64 bytes. By the time you're watching this in two or three years time on InfoQ, it'll probably be 128 bytes. You heard it here first.

When you load memory from the process and iterate through it, what will happen is the CPU will notice that you're reaching out to memory and then start getting the data in. When we're striding through memory, the processor’s memory subsystem will automatically start fetching memory for you.

If you can arrange your processes to iterate through memory in a linear form, great, you're going to be able to go through them. Bouncing around randomly like when you're traversing an object heap, not so good for the memory system.

It does notice when you're doing other striding information as well. If you're striding through every 32 bytes or something, the memory pre-fetcher will notice that, and just load every other line for you.

There is something that you can use in compilers __builtin_prefetch – which ends up being a prefetch instruction under the covers – that can request that you're going to be looking at some memory soon so please make it available. Only use this if you've got the data to show that it makes sense. You're mostly going to make the wrong decisions about it, not because you can't make the right decisions but because you'll either request it too early and it will push out stuff that you were using, or you'll request it too late and you'll have already used the data by the time that you need it. It's something that you can use for tweaking but not something I'd recommend jumping to as a first point.
Trying to organize your memory structure so that you can process through it linearly is going to be the way that you can improve performance at that layer.

False Sharing

You can also have something in cache lines called false sharing. False sharing is when you've got something of a cache line size and you've got two threads being able to read and write inside it. Although they're writing to different locations, different variables, in computer language speak, if they're reading and writing into the same cache line then there's going to be a bit of tug of war between two cores.

If you've got those two cores on the other sides of a 56-core package or in another socket then you're going to get contention inside here. You're either going to get data loss if you haven't used synchronization primitives [NB not specific to false sharing, just in general] or you're going to get bad performance as they're fighting over the exclusive ownership for that particular cache line.

To avoid this, if you are doing processing with multiple threads and you're going to be reading and writing a lot of data then have them separated by a couple of cache lines. I say a couple of cache lines, because the load fill buffer when it loads things it'll load in a couple of cache lines at a time. Although you're not reading and writing that section you might find that you're still treading on each other's toes. The Oracle HotSpot compiler has sun.misc.Contented [or jdk.internal.vm.annotation.Contented] that pads with 128 bytes of blank space when you're trying to read and write something so that it avoids this particular problem. Of course, that number will change as the cache line size changes as well.

Memory Performance Strategies

In order to take the best advantage of the memory subsystem, try and design your data layout and your data structures so that they fit in within an appropriate amount of space. In other words, if your hot data set that you're processing and using a lot can fit inside the L1 cache, great that's fine, you'll be able to process it very quickly indeed. Or you can buy an Ice Lake chip which has got 48k's worth of Level 1 instead of 32k. If it doesn't, see if it can fit in the Level 2 and see if it can fit into the Level 3. These kind of things are visible from JavaScript or Java or Python or whatever you're dealing with, and you can see if you measure performance as you round the sizes up that you get these step changes as you overflow out of one layer of cache into the next one.

Consider how you structure your data as well. If you've got data in a set of arrays, it may be easier to pivot and think of it as a set of arrays each with one field in – because let's say you're processing images: you may not need to process the alpha value, but you might want to process the red, green, and blue. If you have them as an array of reds, an array of blues, and an array of greens, then you'll be able to get a better performance than if you iterate through red, green, blue, red, green, blue, in memory.

Also, consider using thread-local or core-local data structures as well. If you're doing some map-reduce operation over a whole bunch of data, treat it as a distributed system. In a map-reduce job you'd fire it out, you'd get a load of results, and then you'd merge them back in at the last step. Treat that the same with your cores as well. Get one core to calculate and do something, another core to calculate and do something else and then bring those results and then add them up rather than trying to fight over some shared variable in memory.

Consider compressing data. This happened in the JVM recently when we had compressed strings. Instead of having each character taking up two bytes worth of space, it was compressed down to only taking one-byte amount of space. That was a performance win because although it costs a little bit of extra instructions to expand and contract on demand as it's loaded in, actually the fact that you're shifting less data both in and out of the caches and also for the garbage collector to process meant that it was a performance win. You can use other compression strategies. HDRHistogram is a good way of compressing a lot of data down in a logarithmic form. Depending on your application you'll be able to think of something as well.

Pinning Memory and Threads

You can also pin where memory and threads live. If you're dealing with a massively scalable system and you've got a massively scalable problem to solve, then try and pin threads to certain places to be able to process them. If you're dealing with something like a Netty benchmark where you're consuming events over a network socket, then pin your worker thread so that one worker lives on this core, another worker lives on that core, another worker lives on a different core. That way you'll get the best performance because you'll never get them processed or swapped around the place.

If you are doing that, you might need to tell the Linux kernel to stay away from a subsection of those cores using isolcpus which is a boot time option that you can say you want the Linux kernel to do its housekeeping just on a subset of your cores. You can also use taskset to say when you start off a program like a logging daemon or something that you want to create a CPU set that your application is going to use and other applications are going to be run elsewhere. These things you can specify as a Linux sysadmin to be able to do that.

There's also numactl and libnuma which you can use in order to be able to control programmatically where you're going to allocate large chunks of memory. Again, you can use the (sub)NUMA clusters on the processors to be able to decide where that goes.

Inside the Core

We've talked a lot about memory, what about the actual brains of the operation, what about the CPU? The CPU or the core is split into two parts, the frontend and the backend. This isn't like frontend development, it doesn't run JavaScript in there. The frontend's job is to take a bunch of instructions in x86 format, decode them, figure out what they are, and then spit out micro-operations or μops because U is easier to type than μ on the keyboard for the backend to process.


We get a bunch of bytes that come in off the memory system that goes into the predecoder which says, "This is where this instruction ends. This is where this instruction ends," goes into an instruction decode and then converts that into micro-ops.

We're going to look at just this increment one here, because when you increment something in memory, what it really means is you're going to load from memory, going to do an addition, and then you're going to write it back to memory. That increment corresponds to three micro-ops.

Generally speaking, you'll have a one-to-one type relationship between the micro-ops in the backend. If you're using complex addressing modes where complex means you're adding an offset or you're multiplying some value then there might be a few other micro-ops that get generated as well. Ultimately, the point of this is to spit out a bunch of micro-ops for the backend to do.

At this point, we're all operating in order. The instructions come in order, the μops come out in order that we want them to execute as well. There's a few things that go on the frontend. One of them is the μop cache. As you do the decoding from Intel instructions which are fairly complicated into a set of these micro-operations which are internal but slightly easier, they can be cached so the next time you see it, it'll spit out the same things.

Importantly, we've got a loop stream decoder which uses loops so that when you're running through a very tight loop it will just serve the decoded μops rather than going through this parsing process. It'll shave off a few elements in the pipeline. (Most pipelines on Intel processors are between 14 and 19 deep depending on what it's doing.)

Branch Predictors

One of the things that impacts it more than anything else is the branch predictor.

The branch predictor's job is to figure out where we are going next. It's like the sat nav for the CPU. It's right most of the time. It probably has a better rating than me when I'm trying to do the driving navigations. It figures out where you're trying to go, it assumes whether or not the branch is taken, and then it starts serving the instructions so that by the time the pipeline flows through, the instruction's all ready and waiting to go.

Sometimes, that will fail. In that case, the parsing and the decoding is thrown away, it resets to where it should be, and then you have what's called a pipeline stall or a bubble, where it waits for the instructions to go through and then catch up again. If you can reduce bad speculation on the branches, you're going to get better performance in your application.

The branch prediction dynamically adapts to what your code is actually doing. This is also something that's visible from the high level as well. If you've got a Java program or a JavaScript program that's iterating through a bunch of arrays and maybe adding up, say, positive numbers into one counter and negative numbers into another counter, the branch predictor, figuring out which of the elements to go down, is going to be confused on random data. You'll get reasonable performance, but the branch predictor won't be able to help you, because it'll be right about 50% of the time.

If you're dealing with a sorted data set, in other words, you see all the negative numbers first and then all the positive numbers afterwards, the branch predictor is going to be on top of the game, and it will give you its best performance because it will assume once it's seen the first few negative numbers that they're all going to be negative until it changes over when you've got a bit of slow performance and then the positive numbers will go on from there.

The branch predictor, which decides whether you go down something, is only half of the story. There's also a branch target predictor as well. This target predictor says where it is that you're going to go. Now, in a lot of cases, the branch [target] predictor is going to know exactly because you're going to jump to a specific memory location. You're jumping to the system exit function or you're jumping to this particular routine at the start of the loop again.

Sometimes, though, you're jumping to the value of a register and that register may not have been computed yet. It will take a punt and think actually what it's trying to do is that it will load this data value in, but sometimes that might be wrong and you need to rewind and do things again.

You'll see the branch target predictor being confused a lot when you iterate through object-oriented code whether that's C++, whether that's your own object orientation, whether it's the JVM or something. That's because in order to figure out where to go, it has to be able to load what the class is, look in the class word, and then figure out from the class word where the vTable is and then jump to the implementation in the vTable. Those few jumps are going to confuse the branch target predictor.

One of the few things you can do to speed this up is that you can have a check to say, "If this looks like an X class, jump to the X class implementation. Otherwise, fall back to dynamic dispatch." That's something that you can implement very cheaply and very quickly. In fact, the JIT and JVM does this by using monomorphic and bimorphic dispatch for the common cases and some of the non-Oracle ones do more than just two.

That works in C++ as well. If you've got a dynamic dispatch call like if you're implementing a VFS... I saw something a while ago that said a Linux VFS operation had been sped up by a factor of three or something simply because they said, "If you're using the X2 filing system, then delegate this to the X2 implementation instead." Some of the recent mitigations that have been put in the Linux kernel to avoid things like Spectre and Meltdown and so on have actually decreased performance of Linux over time. By putting your own monomorphic dispatch, you can then gain some of that speed up back again.

Here, I think Martin has highlighted this a few times in the past: Inlining is the master optimization because when you do inlining you suddenly know much more about where it is that you're going but you also often lose an indirect call as well. Actually, losing those function calls is a good way of optimizing performance because you then don't have to worry about the branch predictor or the target predictor at the CPU from being able to do things.


We've figured out where we're going, we've got a whole bunch of μops, we've got this nice stream of them coming through. What happens next? Then it goes over to the backend, and so the backend's job is to take all of these μops, do the calculations, and then spit out any side effects to memory. In this particular case, we're loading from a location and memory, we're incrementing that, and then we are writing that value back again.

Now, at this point, we don't have to worry about things like eax, esi, rsi, and so on because we've all changed them to temporary registers. The x86 ISA has these registers inside. The core has a lot more of these things. It says, "For this particular instruction, we're going to stick the temporary variable which was eax here into, say, R99." You're going to pick one of the ones that's free inside this.

As processors evolve, as this register file gets larger, we can deal with more and more in-flight data. That's important because inside the backend, once we've done the allocation, we are now off to the races. All of these μops are competing for availability on the internal processor at the backend to actually do their work. As soon as that data dependencies are there, they'll then be scheduled, executed, and the results will go in.

Intel processors and most other server-side processors these days are out of order because, at this point, any of those micro-ops can happen at a time. Importantly, as far as the performance is concerned, this backend is capable of running multiple μops simultaneously. The frontend can dispatch to the backend four μops per cycle for Intel and Cascade Lake systems. For Ice Lake systems it can issue five μops per cycle. If they aren't competing for slots that they need to run on, then you can have a number of those μops running in parallel as well.

In this particular case, our internal ‘processor’ has got a number of ports, and those ports are the execution units. They are all able to deal with arithmetic, they are all able to do logical operations. Some of them are going to be able to divide, some of them are going to be able to multiply. There are different implementations for the integer and floating-point sides.

All of these will be able to take up 256 bit values inside them. Port 5 will be able to handle 512 bits on its own and you can combine ports 0 and 1 to have a 512-bit operation. If you are dealing with 512 bits, you've essentially got two paths, either port 5 or port 0 and 1. You don't get to decide what happens, this is the CPU doing it for you, but you can at least execute those two things in parallel. There's a bunch of other ones that you need to use as well for address generation, loading, storing and so on.

In this particular case, they get allocated to their appropriate ports, we figure out what the value is that comes in from the load, we figure out that the register now has value 2A. The increment is now ready to execute because its dependency has been met and so that does the update, and then once that's there, the write then flows out at the end.

At this point, when it goes out the door, we're back to in-order again. Lots of stuff happened, we may have made lots of mistakes, we may have knackered our cache up, and we can see that externally through Spectre and Meltdown, but between the μops going into the backend and the μops coming out of it or the writes going out to it, we've been out of order through that whole time.


How do we know what's going on inside to be able to change things? I'm sure you're all aware with perf. Perf is a general Linux performance tool that can interrogate counters that are kept inside the backend and the frontend of the core itself. There's tools like record which will allow you to trace the execution of a program; annotate and report to tell you what the results actually mean from the binary file; and a stat for being able to interrogate performance counters themselves.

Perf record, when you run it, will take an application, will then generate a perf data file for all the results. You can then schlep that file off to another machine for analysis if you want to or you can do data processing on the host depending on what you're doing.

When you do recordings in perf, it will capture the stack trace at each point, and the stack trace is then going to be where you are so you can amalgamate it and get the results back again like profilers that you're used to at the high end will do.


However, at this level, the profiler is subject to what's called skid. Skid is you say, "I want to start recording here," but actually the process is still doing things and by the time it notices that you want to do a change, it's actually got to back up and then say, "We're here now, so I guess that's what you meant."

There are some precision flags, the ‘:p’s that you can add to it for more precise values and they will add more and more overhead but will get you more and more closer to the precise value of what's happened. Generally speaking, if you're trying to identify bottlenecks in your system, then you'll be able to get a rough overview, first of all, by running perf without it, and then as you hone down and narrow down where the issues are, you may then want to start turning up the precision to get exact values of it.


When you have core branches, what will happen is the perf program will try and figure out what the backtrace is by walking back the stack. If you've got code that's dealing with the branch pointer inside there, it will be able to follow that back from the stack. Quite a lot of programs are compiled without the branch pointer support because back in the 32-bit days, we didn't have many registers so it was useful to be able to use the frame pointer for something else.

These days, probably less of a reason not to have it in there, but if you're dealing with a server with debug symbols it can use the dwarf format to be able to figure out how to walk the stack back again. If you’ve ever seen things about incomplete stack traces it's probably because you don't have the buffer pointers in there and you don't have the debug symbols to be able to track it back. However, for running code, the perf command supports something called the call-graph. What the call-graph option will do is say which flavor of backtracing do you want to use.

LBR is Intel's Last Branch Record. What happens is as the Intel processor is jumping around, it records where it's been. If you take a snapshot of that every few call stacks then you can build up a complete picture of where you've gone even if you don't have the debugging symbols or the frame pointers inside there. That will give you accurate values. There's also something on even newer Intel processors called Intel Processor Trace. Intel Processor Trace does the same thing but with a much lower overhead. There's a couple of Linux weekly articles down at the bottom which you can find when you look at the slides later.

Processor Stats

As well as perf record, there's also perf stat. Perf stat will give you an idea of how much work your program is actually doing, and it will read the counters from the processor to be able to tell you that. It'll tell you how many instructions there are, how many branches you've taken. In this case, the branch misses are about 5% branch misses, so it means the branch predictor is working about 95% of the time which is nice. Those are the kinds of things which you could inspect.

One thing to look for is the instructions per cycle or IPC. The IPC says how efficiently you're doing work. You can have a program that's running at 100% CPU and has an IPC of 0.5 that will run dog slow and you can have the same program running at 100% CPU and get an IPC of 2.5 and be running five times faster. 100% times five is 100%. It's a great speed up if you can find it.

Depending on which processor you're running, the IPC is going to be in a region below one which means that we have potentially other issues that we need to investigate or something closer to four which is the maximum type number that you'll see out of these systems. I think IceLake might start to reach into the five IPC range, but larger numbers are better in that particular case; and less than one means that you probably need to look into find out what's going on. These things read from what are called performance counters.

Performance Counters

Performance counters are model-specific registers that each CPU has, and as it goes through and executes instructions it will increment this counter. You get some for free: the number of branches; branch misses; the number of instructions executed, and so on. There are programmatic ones as well. If you want to count how many TLB cache misses you're doing, how many page walks you're doing, how many executions you're running on port five on the processor itself, you can actually configure these counters to be able to give you answers to those values.

They've got names – perf -list will tell you what they all are. If you don't know what the name is but you know what the code is, you can put in something random and that will tell you something as well. Of course, I don't know what that is. I picked it from somewhere but I can't remember what it is. That's because usually when you're doing analysis you need to have a process to follow.

Top-down Microarchitecture Analysis Method

There's something called a Top-down Microarchitecture Analysis Method or TMAM which was created by Ahmed Yasin. What it says is, "Let's take the performance of our application and figure out where the bottlenecks are.” In the big diagram I had before with the frontend and the backend, that's the separation that we have. Are we failing to serve the number of μops on the frontend, are we blocked from the backend of doing some processing? Are we retiring them? That doesn't mean going off to live somewhere in the country, that means actually going to work. Or is it bad speculation because we've simply no idea where we're going?

Each one of these has got various different perf counters that you can enable to be able to say where you are. Now, again, this is great if you've got it and it mentions it in the Intel software optimization manual. What it really boils down to is saying: have we ever allocated a μop? If we have allocated a μop, does it retire? If it retires then great, that's useful work done.. If not, it falls down to the bad speculation route. If the μop isn't allocated is that because we've stalled on the frontend or on the backend?

It all sounds very complicated but as long as you can spell top-down, great, because perf does it for you. If you run perf -topdown it will run system-wide and it will give you an overview in which buckets your system is running in. In this case, we're not timing sleep, we're just using that as a holding point for being able to specify what's going on. In this particular case, I've got a six-core system running on a single socket. (It's probably three cores and three hyper threads.) The retiring rate is somewhere between 15% and 35%.

In other words, we've got between three and four times that we could optimize this to go faster. If this retiring is like 100% job done, it's beer o'clock. Most of the time you'll find this is a smaller number and you can optimize this number by figuring out why your program's running slower, and then taking steps to be able to optimize it. In this particular case, it's highlighting in red that we may be frontend or backend bound across a whole bunch of processes.

To find out something more useful than that, (this will give you a snapshot of your system), you really want to find what your process is doing.


Andi Kleen has written something called Toplev. The Toplev tool will go through the process and give you ideas of where you can start looking. When you first run it, it downloads some configuration files from Intel's website, for the process that you're running out so that you know what performance counters it has available to it.

If you're deploying this on an air gapped server, you can download these configuration files ahead of time and then deploy them with it. There's documentation that shows you how to do that.

Essentially, this is a very fancy frontend to perf. That “cpu/event=0x3c,umask=0x0/” I don't know what that is, but Toplev does and that will then be able to generate a perf command. There's an option that you can use to even say what perf command would you run so that you can then copy and paste it and run it somewhere else. This will do the work for you.

The other nice thing Toplev will do is if you have a repeatable workload, you're repeatedly running a particular process then you can use this in no-multiplex mode. We talked about multiplexing before: if you've got two symbols and you don't have the counters for them, it will record one and then it will record the other, then it will record one and then it'll record the other, and then essentially just double the count afterwards. In no-multiplex mode, it runs the entire program counting this counter, then it runs again counting the other counter. That will give you a much more accurate review of what's happening in your system, assuming that it's repeatable.

You can then run it with -l1. It will show you the same thing as perf top-down did but for that particular process only. Here, we're creating a 16-megabyte random data file, and inside there we're looking at a single-threaded no-multiplexing level one for base64 encoding this just because base64 is something pretty much everyone has and you can play around with this example at home afterwards.

When we run it, it will do some calculations – and one of the things it will tell us is that "By the way, did you know this program is backend bound?" In other words, we're getting the stuff in the memory, we're doing the μop allocations, the branch predictor's not failing dismally, we're passing it to the backend, and then the backend isn't going through it as quickly as we would like.

If you run it through -l2 it will say that it's core bound ,and if you run it again it will say it's to do with ports utilization. What this means is that we're generating a whole bunch of instructions, they are running on one of the ports inside that diagram that I showed you several slides ago, but because there aren't any other ports available to do that, it's all being bottlenecked on this port. For example, if we were doing a bunch of divisions, we would expect everything just to be running on port one because that's the only one that's doing integer division. Different processors will give you different performance primarily because Intel keep adding execution units.

Cascade Lake and Skylake are pretty much the same thing. (One of them is just a slightly optimized way of laying down the silicon.) Ice Lake is then adding new ports inside there and new functionality. You'd expect this program to run faster on Ice Lake simply because they've changed the internals of the CPU.

There are, of course, different options that you can come out from there. You have to drill further to find out where the process is happening. If it's frontend bound then you need to look into the memory and be able to pull things out.

If it's backend bound, maybe there's something you can do with your algorithm. I'm not really proposing to jump in and fix base64 live on stage, but my guess is that it's doing a bunch of multiplies and those multiplies are being executed on a particular port. If we found another way of doing it then perhaps it would be slightly faster if we did it that way. One of those ways is vectorization and I'll mention that a little bit later.

Code Layout

One of the other things that can affect performance of your program is actually the code layout itself. If you've got a program and you've got some ‘if’ logic such as: if an error has occurred, if a NullPointerException has occurred or whatever, there's two different ways of laying down that code in memory. You can either have an error test that then jumps to the good case afterwards, and otherwise effectively just follow through into the error case in which you jump something else. Or you can do it the other way around where the normal case follows through and the bad case jumps down.

As far as the branch predictor is concerned, as long as you don't error out it's always going to guess the right branch then it's always going to do the right thing. However, if you've got cache lines involved in this and you've loaded the instructions in one particular cache line or several cache lines it's going to be good if we've already loaded the code that we want to be able to run. If you could punt your error code so that it lives somewhere else instead of running inside, you will get better performance just because the memory subsystem is going to have less work to do.

One way of doing this is to use a __builtin_expect function, and this is used in the Linux kernel. They've got macros of likely and unlikely to be able to use __builtin_expect. If you say __builtin_expect(error,1), in other words, we expect that's the default case of doing it, the code will be laid out like that. Back in the old days, this used to emit an instruction to tell the branch predictor which way to go. (That hasn't been used since my hair wasn't gray.) If you wanted to specify the good case you can then specify __builtin_expect no and it will come in and lay it out like that instead.

If you're using profile-guided optimization, this is the thing the profile will learn for you. If you've got a CI pipeline that's doing performance tests, gathering profile information, and then applying that, this is already job done – it is doing things in this particular case. If you aren't, this could be something that will enable your good path code or hot path code to run slightly faster.

Loop Alignment

The other thing, and I mentioned it a little bit before, was something called loop stream decoder. In fact, about an hour ago, I saw a tweet going past of someone saying, "Why does this run faster on a particular alignment?" and there's a thread on Reddit that they pointed to. Intel has got something called the loop stream detector, so when you get to the end of a loop, it'll jump back to the top and then go around the next loop. That's basically what all loops like in code. The only difference between these two loops is that it just so happens that they are in different places of memory and that's all.

If you've got a loop and it starts on a 32-byte boundary, then the Intel loop stream decoder will serve the instructions from the μop cache that we talked about earlier. The μop cache can hold something like 1500 μops. As long as your loop isn't too big, if you're serving things straight from the μop cache, then you will be executing that loop faster without having to do any extra work yourself. If it's slightly less aligned, then you'll go through the ordinary process which is to read the instructions, convert them into μops, dispatch those μops and so on, and it will be a little bit slower. Over large amounts of data or large number of times processing it, it can make a difference.

There is a flag in LLVM that you can specify to say, "Ok, for the targets of where we're jumping to, align all of them to 32-byte boundaries because, hey, you're then going to not have this problem." The align-all-function says, whenever you are creating a function, make the function start on a 32-byte boundary. The align-all-nofallthru-blocks is just a very complicated way of saying all of the branch directions to the start of the loop need to be a on 32-byte boundary. There was another option which was align-all-blocks and that says throw everything on a 32-byte boundary and that's overkill – because it's only the ones where you're not worried about it. It's only the ones where you need to jump to the start of the loop where they need to be 32-byte aligned.

This is probably not going to help you;iIt may be able to help you and it's useful to know about, but it's almost certainly not something that's going to give you for free. I believe that the JIT will automatically align functions on a 32-byte boundary as well for this particular reason.

The important thing of mentioning this as a code layout perspective is because recompiling your code can give different performance profiles. There was a blog post by Denis Bakhvalov, where he was talking about having a function and just by adding another function in the code completely uncalled, but just the presence of that function affected the alignment and then saw a drop in performance. When you're doing performance measurement, you need to be aware that just recompiling your code can shuffle the code layout which can give different performance profiles to deal with things like this. In that particular case, these finds are useful because then it will at least give you consistent performance as far as the loop stream decoder is concerned.

Facebook have written a tool called BOLT which I think stands for the Binary Optimization Layout Transformer. What it will do is it will load in your program code, parse it, essentially reconstitute the basic blocks that it started with, and then defrag them. (Anyone remember defragging disks? That's maybe a last millennium thing.) Basically, what it will do is it will run your code, figure out where the hotspots are. The same thing that a profile-guided optimizer will tell you as well. Figure out where those hotspots are and then reorder all of the blocks so that all of the hotspots live in a smaller amount of memory. This doesn't change the instructions themselves, they're not running any faster, it's just saying all of the hot code where before maybe it sat in, say, 10 pages of memory now fits into one or two pages of memory. Therefore, you get much better utilization of the cache and the pages that are bought in.

There's an arXiv paper that they published and there's a repository on the bottom there which you can follow from the slides.


Another thing that I mentioned is the parser, SIMD parser for processing vectorization. This is something that Daniel Lemire's written and a few other people as well, being able to do a vectorized processing of parsing JSON files.

Typically, when you parse JSON it ends up being in a switch loop. Is this character an open brace? If it is, then go down this particular path because we're doing an object, if it's a quote, we're doing a string, and so on.

It turns out that instead of reading things a byte at a time, you read can 64 bytes at a time. Then, you can get much better throughput for being able to parse your JSON object and generating the object as a result. He and a few others worked on this approach for generating the JSON parser and compared it with the known existing best ones in the open-source. They saw a speedup of around two to three times depending on which benchmark it was that they were using.

There's a couple of reasons why it's faster. One is that instead of processing one character at a time, we're processing four characters at a time. The second thing is that they've got a mechanism for avoiding branches. One of the things that vector operations will give you, is a way of doing masking operations then being able to combine things together because, as we talked about previously, the branch predictor's job guessing whether you went down a branch or not is made significantly easier when there isn't a branch to go down. This particular case was branch-free and, therefore, was a lot faster for doing processing.


We've talked about a lot of different ways that we can do analysis of programs and some of the techniques and some of the ideas that you can use to get programs faster. If you're dealing with memory, you want to try and use cache-aligned memory or cache-aware data structures so that things are based on 64-byte chunks because that's realistically what's happening under the covers or page size chunks. Data structures like B-Trees which have historically been used a lot in databases and on disk formats are something which work relatively well for the way memory is laid out, and so you consider that too.

Compress data to make sure that you can compress and decompress data on the fly inside the processor because the cost of schlepping the memory around is usually where the performance problems lie.

If you can, avoid random memory access. Random memory access is the thing that will kill memory lookup performance at run time.

Have a look at whether huge pages can help. Certainly, you can get 10% or 15% speedup on some types of applications simply by enabling huge pages. Databases tend to be very sensitive to this and you should follow those instructions to say whether or not this is a good thing or a bad thing.

Configure how you source your memory by looking at the libnuma and the other controls that allow you to specify where your processes are run. The more local you can get your data to the program, the better it's going to be.

As far as the CPU is concerned, each CPU is its own network data center. In the same way that you've been thinking about distributed processing systems, being able to move data, and summarize them across multiple network machines, think of that as what's happening inside your server as well. If you can get it so that your data is closer, then it will be processed much faster than if the data has to come from somewhere else.

In particular, the branch speculation and memory cache misses are pretty costly. Earlier on there was someone who mentioned that when you're thinking of complexity of programs, don't think of it as the number of instructions that you run, but how long you can run without a cache miss and really count complexity of algorithms in the number of cache misses that you hit.

Look at branch-free and lock-free algorithms. (We didn't really talk about lock-free because when you have locks, you generally don't have something that's CPU intensive because it's locked waiting on something else.) If you can use lock-free and branch-free programs then you're going to get better performance out of it because you're not going to be contending and the branch predictor is not going to get in your way.

Then, use performance counters with the help of perf or Toplev to be able to indicate exactly where your program's slowing down. Is it the pulling in memory, is it the cache misses that we're seeing, is it the fact that we're waiting for this particular dispatch port?

Use vectorization where you can. Most compilers will give you auto-vectorization for free with things like loop unrolling to be able to give you the best possible performance, but it may be the case that dropping out into your own vectorized code or vectorized assembly makes sense for particular types of operations. Things like the JVM will use vector code for doing object array copy, for example, when it knows it can just copy whole swathes of data with vector instructions instead of as a byte by byte loop.

I've compiled all of the references that I used into the presentation on this slide, so if you don't want to download the slide and be able to use it then you can take a photo of this and you'll be able to find it from there.

I've also created a section of links to other people's blogs and other items that you might be able to use as well including some of the ones which I've mentioned within this presentation. Those are good things for following up.

My links to presentations are down here at the bottom. That's my blog at the top and my Twitter feed where I shall post a link to the slide. If you're impatient you can go to Speaker Deck now. I've already published it and I'll send out the link as soon as I finish speaking. There's some other links to things like my GitHub repo and other narrated videos that I have done in the past.


See more presentations with transcripts


Recorded at:

Mar 25, 2020