InfoQ Homepage Presentations “Quantum” Performance Effects: beyond the Core

“Quantum” Performance Effects: beyond the Core

View Presentation

Speed:

Download

53:56

Summary

Sergey Kuksenko talks about how (and how much) CPU microarchitecture details may have an influence on applications performance. Could it be visible by end-users? How to avoid misjudgment when estimating code performance? CPU is huge (not in size) that is why the talk is limited to those parts which located out of computational core (mostly caches and memory access).

Bio

Sergey Kuksenko works as Java Performance Engineer at Oracle. His primary goal is making Oracle JVM faster digging into JVM runtime, JIT compilers, class libraries and etc. His favorite area is an interaction of Java with modern hardware what he is doing since 2005 when he worked at Intel in Apache Harmony Performance team.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

[Please note that this presentation contains strong language]

Transcript

My name is Sergey Kuksenko. I am from Oracle, and I'm working on Java performance. Today I want to tell a little bit, not about Java but an idea, which quite important from my point of view. In performance, how we deal with hardware, what's happened inside our CPU is inside micro architecture of that CPU. Now, a lot of people forget about that. We are thinking about class, there's clouds, micro-services, and that's neglecting micro architecture details for such things like Spectre and Meltdown, for example, not such a long time ago.

I won't talk about the cores of CPU, it's another area. I will talk a little bit about what is, besides the same CPU chip, beyond the cores. Mostly, how you work with memories, and how it has effect for performance of our applications.

Safe Harbor Statement

That is my favorite slide, definitely favorite. I didn't invent it, but it should be a prize for the guys who invented that. That slide means that I could say for you whatever I want, and nothing will happen with me. So I’ve been working on Java performance for a long time at Oracle, and in different situations.

System under Test

I will show you a lot of examples, real practical examples. Most of those examples are taken from real application, from real use cases. I will anonymize that, clean up all the other stuff, and all performance numbers which I will show you, will merit on that particular laptop, that one. It's a Haswell, it's quite old CPU, more than three years old, but it's enough. And major amount of examples, it either doesn't depend on which Java version I'm using, I'm from Oracle and Java, everything will be done in Java. And also I will specify Java 8 or Java 11 – [which version] I am using. So a couple questions for the audience. Who is writing on Java here? Cool. I need to collect statistics about Java usage. Who is using Java 11? No brave men, except for one? Okay, [Java] 10? [Java] 9? Eight? Seven? Below? Okay, I will not go to Java 8. All examples which I will show, you can take from that address and you can play with that. Also it's required JMH, our micro-benchmark harness, which was created for making different kinds of micro-benchmarks.

Demo 1: How to Copy 2 MB

The first task. I'm just a simple engineer. I'm sitting in some company, and I'm doing some experiments and playing. And I did a very simple test, I just copied two megabytes of data from one place to another. That is the code. And I want to check what works faster - the classical, typical Java arraycopy, or my super optimized reversecopy, when I'm doing copying back. Which of these methods work faster, would you guess? Any ideas?

Participant: Second one?

Kuksenko: Why? Let's vote, who is thinking that arraycopy is faster? And who thinking that reversecopy is faster? Let's get the numbers. Okay, its average time in microseconds. What are typically developers doing on that situation when they see those numbers? They're doing conclusions. And the only conclusion which you could extract from that example is that I know how to make a faster copy even than some Oracle engineers. Honestly, it's a wrong conclusion. It's not true. Let's move further. Let's take this situation: you share that code through your team, "Guys, I know how to make data copying faster." Here's my results, that's what I got. Arraycopy is more than two times slower than reversecopy. Bob, on some Mac Pro machine got comparable results, and much faster. Another guy from our team check that. Alice, she told that the situation also doesn't matter, but she's using JDK11. And at some moment, you did a slight modification of your code, and you copy not two megabytes of data, but two megabytes minus 32 bytes, just 18 less, and you got those numbers. Right now the situation is completely different. Right now, reversecopy is two and a half times slower than arraycopy. What's happening here?

We needed some investigation. We found one thing. MacOS doesn't support large pages, anyway, and a lot of that affects what happens due to large pages which in Ubuntu is turned on by default using transparent, huge pages. It's the described difference with Bob. The other thing we describe different with Alice, she already moved to JDK11, and you are still using JDK8. In JDK8, default garbage collector is ParallelOld. Starting from JDK9, default garbage collector is G1, and G1 explains that difference. If you execute that turning on G1, by default you don't see such huge difference for numbers. What are you doing at that point? You are drawing conclusions. You are drawing conclusions that large pages it's garbage, and G1 is a very, very cool GC. But also, they are wrong conclusions. And for the explanation of that example, I will return at the end of presentation.

Demo 2: How much data?

Let's check the second example. You did some refactoring of your code. You had a code with some functionality. At some moment, a lot of real action, a lot of functionality which you used in that code was moved to other places, and only that code remains. What do you have to do here? It's only what remains, all actual actions were moved to other places of your application. It's an obvious idea that that class should be eliminated, and you just use only byte arrays. That all, it's very simple, it's obvious for refactoring. We already did it, let's do the latest step. And the one frequent operation you should use, if we have several amounts of data and we want to count how much data we have. Just in that example I have 256 arrays, data, quite large, different size, and I want to count that, as example of how I counted the amount of my data before. After refactoring, I need to know that class, I am just using byte arrays, and it’s the same. So which version works faster?

Participant: [inaudible 00:08:07]

Kuksenko: Why more? Where I'm allocating more memory, I eliminate one level of indirection, my access should be faster, I think. Explain.

Participant: [inaudible 00:08:32].

Kuksenko: I'm allocating the same amount of data, similar to other - we will talk about that after the presentation. So, and I'm measuring this performance on Java 8. It's quite strange, array of bytes looks like it’s slower. From my previous experiments I know that G1 is a very cool GC, I'm turning on G1. And I got that result. Also, I remember that I had some issues with large pages. I'm turning off large pages on my laptop, and I got that result. Not so awful, but again, the straightforward access with the less level of direction to bytearray gives worse numbers. What are people doing when they get to that place? They're drawing conclusions. And the normal conclusion in that situation is that large pages is awful, G1's awful, Oracle is awful, etc. Again it's also a wrong conclusion. And about that example, I will talk a little bit later, but I promise no more delayed examples.

Caches Everywhere

What am I talking about here? I'm here to talk about how CPU has access to memory. The problem is that access to memory is very, very slow. And the guys which produce hardware invented a lot of stuff for us. It's like caches, it's a typical picture of any kind of CPU. They're similar. We have L1 cache, which is small and quite fast; L2 cache, which is a little bit larger; L3 cache, our last level cache, which is quite large enough and shared between several cores; and the access to memory. There are some systems where L4 cache exists, but they're still quite exotic.

And a little bit about numbers about CPU particular caches, which will be useful for us. L1 is 32 kilobytes, 8-way activity access, cost four or five cycles, L2, L3, and etc. Size of cache line is 64 bytes.

Demo 3: Memory Access Cost

What are we thinking about in the situation when we're thinking to access to memory? It's a cost of memory access. I did just a very simple benchmark, just to show it. When I am going through the link list, it is quite large, the size of node and measure access for each memory. What can you see here? You can see here it's a chart, and you see that caches work pretty well. The red lines are the boundaries between L1 cache, L2 cache, and L3 cache, and it works really nice until we are fitting in four caches.

Who knows what is hardware prefetching? The key idea is it’s a large part of our CPUs. There is some magical block which thinks that access to memory is very slow, and maybe it makes sense to do something earlier, before we need the actual data, to load it before our needs, and to provide better speed to hide somewhere our latency for load. Let's just check to understand how the prefetching could work.

Different Mix

This is as the same chart which I showed you before, access for memory just random areas, but the difference that I change it to a [inaudible 00:12:40] scale to show smaller stage. After that, I am trying to do some kind of regular access to check if prefetch will help. If I do four lines - by line, I mean cache line - 64 byte difference, I’d rather say there is no effect. If I do three lines, there is a large difference, prefetchers start working and help, and in the current situation, when I have real access to memory, it's more than two and a half times faster. Two lines stride, it's four times faster, and if I do sequential access, it's really up to 12 times faster. That's good. So the key idea is, if you're working in this application, you'll just have only two ways to improve the performance of your memory access. Otherwise, you could shrink your data set working size, like eat less memory, or you could not do the random access and eat your memory sequentially, don't run through different tables with different dishes. Do it in order, and you will eat more, it will work faster.

Demo 4: To Split or Not to Split?

Let's consider the following example. Who knows what unsafe is in Java? It is just a much hidden class, which I already almost eliminated in Java to do some tricky things. And that example is quite tricky, I'm just reading and writing a long value on different offsets around my monad. What I see here, I did 0 offset on 4k alignment just to check, and here are the numbers in nanoseconds as the cost of access to read and write my long value. So what did I get here? I got two things which are called page split and line split. That's the basic effects to work with monad. Line split is a problem that when you're trying to load data which doesn't locate it inside one cache line, and locates it in two subsequent cache lines. And it costs. Its cost may not be large, but it may be sensitive. And the page split is another issue when you hit into boundary of your page of memory, and it really could cost you many cycles.

Misalignment

The key question here is why are we talking about this? Java doesn't have misaligned data. And it's true, Java doesn't have misaligned data, but Java has misaligned operations, Because we have unsafe, a lot of people are using that for memory libraries, etc., using unsafe, or it's a new API, which should replace unsafe varhandles. We may work with misaligned access when we're working with buffers, because of byte buffers or any kind of buffers. And also, CPU, all models of CPU have a lot of vector instructions, like this is E of X, and that obviously became bigger and bigger with each model of new CPU. And for us, it would not be a good idea not to use those instructions, because they provide new capabilities. And of course Java and HotSpot uses them very heavily, and there are two ways that source - is like just intrinsificated functions, particularly like system.arraycopy, arrays.fills, string.index.of that has a very tuned code generation for each particular platform. Also, the last year's HotSpot achieved very high results in automatic vectorization. When you just write some code for loop, and the code generated for that loop is vectorized with that SIMD instructions.

Arrays.fill

Let's consider this kind of example. It's a simple usage of the Arrays.fill operation. There are a lot of people here who are writing in Java and who are not using Arrays.fill, who are not using arrays at all. With some little trick, just to check that effect and to isolate it, I aligned the variable from/to page boundary, and just played with different offsets to check how it works. What did I get? I got some kind of curve, and I see the score is different. The minus 32 and 0, it's a very perfect case, everything here is fine. And in other spikes, I got problems with page split and line split situations, so it is a cost for us. It's the situation that you don't think about memory layout, and etc., you don't think about misaligned access, but you have it in the application. And the obvious question here is why we should care about, in this case of SIMD operations, if they give us misaligned access. Maybe we should avoid misaligned access. Okay, it's possible to turn off any kind of SIMD code generation inside HotSpot, and we can do it. And in that situation, you’ve got a pretty straight line without any kind of spikes, and etc., but it's slower, and it's quiet. That is why we have to use SIMD instructions, even with this kind of performance effect.

Demo 5: Matrix Transpose

Let's consider the next example. Who knows what that code is doing? It’s just a simple matrix transpose. Whoever uses such code in production? Who uses matrixes in production? It's a key question, what I'm getting here when I'm showing this kind of example, is who cares about matrixes? We are enterprise engineers, and etc. But, I'd rather say machine learning is growing nowadays, and there is a lot of matrix multiplication in machine learning. So, a simple example. I'm just doing matrix transpose here with that code, and I took different sizes of matrix. I'm not showing it right now, and doing that for a measurement. Let's try to guess, starting from the largest matrix. Here is that number, here is the other number. Let's guess what should be in N+1 place. I'm not asking the exact number, just calling a range a range of values. This is the second question: could you try to guess what N is? At least we show them from N+0, +1, +2, +3, we show them as power of 2. Which one? Who said N+3? Yes, the situations that are raised in Java require header. And header for variable double is 16 bytes, two doubles. That is why we have a spike on 254. Without header, we'd rather get that spike on power of 2 boundary.

Let's start thinking about that. What are we doing when performing matrix transpose? We are taking subsequent allocated elements of memory, and moving into different memory allocations. First of all, it's a problem that we are taking some data from one cache line, and putting it into a different cache line, but it may be okay. The problem, it's called cache sensitivity. Who knows what cache sensitivity is? Not zero. Let's shortly describe that. For example, we have L1 cache, 32 kilobytes. We put memory right inside that cache, but in reality we can't put our data from memory in any place of that cache. We know that our L1 cache is 8-way sensitivity. In reality, it means that there are eight places in that cache, only eight where that cache line may be placed. That's all. Our L3 on that machine is 12-way sensitivity, there are 12 places for that particular memory address, when values for that memory address could be allocated, not more. And there are different situations. Of course, there are such things which are called full associativity caches, but they're very, very expensive from a hardware point of view, that is why they are not feasible to make on a CPU. There should be some kind of balance.

Cache Associativity

How it works, a little more details. We have the address, and the address perfectly defines where our cache line resides inside our cache set. And if we check into our physical address, the lowest 6 bits of our address is just a number of bits inside the cache line and it doesn't matter. In reality, in memory there are no such things as bytes. When we operate with memory, there is the minimum amount of data which we read to write, it is the full cache line, one byte access, two bytes, four bytes access is just a simulation on our cache line. For L1 cache, we are taking bytes from 6 to 12, and using these bytes from number 6 to 12 to identify in which of these sets of 8 elements we put our data. For the last level cache, it's a little bit larger.

And there are such things which are called critical stride. If your data is allocated to a difference which equals to critical stride, which you could calculate using that formula, it means that your data will hit into the same index set. So if you need to read 20 different cache lines, or so, which is the difference between them? It’s 4K, you don't have enough places inside your cache to allocate them. We have only eight places for L1 cache, to collect your data. So if you want to read 20 times, it means you should drop out previous data, put newly read data inside cache, and it means that you have to read from memory constantly, and that describes the problem. That is a problem because in case of critical stride, the size of data is 254 plus header, so we hit the difference 4K, we are doing transpose, and for that line on the right side of the slide, there is just not enough place for our cache to plug in that data, and that causes the problem. And this is the explanation of example number two - how much data.

Why did we get such a huge and very bad performance for G1? All our arrays were quite large enough, and in G1 they were aligned into one megabyte. And if they were aligned into one megabyte boundary, they were also aligned for 256K, 32K, and 4K. It means that in this example, we had a problems with sensitivity in all caches, L1, L2, and L3. We don't have enough place to allocate the file data and we have to read it from memory constantly, and that is why we got 30,000 nanosecond just for a simple operation.

In the case of ParallelOld the memory layout is different, but also there are some problems. We don't have problems with L2 and L3 here, but we have problems with L1. Some of this allocation is hitting into L index cache, and here I emphasize with the red numbers, sometimes I have 10 arrays which have alignment with the same critical stride, and 10 is not enough. We have only 8 places in cache for locating our data.

Demo 6: Walking Threads

There is an interesting effect. I have multithreaded application, and I have two dissimilar threads. I have two threads, they're just working and reading data from memory. They are doing absolutely the same job. The only difference is the data for each thread is independent. Each thread has its own data. Theoretically, they shouldn't have any influence to each other, and it looks like it's okay in some situations. If I’m using a very small amount of data, it's like 128 kilobytes per each thread, it's quite okay, all the threads are working with the same speed.

I'm increasing that, it's not at five nanoseconds as before. My examples don't fit into L2 right now because they hit number three, but if I increase a number, I get a very interesting effect. One thread works much faster than the other thread. And that happens when we don't have enough last level cache, as here. Last level cache as used for that example is three megabytes, and I have four megabytes of common data. What's happening here? Caches, when we load new data, they are choosing how to evict some previous data, and they're using something similar to the last recently used mark. It's not a true and honest LRU, it's called sealed LRU, but it works similarly with that. And in that situation, one thread worked a little bit faster than the second. If that thread worked a little bit faster, it passed a little bit more data through itself. If that thread passed a little bit more data through itself, it passed a little bit more data through last level cache, it shared last level cache, shared it between two codes.

If that thread has more data in a thread, it works faster, the other thread became starving, the other thread doesn't have enough data in a cache, the other thread has to read data from memory, and it works slowly. And because it works slowly, it marks its data as last recently used less and less, and the balance is in the end. So the thread that works a little bit faster became faster and faster, just because it had more data in the last level cache, and the situation became worse and worse. That effect is not infinite, it has experimental proof in boundaries when the amount of used data is within the size of last level cache, and two and a half sizes of our last level cache is a large amount of data. Cache is not enough for any of the others constantly loading from memory and everything became similarly worse for all of them.

Demo 7: Bytes Histogram

There is an example which should be taken into account when you are working with memory performance. Quite a simple code, we just want to calculate frequency of our bytes, how frequent we made our bytes inside our code. It's called finite-state [inaudible 00:31:10] task, it's a very common task for some kind of archivers and for compression tasks. And for that example the random byte array is 16K in size, the average time is showed in microseconds. And there is the question, what if this data is not random? If there are some bytes, like space or letter A, which happen more frequently in our data than others? What will I get in such case in terms of performance? And here is a chart where the percentage of duplicate is increased. And you can see that the performance of your application became worse and worse (it's on the image there) than more repeated data that we have.

There is another problem which causes the store buffer. What is a store buffer, and why we need it? Here is CPU, and our L1 cache. The distance between the L1 cache and CPU is within four or five clocks, four-five CPU cycles. It's quite large for a model CPU. If you are doing a lot of data, we rarely can do something with that because typically, if you're loading something and if you are loading it from L1 cache, we need that data right now. We have to put it into register to do operation with that, we have to wait, anyway, until our data reaches the CPU. With writing to memory store operations, the situation is different. It's a typical result, we are writing something to memory and we don't need it right now, which means that our CPU may continue execution of files instructions out of order execution, etc. So we don't want the CPU to be delayed on that store operation, we need another separate unit which will do that for us. It's called store buffer. We take store operation, store buffer, it travels through that store buffer for that four -five clocks until it reaches the L1 cache.

Store Buffer

And CPU in that moment could execute another instruction, it's fine. What happens in this situation, is we are performing operation storage on address A. I'm not sharing values here, who cares about values? The key question here is addresses. I am loading value B, and I know that there is some hardware blocks to check, that there isn’t any kind of writing for address B, and I could execute that operation load B without any dependence from store A. But what if I'm doing load on the same address A? In this situation, we are checking that writing to this address A exists on store buffer. What could you do in this situation? There are only two ways. First of all, we could wait until our value reached L1 cache, until it reads that from cache, but it's got delay, it's expensive. Or there is some block which is called store forwarding, some operation which, doesn't wait until your value reaches their own cache, it takes that value from store buffer and returns it for that subsequent lot. Unfortunately, store forwarding is faster than reading from L1 cache, but it's not zero cost. It costs something. And for our chart, we got problems with that spike when we increase duplicates, because if we have a lot of duplicates, we have a lot of subsequent combinations for the same address, and that means we read and write the same address and it costs.

Just for example, how you could avoid it? I did such examples when using two tables and collecting out histogram, our frequency in two tables. And after that, just imagine just four tables. And here are charts, what I achieved. Because if I'm doing collection in different tables, I don't have so large collisions with reading and writing in the same address, if data repeats frequently, just for example. So it's also a very common effect from [inaudible 00:36:12]

Demo 8: Bytes < = > Int

And maybe the last example - how to split integer into four bytes, and how to collect integer from four bytes. How will you do that? It [inaudible 00:36:34] there's not a lot of people, if they need to perform that task, they will do maybe some hackers tricks, shifts, masks, etc., But if you go to Stack Overflow, I checked it, there are approximately near 250 questions in Java section of Stack Overflow, how to split int into bytes, and vice versa. And 85% of the answers are quite simple, don't worry about all kind of SIMD, just use ByteBuffers. It's okay, it's quite a simple code. Let's use it the same way. Don't worry about shifts, let's use ByteBuffer, writing to memory and reading from memory. Nobody wants to know how ByteBuffer works. What do you think, what is faster? To split integer to bytes, or to gather integer from bytes? Or will they have an equal speed? And a situation that splits integer into pieces is faster than collecting integer from pieces. It's a typical situation for life, to split something, to disassemble something, is much easier than collecting it again. Why?

And it's an initial situation. We have int, we have one store operation inside our store buffer, and we have four loads, we have everything on four bytes. It's fine. In that situation, store forwarding works, because we just could extract our byte from our data. And it's a reverse situation, we have four different bytes, and theoretically we could gather our int collected from the different threads, but nobody's doing it from hardware. And it's called store forwarding fail, and because if there is a conflict on addresses, we are writing to the same memory location as we are doing load, we have to wait until that data reaches the L1 cache. That explains difference. And the situation, the rules for store forwarding fail are quite simple. Just when you're reading something that overlapped, or which doesn't fit into your previous data log.

Back to arraycopy

And let's go back to the first example. What happens with our copying of too many megabytes of data? First, what I want to do is just to check the assembly code. Maybe it's a very different code generation. And I did it, I got to assemble it, but they are a little bit different. What I can see here? First of all, who knows what is VMOVDQU? What is it? It's a AVX instruction which loads, moves data into our AVX registers and memory, and vice versa. Particularly with that operation, it operates with 32 bytes. So, in most cases, in arraycopy, when [inaudible 00:40:20] assembly generated for my particular platform, and for reversecopy, which you obtained from just for loop, I've got automatic vectorization, in that case. I've got a vectorized code which moves data by pieces, by 32 bytes from one place to another. And there is only just one rule, in my own case, I do load, store, load, store, L1 iteration, and [inaudible 00:40:47]. The reverse code is the only difference. It looks like that's difference, and no one can explain the difference in performance of our examples.

What about Memory Layout?

The problem is the memory layout. It’s just an example of memory layout, which I've got using the ParallelOld GC and G1 GC. As you remember, all problems which I've got here, they were in the case of ParallelOld. G1 provided [inaudible 00:41:12] performance for both matters, and as it should be expected. And the real problem is that we allocated 2 megabytes memory, plus 16 bytes of header. It's a difference between addresses for our data when we are copying from here. And we are talking about so well-known problems, which called 4K-aliasing. It happens when we are reading and writing data which have difference in addresses equal to 4K. The problem is that dealing with the store buffer, we have to check that load operation doesn't have a conflict with a previous store operation. And hardware uses only 12 lower bits of address to prove that there are, or there are no conflicts. And if lower than 12 bits of address is the same, it's considered as a conflict. In this case, we have to wait until the operation passes through the whole store buffer, and it causes delay. And the way it works, is different. So it's our operations, it's a trace for execution. If you load address from array A, store it in array B. Load the next 32 bytes, store the next 32 bytes. We know that array B, in that case, is array A plus 2 megabytes, so you could write it like here, or just check and take lowest 12 bits.

And here we see that there are conflicts, not between load and store, but between store on the previous iteration and load on the next iteration. In this situation, addresses are intersected. We operate with 32 bytes, and it causes a 4K-aliasing conflict. We have to wait always until our data reaches the cache, and that causes problems with the performance of arraycopy. And that is why, when I just decrease my data by 32 bytes, I've got a different situation. Right now, arraycopy works faster and reversecopy, we're still stuck with 4K-aliasing, and with a very huge difference.

So here's us, copying of data, 1 kilobyte of data in a different situation. Here, everything is fine, it's normal sets. Here we have just misaligned access, and here we have 4K-aliasing. And there is a difference between those numbers. That's not all, the situation really may be much worse. It's a different alignment of source, here's a different alignment of destination, and just as an example, because the diagonal numbers, it's 4K-aliasing on the sides, it's page split, it's cache lines split, and all other performance issues. And there is another issue, which I don't know how to explain, because I didn't find any explanation in the hardware vendor pages. It happens only when large pages turn on. I have an alignment difference between addresses that equals to 1 megabyte, and it causes slowness of copying my data. I don't know what it is, honestly.

So just for a first example, you got a lot of issues caused, performance alignment caused influence to performance like line split, page split, aliasing, and all of those things. So the only thing I want to say, if you want to work with memory, want to measure memory performance and solve your problems, you have to know what's beneath your software, inside your hardware. That's all for me.

Questions and Answers

Participant 1: In the case of store forwarding, is it actually slower than fetching from L1?

Kuksenko: No, it's faster than fetching from L1, but if you don't have any kind of conflicts you just bypass the result and you delay. The cost for access to L1 is typically five clocks, the cost of store forwarding success is two clocks.

Participant 1: So in the case that you just wrote something and it's in the store buffer, then wouldn't it still be better than if you hadn't written it recently?

Kuksenko: No. Store forwarding gets effect from load, if you did a load from L1, you have a cost of L1 load. If you hit on the store forwarding, the difference is if you load something, the performance [inaudible 00:46:48] If you hit a server and it's faster than the cost to run, it's two clocks. But, your load can't be performed before your store. If it's independent on just your load, could be executed before your store. In that situation, no, you have wait until that store is retired and plus 2 clocks.

Participant 2: I'm trying to understand the practical implications of it. A lot of times as engineers, we don't have full control on where the software is going to be deployed, what the underlying system is going to be, and a lot of times when there are migrations, such as from the physical data center to the cloud, we don't always get to say what the underlying machine and hardware details are going to be. So in that case, what are the things that we can do to write software that will, on an average, perform better than what we would write otherwise, regardless of knowing the precise underlying hardware details?

Kuksenko: I'm not saying that it could be used otherwise, as I did so much elimination from user code, eliminating their own version of reversecopy, in such, and other kind. They are stuck in a corner case, there are some examples which work in some corner cases magically faster than standard, for example, Java Library. And they believe that it happens every time. I want to bring that don't invent bicycles. If you believe that your code is faster, and you can prove it definitely fast in all cases, just bring it to us, it's open JDK. But a much more typical situation is when people set such examples, proliferated through some applications. Nobody remembers when it arouse, and people still continue using that. I'm just showing that it's not a practical usage, but avoid impractical usage of that. It's an explanation of issues which happened in ALI. When people came to me, asking, "What the fuck?" with regards to that situation. And, that's all. Don't micro-optimize.

Participant 3: So what he did was kind of replicate the problem on a particular hardware. So, if you notice a problem in your cloud deployment, and you're seeing that when you're migrating you're having more expensive usage of your environment, then you would want to see what happened between the first deployment and the second deployment? So you will try to do use cases. You may not go down to the micro architecture level detail just yet, you may find out some other things at the top. But then if you don't find anything, then maybe that's the way, try to replicate on your hardware.

Kuksenko: Just as a general advice, don't do such kind of 3D performance attempts until you know what happens beneath.

Participant 4: Does the JVM attempt to mitigate some of these problems? You know, sort of detect some of the patterns, insert extra padding, that sort of thing? Does the JVM attempt to mitigate some of these issues with repeated arrays of various strides, and whatnot? You know, insert extra padding, say “Oh, well here's two 512 byte arrays, or 4 kilobyte arrays, maybe stick in an extra 12 bytes?”

Kuksenko: At least we have found a very tricky code generation for arraycopy which is free from that. Not free, but significantly mitigates that issue. And the practice shows for us that the generic code anyways, that is in the average case much faster than cases when we are trying to insert some kind of checking of that alignment. And what we want to do, and what we are trying to do is just trying to generate different code which maybe also suffer from that issue, but suffer less.

Participant 3: But there are no patterns, it's not smart enough to have a pattern. Because some of these, as you can tell, it's on your load store buffer size. The other ones are cache line sizes. You know, Intel and AMD do have standard 64 byte, but some other architectures don't have a standard cache line. Sometimes it's more beneficial to go by their cache line. So then it becomes, like again, it's your market architectural differences that come into play.

Participant 5: I know you talked, you said we weren't going to go into the cloud too much, but then you showed that cool L3 example about starving other threads. And do you have any tips on how to be a good neighbor or be a bad neighbor, I guess, in a multi-tenant environment, where I'm sharing a CPU with some other code, and how to play nicely with the L3 cache, or play poorly, I guess, if I want to have some fun?

Kuksenko: Unfortunately, I don’t know anything about nice neighbors. I know a lot about bad neighbors. It was example of symmetric neighbors, they are doing the same and they are fighting for resource. It won't save you. There is, for example, another situation, where you have a very memory intensive application and very computational thread. It looks like they don't have computing resource, they just work simultaneously together. But, there is an example in that situation. When they work on the same machine and memory intensive application captures the whole last level cache, and the compute intensive application theoretically should compute without problem with his cache. But, if all last level cache is taken by that memory intensive, it means that the code cache for the compute intensive application also will be declined from memory. And compute-intensive application began to work slower because of no code cache for that computation. And there are a lot of such collisions. And if you really want a very high performance, and very high utilization of everything, don't have neighbors.

See more presentations with transcripts

Recorded at:

Feb 16, 2019

Sergey Kuksenko

InfoQ Software Architects' Newsletter