InfoQ Homepage Presentations Comparison of Performance of Multiple CPU Architectures

Development

Comparison of Performance of Multiple CPU Architectures

View Presentation

Speed:

Download

37:14

Summary

Matthew Singer and Jeff Balk discuss similiarities and differences among multiple high performing CPU architectures.

Bio

Matthew Singer is a Senior Staff Hardware Engineer @twitter. Jeff Balk is a Sr. Hardware Engineer @twitter.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Singer: I'm Matt Singer. My pronouns are he and him. I'm a senior staff hardware engineer at Twitter.

Balk: I'm Jeff Balk. My pronouns are he, him. I'm a senior hardware engineer at Twitter.

Performance and Architecture Can Refer to Different Things

Singer: Let's kick things off by talking about how performance and architecture can refer to different things. One of the things that this can refer to is the core performance or the core architecture. Really, this is about whether or not each core or thread you're running meets performance expectations. Then another dimension of this is, what is the CPU or the socket architecture? That means, in each package from each one of these vendors, do you get enough cores, do you get enough memory bandwidth, do you get enough I/O bandwidth for your workload. Then you can go another level on top of this and talk about a system level where you can arrange multiple packages from each server vendor and make even bigger servers. Then you have to think about, are there going to be issues from the fact that you're now dealing with non-uniform memory. Lastly, when you're talking about building your own fleet, what's the performance per watt? This is going to become a lot more important when your server numbers grow. Whenever you purchase these servers, you're going to be able to get a certain number of them into certain power footprints. You're going to be paying for the electricity to run them, it can become important. One of the things I want to point out, though, about performance in terms of this presentation, is we're not going to address machine learning workloads. We wouldn't have time and a lot of machine learning workloads are a subspecialty among themselves.

We're going to talk about the three different publicly available architectures from AMD, Intel, and Arm. We're going to discuss some notable differences in performance. In this case, we use SPECjbb 2015, and OpenJDK 17 to gauge the performance of both throughput and latency response. We chose SPECjbb because we use it internally often to model against our other workloads that run in the JDK that are predominantly written in either Scala or Java. We used OpenJDK 17 because it's the most recent long term supported release.

SPECjbb 2015 Quick Start

One of the reasons we really find SPECjbb useful is it produces two different metrics that characterize the system differently. The first is called max jOps. This really is great at measuring the maximum amount of work we can get done on a server. Then it produces the second metric called critical jOps, where it's monitoring all of the workload against a latency SLA. It takes a geo mean of several different latency SLAs, but it gives you a good idea of how well a server can respond when you're expecting it to respond in a certain amount of time. Let's talk though a little bit about how we test. The heap size can have a pretty big impact on the scores you get from SPECjbb, so we keep a consistent heap size for every thread core or vCPU. There's also some changes in the way the SPECjbb workload works if you have different numbers of groups. We prefer to only compare runs that have the same number of groups. We use the same system tuning on all the runs. We're using one worker thread for every vCPU. When we break up the heap into smaller portions using higher group numbers, we'll enable compressed references for any heaps that are under 128 gigs.

This full config has been added on to the end of our presentation. You're welcome to look at that and compare and contrast to what you see done in other public runs. I would like to mention that because there are so many ways to configure SPECjbb, it's hard to compare runs that are configured differently to others. We discourage you from doing simple apples to apples comparison against other numbers you may find published.

Bare Metal - One Socket Comparison

First, we're going to talk about four different bare metal servers. We wanted to do this comparison because it lets us talk broadly about differences in core counts available out in the market. It also lets us talk a little bit about performance per watt, which we can't do on any cloud instances. We're going to be looking at a high core count CPU from AMD, another high core count CPU from Ampere Computing, which is an ARM architecture core. We're going to be looking at two smaller core count parts, another Arm CPU from Ampere, and an Intel Xeon Processor that's an Ice Lake based core. This first very simple test run is a one group config in SPECjbb, where we see some results that probably aren't very surprising. The highest core count part from AMD was ranked top in both of the SPECjbb metrics. The best comparison with that AMD part, the EPYC 7713 is against the Altra Max 128-core part. That part performs well, not at the same level, but it's a very strong contender that we've seen from the Arm marketplace.

Then the next two parts on the graph are similar core counts, although the Ampere part does have eight more cores than Intel's part. We do see excellent performance here as well. I'd like to point out that the Intel 8352V part that we're using isn't near the top shelf part available at Intel, it is the part we use in Twitter's fleet. There are differences in core counts and differences in speeds. I want to look at the performance per thread versus just looking at the CPU performance. Here we see a really different ordering of the way the different manufacturers' parts behaved. The Ampere 80-core part on a per thread basis was the highest in all these metrics versus the other parts. The next dimension that I like to talk about because we maintain our own fleet, is the performance per watt. When we look at performance per watt, we see a different ordering again. We see the AMD EPYC part ranked high on both metrics in terms of performance per watt. The Ampere parts are interesting, in that we have two parts that run nearly the same performance per watt, but varied wildly in performance per thread. Ampere's decision was made to create one part that was designed to excel at high thread count workloads that didn't need a lot of shared caching, whereas the smaller core count part behaves better on this benchmark, because there's a lot of shared cache needs. The Intel part is unfortunately here at the lower end of performance per watt. There are other parts in Intel's SKU stack that we think would behave a little differently. We'd encourage you to look at some other results.

GC Times, AMD and Altra Max

Another metric that's pretty key to Twitter's analysis of a part is looking at garbage collection times in Java. This is really important for us to make sure these components minimize garbage collection pauses, because that's something that our users would see if applications pause for a long time rather than serving traffic. We see something here with the AMD part, which is the green line that we'll see a couple times throughout the presentation, because we see a lot more jitter in what the GC pause times are, versus some of the other competitors. One of the unique things we saw on the Altra Max 128-core part here, is near the end of the test run, a quite a large spike in the garbage collection times. We're currently attempting to root cause this. It doesn't seem to be anything particular to the architecture. The benchmark continues to run and serve traffic during these extended pause times. We didn't see this in any of the other group configs on this part. It's pretty unique to the situation, and it doesn't seem to be anything intrinsic to the architecture.

Bare Metal Summary

In terms of bare metal, I think there's a few things to consider. Remember that, with the exception of the Ampere M128-30 that we tested, the other CPUs weren't the top end CPUs, and we didn't benchmark 2-socket AMD or Ampere Systems. We benchmark some 2-socket Intel systems, later in the presentation, but we don't do it here, because we wanted to talk only about one group test results. We're going to talk more about some of the higher bin stack CPUs, because we know that's what's being used in some of the cloud instances. I'll point out, notably Intel's Xeon 8380 would have had more cores, and would run at a higher clock frequency, AMD's EPYC 7763 would have the same number of cores, but run at a higher clock rate, as would Ampere's Q80-33 part. For a broader list of bare metal configurations, a non-tech recently published an article comparing a lot of these CPUs including the three I just mentioned.

Small (64 - 80 vCPU) Instances - Configuration Comparison

Balk: Let's take a look at the configuration comparison. This is the list of instances on the cloud as well as a couple on bare metal, which we're using to compare against the cloud. We have a range of instances, Ice Lake on Amazon, we have Cascade Lake CPUs on Google, which is Intel's prior generation, because on Google the newer instances are still in a preview state. We have AMD Rome on Google. On Amazon, we have their new Arm Graviton2 processor. For bare metal comparison, I wanted to highlight that we also have the Ampere Altra, which is an Arm architecture.

Here's a set of results for 64 thread instances. This data is all taken with two groups. You can see the maximum jOps which is in blue. For maximum jOps, you can see that the best performance was seen with the Amazon m6g.16xlarge, which is their Graviton2 instance. In terms of critical jOps, that instance also performed very well, but highest was the Amazon m6i, which is Intel. For 80 thread instances, we're now bringing into the picture on bare metal the Ampere Altra. What we can see here is in terms of maximum jOps. Google's n2d-standard-80 instance performed highest, which is an AMD EPYC instance. You can see that in terms of critical jOps, Google's n2-standard-80 instance performed highest, and this is an Intel Cascade Lake instance.

Here's a list of garbage collection times. You can see on the x-axis that this is time along the duration of the SPECjbb test, and the y-axis is milliseconds for each GC instance. You can see the pattern that Matt mentioned earlier, where in dark green, the Google AMD EPYC CPU shows GC jitter, but otherwise, GC times are relatively stable and close together.

Large (128 - 144 vCPU) Instances - Configuration Comparison

Next up is a comparison of large instances between 128 and 144 cores or threads. Here's a quick comparison of the configs that were tested. We have the Ampere Altra Max on bare metal. We have the AMD EPYC on bare metal. Then in the cloud we have a Google AMD Rome, and we have an Amazon m6i, which is Ice Lake. In our comparison for the 128 and 144 thread results, we can see that the Amazon m6i, which is an Intel Ice Lake CPU, showed both the highest maximum jOps and critical jOps. We also saw another tier of performance that was equivalent from the AMD EPYC, and the Google n2d, which is also AMD. Here's a plot of garbage collection for these 128 and 144 core instances. You can see again in dark green and light green, we continue to see the garbage collection jitter with the AMD EPYC instances, both on bare metal and in the cloud.

Here's a summary of some selected instances with their per thread performance. You can see that the best maximum jOps and critical jOps per thread performance is the Intel m6i on Amazon. You can see that behind it is the Amazon Graviton2, the Arm processor.

Architectural Cache Differences - Case Study (AMD EPYC)

Singer: I think one of the biggest architectural performance differences that we've seen on these products has to do with the various cache architectures in use. We're going to highlight that doing some testing with Java compressed references. One of the things to understand about compressed references is that Java will attempt to use 32-bit references on 64-bit systems, in order to create better cache utilization by shrinking those pointers. The expansion of that pointer can be done pretty quickly, although obviously, there's going to be some work. Overall, if the heap is small enough that you don't run into problems with object alignment, causing things to space too far apart, you'll get a benefit from it. The first score here that I'd like to draw your attention to are the lower two groups of bars on this chart.

This is an 8 group test with compressed references enabled, and test with compressed references, also sometimes called compressed Oops turned off. We see a 5% to 6% reduction in performance by turning off compressed Oops. This is a good time for me to note, we changed the benchmark configuration at this point in time to an 8 group test, because we want to shrink the heaps down to the point where the compressed references make sense. That's why we've shifted to 8 groups. In this particular case, the EPYC 7713 as well as I think all of the other 64-core Milan parts have 8 separate L3 caches, due to the way that the chip is assembled in a multi-chip package. Each one of these L3 caches are connected to each other through the I/O die, but it is relatively expensive to look something up through one of those links to one or the other die. If we take advantage of that, and we take these 8 backends, leave the system in an NPS1 or one NUMA Per Socket config, but we pin each one of these into one of these core complexes that has its own L3 cache, with no other effort, we get a 10% to 12% boost in performance off of not doing that. When we compare whether or not compressed references make a difference now that we've really optimized the cache usage, that's what you'll see in the top two per comparisons. We've optimized the cache usage through doing this pinning, now we see a very small 1% or so difference in the maximum throughput. We see a similar 4% to 5% to 6% drop in critical jOps performance, and that's pretty consistent across all of the platforms and architectures we've tested.

Case Study: Arm

Let's compare that to our Arm parts that we've tested. We wanted these two similarly sized components here, so we're comparing the behavior of the Altra Q80 to the Amazon m6g instance. On both of these, when we toggle the compressed references on or off, we see again a 4% or 5% or 6% difference in max jOps or critical jOps. Both of these parts have a smaller portion of L3 cache per thread, which is why we think we see this drop on both of these parts pretty consistently.

Case Study: Intel Ice Lake

If we pivot towards Ice Lake, Intel's cache architecture is unified in one L3 cache per die, like we had on Arm, but the cache is a lot larger in terms of per core per thread. What we see here is that the caching architecture is really not very sensitive at all to having these compressed references when you look at maximum throughput. Whether we did it on bare metal with the 8352V, or whether we did it on Amazon's m6i, at a similar thread count. We actually get a little bit of a performance bump by turning them off on the Intel processor, which was a distinct difference. This probably means that Intel's caching architecture really didn't get the benefit of having the compressed pointers or references. Rather, it was more work to have to expand those compressed references, which is why when we turned it off, and we no longer had to perform that expansion, we got a little bit of a performance boost. Like on all of the other architectures, we see a 5% to 6% decrease in performance in critical jOps by turning off compressed references.

Wrap-up

We think it's a really exciting time in the industry, because there are three high-performing competitive architectures available. Understanding some of these differences will help you fit your application properly to the right part or the right instance, wherever you run it.

Questions and Answers

Printezis: What was the thing that surprised you the most out of these numbers?

Balk: Certainly the GC, garbage collection spike we saw with the Altra Max processor.

Singer: I think overall, the thing that surprised me the most is how different the Arm architecture cores are today versus in the not too distant past. The gap is closing between all the vendors right now, which I think is a great competitive motivator for everybody.

Printezis: I agree. If I may say what surprised me most as well, I was always under the impression that Arm architecture was great for power consumption, but maybe was not quite clear throughput wise with Intel, AMD, and they seem to be basically very similar now. It's great times we live in.

Compressed references, typically, the object alignment in the JVM is 8 bytes, so you get 32 gigs is the limit for compressed references. You mentioned 128 gigs, so I assume you also increased the object alignment. Do you still see benefits with that?

Singer: We do see benefit. We stopped trying any compressed references at anything bigger than 128 gigs. Yes, we did see a benefit even at 128 gigs.

Printezis: Any reason why you picked OpenJDK 17? Was it just you thought, might as well get the latest and greatest?

Singer: It really was about making sure that the Arm support was as mature as possible. We would assume that OpenJDK 11 might be a little bit under-optimized for that. At least with this benchmark, using several different JDK versions while we were getting ready to do the testing, we didn't see any major changes in performance here.

Printezis: Can you show details on the compressed reference object alignment?

Singer: The logic we used is at the very end of the slides. If we go beyond 32 gigs, we bumped the alignment to 16 bytes. If we go beyond 64 gigs, we bumped the alignment to 32 bytes.

Printezis: Any comparisons with Azure cloud architecture, and do you have any plans to do so?

Singer: We don't have any plans to do that. We chose the cloud architectures typically because we already had access to those at Twitter. It made the most sense.

Printezis: What heap sizes were used for the various groups?

Singer: We used 3.2 gigs of heap for every thread or vCPU we tested. On a 32 thread backend, we would have a slightly over 100 gig heap.

Printezis: Have you tested on smaller machines? We're using 16-core machines from GCP, much smaller than the small one.

Singer: We haven't tested that.

Printezis: No interest for our purposes at Twitter.

Singer: We tend to utilize bigger instances and multi-tenant them.

Printezis: Which architecture are you using now after this comparison?

Singer: I can't speak to anything other than what you would find publicly, looking on Google.

Printezis: Have you measured or are you planning to also include energy consumption?

Singer: We obviously could only talk about energy consumption on those first four systems that we looked at that were bare metal. In that particular comparison, we only used the CPU's rated TDP to generate the performance per watt numbers. We were testing a single socket on all of them. They would all have a similar memory population, similar drive population. They're all in a relatively small TDP range. We think that the added power for the entire system on top of the CPU would be quite similar. I don't think we would find anything really different if we were to look at the entire system power consumption. Again, there are so many variables in the way a system could be designed. I don't think we wanted to get into ensuring that the systems we were using were absolutely the most power or thermally optimized.

Printezis: Can you share the raw power for the platform, just like you share max jOps and DRAM? I think you also mentioned how big the memory that you were using was.

Singer: I can share. The SoC power is in the introductory slide, where we compare the configurations in the bare metal section. Most of these systems would have been loaded with approximately 512 gigs of DRAM per CPU. In the case of the Altra Max system, it was loaded with a terabyte of DRAM. Even in that case, where we had it loaded with a terabyte of DRAM, we don't use all of that DRAM during the comparison, we limit the heap size, so like I mentioned 3.2 gigs of DRAM per thread.

Printezis: What is the specified power, what was measured when running the benchmark?

Balk: We didn't measure power during the benchmark run. I think at least I'm not clear on whether that might affect the critical and max jOps, so we don't have that.

Printezis: Again, for the longest time, we had Intel. Now we have Intel, AMD, and several Arm options. What do you think is going to happen in the future? I'm not asking specific details about what specific companies are going to do, but do you think we're going to have even more architectures, are we going to have only those three? Any predictions?

Singer: I think the other architecture we see coming is RISC-V. I think it would be quite a bit into the future before we'd see that developed. One of the most important things is the software ecosystem around it. Arm cores for the data center have been there for a while, but it's really now that I think we're seeing the software ecosystem starting to come to par as well with the hardware ecosystem. I think that could just take such a long time for another architecture.

Printezis: Of course, if you're writing against the JVM, the JVM will do most of that for you.

You can't say much about price considerations. I assume it just varies widely, how you're going to get it, what deals you're going to get.

Singer: Everybody is negotiating a different price with the vendors they're working with. I certainly wouldn't want to be quoting somebody else's pricing.

When we were doing all of our experimentation and monitoring what was happening, this particular workload doesn't really hit floating point math or any vector instructions. We really don't see any evidence that the part would be throttling below what the vendor had rated it for, for all core turbo frequency and so on. If you start getting into vector math, there's a good chance that you'll have some differences in power and you'll drop your core frequency, but we don't see that here.

See more presentations with transcripts

Recorded at:

Sep 29, 2022

InfoQ Software Architects' Newsletter