Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Scaling up Performance Benchmarking

Scaling up Performance Benchmarking



Anil Kumar and Monica Beckwith share application architecture decisions, observations points, etc. which can be applied when architecting, deploying and analyzing real production applications. They also share how application level, as well as system level metrics, change for various application architecture decisions as well as high-level JVM analysis to correlate tail latencies.


Java Champion Monica Beckwith is a Java VM Performance Architect at Arm. Anil Kumar is Performance Architect for Scripting Languages Runtimes at Intel.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Kumar: Let me start with the first thing. Nowadays, these are very exciting times and in various discussions over last five years, we are seeing startups going from a very small number of users to very large [numbers of] users, and those are also fluctuating. Sometimes, they are very small and sometimes, they are large numbers. What is happening? There are many technologies in last over five to six years which have come in, and if you look at the base you have Intel Scalable Processors, and all of those are, there is a lot of caching, a lot of core and they automatically scale. Then, on it you need to scale these servers and you have these infrastructures like Kubernetes, you have VMs, you have elastic cloud. And add to it, if most of the folks here are from the Java Virtual Machine side, you have four giant thread pools which autoscale in them. And now over it, we start writing frameworks which scale. Then we start deployment options. Earlier it used to be one monolithic application which you need to handle, and the deployment time is very long and then you have microservices concept. You have the latest things which are going, Function as a Service or Serverless. And on the top of it, you're seeing autoscaling.

And I was just thinking over the last few weeks and other things, all these things are going to magically scale, work, and solve all the problems. It will result in this. I gave it little more thought and many discussions with other folks. What we are finding is in the beginning when you have a startup and other thinking, we are very focused on the functionality, we are very focused on security, we are very focused on just make the logic work, it is working. And then suddenly, the user demands come in and we say, "Oh, there's a cloud, let's just scale it up." What results is on the left side, you start scaling up something which is barely working. Where there are not very few users, the whole node support and the scale out is available, all those tools are available and they just scale out something which really is supporting very few users.

It's only after a few months when your CFO checks the bill you are paying, you realize, "Okay, what can we do here? We can't sustain it." And when you start looking into different pieces of it, that if you want improvement, you need to just work; like a symphony of the orchestra, they all need to work together, if you want to deliver. And that is where the magical moment comes in, that even if you can save 1%, you can actually save a lot of dollars. Particularly, just talk to anyone at Netflix, anyone. Even actually they will go to a fraction of a second to save on different things.

With that in mind, the agenda for today when we were just discussing with Monica [Beckwith] on this topic, we can't go over the Function as a Service, those are two different topics of it, not so much serverless. But microservices is all within our range; there are so many decisions we did. So even though we are not specifically going on to microservices, but the lessons could be applied. With regards to the rest of the talk we will share, first five minutes quickly about the background of the benchmark and the architecture, what it does. Then a bit into the telemetry part, how we are observing different things, the decision making. And when I showed all these things, the benchmark is so great, of course, the first reaction for Monica was, "I don't trust it, let me check it myself." She has been poking it with different data points and how the behavior is. So I think most of the hard part will go into sharing with you, when she poked around, how was the behavior? How was the telemetry or how the different decisions were made, is there tracking? So she would share that very interesting data.

And then at the end, the takeaway is what you can apply from these different decisions about architecture, re-scaling, scaling out to your application environment. And we would be glad in the question and answer, you want to work later. We are both part of the standard body which developed the benchmark, which changes things. We would be glad to take the input and work on the next one. We are actually working on the next one to look for.

Architecture: Modelling Back-End of E-Commerce Enterprise

First into the architecture and modelling. This is an e-commerce website backend which would be similar to many other cases where it is not doing metrics multiplication. It is doing very similar to banking or commerce or some other case, where a user request comes in, you have a supermarket, more like a Costco model where a user needs to log in first with their ID or so. So you have a supermarket which has inventory. You have suppliers. Supplier is only the interface, it is to order or resupply things, not the full mechanism there because that one needs to be handled at the supplier side. And then you have the headquarters which has the customer data about the ID and the payment information. It stores the receipts once the transaction is done and you could run the data analytics on that data in the headquarters. You can issue coupons and you can send some other information to the customer. So that is the kind of model it is exercising.

Now, when we thought about scale up within this benchmark, you can scale up within the supermarket, you could have much larger inventory. You could actually scale up the number of supermarkets itself. Then on the supplier side, you could scale up the interfaces for the supplier. The supplier is a bit lightweight because it is only dealing with the interface. The supply comes in, you make a request to the supplier, [it] comes in. So it's not managing any large inventory or so, it's the interface. On the other hand, the headquarters has a lot of data, you can have a bigger number of customers, you can have a lot of receipts collected there, you can run the data mining transactions on a much bigger number of receipt, etc. So there's a lot of scaling up within one instance itself you can check.

Now, let's talk about the scale out. Scale out is more similar to - if you're familiar with Costco or Walmart, you need to maintain your data within U.S region for U.S. If it is Canada, it’s a different country rule, you need to maintain a totally separate structure there. And if it is some other country, then you need to maintain that part too. And you need communication among them because you can just go from one region to other and you can still do shopping, it's not like you need to become member again.

So the idea is that types from the scale out are mostly independent, but there could be customers and suppliers which could go from one region to another. I'm able to go over from U.S to Canada, able to shop in the Costco, I'm able to return the item from Canada I bought there in the U.S store. So we try to do a very similar concept here, which does the communication among them because that is very important for a scale out. In very rare cases you are able to scale out something which is totally independent, where you're networking overhead does not increase. There can be cases, but most of the time, you have to do the extra orchestration, you have to do the extra synchronization and we wanted to simulate that part in the workload.

As now you're understanding what this e-commerce side workload is doing, how does it help you? How does it, in your case, match? We were debating that part. Over time we have seen that, as an application development or as your deployment, you have full control on your application. Most of the time you don't have control into the VMs, OSs, or the hardware part of it. Many times you deploy, the results are not good, you are on a 2:00 a.m. call, and is the problem on your side of the application, or is the problem in the infrastructure? And that is where it can help you to at least do this step one. It is a very well known workload, a benchmark published. You know what the expected numbers are, if your infrastructure is dealing with the Intel servers, or dealing with the ARMS-based servers, or dealing with the IBM, or SPARK. You have the approximate numbers. You could also, in your performance testing on that infrastructure, do these runs and have some approximate idea.

Now, before you run your application on that, you should try to give a benchmark and if you're not receiving the expected numbers, something is not right in the configuration on the infrastructure itself, before you start dealing in your application. So it is a very good tool to estimate just where the issue is. Is it underneath infrastructure or in your application? If this application stresses the compute memory and the CPU core scaling, all those, it addresses a lot of networking as you're seeing the transactions needed to be done. It has some logging, but it does not do very heavy on the storage side.

For that part you may have to add some storage components to test, because, if you are logging or abound, we have seen many times that, if you did not map your logging file correctly in the virtualization or other environment, you could have terrible performance. So, this workload will not be able to cover that part because it only does the logging and that is not in the critical part but, otherwise, you can have a pretty good confidence level about the rest of the infrastructure performance. And later you will see some scaling data.

Let me talk about what we are doing with regards to telemetry. I mentioned the telemetry is very important and the response times are very important, so what we are doing for that part? You will see here from the start to the end that it’s usually the customer SLA you want to see. And for that part, we have the end-to-end response time in this one. The problem is with just one end-to-end. How do you find where the problem is when the time increases? Just one end-to-end is not enough. We do only the exhibitor time, we do the little out, the queue which sends the response, as well as the queue which accepts the transaction coming in. We also do the request sending part which would be more like a load balancer. The request shedding your load balancer and you're sending it to the rest of the infrastructure equal in time.

You have to do the timestamp at several of these points. And why it is important? Because you want to run these things at a very low load number of users or load level, and you want to run it at a very high [load], and it's couple of points in the middle. What you will notice out of these different components is which one is suddenly increasing a lot. And that gives you the idea that, " I need to redesign this part of my architecture here because it is not scaling with the load." And we have many examples where things were perfectly fine up to 30%, 40% of the load, they fall apart as you stress more, and I have mentioned those a couple of examples.

This is a typical run of the benchmark; the part you're not seeing in the big first part here out of the graph is the 10 minutes. In the beginning, there is a phase where it is calibrating, it is determining the approximate high bound of the system. Then, once the high bound is determined, it takes 1% of that value and starts distressing the load. So the X-axis in this graph has the load. The load is increasing as you see, almost from the 0 to 160 injection rate per second, the operations. The Y-axis is the response time and response time is almost, you can see from one millisecond going to 10 seconds, it's a logarithmic graph, and why? Because you care most of the time for SLA between zero to one second. After that, you really don't care, you are in the zone where you don't want to be. So you really want to see how the time is increasing from zero to one. Actually, it's mostly 0 to 100 millisecond, that's why the logarithm it is so amazing. You could see the 1 millisecond to 10 millisecond and then 1 second.

Most of the time, you will see this part of the process in e-commerce and others. Actually, these pieces should respond within 100 millisecond nowadays. It used to be 300, 400 milliseconds, but now you're doing so much, the aggregate is two in your infrastructure that pretty much 100 millisecond in the upper bound. So you will notice here that the 10 millisecond or 100 millisecond range is only up to 20% to 30% of the load point. After that, on the right side where your system can take the maximum load in some situations, on the right side the red line is the maximum load, but the response time is so bad you don't want to be in that area.

On the other side, the SLA-type meeting is more on the yellow side, which is almost two thirds of the part, that is where you want to be. And Monica will share some data later on how you can make that yellow line go more towards right, where your SLA is at much higher throughput meeting. Because it's no good if your system is really good but your SLA point is barely at the low side. You can't use infrastructure in that case, in any meaningful way. So this workload lets you stress.

And the important point is because we are busy in developing the business logic, deploying and scale out, we actually never tested for these situations. How are my resources utilizing the graph? I've overlapped here how is the CPU utilization as I'm increasing the load. We have never been able to go in that deep to see, when I'm increasing the load and the CPU utilization or the resource utilization equivalent is increasing, if the load is also increasing. One example could be you increase the number of users - try to increase it. System is busy if you're only monitoring resource busy. It will look busy but not deliver you anything. So you really need to correlate your resource utilization, increasing the number of users and your SLA timings.

Now about this workload, coming back to what you can use it for. The hardware focus, we had for this workload when we designed it, is definitely standalone systems or blade servers, a whole rack of servers, because the benchmark scale out and you can use it on the whole server. It will also work when you have an SGI-type system where hundreds of CPUs are just an OS image, one large OS image. There are some applications you want in that case, because the whole memory and everything is showing as one image. The one area it does not work is when you want to have the offloading part to graphics all the FPGAs, etc. That one area does not work, because we don't have a compute area in this decimal transactions area. It is not the genomics or it is not your artificial intelligence whether you're doing metrics, multiplication, and other things. So it's not a suitable candidate for offloading GPU or that kind of workload.

I earlier mentioned regarding scale up that, when we were designing this benchmark, we had different thinking for scaling; it must scale, that was the requirement. And the different components. Even in your software you will notice the modules, thread pool, queues, data structure, and the communication happening among them with telemetry.

Scale up: Modules and Thread Pools

Let's look at the couple of lessons we found when we were working in this area. Scale up with regards to modularity. Modularity is, in this case, supermarkets. You have inventory, you're doing something, in your case you may be doing things, and then you have the headquarters which you're doing things. They're sharing data, the transaction is going through them. The benchmark has the capability to do them separately in separate JVMs. Now, you have to do the serialization and deserialization cost and you can see the impact of it on your response time when you separate out as well as in your footprint due to the data sharing part earlier.

When you talk within these modules you want to implement the full joint. I mentioned you have autoscaling in the full joint variety scales. In the beginning, we did the simple implementation of autoscale, and what we found after some load, was that the performance was terrible. It took a couple of days or so to figure it out. What was happening was, if you give it more load on the autoscaling part, then it keeps increasing number of threads. And on CPU when the threads are going, we started monitoring the threads, and 100,000 threads were being created because we kept giving it more load and your user requests were coming in. And even after 10,000 threads on 100 core system, due to context switch out, the performance will start tanking down. Once the system goes in undefined state, you have difficulty shutting down those threads, too. So it's not a very good way of autoscaling.

Another feature in the full joint thread pool is called bounding. You can have autoscaling up to a bounding value so that you can set your VM. If your VM is four core, make it no more than 100. If your VM is much bigger, you may want to have some idea about what kind of VM you are going to deploy and then you set the thread pool value. Even though, as performance engineers, we want to be totally away from the infrastructure, if you really want this thing to scale up, you have to have some idea about those things.

Scale up: Queues and Data Structures

A bit more on the queue and the data structure. Let's say we are in the e-commerce side. You have the two kind of user requests coming in, one just searching the product, another one has in the card and ready to check out. When there's two requests coming in, you definitely want to prioritize them separately. You don't want to put them in the same request because you don't want to make the payout part suffer due to so many users coming in or searching. So, in a similar way, when we designed the supermarket inventory and other things, we learned this lesson, that once you have thread pool doing the work, how do you create in that queue the mechanism so you can separate or make a different priority for these requests? And the way we have done them, is you can create certain tiers in the beginning part of the request when the search is done, you put them in the different tier or different thread pool to do the request. So that is why you need to design queues so you can handle the different variety.

And on the data structure part, also, you want to design your data structure, for example, a supermarket inventory. If I start attacking it with a lot of users, no one is able to do anything, because that would happen very often. So that is also what we [found is happening] with certain kind of concurrent structures, different kind of buffer arrays in JVM area, there are many. Only some of them are very good at scaling and handling high locking things. So there is some experience which we could share in white paper, I think.

Scale up: Scale up + Communication + Telemetry

Now, the last part here I have the scale out part. So far the thing we were talking about is just scale up. Now, when we want to go scale out, one of the very interesting decisions you will find … I talked about supermarket, I talked about supplier, I talked about the headquarters. When you want to scale out, do you want to scale out just on a cluster? You did that one and you [would] just replicate the cluster on the left side, where you have node 1, node 2, node 3? Or do you want to scale out more on little to the right, where you have all supermarkets in one node, all suppliers in another, and all headquarters in another? Or do you want to create a mix? Or the supermarket is cache and memory sensitive and going to consume those resources?

On the other hand, the headquarters is storage sensitive. I'm storing the things and I'm getting things out. So here those two map together. What you will find is what things look simple with respect to how you want to deploy your microservices and this going into your microservices, too. Do you keep your microservices neat? They are on one cluster and you replicate them, or are two different kinds of microservices going to do better on some kind of node so you can leverage the whole infrastructure in that case more effectively? Because usually, you pay for the infrastructure. Once you took that node with the storage and memory, you are paying for it whether you're using it or not. So in a scale out, that topology part is very important.

I think up to this part, I described to you what this benchmark does, what we thought of the scale up, how we thought about the scale out, and how we look at the different telemetry and data points. So the main takeaway here is you need to think of this thing from the very beginning when you're designing. It's okay in the beginning because you're in the execution mode, you are in the startup mode, but once this thing really scales out, you have to start thinking in parallel, otherwise you will scale out something which each node is doing almost nothing, or very little. Now I'll hand over to Monica who has done her poking around into these different features and she has the data.

Beckwith: Thanks, Anil. I've worked with SPECjbb for a long time as well, and like I mentioned earlier when Anil did bring this up, I thought that it would be a good idea for me to test his theory. So it's a good thing that this benchmark can do all those things but, I want to make sure that as a performance engineer, I want to put that to test. So I started with the problem statement which is also the title of this talk - scale up performance benchmarking. But scaling up is not easy as he mentioned. For performance engineers, scale up means a lot of things. So what do we do when we are trying to scale our system?

What Do We Do?

The first thing we try to do is scale-out, which is kind of easier. And why would you choose scale out? Commodity hardware is easily available. So that's one of the things you would do. Second thing you would do, is because like how he mentioned. You have nodes in scale out. So you could tune one and then you can just replicate it for the other ones. Also, another thing about tuning, is tuning is very easy at the application level. So if you look at the stack, your hardware, software stack, if you apply more tuning, more smarts at the application level, you get the most bang for your buck. As it gets deeper and we have lots of experts that will talk about that, as it gets deeper into the jet, the runtime or memory optimizations, those do well but they won't give you as much bang for your buck as tuning at application level.

When we scale out, and Anil mentioned it earlier, what are the things to consider? Orchestration, networking are the things to consider. So that's what I try to do with this particular benchmark. The experiment here is to have 0% remote and then compare that to a 50% remote traffic, okay? And those are the options that I used. And this particular case, because I'm doing scale out, so I'm using two JVMs. So basically, consider that as node 0 and node 1, as at the JVM level. So what did I observe? On the left you see 0% remote, and then to the right of that is 50% remote. And I did see that max-jOPS got lower. Basically, that's the throughput metric. So that's a bandwidth your system can sustain, and that became lower.

Also, the network traffic affected your SLA. So critical jOPS is also a throughput metric but it has these SLAs, System Level Agreements, so 99th percentile and it's like a geo mean of 99th percentile, 90th percentile, 95th percentile. It's like the throughput that you can deliver within promises. Even that suffered. Actually, that suffered more. If you look at the percentage drop, it's more. So your SLA is suffering. And these graphs have been explained to you by Anil, so I'm not going to dive in deeper.

But whenever you see these highlights, what I'm trying to highlight is the 99th percentile. The future dots is what I'm trying to highlight. If you compare the one on the left to the right, you'll see that it's getting closer to, I think, it's 100 milliseconds. Yes, it's closer to 100 milliseconds. That means, that your SLAs are depending on where you are, and we're trying to be more close to the critical jOPS which is the yellow line, your SLAs may or may not be met, if your SLA was, say, 100 milliseconds which is what typical SLA is at. So network traffic does play an important role and we have been able to show that with respect to this benchmark.

Now, going back to our problem statement which is scaling up. We're still trying to scale up and we still think the scaling up is not easy. How do we approach scale up problem. When I thought about, there are brave hearts that attempted. There're many times that I have attempted it because we have this nice bulky heavy system that we invested in and we want to make sure that we're utilizing it to the max. So what are the different ways we can do scale up? You can do scale up at the hardware level. And then you can do scale up at the software level as well.

I talked to various customers and they mentioned that, "Oh, we want to have our injection rate." Which could be clients, which would be users, which could be transactions, but usually it's the same thing. We want to increase our injection rate, we want to add more clients to the system. We have invested in this nice BIFI system, so we want to make sure that we can push more. I again said, hardware and software level you could. So there are multiple ways we can approach it. I'm just going to consider these three scenarios and we'll dive deeper into these scenarios.

One of the hardware scenarios is, of course, add more memory. You have a BIFI system, you could add more memory and because it's a manager on time, that equates to increasing your heap. At the software, maybe choose a better algorithm. Being aware of what I'm trying to do, the different GC algorithms that are tuned for throughput latency footprint. Maybe choosing then or choosing middle of the road, because our SLAs are really important. And I have an experiment for that as well. The other way, again, I have already invested in the system and I want to drive my CPU to its maximum capacity. I don't want to operate at 50% utilization. I want to go farther. If I want to do that at software, I have shown how the full joint pool scheduler works here. Basically, adding more thread pools to that. I have an experiment showing that as well.

When doing scale-up, we want to make sure that our SLAs are met. So that is something to consider. When you are looking at your data, you should keep SLAs in mind and that's what I've highlighted here as well. So I'm breaking down those things into three scenarios like I mentioned. First one, increase injection rate, increase heap. Scenario two, again, increase injection rate, but look at a different GC algorithm. And scenario three, optimize the SMT usage. So the particular test system that I had, system of the test, is SMT capable. So I was trying to optimize that. Usage of that. So kind of add user-smart task scheduling pool.

Scenario 1: Increase Injection Rate, Increase Heap

The first one, increase injection rate but keep your SLAs under check. Let's go to the first line first. There are different injection rates I started with. I started with 10K which is kind of lower and then I went up to 50K. So 10, 30, and 50 are the three injection rates that you'll see here. While I'm doing this, remember my goal is to kind of add more memory so I can bump up my injection rate, which means heap here. So I'm comparing 10 gigs and 30 gigs. Looking over here, what you see is again, every time I have the oval, it's highlighting the 99th percentile. So you see that all my 99th percentile is above 10 milliseconds on a 10 gig heap. And then on the right, you see that it's below 10 milliseconds. So as I'm adding heap to this equation, and this is with injection rate fixed. Then I can already see adding more heap can give me little more space. So now I can bump up my injection rate. That's what I'm going to do next.

Now I bumped up my injection rate to 30K. So I'm happy that I could add more heap which helped me bump up my injection rate to 30K. And what do I see now? I see that not only my 99th percentile, but also my 90th percentile is better now. So you see? Everything seems better. Now I know that if I tune it even further, I could definitely achieve better SLAs for this particular case. So there is a chance. With a little bit tuning, I could go to 50K injection rate as well.

That's what I did next. I went to 50K. Again, with 10 gigs, I'm way off the chart. If you look at that, that's about one second. So my 99th percentile at 10 gigs with 50K injection is seeing one second or more. But with 30 gigs, it's actually less than one second, but it's still crossing my 100 milliseconds SLA line. I think at this particular time tuning will help which also means that may be choosing an optimal algorithm will help. Since I have a little bit background on GC algorithm, so what I did is I'm going to go to scenario two and I'm going to show you how GC affects that.

Scenario 2: Increase Injection Rate, Choose Different Garbage Collector Algorithm

Scenario two. I decided to choose a better GC algorithm for this particular tuning. So what you see on this is a GC histo. I'm not going to go into the GC algorithms in open JDK HotSpot but these are the two main ones. G1GC is now the default since Java 9 and prior to that Parallel GC was the default. So Parallel GC was what you had seen all the previous graphs with. But I changed to G1GC. So what you see and I'm sorry because yellow doesn't show well on white, I guess, but those yellow lines are full compactions. They're parallel threads but full compactions. You have full compactions, if you look at the time in milliseconds, it's about 4.5 seconds. So in your worst case, you have really bad full compactions going on with that garbage collector. What G1GC does, it tries to avoid full compactions.

Unfortunately, this particular one, this is out-of-the-box, is still seeing two space exhausted which is basically evacuation failures, but it's able to reduce the full compaction sizes. If I tune this further, then it'll show you even better tuning and it'll drop down your GC overhead. My background is GC performance and GC tuning and all these things. When I look at these, the first thing I look at is GC overhead. That's an extra. The same pass time gram, the GC histogram, is shown here to your left, you would see that's the GC overhead. For G1GC, the overhead is way less than what we see for Parallel GC. It's actually less than 15% for G1GC. So that is still more. To me, 15% is still high. But if you compare that with Parallel GC, it's 21%. Parallel GC is overhead, is 21%. GC overhead is basically the time spent in GC and not in your application. Your application is not able to do what it's supposed to do, so your throughput is dropping. So if I tune this further, of course, I'm going to get lower overhead and better throughput for my application.

Another thing I look at as my max. Remember I showed you that the full GC was 4.5 seconds? That's what it's showing here. So full GC, full Parallel GC is way up high compared to what full GC was for G1GC. The way G1 mitigated it is based on the algorithm. It does have evacuation failures which can be tuned out, but right now just out-of-the-box G1GC shows to be a better candidate. So being aware of my GC algorithms I can tell you that, we can switch to G1GC and see better scores, better throughput numbers under SLA constraints.

Scenario 3: Optimize CPU Cores/SMT Usage, Optimize Task Schedule

Moving on to scenario three, talking about the task scheduler. Again, SLA constraints are something that I'm looking for right now. So let me explain this again. It's the same graph that we've seen but on the left you see that the pool size is equal to your SMT threads. And on the right, I have doubled the pool size. I have multiple numbers, I've also one quarter to one half of the SMT threads. But for now, I think this sends the message across clearly. So not only does my max-jOPS become better, so from 49 it goes up to 52 which, again, max-jOPS is your sustained bandwidth. What your system can sustain, that's what max-jOPS shows up here. Max-jOPS increases, but also my bandwidth under SLAs, under response time constraints is also increasing. So it goes from 24 to 25.

Now, if anybody has worked with SPECjbb, they can tell you how amazing is that. Because, sometimes your max-jOPS are the only ones that will increase and critical-jOPS does not. And sometimes critical-jOPS increases, but max-jOPS is still the same. If you go to the web page and look at the results, you'll see people submitting scores where their critical-jOPS is high and then people submitting score where their max-jOPS is high. They just keep on submitting because never do you see both of them actually increasing.

Again, I've highlighted the area where 99th percentile is shown. And as you see on the left, the 99th percentile is almost the same as on the right, but you see that because we are doing more work on the right, more pools are allocated, so your jOPS are getting better. So that's why you see more dots and more data points, so to speak.

Now, one thing I wanted to highlight again is your absolute worst case. On the right, right around the max-jOPS line, the red line, I have put in two ovals. So you see how the 99th percentile goes up like that for both of them, but for the one on the right, it doesn't even cross the upper threshold that I’ve marked in blue. So those are the things. If in case sometimes you're designing a system, with time outs, you have four different highly optimized nodes, but they need a communication or some kind of time out or responsiveness criteria, so keep that max response time under check. Maybe that's why you will not go to maximizing your scheduler, because you may not want to cross that max threshold that you have, the max time out that you had kept for your system.

I think we're pretty close to being able to summarize all these. You know “if”-s and “but”-s. But the truth of the matter is that if you understand that you want to check your SLA constraints, if you understand that, then scaling can be easy, scaling up can be easy. If we're able to check SLA constraints, when we are increasing the heap or choosing the right algorithm or optimizing CPU usage by optimizing the thread scheduler that we're using, then definitely scale up can be easy. For conclusion, Anil, do you want to come up and let's conclude together? I think we're just pretty much done. We have so much information here and the way we were deciding on conclusion was that scale up and scale out are two different things.

Kumar: I think we summarize on the scale-up part that the most important thing you will find is the telemetry and the correlation. If you are not measuring something, you can't correlate it. So you have to put certain telemetry points and you don't have to do on all the transactions, you can do on certain transactions, and a probe-type approach because it can be very heavy to do on all of them. And then you have to have the estimate performance. What is estimate performance for few users and what is estimation when system is being used, the node being used because I've talked to so many people who have no idea if their node is being fully used or not, particularly on-premise case. Cloud case you're getting the bills, so you know it, how much you're consuming. On-premise, the infrastructure is there. You're going to get into trouble as soon as you start to move the deployment into cloud from on-premise. So you have to have the some scaling up idea, system utilization SLA, very important. And your footprint also, you need to watch for it because either way you get in trouble there. Scale out.

Beckwith: I was going to add something too. As performance engineers, you will find so many talks out there. And there's definitely a talk by me as well. We always call it balancing. Footprint, latency, and throughput. We always favor, unfortunately footprint is like the third wheel. But when you're scaling up, footprint does matter because you're paying for it, remember? So scaling up footprint is an important consideration. And then when I showed you the 10 gig and 30 gig, at a point, it may not matter depending on how much of injection rate you're thinking of. So you may still want to stick with 10 gigs for the bill for that one.

Kumar: Yes. For example, on the heap part, memory is pretty expensive so you want to utilize it as well with all the thread pool and things. For a scale out, again, telemetry correlation, there's no replacement for it. And also, for the scale out, you want to check your cost. Are you supporting more users with extra cost once you're sure the scale up is good? I think those are the main conclusions: make sure your scale up is there and when you're scaling out, you are getting your money worth for more number of users as you're paying the cost. The only way you do it, through telemetry and correlation. Those are the very important points. If you achieve that, I think then you scale up, scale out and you can go on vacation in Hawaii and monitor from there - that was the idea. Thank you.

Questions and Answers

Participant 1: Based on what you guys have done with experiments, do you guys have a fit for applications, like, "Oh such and such applications should always use such and such deployments in their DCs, in their cloud platforms"?

Kumar: I think for most of the social media type and e-commerce type, particularly when you're doing the check in or you are checking sites or text or string-based computations, certain things definitely fit. For example, you could have more core or more deployment or microservices type, it will work pretty good depending on how much data you need to go in between. I think 1 to 10 gig networking between them fits the traditional enterprise scenario even in the scale out case.

But if you have new use cases of AI or ML or Genome or the health services, if it is more computation bound, for example, you have the users. Let's say you are a lending service, so you have a lot of customer data about their paying, they're paying the installment, the different market situation going on trend, you have all the data, but I don't think anyone is thinking of analyzing it together. If you put a machine learning algorithm on it, then the same deployment will not work. You might better use a machine learning deployment for it, where you're finding a pattern on it, a situation certain customers are not able to pay back.

That information is very important because you make money for all the customer analysis also. And you are able to sell them better if you understand the trend. For example, in the landing tiers, I was just talking a day before with someone, if you are identifying a pattern and market is down and different things are happening, you could be more careful on your users. A happy user comes back to your service. So that ML part looks like a different deployment. And the other piece, if you're into genome service or the healthcare service, definitely now you have the FPGA. You have the other accelerators in the cloud, where they are providing the frameworks where you could deploy. Before, you had to do everything yourself.

Now the compute-bound, genome or health types, they might require a different deployment. There are definitely these categories where they go fit into. If it is a mix, then you may want to probably separate them. You don't want to do just a compute-bound that you could have done on the regular systems. You would still be found, if you still think this is going to change very fast because if something is changing, you probably want to wait a bit before going into ASIC. Because once you pay and you wrote there, it's a lot of pain to go back put it to general case. So you may want to wait a bit and then decide for those scenario. If they are stable, then now let me find those things. Is that, Monica, your thing?

Beckwith: I think that's good. And it also depends on how you have a profile of your application and you kind of play that as the base of your benchmark design. I want to have these components in my benchmark design. For example, he mentioned strings, and strings is a big part that actually SPECjbb highlights. And in the future, what we're thinking of, is also when he was talking about jbb runtime, we want to think about, stressing certain use cases, such that we call it functions or something and then we're going to stress those functions.

So what really matters at that time is what a unit of work can this particular function do, and then design the benchmark based on that. This particular benchmark as you can think of, it’s not monolithic but in the correct usage of the word, it is to the extent that it does a bunch of things and then you have to kind of scale it down to find your use case. It's very capable of trying multiple use cases as he highlighted, but you could narrow it down to your particular use cases. I don't think every application out there is going to be able to use all of the use cases that SPECjbb has to offer.

Kumar: We are looking for new use cases particularly if you are in more challenging environment, please, you have our handle for email. Because we want to create the new benchmark now, a new use case, and we are open for open-source type developments or propriety development, because so many times you're not comfortable with sharing due to competitive reasons and other reasons. We are open with both, and we have a very good track record of keeping the things under NDA, if that's what anyone wants to discuss.

Participant 2: Thank you for your talk. I was trying to do some comparison between Parallel GC as well as G1GC and trying to see which one was better for my use case. My application was using a tomcat web server. What I saw was that the performance was better in default GC, the Parallel GC, but the resource utilization was better in G1GC. What kind of use case do you have where you say that G1GC is not better compared to Parallel GC, so that we can think about it and we can transfer it?

Beckwith: That's what I was trying to show here - SLA constraints. I think G1GC thrives when you have to throw so many different variations of a workload to it, because it'll try to adapt itself. Anything that needs a lot of tuning, you're probably chasing only 5%. Five percent may matter to you, but probably it doesn't matter to many people. I'm not going to talk about that because if you can tune Parallel GC to just have young collections and you never have a full collection, so you can do all those tuning tricks and I've spent a lot of time on those. But if you have a bearing workload, so something that you're building a cash first and then you're doing transactions, so I think for that kind of workload, G1GC is better suited than Parallel GC.

Kumar: In this workload we have in the supermarket inventory, you have lot of objects with the concurrent hash map being stored. You have the receipts which get stored. It's a different kind of data structure in the headquarters. So you can increase and decrease each one and understand. We have one scenario where G1GC was actually not working well. So that's the benefit of the benchmark, it is just not standard form, you can increase or decrease things.


See more presentations with transcripts


Recorded at:

Feb 15, 2019