BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Brewing Java Applications in Sigma Managed Clusters

Brewing Java Applications in Sigma Managed Clusters

Bookmarks
53:08

Summary

Kingsum Chow talks about the challenges of large-scale software deployments in the data center for one of the largest online shopping events. Chows covers evaluating and estimating software performance at scale, and optimizing software for resource management.

Bio

Kingsum Chow is currently a Chief Scientist at Alibaba System Software Hardware Co-Optimization. Since 1996, he has been working on performance, modeling and analysis of software applications. He has been issued more than 20 patents and presented more than 80 technical papers.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

I'm very happy to see a lot of faces here, despite it is pretty late in the evening. It has been a long day. I hope to do a few things here, but before I start to talk, I want to tell you about my history and why am I here. Since I graduated in 1996 - to some of you that is a long time ago - I have been searching for something fun to do. I call that the performance data playground. What I want to do is really understand the relationship between the software performance and everything that needs to be used to run the software.

The Search for a Performance Data Playground

First, I started by joining the Intel CPU performance team. I stayed with the CPU for about three years, then I changed it to work for, at that time, Intel Online Services. And some of you might remember, that is the beginning of the dotcom bubble. At the end of 2001, I joined Intel as a performance engineer. I have no idea what is a performance engineer, but I need to get a job. Some of you might have that experience. So I started not doing anything, I assume, well, you just measure, get a few numbers, and say something about your measurements and you're done.

Never did I realize I will learn so much from my last 15 years at Intel. I worked with Appeal before the JRockit team was acquired by BEA in 2002. In 2005, I was asked to help Siebel, the CRM company. Now you know, CRM does not stand for Siebel anymore, it's Salesforce, but at that time it was Siebel. And after one year of optimizing Siebel software running on JVM, they got acquired by Oracle. In the meantime, I worked with the BEA WebLogic team optimizing performance, again, running on JVM. In 2008, they were acquired by Oracle. Around 2007, due to the special collaboration between Intel and Sun optimizing Sun Hotspot JVM, that was around the time that I met many people working on Sun Hotspot, including Monica. We optimized Sun JVM on Intel hardware, and in 2010, they were acquired by Oracle.

To make my life easy, in 2010, I worked with Oracle as an Intel engineer, I optimized Java and cloud. I enjoy my work a lot, but something is missing. While I was optimizing the software for many different companies, I did not have a chance to change the software. I could suggest the changes but no, I don't work for those companies, so I could not make the change directly.

Due to that itch, I was looking for a company that lets me play with large amount of data so that I can figure out what to do about the software-hardware interaction. And I was very fortunate that Alibaba, out of the blue, they just picked me said, "Well, come over and do something." So I have been working with Alibaba happily since 2016. I am actually based in Hangzhou. I am very pleased that I have a chance to be here, thanks to Monica and many people here, to share some of the work I did before Alibaba, and some of the work I did with Alibaba.

Peak Transactions Per Second for Single’s Day China

This is actually the most exciting week working at Alibaba. The other way to look at it is, this is the most stressful week working at Alibaba. A few days later, on November 11, we will have a day called the Singles' Day in China. Any of you heard about Singles' Day? Wow, only a few people. So Singles' Day is the largest online shopping event in the world. On that day last year, $25 billion of merchandise was sold. That is more than the 3 biggest online shopping days in America combined by a lot, almost 40% more.

How do we do that? That's one question. The other question is how do we get from almost no activity many years ago to a lot of activities we need to sustain last year? So looking at a graph here, on the y-axis is the number of transactions per second in peak performance. We're not talking about average of the day, we're talking about during a day that is the number of transactions per second we need to handle for a short period of time. And guess what, we need to do that. For the orange line, we are showing you the number of transactions per second for all the e-commerce activities. The blue line is the number of payments we need to handle.

Long story short, over the last 7 years, from 2011 to 2017, there is about 100 times increase in volume in both transactions. Our job is to figure out how to handle this huge amount of transactions. One solution is to buy 100 times more hardware that will make certain companies very rich, but we are looking for other solutions. Can we buy less hardware and sustain the growth?

Alibaba Infrastructure – Lots of Apps on Lots of Servers

What exactly is happening in Alibaba is we have four layers of software, and this is probably not very different from other cloud providers. Among all these four layers of software, including the hardware layer at the bottom, there are two layers that are of particular interest today. The lower yellow bar is the system software layer that as we define in Alibaba. The definition of system software might be different in other companies. For us, the system software includes both the OS and the JVM. And we package the applications using containers. And we schedule the applications using some kind of cluster management system. We're going to talk more about the layers of software.

But I want to emphasize one thing is because Java is relatively easy to program, and we have lots of Java programmers in the world, we have lots of Java applications as a result. Guess how many Java developers we have in Alibaba? Anybody want to guess? Nobody? It's not 28, it's not even 4,000. Ten thousand. I don't know where it's at, yeah, close enough 10,000, what the number is. We have 10,000 Java developers because nowadays when you hire developers from college, you don't get a lot of C programmers, you know. You don't get a lot of C++ programmers, but you got a lot of Java developers. And they write a lot of Java code, some good, some bad, we need to handle both of them.

All of you are good Java programmers, but believe me, there are some bad Java programmers in the world. They create a lot of applications, roughly 100,000 applications. In the e-commerce world, we launch new opportunities all the time, we change our applications all the time, we don't keep an application running for months. Sometimes some applications that's running for weeks and we replace them, lots of changes. Because we need to handle a large quantity of activities, we have lots of instances of Java programs running at the same time.

At one measure, we have about one million Java instances. So in a nutshell, what you can see here is we have a very large presence of Java programs, and we need to scale these Java programs to a large number of servers.

OpenJDK at Alibaba

And thanks to OpenJDK, we have been using it for many years. In 2011, we have been using OpenJDK6, and then we moved to JDK7, and we settled on JDK8 recently. This is a very interesting year that we are considering upgrading to JDK11.

For those of you that are very familiar with the JDK changes, you may realize that this is actually a challenge for us. But I'm not going to details why this is a challenge for us today, it will be a very, very long story. So instead of talking about the challenges of JDK11, I really want to thank Oracle for open sourcing JDK. And not only we are using JDK as it is, more importantly, we can customize OpenJDK for our needs.

AJDK: Customize OpenJDK at Scale

I would like to highlight four areas of things we do and we plan to contribute back to OpenJDK communities. I'm going to touch on these four areas, and some of these might be interesting to you as well. The reason we handle these four areas is to address large scale Java programs. So one of the things we need to address is we have a lot of smaller Java programs we need to run in the data center. And it may not be cost effective to put each of these applications in the JVM. Not sure you have this problem, but we have many tiny programs.

We find a way to package these applications together in a single JVM, we call multi-tenancy. We are not the first to invent this technology. There are other companies like IBM have tried this, and they got it kind of working. The reason we got it working is we control the 10,000 developers that work for us. We will say you shall do this, you shall not do this. So we make multi-tenancy work because we have a fair amount of control and all the applications are developed by us.

The second thing, we call it Wisp, is to address a special problem we face. We have many Java threads running in the Java applications, and some of these Java threads, when you leave it to the OS schedule, they can be very slow, they will incur a lot of context switches. As mentioned in some of the earlier talks today. So instead of scheduling the threads by the OS, we create green threads or fiber light threads within the JVM, and we handle the thread scheduling ourselves. Again, we make this special for our environments because we control all the e-commerce applications. We plan to generalize it and work with the open source community, like The Loon project, and we'll find a way to get them to work together.

The third thing is actually heavily discussed in the previous talk by Microsoft, Warmup. We also want to speed up the launch of the applications and make the applications ready to use. We developed this method - while other companies there are solutions like AOT, ReadyNow we address our own need by creating a list of methods based on that previous execution of the application and launch them.

Last but not least, we have JET, Java Event Tracing, it's for diagnostics. If an application goes wrong we want to trace down how the memory is utilized. This is not a JVM focus talk, so I just want to share why we are doing things, and how we are addressing the problems. What is more important to Alibaba is the next slide.

Performance Evaluation 2.0

Our problem is, we don't have a lot of stuff smart people, our problem is we have too many smart people in the company. You guys may have the same problem. When you have too many smart people in a company, every one of them will claim that they have found a solution to cure all the problems you face. Quickly we end up with about 1,000 or 2,000 solutions. So our problem is actually how to evaluate all these great ideas in a way we can make the right decision.

To do that, we look at benchmark data like Monica and Neil mentioned earlier this morning. We do look at benchmark data to see roughly where things are. But in addition to that, we also need to evaluate the performance in our online systems. On the right side of the slide, we highlight a few key components of the online systems we need to run. The orange box is Taobao. Taobao is this service very similar to eBay. So a bunch of small merchants or users are selling stuff to other consumers.

Next to that is Tmall. Tmall is somewhat similar to Amazon. This is a platform that enables many merchants to sell their stuff to the consumers. It's not exactly like Amazon because we don't sell stuff ourselves, we enable other merchants to sell stuff. And the blue box Cainiao is the logistics company, they handle the shipment of the products of the things that are being purchased. And followed by Ding Ding, followed by many other services.

SPEED (System Performance Estimation, Evaluation, and Decision)

When we have a lot of services running in the data center, we need to figure out a way how to evaluate the performance. So we come up with a simple name called SPEED. SPEED stands for System Performance Estimation, Evaluation and Decision. We monitor a lot of servers, if no other servers. We monitor a lot of servers, how they are running in the data center. We suck in the data from these servers so we have some idea what's going on.

The reason we don't have the exact idea is sometimes we know our transaction is getting into a server, but we do not know exactly what the request is. Those things are confidential, even we don't know about it, but we know there will be a request going into a certain service. So by adding up all the requests across all the servers, we have some idea about how busy a certain server is in a data center.

Now, when we try to optimize running of the software systems in the data center, we keep track of the previous baseline. So before I change, we look at how the system is being utilized, then we apply a change. Now in the data center, when we apply the change, we don't change everything, we change a small portion of the servers, a very small portion. And then we evaluate if something good is happening or something bad is happening.

A change can be a software change, very often it's just a software change. Sometimes it's a configuration change of running a certain software application. Sometimes is a new JVM feature we're testing, whether it is actually doing something good in the data center. And we do it in the live data center. So some of our transactions might be going through your test system, some of your transactions might be going through the regular systems. Then we need to make a decision based on the data.

Performance Evaluation at Scale

Now, remember I said earlier that we run thousands and thousands of applications. And when we make a change, sometimes we affect multiple applications at the same time. We come up with something very simple. And this is not anything innovative, it is a simple formula that we want to apply to all workloads running in the data center. We call it RUE, Resource Usage Effectiveness.

For every application, we define something called work done, and we measure the resource usage of their application. We take the ratio of these two numbers. So we divide the resource usage by work done, and we look at whether the number is smaller or higher. If the number is smaller, we are using fewer resources per transaction. If the number is higher, we're using more. And for us, using fewer resources is good.

And sometimes, if we find a solution that can save us, an average of 1% will be so delighted that we'll jump up the roof. A 1% of a large number of servers can pay for multiple teams. But those things are hard to combine. I'm going to show you a real example later, but before I get there, I want to talk about CPU utilization.

CPU Utilization

Do we have CPU utilization expert here? That means when I quote you, 34% CPU utilized, you know exactly what that means. All of you are very humble. Let me ask you this question, this is not a trick question. You don't have to know a lot of math, a simple question. Let's say you own a data center, you have one million servers, a small number, a million servers. And you collect all the data, you find out that your data center is 50% CPU utilized based on Saw, Task Manager, whatever tools you have. And you happen to be running on Intel processors, most likely you are.

Now let's say you figure out that is the peak demand you have experienced and you have very strong programmers, there's no software scaling problems, no JVM problems, no interference, no problem at all. Would you cut your servers by half? Why not? Oh, there's a guy from Intel? So we don't dare to cut it by half for many reasons, and one of the reasons is the definition of CPU utilization.

Before I jump in there, I want to share with you many mistakes, and I probably make many of these mistakes myself. So first thing we already covered is keep CPU utilization low or else bad things happen. Bad things can be anything. And you might have a coworker that come to you and say that, "Oh, I just optimized the last bottleneck software, there's no performance improvement." Maybe that's you and maybe that's your coworker.

The other four cases happened in the last three years. I'm going to go through them one by one and I will have some detail examples for some of them. I encountered a very smart engineer. He told me that, "Due to my very smart co-change, I changed 1 line, my application now is running 50% better using half of the CPU. One line changed, one line." So I thought, well, he's really smart. And we have other people developing tools to pick up data telemetry data, performance data. Sometimes you encounter teams saying that, "Well, use my tool, use my tool of zero overhead, zero."

I'm going to show you an example about what that means. And sometimes I encounter some engineer saying that "Well, that's not my fault, I did not change anything, the performance just went downhill." You might have those friends as well. And we want to figure out all these things what they mean. And last but not least, is people saying, "Hardware options are already optimized, you should not change it. Don't touch it, it's good." I'm going to dispute most of these, maybe all of this.

Hardware Mechanisms of Intel HT Technology

But first, let's follow the thought of our Intel participant here. Intel introduced a very good technology, I'm not bashing this technology. It's very good, it improves performance, no doubt about it. This is called hyper-threading and some other processes are coming up with SMT as well. So in the hyper-threading technology, when you turn on hyper- threading what actually happens is a single core of the CPU will behave like two logical CPUs. If you look at the diagram at the bottom instead of running one software thread, you can run two software threads running on two virtual CPU, the logical CPUs.

Things run faster for many reasons. One of the reasons is if one thread is waiting for memory access, the other thread can proceed, so you can hide some of the latency still to resource those. Those are very good. And one way to picturize how it works is on the diagram there, you have two threads, the black thread and the green thread. Without hyper-threading they run sequentially. Looking on a diagram on the left, you look at the vertical line, will consume about 10 cycles to complete a task. Once you turn on hyper-threading, the two threads can share the resources within them, and together they completed the task within seven cycles.

This is just an illustration, sometimes you get more performance increase, sometimes you get less. We get performance increase, that is not a problem. The problem is when we look at the CPU utilization of the systems. This is a Windows screenshot. The left diagram indicates the summary of the CPU utilization. And you can actually look at the breakdown of the four logical CPUs, they vary.

I want to start counting the CPU from zero to three to match the Intel terminology. CPU zero on the left is running below 50%, CPU one the second chart on the right is running about 50%, and CPU two, is running below 50% and CPU three is running more than 50%. So on average, they have 50%, and that's what the number is showing.

Now what is happening in the data center is we put all these numbers and sometimes we put a number on a per container or per machine level. And when you put a number that way all we are getting is the chart on the left, so we get the overall number. And we need to make a decision based on the overall number. So is there any problem with that? When you have four logical CPUs versus two logical CPUs before you turn on the hyper-threading, the implication of CPU utilization is actually different.

On the top, you have two cores running two threads, that means you didn't turn on hyper-threading. And when these two logical CPUs are busy, naturally, we are running 100% CPU utilization as indicated by the blue dots over the gray area. Once you turn on the hyper-threading, you might have many different scenarios, I'm just showing two scenarios here. On the left scenario, lower left corner we have 2 green threads running on 4 logical CPUs, so that's 50% CPU utilization. And on the right lower right corner, we also have 50% CPU because we are also running 2 green threads.

But these two 50% CPU utilization are different. The problem is when you put a number from a large data set in a data center, you may not realize the difference. So are they really different? I'm not sure you guys bought into that, they are really different.

Experiment

I'm going to run experiments but I already run experiments, so I'm going to show you the numbers. I'm using an older benchmark called SPECjbb2005. I run with two different configurations.

The first configuration I'm running with 32 warehouses, second configuration I'm running with 64 warehouses. We don't really need to know what warehouse means what we need to know is in the first configuration, we are running on half of the logical CPUs. In the second configuration we are running on all the logical CPUs.

Experiment Results

So long story short, I have the performance numbers. The blue line is running with 32 logical CPUs, and the red bars are running with 64 logical CPUs. Just to make sure that they are doing about the same thing, I look at the throughput on the left side. The throughput is indicated by the number on top of the blue or red bars. 1.199 million ops per second versus 1.235, they are pretty close, the red one is slightly more. But when you look at a CPU utilization without looking at the detailed breakdown, the overall CPU utilization you got is 50% for the blue and almost 100% for the red.

In other words, it appears that when you are running with 64 logical threads, you're consuming a lot more CPUs, but doing about the same work. To double check that on the lower right corner, we track the number of machine instructions that got executed during the runtime. And we determined that the number of machine instructions 7.58 for the blue, 7.84 for the red, they are similar. So the whole experiment is just to show that if we just use the simple average CPU utilization, it might be misleading, and the maximum error is 100%. The good news is you probably will not get worse than 100% error.

Now, this is our concern. Back to our original question, when you see your data center is 50% CPU utilized, we don't know which 50% utilized it is. In one case, we will not have extra capacity, in the other case, we have extra capacity to do something else. So in our work, we spend a lot of time looking at resource utilization. Resource utilization is costing us money, power, and space.

"My Tool Has Zero Overhead"

When we try to obtain the data we need tools that is not consuming a lot of overhead. So naturally, we have a lot of friends, they will say that they have a tool that doesn't cost us any overhead.

There are many interesting way to measure if a tool has overhead. One of the ways is illustrated here. This tool claimed that they have zero overhead because they run it with other tool and they run it with the tool, they said, "Well, look at the performance is exactly the same number of transactions per second." So far, so good. I'm pretty sure some of you caught the problem, so I'm going to show you the rest of the data. So we pick up other data to check if the tool is indeed doing what is supposed to. We pick up the CPU utilization, it doubles, and there are other utilization that got worse as well. So naturally, we refuse this tool, we say that overhead is too high. When I monitor a lot of systems in the data center I only have one simple rule, the overheads should be less than 1%, or else I don't know how to use the tool for monitoring.

Now, I'm not saying that there is a requirement for profiling and optimization when you need to take a profile, take a stack, look at the stack, look at the hotspots, right. During those times we can afford to have a slightly higher overhead. But monitoring means that we are running a tool all the time on all the servers, it better be lightweight.

"Not My Fault. I Did Not Change Anything"

One day somebody told me that the performance dropped and it happens to be a SPECjbb2005. Because I cannot show our own workload number, so I happen to pick benchmark number so to illustrate a point. The really cool thing here is there is about a 25% drop in performance, you look at the number on the right, the performance dropped from about 1.8 million to 1.49 million.

And the engineer said, "Nothing's changed, and initially, we believe that nothing has changed. Okay, something's wrong with the hardware and then other things." Some of you might be familiar with this, if the engineer happens to cross the heap size, that trigger compressed oops, there is a significant change in performance. And in fact, that was the case the engineer was thinking that they are doing us a favor, "I used more heap, I even fixed the initial heap size, and the final heap size as recommended by all the experts." You laugh. And so far right now for us to discover any of these problems, it is a painful manual process for us.

Hardware Options Are Already Optimized

Now, last but not least, hardware options are already optimized. This is an experiment we did on Intel Broadwell Processors. We are looking at whether we can improve performance without changing anything. So we don't need to ask the [inaudible 00:34:39] the change occur, we don't need to ask the operator the change at deployment nothing, just run. We are looking at can we improve the performance of memory intensive Java programs, we do have a bunch of those.

And we did all these experiments. Some of you are familiar with Numa, you don't have to raise your hands, I know you guys are all very tired, your hands need to stay close to. We did experiments focusing on can we speed up the memory access. So we picked two options, in fact at that time, there were probably just two options. One is Numa, we localized the memory per socket. The other one is COD, Cluster-on-Die, is a feature on Broadwell, that means we can further divide a socket into two sub-numa nodes.

And we did a bunch of experiments. And long story short, we improved the performance by 25%. Again, no single line of change in source code. To us, we are not software developers, we are not crazy about the performance gain has to come from us. So far as there's a performance gain, we are happy. I'm not sure how you feel about that maybe some of you are software developers you feel like it has to come from me.

Simpson's Paradox

I want to walk through this example very slowly because this is super critical in our data center. Simpson's Paradox, heard about that? This is a real question? If you have heard about that, raise your hand I'll ask you to explain. Nobody. Have you heard about the lawsuit that happen regularly in America about suing either schools or companies discriminating against different groups of individuals? Those lawsuits are based on numbers. The most common lawsuit happened many years ago at Berkeley.

The lawsuit was against Berkeley saying that Berkeley admission favors male candidates versus female candidates. And it was explained by Simpson's Paradox. But instead of going through that case, I'm so lucky I happen to find a data this is real data, I spent a lot effort to hide what it is. So far as you cannot deduce what it is, that's good, that means I can go through the data set. So what actually happened is, we have three groups of applications we care about. And they are called group one, group two, group three, very meaningful names.

And these three applications, they are all e-commerce applications, they're running together in a cluster of servers, and we have lots of servers running there. Somebody decided there is a very cool JVM fix that can improve the performance of these applications, very smart JVM developers, we have some of them here. And guess what, when we make a change in a software, we cannot change all the software people like to test it first, is it working.

On the left side, the left two columns is the baseline case, when we are running a software without a change what we see. Actually the middle two columns after the upgrade, we have some number of applications running there as well. Now, we are not spending time configuring the change manually, all we are doing is we say there is a change, we need to launch a number of machines based on whatever selection criteria during the online process.

They decided, they will launch 1% of the instances with a change, 99% without a change. And based on the criteria and based on how the machines received the request, this is the breakdown. The first column indicates that 99% of the instances are running the baseline that means no code change. The third column indicates 1% of the instances have the change. The number after 99% is 776. That number is the number of resources we spend per transaction before the change. The number after the 1%, 716, is the amount of resources we spend per transaction after the change.

We are reducing the amount of resources we use from 776 to 716, we are improving the performance by 8%. So everybody is happy and say, "Go for a change." What we did was, well, hold on, let's look at the three application groups. For the first group we are comparing a number 1289 before the change and 1484 after the change, there is an increase in resource utilization per transaction by 15%. It's a pretty large increase.

In the second group, there is a small increase or within a noise of the measurement. In the third group at the bottom, there is a 19% increase in utilization. Now, these are real numbers, this is a real problem in statistics. When we look at the individual transaction, sometimes what we see is different from the overall trend that is called Simpson's Paradox many years ago. So it was a well-known problem, and we happen to see that in our performance data as well.

If this is a controlled experiment, let's say you're running a benchmark, you control the transactions, you control a lot of things, this should not be a problem for you, because you can control everything. In a very large data center, when we cannot control everything about how the requests are distributed in a live system, we do have to pay attention to them.

So in summary, if you are dealing with a large number of data, not from your benchmark, from a real data center, do not just believe the summary the overall numbers, that might be misleading.

Resource Management in Alibaba

Everything I mentioned, is about CPU utilization, about looking at performance numbers, about something we can do, how we actually do that is capturing this slide, resource management in Alibaba.

This is a very busy slide, we can focus on just a few things on this slide, then you have some pretty good idea about what kind of problems we are dealing with. On the top row, you can see that we have two kinds of applications the blue ones are the LRAs, Long Running Applications. And you can include that, we run our JAVA applications for a long time.

And we also have something called batch jobs. You can think of that as some of the big data operations, MapReduce, the typical stuff that people are having fun doing. In general, long running applications have stricter response time criteria as we have tighter SLAs. And batch jobs, we have loose SLAs. To utilize all our servers, we try to mix and match both kinds of application on our servers. Sometimes the demand of latency critical application is high, sometimes it's not high.

We package these applications using something called a PouchContainer. In fact Alibaba open sourced PouchContainer last year. I don't have a link here, but you can search PouchContainer. And if you have nothing else to do, you can download and play with it. The reason we have our own container data back to around 2011, we had a need to build up the container technology for our own need, and that time, they were no other container technologies around. At least they were not easily accessible. And since we have already built our container technologies and is serving as well, we stay with it.

Effort from Sigma for 11/11

We also have our own ultra tracing layer called Sigma Scheduler. We schedule the applications on our servers. Sigma is a scheduler that saves us for Singles' Day. On this chart from left to right, initially, we allocate servers for either latency critical or batch jobs. Look at the second bar, we have a huge demand to do a lot more processing, that is the light yellow bar in the middle. And we squeeze everything down by monitoring the resources utilization and packaging the applications together.

There are many considerations when we pack the applications together. But we package them together and they run very well. Now, the performance improvement is very significant depending on how you measure it. You can call it 100% or 50% performance improvement, and things are great.

Alibaba Cluster Data (v2017)

But we don't believe that we have done the best, we think that we have just improved a little bit. We open sourced our data set in 2017 (last year). We asked everybody to look at our data to see if there are ways they can optimize the scheduling of the applications. And that will help them, that will help us, that will help everybody.

Selected Publications

And we have strong support in the research community. Five papers were published, and one of them was just published last month as a best paper in OSDI. So we are very pleased that our data is useful. But we also received a lot of feedback about something is missing from our data set.

Alibaba Cluster Data (v2018)

Good news for all of you, just a couple of days ago, we released our 2018 version. This data set contains 4,000 machines over a period of 8 days. It is two orders of magnitude bigger than the first data set. We also provide the dependency graph about these jobs. So you can use the dependency specification and schedule the applications.

We are looking for ways to improve the efficiency of the data center. We believe all of you can help us because all of you will look at a data from a different angle, you may discover something we never thought about. And since it is just released, I hope that you will be the first one to play with it, and if you have any questions, you can come to me or go to the website, there is a Q&A, there's an email, you can you can ask questions as well. With that, I would like to thank all of you for staying so late, and I'm going to open for questions.

Questions and Answers

Participant 1: Going all the way back to the beginning, I was wondering what your JDK update cadence is going to be going forward with the new release cycles?

Kingsum: The bulk of our applications will need to run on JDK8, because of a lot of the party dependencies. We want to move to JDK11 sometime, I cannot give a specific time frame. For the time being, we focus on how to get JDK8 to run for us. Including something I did not mention in the top here is we are backporting a bunch of patches to JDK8. Some of the patches should be available to the general public. In fact, we are trying to find a way to do that for the newly open source Java Flight recorder, JFR, because we are running on JDK8, so we have the need to port it to JDK8 from - I think it's released under JDK 11.

We have already done the job. The problem is actually the license agreement for JDK8. Because if you do on JDK8 it might be considered as a commercial feature, we can't do that. So we're trying to find a way to work around that. And we have other things we are doing as well, we have some other constraints, and for the other projects we're doing is mostly the overlaps with the existing projects like The Loon project for our life scheduling that's mostly the overlap. We also doing Warmup plus AppCDS, and AppCDS is taking a certain direction, we need to figure out how to work together. Expect something exciting next year.

Participant 2: Any benchmarking been done on the everyday tools that we use as part of the Java community, say any of the Apache frameworks or Guava that are being used in Alibaba?

Kingsum: We don't do those benchmarks, it's not the benchmarking is not good. There are reasons we don't do benchmarks that much is our code changes too fast. Also sometimes the problem doesn't show up unless you're running in a mixed environment. Let me give you a drastic example. Sometimes the problem has nothing to do JVM. And the worst problem I have seen is usually nothing to do with JVM itself, is the interaction of a JVM in a container happen to be running on the server with another container that's doing something fishy. We need to take care of those problems, so that cannot be detected by benchmarking.

We do use a lot of benchmarks from SPEC, we use SPEC JVM, we use SPECEnterprise, we look at the numbers to make sure things are cool, SPECJbb and a new version coming up. Yes, we look at those things, but we are not creating benchmarks. We hope to have a system we can evaluate the movement of applications in a data center. So things move around, and once they impact on the overall performance. We want to get there, and if any of you have some idea how to do that I would like to learn from you.

Participant 3: You mentioned the multi-tenancy on JVM, and then you mentioned the PouchContainers. Can you describe more about the strategy of packing the applications together, and then packaging them to containers? Or what is the strategy there?

Kingsum: I'm being recorded so I needed to say things carefully. There are a lot of very lightweight applications, we need to package them. If you package these lightweight applications by running them in multiple JVM instances, that means you're going to cheat a bunch of code over and over again, for these lightweight applications. And these lightweight applications, maybe use 1% of the time. You spend [inaudible 00:52:12] and you occupy a lot of memory as a result of [inaudible 00:52:16]. Even if they coexist in the same container, you have that problem, so that's why we have multi-tenancy.

Now container is a very good medium for us to schedule a lot of different applications, including JAVA applications and non-Java applications. So package them looking at the memory quota and CPU quota, and sometimes we even look at the distribution of CPUs way that we can optimize it. Hope that answer your question a little bit, I can take it offline if you have further questions about the strategy.

 

See more presentations with transcripts

 

Recorded at:

Feb 17, 2019

BT