InfoQ Homepage Presentations Tuning a Runtime for Both Productivity and Performance

Tuning a Runtime for Both Productivity and Performance

Bookmarks

View Presentation

Speed:

Download

48:12

Summary

Mei-Chin Tsai and Jared Parsons talk about how Microsoft’s .NET team designed the runtime environment to balance convenience, fast startup, serviceability, low latency, and high throughput. For example, services such as JIT compilation, TypeSystem, garbage collection all provide convenience but come at a cost. The challenges presented are common to many environments.

Bio

Mei-Chin Tsai is Principal Group Software Engineer Manager at Microsoft. Her team owns C#/VB compilers and .NET runtime (often referred as CLR). Many of .NET innovation has been successful over her watch/supervision such as .NET native (a pure ahead of time compiler) and low allocation APIs (Span and Memory). Jared Parsons is Principal Developer Lead on C# Language Team at Microsoft.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Mei-Chin Tsai: My name is Mei-Chin. My team owns .Net Language and Runtime.

Jared Parsons: So I'm Jared Parsons. I'm part of the .Net Language and Runtime team. Specifically, I own the C# compiler implementations.

Tsai: Today we are here to share how we tune our runtime for both productivity and performance.

What is a Runtime?

I often got myself in a situation that I had to explain what the runtime is, and it is actually really hard to explain the runtime without whiteboard and without 30 minutes. So the short answer that I usually do tell people is, think about runtime as your translator. When you write your code, either in C# or Java, [inaudible 00:00:40] compile once, it runs everywhere. That is because, when your code that is portable come to the device and started to run, that VM was there to do the translation.

If you think about the language that VM had to translate to, there's actually quite a few. Platform itself, we have Linux and Window. Architecture, you have ARM32, ARM64, X64, X86. And our runtime is very sensitive to this environment, and that is why it got isolated out from all the platform dependency and architecture dependencies. If you compare a runtime to a translator, then you think about how do you define a good translator, a bad translator? (No, you are taking pictures.) A good translator. First of all, a correct translation, is a must. So incorrect translation is a bug. It's not a feature. That's a [inaudible 00:01:49]. A good translator has to do a good job in translation in fast matter, smooth matter, and also elegance.

So this talk is divided into three parts. The first part we would like to talk about tuning for startup and throughput. The second one, I would like to walk you through the latency case study of Bing. The third one is actually a conclusion of a takeaway.

Services of Runtime to Execute Code

Before I jump into the startup and throughput, I want to do a one-minute breakdown on what VM is doing for you. Look at the code on the screen on your right-hand side. A simple code like that, you have a base class, you have a myClass, you have a myField, you have a myFunction. In the runtime, when your portable code is running, there are many components who are in the runtime to support your code running.

Three components are relevant to today's talk - TypeSystem, Just-In-Time compiler, and GarbageCollector, and many of you may already be very familiar with it. A TypeSystem is actually responsible to answer, when you create an object instance, how big that object will be, and where each field will live. In this particular case, it would be how big is myClass. And then when you're trying to reference to myField, which offset inside the object it will be.

When you invoke myFunction, myFunc, then on the vtable lay out, because it's virtual, which slot that vtable that you're actually invoking to.

JIT is short for Just-In-Time compiler. It actually consult TypeSystem to generate a code to make sure that your program asks you correctly. GarbageCollector, actually, when coming to [inaudible 00:03:34] play, when you're allocating, under memory pressure in allocating too much, and they have garbage that need to be cleaned up to continue your program running. These three components actually work closely together.

Now here comes to my very first question. A simple code like this. It is just a main function, right? I insert a line of code. Console.writeline, "Hello World!" This is an infamous, or famous, Hello World! console application. Can I ask the audience to make a wild guess how many functions need to be JIT-ed in order to run this code? Anybody want to raise a hand? My team member should know. Yes, Monica?

Monica: 28.

Tsai: 28?

Parsons: Close.

Tsai: Close. Is there anybody want to make a wild guess?

Man: I would put zero.

Parsons: That would be great.

Tsai: That is the goal. That would be the goal. Anybody want to make a wild guess? 28 was very close. It's just off by 10.

Parsons: It's just off by an order of magnitude there.

Tsai: It actually needed JIT 243 methods on .Net Core 2.1. Actually, depending on the different skew of .NET you are running, the number could be different. So you see you wrote Main, you call Console.Writeline. Kind of intuitively know there. But there are many things that's dragging. String, StringBuilder’s. In order to set a runtime, and Cultureinfo’s, that globalization support that you may need.

So my next question. Console.Writeline is not interesting application. How about a simple Hello World! Web API application like this? This program itself, I only showed you a snippet because the [inaudible 00:05:18] generated by template. The template is [inaudible 00:05:20] start a server, a [inaudible 00:05:23] server, run the server, and then we are sending Hello World! to that webpage, and [inaudible 00:05:29] that application. Anybody want to do a guess on how many method need to be JIT-ed to run this Hello World! Web API?

Monica: 28?

Parsons: Again, close.

Tsai: Monica, is 28 your favorite number?

Monica: It is.

Man: Zero?

Tsai: That would be the goal. Anybody want to get a wild guess?

Man: Thousands.
Tsai: Thousands. I wish I had a prize for you. So if you look at this screen - I was told not to move around, so I'm going to stay here - if you look at this table here, this measurement was actually down on Intel I7 machine, 3.4 gigahertz. So it was quite a beefy machine. There are 4417 method got JIT-ed.

And at startup, actually 1.38 second. That is a long time to do this. So what was happening here? In order to JIT that 4000 methods, JIT [inaudible 00:06:36] system [inaudible 00:06:37] system [inaudible 00:06:38] types, and JIT continue to consultant type system. Many question were asked, like, you know, what's my field offset? What is my interface slot? What is the visibility of field? Is this function virtual or not? Oh my gosh, I need to load my derived cost. Before that, I need to load my base. So a lot of cascading is happening.

I have been a JIT dev lead. I went to my JIT team, and well, I shouldn't say do, but he said it's okay to say do. You are in a [inaudible 00:07:00] of startup. I remember my JIT team told me, "It's really not my fault. I'm only one third of the problem."

So I said, "Well, so who's fault is that?" They told me Type System. It was really so. Two years later, I became a type system lead. I went to my type system guy, who is sitting here as well. I say, "David, you are too slow. You are taking up most of the startup time."

David told me, he said, "It's really not my fault. I'm only one third." I'm not stupid. I should be able to do math. If every one of us is one third, what on earth is one third? It was actually happening in that interface. So we went back to the JIT and say, "You're asking too many questions. Couldn't you cache your state?"

Then JIT come back and say, "Well, why don't you answer the question faster? Why don't you cache your state?"

So regardless whose fault is that, now I'm a dev manager, so JIT and type systems are underneath me. I go back to my engineer, Jared. "You know, Jared, 1.38 second to run the Hello World! app is actually not acceptable. No real world application would be able to successful run on .Net. Solve the problem. It doesn't matter whose problem it is. Just make it go away."

Precompile on Targeted Device

Parsons: It's nice to get put on the spot. As Mei-Chin noted, the problem here, it's not the steady state execution of this application. Once it's up and running, it's very fast and very efficient. The problem is getting us from the initial start to that execution state.

So if we take a step back and look at the whole picture here, what's happening is we're running the JIT on every execution of the application, and it's fairly wasteful. After all, we're executing the same code on the same machine in the same scenario every single time. The JIT output for every one of those executions is going to be identical. Why are we doing this? Why not instead run the JIT one time, let's save the results somewhere, and then just execute that saved code next time?

So Ngen is a tool that we built to solve this exact problem. It effectively goes up to your application, loads it up, JITs all the methods, stores the result in a machine wide cache. In the future, whenever the runtime is executing your application, it's going to essentially load that cache instead of going through the JIT.

This means, in the ideal scenario, you JIT zero methods, because we will just simply use this cache, execute the application, and that means we can get executions performance on part with even native applications like C++. So if we compare kind of the Ngen numbers here, what you see is that we have kind of dramatically improved startup. We've gone from 1.38 seconds for startup to 0.48.

Tsai: Okay. Not bad.

Parsons: Half second. So you will notice, though, that there are still some methods that are being JIT-ed here. It's not a clean slate. The reason why is that even though we could JIT everything ahead of time, there are certain patterns that are just better done in the JIT. For instance, generic virtual methods. It's much easier to kind of execute those at runtime than go through all of the expansions and write those out. So even when we do NGEN, there will be some level of JIT-ing. But we've kind of solved the startup performance problem, so we're in a pretty good place.

Fragility

There is kind of one downside of this approach. The values that we're storing in this cache are a bit fragile. That's because the JIT makes a lot of assumptions about the applications that it's executing. For instance, it assumes that types aren't going to add, delete or reorder virtual members, and this means that it can do optimizations like hard code virtual table offsets. It also assumes that the contents and methods aren't really going to change, and that allows it to do in-lining within and across assemblies. It assumes that kind of the actual data structures used within the CLR are going to remain consistent, and these are all pretty reasonable assumptions for a JIT to make. After all, it's generating code at application execution time, and you're usually not changing method bodies as you're running the application.

But this does mean that other kind of events can invalidate these results. For instance, if the application is redeployed with different content, or if a library you're working with gets updated, or even in the case when a windows update runs and changes the .Net framework. Any of these will essentially invalidate those cached results. That's kind of a consequence of using the JIT for a scenario it wasn't exactly designed for.

To be clear, here's no safety issues here. The applications aren't going to suddenly start crashing because some of these things change. The runtime is resilient to this type of event. When it's storing out the cache results, it will actually write out enough information to know if these assumptions have been invalided, and if so, it will just simply fall back to the JIT and not using any caching. This type of deployment event, it seems like it's going to be pretty rare. I mean, how often does Windows Update run? Hint. We designed this before patch Tuesday was invented.

And even when this does happen, it's not going to break anything. The customer's app is just going to slow down for a little while. Eventually Ngen will kick back in, re-cache all of the results, and the performance will speed back up again. So this doesn't seem like a real world problem. It seems like something that's more like an engineering detail. I think we're good.

Tsai: Well, as you noticed, our engineer is very well aware the tradeoff that he had made in the solution. At that point of time, it was actually a good tradeoff. But as .Net becomes more and more popular, there are actually more and more people deploy their [inaudible 00:13:31] with Ngen to speed up the startup. For example, Visual Studio, or a bunch of, like, Office applications. When Patch Tuesday happens, we occasionally, actually is not that occasionally, we receive complaints from customers say, "Oh my gosh, after the Windows update, my machine is out for 30 minutes, it's not responsive." We went back to our customers and say, "You know, that's a choice that we made of this solution"

You want performance? We need time to repair, and that's the time that we do repair. So Jared lived happily with that smiley emoji for some time. I asked if I could snap a picture of him smiling. He say no. So here's the emoji.

Parsons: Well, you have to find a picture of me smiling.

Tsai: I was trying to take a picture. We share the same team room. He can see.

The World Changes on You

Tsai: So Jared's happiness did not last forever. Otherwise we would have ended this talk in 15 minutes. There is evidence this solution would be blockers for us to adopt new scenarios and new workloads. The first sign was actually the device where actually battery life matters. When the phone, the Windows phone, first came out, before it came out, when they approached us and said "You cannot run anything hot on the device," and we look at them and said, "Really?" and they said "Yes."

HoloLens, wearables and many others, even laptops today, the battery life is actually a guarantee, right? You do not want a laptop that only use two hours, and then you need to plugin. The second sets of scenarios that started to show up, that shows a solution [inaudible 00:15:06] it actually as modern servers. A lot of server problem that we have, they want to build the image once. They want to deploy on millions of servers. They want a server you need to start it running relevant with 20 minutes for us to generate Ngen images before they launch the application and waiting for the request.

The third one was also in the last job. It's actually security. The security has come in play. We were aske about, all executables on disk must be signed. We are generating executable Ngen images on the device, and we are not signed. How do we know, when we compile it, to deploy it? It's not tempered, and we have no answer.

The last one is actually, we are going to Linux. When we go to Linux, do we even have a place to plug in our elevated services to do a repairing? Do we have a 2:00 a.m. to do repair? Do we have that window or not? Answer is probably not. So I went back to my capable engineer. I'm sorry, you had to deliver a different solution.

Compile Once at Build Lab

Parsons: Well, if it was perfect the first time, I wouldn't be employed. So I guess that little engineering detail is a real world problem after all. So what we need then is we need a code generation strategy that is just not going to be as fragile as our current one. Using the JIT directly with all of its optimizations is probably not going to work. So our new approach needs to remove all of the code that makes these type of assumptions so it's not fragile to individual libraries or to underlying framework of being updated.

The good news is, all of these assumptions are pretty much there to generate more optimized code. There's nothing kind of fundamental to the strategy that is fragile here. So in lieu of these assumptions, we can avoid these optimizations. Or, instead of doing hardcoding offsets, we can just omit code that asks the runtime to give us the answer directly. So, for instance, instead of generating code that has hardcoded virtual table layouts, we can just ask the runtime, "Hey, can you look that result up dynamically for me?"

CrossGen is a tool that we wrote for this. The idea here is the generated code will be a little bit less performant, but it will be a lot more version resilient. Because it's machine and version resilient, we can actually run this tool at the same time that we're building the rest of the application. So this means, generally, if you're going to be investing in something like signing, signing is part of your build process. So companies can then choose to sign their build output and their ahead of time generated output at the exact same place. They don't have to move certificates for their deployment on machines, and anyone who has ever had to deal with certificates will definitely appreciate that. Then this build can then be deployed to a number of servers, and we'll be good to go. There we go.

So I wanted to take a look real quick at what this kind of change means for some real world scenarios. One optimization we've mentioned a few times now is virtual table layouts. So on any object hierarchy, every single type can choose to override a virtual method that's defined on one of its parent types or interfaces.

A good example of this are two string or equals. I'm sure we've all over met that at some point in our life. So when executing such a method, the runtime has to decide which two string is going to get executed based on the runtime type of the object. Not the static type that's actually there in code. Generally this is done by means of a virtual table. Essentially, every type has an associated table of virtual method addresses, and based on the hierarchy, the runtime will know, for instance, that two string is located at the second slot in the table equals the first slot in the table. When the runtime wants to invoke a virtual function, it will essentially generate code that goes from the instance of the object to its concrete type, and that associated virtual table, it will then just call into a specific offset that table, and that's how virtual dispatch works.

So you can see here we're actually executing simple virtual method, and in the ahead of time strategy. Those little bolded lines there is essentially virtual table dispatch. We're essentially jumping from the object, grabbing the table, and then you'll see there's that hardcoded offset there at the bottom. It's probably hard to read in the back, 20H, and that's the runtime saying, "I know that the two string method is the 20H offset on the virtual table, which is calling to that, and we're good to go."

So this also kind of shows the fragility here. The virtual table is laid out generally in the order that you define virtual methods. If you happen to add a method, reorder them or delete them, these offsets change. If this happened during employment and we executed this code, that 40H could mean two string now calls equals, or it could just be executing random memory. So this is why this kind of strategy is fragile to changes.

So on the right we kind of have then newer solution, and what this does is it removes all of our hardcoded offsets. Instead, we essentially grab the runtime type of the particular object, and then we hand it to the runtime. We say, "Could you please invoke two string for us?" Then the runtime can do its internal math to find the proper offset and jump to it. Now, this code is actually a little more complicated than that, because what ends up happening here is there is some logic that allows the runtime to write back the result of that dynamic lookup to the calling code. So the next time we come through here we don't even go through the runtime. We actually can invoke the results of the lookup directly. So the second time through here, the performance is going to be roughly on part with what we had before.

Simple HelloWorld Web API Sample

Now, you notice the table went from two rows to about six here, so there's a little bit to talk about. The first thing we want to look at, there's actually two runtimes listed now. We have the desktop .Net runtime and the CoreCLR one. The reason we did this is because Ngen is a tool that only works comprehensively on the desktop runtime. CrossGen is something that only works comprehensively on CoreCLR.

So you can't really compare these directly. Instead, if you want to know what the change is between the runtimes, you basically take the best case and worst case scenario in both environments and say, relatively speaking, how much better have I made things compared to the other world?

If you look here, the change we made from .Net is improvement of about roughly two thirds. Now, when we look at CoreCLR, there's actually a couple of other rows here. You'll see that this still has Ngen listed there, even though we just spent a couple of minutes telling you why Ngen was a bad idea. Ngen is really only a bad idea because it's fragile to dependencies changing. Well, the good news is the runtime has no dependencies. It's the runtime. It depends on itself, and only itself. So you can actually run Ngen and all of its crazy optimizations on your core runtime library, and then use CrossGen essentially on everything else.

You will still have this very version resilient deployment strategy. But there are all kinds of different ways you can blend this. But the good news is, what really matters here are the top and bottom numbers. So in the JIT we were executing one second on CoreCLR, and now we've gone to about 26 seconds. So we've managed to improve startup by about three quarters. So that's even better than the desktop one, when we were only able to improve it by about two thirds. So totally nailed it.

Tsai: As you can see, our engineer wearing that smiley emoji again. Being a dev manager and being part of the team as well, I [inaudible 00:23:25] his technical detail. He just told me he introduced interaction. He just told me he did the optimization, and he told me he improved startup. Is this really, like, a perfect world? That is, regression cannot be major elsewhere? Jared, do you mind to check the other metrics that we track for performance?

How About Throughput?

Parsons: That's not good. Pretty much up until now, we've been talking about startups. So what does this do to throughput? Well, what you're seeing here is this is a JSON serialization benchmark that we have. You'll see that the best number here is the JIT, and that's what you would expect. The JIT is kind of our highest quality code output on our CLR runtime, so it should have the best throughput. But when we look at CrossGen, it looks like we dropped just a bit here on CoreCLR. That's because, as Mei-Chin said, we've introduced a lot of indirections. We've removed a lot of cool optimizations. So what we've done is, we've kind of moved our performance problem from startup, we've made startup great, but we've now sacrificed our throughput. So we've just essentially moved the problem from one place to the other.

Code Generation Technology Choices

Let's take a step back and look at the code generation technologies that are available to us, and see if we can find a solution. We've talked about CrossGen a lot today. It's going to be great for creating fast startup times, but it's going to produce suboptimal throughput code. An interpreter is where there is no need to do code generation at all. You don't have to run the JIT. The runtime can just read and execute the IL directly. This can have shockingly fast startup times.

For example, in one experiment, we found the fastest way to compile Hello World! with the C# compiler was to use Mono's interpreter. It beat even our Ngen test for perfs. The first time one of my devs ran that experiment and he told me about it, I told him he should re-measure, because I was convinced he was wrong. But we did some more measures. We found out that, yes indeed, that particular scenario is fantastic for an interpreter, and that's not a hard and fast rule. There's some give and take on which will be better, but in general, interpreters are really excellent for startup scenarios.

Even so, interpreters, they're not really an option for us right now. We have a couple of prototypes. Like I said, we have the Mono interpreter, we have an interpreter for CoreCLR. But they are prototype quality. They're not something that's been rigorously tested. Additionally, we haven't put the work into them to have good diagnostics. So, for instance, we have essentially no debugging support for them.

The good news is, though, the JIT kind of comes in two flavors - minimum and maximum optimizations. The minimum optimization version shares a lot of properties with the interpreter. It's very fast to regenerate its code, the code quality is pretty low, and in some ways we can think of this as a substitute interpreter for the CLR. The reason we actually have this mode at all is for debugging. When you hit F5 in Visual Studio, this is what you're getting. You're getting our minimum optimization version. We don't collapse any locals, we don't do any in-lining, because we want to provide the most fantastic debugging experience possible.

The maximum throughput one is essentially the normal JIT, when you just run your application normally. But looking at the spectrum of options we have available here, no one thing is going to solve all of our problems.

Tiered Compilation

What we're having to look to now is tiered compilation, and tiered compilation is something that lets us blend these technologies. Up until now, the CLR has only been able to take a method and generate code for it one time. That meant that you had to make a decision for your application. Do I value startup, portability or throughput?

So what we've done is, we started to evolve the runtime to allow generation for a method by having multiple times. This creates, if you will, a versioning story for generated code of a method. So doing this, we can start in the initial version by generating code as fast as possible, sacrificing a little bit of throughput for speed on startup. Then, as we detect the applications moving to a steady state, we can start replacing active method bodies with higher quality code. The runtime itself, when it starts to run in method, it's just going to pick, what's the latest piece of generated code I have for this method body? Let's execute that.

This means that, as the JIT is replacing these method bodies, the runtime is going to start picking them up and executing them, and that can lead to some kind of pretty interesting scenarios. If you consider, for example, like a deeply recursive function, one that's essentially going down a tree, or a list of some nature, as that method is executing in the low quality startup code, the runtime can decide, hey, that's an important method. Let's make that one a little bit faster. It can generate some code, and the next level of recursion will actually pick up that new method body. So on a given stack, the same method body can end up having two different generated bodies on it, or really in. It's kind of fun.

Even further, we can actually additionally blend this with CrossGen. We can use CrossGen, generate the initial method bodies for a good chunk of our application, and then on startup, the runtime will just use those if available. If not, it will use the low quality JIT, and then as the application moves to steady state, we can pick the hot methods. We can swap them out with high quality code. We'll be good to go.

This is a visualization of what's happening here. When the runtime is executing, if the method's been CrossGen, it will just use the CrossGen code. If so, it will use the minimum optimization JIT, and that's how our application is going to run. But as the runtime detects that things are getting hot, like, this method is something that's important to quality there, it can start swapping all of these out with the optimized JIT version.

Heuristic of the Tiering

But one of the questions, though, is how do we determine when a method has transitioned from the startup to steady state? So there's no real definitive answer here. Every application is different. There's no “I have hit my steady state API call that anyone makes.” So we have to use some kind of heuristic here. There are a couple of options we looked at. The simplest one is, just pick a hit count. Say, after a method has executed a certain number of times, it is now hot. Let's go. This is a pretty good metric.

A startup code tends to be executed a small number of times. For instance, you've probably only parsed your config file once. If you're parsing your config file 30 times, you have other problems, and we should have a talk. Other options include things like using a sampling profiler to look for hot methods, or using profile guided optimizations from previous versions to give the runtime a lot of hints on what to do.

At the moment, though, what we've settled on is just using a simple hit count, going and saying that once this method has been executed 30 times, let's go. That's done well on all the platforms we've tested.

Measure Again

So when we measured this tier JIT-ing solution, we see we've gotten back to the same throughput as before. Exactly the same throughput as before. That might look suspicious. You might think we're cheating here, if you [inaudible 00:30:56] that. But remember here that both scenarios are ideally executing in the optimized JIT code at steady state. So the results should be identical. If the results were different, that would mean that we probably screwed up our tiering and we weren't actually correctly identifying our hot state methods. The other good news is, we lost none of the startup gains from before, because CrossGen is still preferred, if it's there. We're still getting all those startup gains from before, we get those numbers. So now we've found this nice little sweet spot where we get the fast startup and the good steady state perf.

Recap on Codegen Journey

I think we're pretty much got to where we want to be. Tiering is really hitting the sweet spot for us, and if we look back and recap how we got here, the optimizing JIT produces really good code, but it's poor for startup. As we pushed on that, we ended up having to work on this tool. Ngen really helped us with our startup scenario, and the Ngen tool was good. I mean, we essentially used Ngen at Microsoft for basically a decade. It's fantastic for desktop workloads. It's fantastic for line of business applications and many server technologies. But as we've moved to the Cloud and people are doing high scaled deployments, the cheese has moved a little bit and it's just not sufficient anymore. CrossGen fixes our fragility, it helps us with these mass deployments, but it's not great for our throughput.

Tiering, though, is where it's the sweet spot. It's where we can now take all of these technologies, and we can use them where they're strong and ignore them where they're weak. What's really nice is that, looking forward, tiering opens a lot more doors for us, because a JIT is always a tradeoff between how I need to get this method, I need to get the code for this method generated as fast as possible, but it also needs to be of sufficient quality that the next time I execute, the customer won't get mad at me. With tiering, though, we don't have to make that tradeoff anymore. We can choose to basically get the application executing, get it to steady state, and then we could just, for instance, take a background thread and say, "Let's spend some time really optimizing these hot pads and swapping them out later."

It also opens the door for us to do some more speculative code generation. For instance, opportunistically de-virtualizing calls. If we see that a particular type is always a specific instance, why not just generate direct calls? Why generate the indirection at all? This is something that Java has done essentially since its inception, and they get massive performance wins from this. Java kind of needs to do this, though, because the default in Java is to have virtual methods. So de-virtualization is super key for them to have good performance. .Net, you kind of have a mix of virtual and non-virtual, so it's not as key. But we can get similar wins by taking advantage of this type of optimizations. So looking forward, super happy. The future is really bright here, and we really feel we have a lot of room to work with now.

Latency Case Study

Tsai: I think what Jared is saying is that he still have a job. And his job is going to be there for a long time, because if you think about it, CrossGen was actually introduced in .Net Core. But we had not really shipped it to application developers yet, because there are some deficiencies that we still need to work through. I believe that we are going to use our framework, already using our framework, and we're probably looking to [inaudible 00:34:27] in probably the [inaudible 00:34:28] timeframe. Tiering is already in a preview stage with .Net Core 2.1 and .Net Core 2.2. So feel free to play with it. At the end of the slide, I have a link to a blog post that, if you haven't done the .Net Core and you want to play with it, you could. But we didn't turn tiering on yet, because he showed you a good case. We are hitting 99% of good case. That 1% of not so good case. Jared, you have a lot of work still ahead of you, your job is secure.

(Now, I would like to move down to the second half. I think we are running short of time, so I'm going to quickly move through my slide deck. You didn't time it. You write too much last night.)

Parsons: It wouldn't be a JIT talk if you didn't do everything just in time.

Parsons: Yes. So the second half is going to be pretty short. I want to show you latency case study. What is acceptable latency? (It's okay, you can continue to show that and I will explain what that video is.) If you ask different customers, what is acceptable latency, you are going to get back many different answers. I asked Bing, what's acceptable latency? His favorite query is Katy Perry. When I type Katy Perry on the search page, within one second, if the content comes back - it could be article, it could be video - within one second it's okay. I said, "That doesn’t sound too bad."

HoloLens, when they approach us and they told us they are building this AR, and they need 60 frames per second, we did the math. Oh, 60 millisecond per frame. That means [inaudible 00:36:08 - 00:36:11] less than 10 millisecond, I'm good. They look back, no. 60 millisecond including frame rendering. So you’d better not take 10 millisecond. So that's a very different workload there.

What is Acceptable Latency?

Then you look at the multiplayer real time online gaming that is actually the video clip that we are showing. (Can you pause it and then play again?) So this manifests the experience when pause is happening at a very inconvenient time, what the user experience will be. (Start from beginning.) So here you are. You are playing your game happily. Explosion happening. We're shooting at each other, and somehow, just the silence. What happened? Explosion continued. Oh, my gosh. Do you know what that is? That was a GC pause.

I know my GC architect is going to kill me if I don't explain to you, this is certainly not a bug in our GC. Actually, the application was using the wrong flavor of GC, just like a Java VM. They have a bunch of different flavor GC. So do we. We have workstation GC, we have server GC. On this particular application, server GC should be used, and when he was porting to .Net Core, server GC was not yet enabled. So that just manifests to you that, when a pause happens in a runtime, it could be quite annoying. That was the demo.

Bing's Migration to .Net Core 2.1

The case study that I would like share is actually the Bing's front end, migrated to .Net Core 2.1. They came back to us super excited, because their internal server latency was actually improved 34%. They asked "What did we do?" What dark magic did we put into 2.1? How do we clock this kind of improvement? I can tell you, nothing is magic. Everything is hard work.

Three Prongs of Tuning

We tune the runtime determinism, and we build performance features that application developers can use, and you can improve your application. They are targeted driven, data driven, targeted optimization that we do that was so micros, that was sweeping the floor. But the result was actually aggregating together. That's how they see that 34% of improvement.

Tuning the Runtime Determinism

There are many factors that can contribute to determinism, but you all know one. One is your GC. The other one is your JIT, especially when your method body is not JIT-ed the first time, you record the first time request response time. What Bing did is, actually, they migrated to our server, GC. That is actually awesome. When we were migrating to Linux, we actually did not have the fundamental support from the platform to implement our server GC.

On Windows, Windows has API for us, GetWriteWatch. That's kind of a dirty bit that you can monitor which other pages are being dirtied in between GC faces. So that enabled GC to not do useless work. If you know this page is not touched, then maybe you don't need to worry about updating the state.

So we implement a Software Write Watch in order to enable us to have a concurrent server GC on Linux. For the JIT latency, being actually eagerly deploy CrossGen even before we ship it, they were essentially our guinea pig. That just shows how desperate they are.

Performance Features that Enable Building a Leaner Framework

The second category of work is actually performance features that enable building a leaner framework. We usually have customers come in and tell us, "GC pause is not acceptable," and I hate to tell you this -- he's one of the customers. Roslyn compiler is written with C#.

Why are we [inaudible 00:40:03] the same organization? Why? Before you report it to me, he should come to me. Your GC pause [inaudible 00:40:08] we look at his heap, and we tell him, "Jared, you are allocation too much. Remove your allocation. GC pause will be better," and Jared walk away and say, "C# really sucks. Or .Net really sucks."

The truth is, when you are trying to tell your customer or developers, "Allocate less," there are so many things they can do to allocate less. But you must provide features for them to be able to allocate less. So we observe [inaudible 00:40:34] data, and then we found that there are features we actually can build, especially for those people that really, really worry about the pause in the allocation pattern, Span of , Memory is our lower allocation API. If you are just passing a slice of data around, it allows you to pass that slice of data around without copying. That actually reduces the allocation.

So we build these features, and we went to our framework from where we can use it. We use our own framework, actually Bing used their application. So you can see this kind of tier effort, everybody had to be in play. Runtime in play, runtime enabled functionalities. Framework had to be in play, had to be a good citizen. Application had to be in play. Understand your application pattern, understand your allocation pattern, and figure out what's the right solution for you. Everybody had to be in that performance game. (Can you see? I'm not reading through my notes, because I only have three minutes now.)

Data Driven Targeted Framework Optimization

We have a lot of data. We can look into MS Build, we can look into Roslyn. In fact, all our internal partners - Bing, Exchange [inaudible 00:41:46] - they send data to us. We can look into the heap. We can actually look into allocation pattern. We can look at workload trend. So, and then we can look into “Tis function is being called a lot.”

So we did some small target optimization. The first one is your string.equal. This is actually a popular function. So we built SIMD for a long time. But we never used SIMD in our framework. Why not? So we applied SIMD to our string.equal, and that second optimization is actually [inaudible 00:42:21]. Our JIT cannot see through [inaudible 00:42:23] Well, that's [inaudible 00:41:25] see through [inaudible 00:42:27]. Make the [inaudible 00:42:28] inlinable. Inlinable, as we enable the further optimization the JIT code as well.

Then we enable the de-virtualization of equality compare of default. That is actually a very common function being invoked as well, and that's not our magic. We just marked the function as intrinsic, tell JIT it's special, do extra work on it.

The fourth one is actually my favorite one. We improved the performance of a string IndexOfAny for two and three character search. Why do we choose two and three? Why didn't we choose one? Why didn't we choose four? Because from all the data we have coming in, we found this function in various workloads. Many of the searches in MS Build, forward slash, backward slash, trying to find two characters. Maybe Roslyn was also trying to find some sort of delimiters.

Or in the webpage, where you are trying to parse subitems, you are looking for the bracket. So then we look into there and say, "No, what we can do in 2.1, we actually manually unroll a loop in a special case for search for two characters, special search for three characters." That, actually, that altogether in that three category of work, being able to 34% gains [inaudible 00:43:44 - 00:43:45]

Interestingly enough, while I was working through the slides, looking at the code, and I found out string index of any code changed. We found out, actually, SIMD's even better. So now if you go to the Core CR Repo and try to find this method, you are going to find it's actually using SIMD. So you see there are tons of hard work put in there.

Conclusion

These two slides I'm going to skip, this connective dive into the work there, and I already covered. I want to go to the conclusion. It's actually the takeaways. As you can tell, there's no silver bullet for performance. You always have to be data driven and measure. And you have to design for performance and tune for performance. Performance is hard. Performance is ongoing, and performance is always a priority. But you must understand your requirement ahead of time. You have to monitor and revalidate, and be prepared if situation changes. Many of you are not building a runtime. Many of you are building larger scale applications, and maybe a workload change. Maybe you become popular, maybe you become less popular. But be always monitoring.

Questions and Answers

Woman 1: First of all, I would like to say, thank you very much for this presentation. It's just amazing to find so many similarities with the Java Virtual Machine and the string, SIMD, so that's a classic.

Parsons: It's like we're both managed languages trying to execute on a runtime.

Tsai: What a surprise.

Man: I mean, it's great to see you finally trying to catch up with the JVM.

Tsai: What are you talking about? We didn't choose a different approach.

Man: Actually, I'm interested in the port two Linux, because the CLR was highly coupled to Windows before, which is why it performed so well. How did you manage that situation? Did you actually decouple from Windows to port, or did you just do a full port that coupled in to Linux?

Tsai: The initial port is actually not too hard. Actually, the main person who is porting is Sergei. My team member came to travel with me. They are worried for me. He did a port, and he did a port about maybe six months. Then after, you are finding all performance issues. All the horrible performance issues, they come to me. For example GetWriteWatch is one of them. We couldn't even implement concurrent GC. Then you run into a situation that the Ngen images a lot of time were loading to a different address space. He was a poor soul, and he was coming to me and say, "Oh my gosh, look at the performance."

Parsons: Core CLR is not the first venture. I mean, remember we did have Silverlight some time back, which did work cross platform. That was probably the first time we had to decouple from Windows and work cross platform. So we weren't exactly starting from scratch on Core CLR. We did have some prior work to lean on and understand what went well and what went wrong.

Tsai: So I would like to say that, yes and no. We know how to get there, but how to get there and performing is hard. And that was the last push about, like, a year or two. We are doing the performance measurement, and we are trying to figure out what is a reasonable goal. Because after all, you're going to a new territory, remember? You wish to understand your goal. So is Window our goal, or is Linux our goal? On Linux what is an acceptable goal and what is a great goal? They are all gradually defined, and we are still working on performance.

Woman: And go check out Core CLR on Linux. It's out there, it's open sourced. So go check it out. Check out the performance, and provide feedback.

Tsai: So if you guys are interested, come and check us out on Core CR Repo. Give CrossGen a try. Give JIT a try. Early feedback is welcome.

See more presentations with transcripts

Recorded at:

Feb 16, 2019

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?