InfoQ Homepage Presentations Performance vs. New Features: It Doesn’t Have to Be a Zero-Sum Game

Performance vs. New Features: It Doesn’t Have to Be a Zero-Sum Game

Bookmarks

View Presentation

Speed:

Download

37:23

Summary

Dmitry Vyazelenko explores implementing CRC checksums for a durable log while trying to retain respectable performance. Vyazelenko highlights how a new feature can amplify the call to revisit performance of an overall design.

Bio

Dmitry Vyazelenko is the Founder of Safepoint. He is a software developer, conference speaker and an organizer of JCrete and JAlba conferences, passionate about concurrency and performance.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Vyazelenko: Who here trusts their file system? A quick show of hands if you believe in your file system? We have 5 people. Do you know all the config options that go with it? Do you configure journaling and other means to make sure this thing is reliable, and crash tolerant? Let me first scare you. Here, we're going to look at a paper from 2014, "All File Systems are Not Created Equal," by Pillai and the rest, who did a study of six common Linux file systems: ext2, ext3, ext4, Btrfs, ReiserFS, and XFS. They looked at the basic properties. What do file systems provide? It's your basic means to store something in a disk. You have the file system instruction. What guarantees do they have? You basically have a POSIX standard and file systems implement the APIs this way. What does that really mean? They went on to actually study the properties that the file systems have, and whether those properties hold under different error conditions. They decided to stress test those file systems. They built an application called Block Order Breaker, which they then used to inject the errors into the file system. What they found was this. Basically, they studied the file systems and the properties of the two different axes. One is atomicity of certain operations, like writing a single block of data, writing multiple blocks of data. Then the atomicity of the instruction. Can you do atomic rename of a file, or atomically append to the end of the file? What they basically found, in all the tested configurations, was that all of the file systems exhibit at least one error, at least on one of the axes. They had about 16 different configurations. Some of them really horribly failed on almost all of them or on all of them. That's the first finding.

Your file system, the common ones, even the ones that are default in operating systems like Linux, are not reliable, or at least not reliable in the configurations they used. The data journaling bits here just have one error under multi-block appending. The rest is good. You can get some levels of guarantees and reliability from them. Prepare to be surprised. That was just one part of the study. That's the basic proof. This is the building block. People don't use their file systems directly. You interact with applications. You build applications on top of them. How does your application handle errors? For that they built another program called ALICE, where they inject error into the application. They took a couple like Git, Mercurial, a couple of databases here, and they subjected them to fault inject at the file system level. What happens if certain writes are reordered, or erred, so you get an error from a syscall? How does the application behave? What is the end result? The end result was quite bad because they found, across all of the applications, that there are multiple errors that they are subjected to. The final conditions were from silently ignoring errors, or silently not giving you the data, to data corruption and data loss. Imagine you have a Git or something and then you lose your data. Your commits and stuff disappear. That's horror story number one.

If that still doesn't scare you, and you still believe that stuff is good, we can write stuff to disk reliably. Let's look to another paper, because 2014 was 6 years ago. I'm sure all the problems have been fixed. People are thinking about Btrfs maturing, ZFS, and whatnot. In this research, redundancy does not imply fault tolerance. Researchers did a different spin. We have distributed systems and distributed data storage systems in particular. They have a cluster of nodes for redundancy. You have multiple copies of stuff. You have clusters with leaders and followers. There's been lots of research going on. There is jepsen.io. You experience network partitioning, what happens to the distributed system? People are actually addressing those problems.

What did these guys do? It's like, what happens to our distributed system, which is highly available, multi-node, if we just have one single file system error? Just one dead thing. What they did is they actually studied eight distributed systems and they found lots of text, but basically all of them failed. Some of them failed catastrophically, where a single corrupted node can infect the cluster and take down your entire system, and make it unavailable. Those of course have versions that, particularly, have been tested. I'm sure that most, if not all of the problems that they found in their research has been already addressed. What they found is that, although you have a distributed system, which has redundancy at its core, because you have multiple nodes running, the advantages of using a distributed system to handle file system errors is not there. Those systems don't have protocols to say, "This thing is corrupted. I'll just blow up. This node is going down." Or, "I have this error and I'm the leader. I'm going to infect all the followers." They came up with this list of five things that they think that systems are doing badly, and where we should direct the research for doing these things. If you have a fault tolerant distributed system, maybe you should also include protocols for recovery of corrupt data. Especially, if it's metadata and some stuff that sometimes lives in one single copy, and then this thing goes bad.

"I'm not writing distributed systems. My thing is a website. I'm a simple backend developer, whatever, it doesn't affect me. I'm using ZFS." For that, I'll leave you with this quote, "Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me." Don't use it. I give you the problem statement. Writing to disk is unreliable. You have something written. You have distributed systems. The whole thing might crash and burn. Basically, we're talking about two different problems here. One problem, how do we detect such problems? How do you know that something is corrupt? How do you know the data is corrupt? The second problem, how do you recover from it? I'm going to touch on the first one, the detection bit. Corruption, you basically look into distributed systems, into a cluster of things, and building the protocols on top of a cluster. Where do I get my data? Which node has good data? How do I know that the data is good again?

CRC

How do you detect those things? It's of course CRC, Cyclic Redundancy Check. We are back to the basics. It's an error code technique. You run checksum over the chunk of data. You attach this checksum together with the data. Then you send this data over, for example. Then you read the data. You do the checksum again using the same algorithm. If checksum matches you know there was no tampering with it, data didn't [inaudible 00:09:05]. The same works with persistent storage. You write it to disk. At some point you run recovery checksum validator. Or, a simple program that says, database starts up reads from disk. Is it still valid? Is the stuff still good?

I want to do CRC. In particular, just doing CRC and describing it in terms of hand waving is a good thing. It's not that useful. Let's say you are building a distributed system, or you're building a part of it, or using a basic building block called a log. Append only sequence of records. You're always continuously writing to the same thing. You're writing log. When you write log, you want to do checksumming of the log. Then you're replaying the log because you need to replicate this to another machine, or you want to compute something over it because it's event driven architecture. You want to verify the CRC. I'm going to talk about a thing that I've done for Aeron. I've been adding the CRC support to Aeron archive, where it's basically adding CRC to the log and then to the replay part when you read from it.

CRC in Java 8

Let's say you want to do CRC in Java. In particular, Aeron targets Java 8 as its minimum. Who runs Java 8 in production? 85% of people. The rest are either non-Java people or run something else, maybe Java 6 or 7. I don't think anybody runs 11. Maybe there are a couple. Let's say I want to do CRC in Java, obviously, you have to go to java.util.zip package. It's obvious to everybody that that's where you find those things. I don't know how people who write core-libs in Java actually come up with these things, but things end up where they are. You're looking at this Java-ish interface, because it gives away that this thing has a state. When you create one, and you update it, then you get the value. Then you have reset method. You can reuse it. Maybe. Maybe not. What is puzzling is that it only works for arrays and the primitive int, or a single byte, 4 bytes, whatever that int is. It is up for interpretation. Is it really 4 bytes or just 1? I think it's 1. Anybody who works with Java NIO knows that core abstractions of Java NIO is actually ByteBuffers. There is no ByteBuffers in Java 8. The ByteBuffer was added in 1.4.2. Am I missing something?

Digging Deeper

Let's dig deeper. Of course, if you go to CRC-32, and Adler-32, there are two implementations of this interface in the whole JDK. You find that it has method with the ByteBuffer, if you squint a bit and read what's in the middle, or read the signature closer. Unlike this guy, where the byte array has an offset and a length. You can actually point out to a chunk of it. In the ByteBuffer, you start with whatever the current position is. By the end of the update call, your ByteBuffer moves. It mutates your data structure. This thing is supposed to read some bytes of it. Why not provide an absolute length and then put in an offset and length like you do for the byte array? This is Java 8. The method was added. Both implementations got the same method with the same signature. This is Java 8. We have default methods. Why don't you put it to the interface as default method? You're doing the same work twice, declaring the same thing. Why not just declare it once?

Of course, I'm ranting, this is Java 8. If you talk to Oracle people, they say six months releases, so use Java 14. Of course we use Java 14. This is the interface in Java 14. We have our update ByteBuffer method since Java 9 there. We have a convenience method for the byte array. The other way around, it goes from 0 to the length. It still has no length and offset, which means if you want to use it, you end up doing something like this. First, you need to get your CRC-32 somewhere call update, and then get value. CRC-32 stands for 32-bit value, so it's 4 bytes and you have to basically downcast your value back to int because this is what you're going to use and attach. You're not going to attach 8 bytes of stuff which only has 4 bytes of the thing in it. Then you have to rewind the position. In a normal Java day in the office, this is the story.

How About?

What if the world would be a bit different? What if we would have something like a static method for CRC-32? Or, even the same but with the address. What if your ByteBuffer is not on heap? It's on heap, it has an address. You can do this. Someone might object that different checksum algorithms are very sensitive to the initial value. CRC-32 for example starts with 0. CRC-32C starts with a special constant, which is basically -1, or 1, or all bits set to 1. I get it. You can document it. It's not an ideal consumable thing. As a building block, this is amazing. Maybe I'm imagining stuff. I'm living in la-la land. If you look inside CRC-32, this is exactly what you see. That's exactly the API that is there but it's not available to you, because you're not supposed to do it. It's just for JDK guys. The implementation is private, but it's also native because it's going to be intrinsified. If you look at the update ByteBuffer signature, the first argument is not called CRC. It's called Adler, because it's copy paste from another class, which has exactly the same method. It gets even better because in JDK 9, they fixed it. This is now renamed to Alder. D and L is swapped. Probably, to obfuscate the reason that it was copied. This is CRC-32.

In JDK 9, they did another algorithm called CRC-32C, which is a way better checksum algorithm. It doesn't have the failing conditions that CRC-32 does. It detects more problems. It's a better thing to use. If you look inside there, we see the same picture. The version with single byte disappears because it's handled on Java's side. We have two HotSpotIntrinsicCandidates, so we have two methods that are intrinsified. Intrinsified means that instead of running the code as written there, if you open JDK 9 and beyond, and you open this class, you'll see huge Java methods that basically implements Slicing-by-8 algorithm that Intel guys wrote a couple years back. This is a Java implementation. This is not what is going to run on x86 platforms. On x86 platforms, this whole thing will be replaced with a specialized code that HotSpot will spit out, which will use CRC-32 instructions, which basically, Intel first came up with, and then AMD implemented as well, to have this as more efficient. We are looking, for a long time, at this thing. This is just this part. We have this. These are our building blocks. As I alluded to, and showing the native thing, that's basically what you ended up doing if you don't want to allocate in your first pass, and you go method handles and you call private API because there is no other way in Java. This is the way you do it. It's shameful to admit. On the other hand, there is no other way.

Baseline

Let's look at the implementing part. We have this thing, but before we touch any system, we have to baseline it. I have done it on the example of Aeron. Aeron has an archive module. Aeron is a messaging system, so UDP, multicast, unicast, and IPC transport. It has an archive model, so you have a subscriber that can get the data and move it elsewhere for archiving and journaling reasons. What have I done? I have run the archive module before I touch the code. This is just everything on my laptop. On real systems, this thing goes up to more than 100 million a second. This is a throughput test. Basically, there's two parts to it. One is writing to disk, so recording events. Events are really nasty. Writing small messages, 32 byte messages, is really stress testing, because you have 32 bytes of payload, and Aeron adds another 32 bytes of header on it. You're writing 64 bytes. You're thinking you're writing 32, you actually doubled. That's the nastiest one. Writing megabytes to disk is very easy, writing bytes to disk is where the real pain points are. Even a dead thing can write about 40 million messages a second, and I can replay off the disk, about 35 million messages a second. Does anybody find this picture disturbing, or something is wrong with the picture?

Participant: Actually, your reading should be a little faster.

Vyazelenko: Exactly. When you're replaying you are reading off the disk. You're not writing to the disk. Why is this slower? That's an interesting point. This is JDK 8. Of course, let's do another LTS release version, which is 11, and run it, and see why you want to upgrade to 11. This is the absolute reason why you want to upgrade. You're like, yes, you will get one more million out of writing to disk but our replay just went down the drain. Asking the question, why? One part of it is we use private APIs. One of the things that Aeron is using is a lot of unsafe to deal with native buffers, and patching selectors, and network stack, and doing stuff because there is no other way. What was done in JDK 9, the unsafe as we know it, was split into two unsafes. There is sun.misc.Unsafe, and there is JDK internal unsafe. There are now two unsafes. The other one is getting new features. Now there's this duality. If you profile it with async-profiler, what you end up doing is that on 8, there is no unsafe calls. They all disappear. They become virtual. On 11 you see the sun.misc.Unsafe all of a sudden popping up to the JDK unsafe. Somehow the call stacks get deeper, inlining is not happening. I have no idea. I didn't go really deep on that thing. This is the fact. Then you explain, why don't you upgrade? Why would I upgrade? Is my incentive to make my application slower? That's the Java releases and stuff. That's all nice.

Initial CRC Support

We take our CRC-32. Plug it in, and we see the following picture. CRC-32 slows things down substantially. This is expected. This is a premium you pay. You want to be sure that you can detect data corruptions and stuff, so you have to pay for it. Maybe. Maybe not that much. What about writing to disk? We're already writing bigger chunks. We have to go through the chunk. Probably, we cannot do much. Replay looks interesting, because replay gets quite a hit. What's going on there? First, let's look at 11, because 11 is better. JDK 11 has really nice CRC-32C. If you look closer, you will see that this thing is actually faster than the CRC-32. Who thinks CRC-32 is faster, or CRC-32C is?

Participant: It should be the same.

Vyazelenko: It depends. By just looking at this, it felt really weird. For this use case, you probably want to go with CRC-32C. Actually, if I would be running on JDK 9 plus, just for the quality of the checksum algorithm, I'm going to choose the CRC-32C, even if it might not be the best in all cases.

CRC-32 vs. CRC-32C

I've done a test where I compare CRC-32 with CRC-32C. That's why I said it depends, because depending on the size of the input, it's one or the other winning. On the smaller sizes before, at about 92 bytes, CRC-32C is the winner. I think it's some factor of two at the beginning. Then all of a sudden they swap, to a point where another one is three times faster at 512 bytes. It's quite interesting. If you start digging deeper and looking, am I getting the right thing? Is my understanding correct with where this is going? There's different algorithms. They are not substitutable. You cannot just have CRC-32 on JDK 8, migrate to 9, and then read the same checksum using another algorithm. It's not going to work. Once you stick with one, you have to live with it, or have an update process that replaces all your checksums and whatnot. There's all possibilities.

We have to go for the private API because of intrinsics. Just to show you how bad that thing is, you see this graph? This is the same graph. On top of it, I added hand-rolled, Slicing-By-8 implementation. Essentially, JDK has a static method without any JDK magic, so plain Java as implemented in the paper. If you don't realize how really bad that thing is, this is log scale. There is an order of magnitude difference between the things. If you look at the 2K. The 2K is more than 10x. It's about 13x performance difference. If you try to be a good citizen, and not touch any private API, just go plain Java, no dependencies, no breakages between releases, no whatever, you can't. You really want to have what the platform provides you because that's the whole idea, is run on the platform. We tried. It didn't work.

How Replay Works

Let's get back to the replay. Replay should be faster. Read off disk, and do whatever. What is replay doing that is that costly? It's pseudo Java. It's not exactly the code. Replay was doing something along the lines of you read any chunk of data from a disk, about 2 Megs. Then you process it in so-called frames or fragments. With this particular case, for every message, for every 64 bytes, you will basically have one loop iteration. Then you do occasional verification of the checksum, because that thing is optional. If you're running on enterprise-grade SSDs, or whatnot, you didn't enable checksums, you're not going to get a hit. That thing is optional. Then we do the publication. The publication is done in the zero copy semantics. TryClaim is an operation that claims a space in the publication buffer saying, "I'm going to need that much. Can I claim it?" This is an atomic operation. This is very important, because this has memory fences. One tryClaim has memory fences, because it's concurrent. There can be multiple things publishing through the same publisher. Then at the end we copied the bytes. We already have the chunk so we directly copy it into the destination, we call commit. It's a second memory fence. We have two concurrent operations going on here. Those prevent loop being unrolled. Those prevent some nice inlining, some other optimizations. This is where the cost is. What do you end up doing? You need to somehow amortize the cost. If you think about it, there's a cost on the replayer side, which prepares this publication. There is a publisher side, the other side of it, which reads out of this shared buffer, and can progress. Because we are doing one frame at a time, the other guy is spinning there, waiting for that one frame, then another frame. Basically, they also end up contending on the cache line for that thing containing 64 bytes. It has got 64 bytes. We're contending for that cache line. That's really bad.

Let's Fix Replay #1

The first idea to fix that was let's do the batching, because batching is good. Can we do it a bit different? Can we publish a bunch of bytes into the publication, using the existing API? The idea was, instead of publishing one frame at a time, you say, I'm going to grab the first frame, not shown here. It's called handleOnStart, and not commit it. You reserve the space, but not commit it. Committing means writing a special field there allowing publisher to proceed. On the first field, the publisher keeps spinning. Then we write as many fields as the MTU size, which in this case was 4K, or something like that. Until we reach the 4K, basically, we commit every other frame, but nobody is spinning on that. For the replays thread, it's exactly the same. Exactly the same number of instructions, exactly the same number of memory fences and concurrent operations, but we do better on the publisher side. Did it really make any difference? I'm going to show just the JDK 11 result, because on 8, it was the same. On JDK 11, it made a difference. It unlocked some optimizations. The code was different shapes, so JIT managed to do it differently, and we got about 1 million more out of this thing, which is roughly 5%. That's not where we want to be. We're not beating the recording.

Let's Fix Replay #2

Let's try to fix it again. Here comes the bit that I want you to take away. Sometimes, just look into the problem. It makes you reconsider the whole design. What are we doing here? We are reading a chunk of bytes. Then we're doing some mangling with it, and using the API that we already have to send this chunk of bytes. What you want to do is read this chunk, and just send it over. Of course, it's not that easy, because the publisher side of things is doing something to the messages. It's preparing them to be sent over the network, over IPC. There's a lot of things going on there. There's recording, session IDs, and stuff that Aeron needs to function. Our processing becomes a bit different. We still have a loop to go through the whole thing, because we still need to do CRC. The whole point of this exercise was to do CRC in the first place. Then we do a step called prepareFrame, where we do some of the bookkeeping that was living in the publisher. This is deep down inside of Aeron. This is not a super public API thing. Once you know this is your application, you can do tricks. You can do, what is the problem trying to solve?

We basically read this chunk of memory, everything caches. Everything is nice there. We go through the hot loop. We go through the same thing without touching anything else. Then it was like, here the chunk is ready to send. At the end, we have this new API called offerBlock. Off you go. Of course, there's an error handling and lots of other things involved. Basically, this is the whole idea. Instead of going with these little things, sometimes you have to flip things the other way around, drill a hole through your system. Put an API that is not very safe, but it's solving the problem. You're tailoring a solution for the problem at hand. You stop and think, what is the problem at hand? How can we solve it? Go with it.

If you look at what happened next, of course, we need to measure things. If you look at results, finally, replay is faster than record. It makes sense. Reading from disk is faster than writing to disk. Of course, processing and sending. What is also interesting is that taking the CRC over, also went quite faster. We went from 34 million messages being replayed without CRC, and 21 million with CRC, to 49 million and 30 million. The CRC bit got 25% faster, the other one got way faster, more than 50% faster. That's JDK 8.

Let's look at JDK 11. Quite impressive. If you go back, we're doing the same 49 million as before, but replay of CRC-32C. That's where the difference between CRC-32 and CRC-32C comes into play, where I told you the small messages is about a factor of two. You can see just how many more messages. We got another 30%. If you would directly compare to JDK 8, you can't do it. Basically, you get this picture. If I remember correctly, results from before, we had 26 replay when we started, on JDK 11. Which was the hilarious, nothing, JDK 11 doesn't work, don't upgrade type of thing. We are back at the same level as JDK 9. Again, because we amortized all the costs. Now all those tryClaim and commit operations that we had, let's say, I have a million frames, or something, I had 2 million of those operations, now they're at least 1. All the 2 million of those were going via unsafe. These unsafe calls disappear. In the hot loop, in the hot path, we don't have this unsafe loop. We have a dent when we need to release this stuff. That, basically, was the journey of trying to figure out how to do it and how to do it in a performant way.

Conclusion

What I suggest you do is always stay curious, keep on digging, trying to figure out how stuff works underneath you. Sometimes you're going to break the rules. Look inside your platform. Figure out how you can access the functionality that's not available there. Maybe there is alternatives. Maybe you can solve it the other way. If not, try to write your own. If you fail, you still have the escape route of digging through things. What I wanted to show as well is that it's possible, even in a mature product and a project of any size basically. If you keep adding or evolving it, it doesn't have to be a pile of things that just falls apart. You take one quality aspect of performance. Things get slower because nobody cares. Eventually, the thing is decommissioned and we write a new system, which repeats the pattern. If you really try to solve problems, you can actually do both. You can add your feature, and basically have your cake and eat it too. Or, even improve the performance. In this case, because replay was on par with recording, there was 20% difference. The priorities on the project were in different places, so nobody actually looked into, why is that? My model of the world and my understanding of what this thing is doesn't match the empirical results. Then you eventually find out. Because there was a need to change that code that led into rewriting of the algorithm, and bringing back where it should be. We're starting with the first phase. We have a nice, new feature. It's performed as well.

Participant: The crucial point was that you try to avoid these commit actions in between.

Vyazelenko: Both. Not just commit but also claiming. Basically, avoiding the expensive operation, so amortizing the costs.

Participant: [inaudible 00:35:47].

Vyazelenko: That was also batching. At the end, what mattered was removing those concurrent calls, removing the volatile writes or whatever this thing is underneath. If you think about it, the process in the first approach was also doing batching, because it was releasing the whole batch to the publisher, so the publisher saw all bytes in one chunk. At the end, the publisher saw all bytes in one chunk. At the beginning, when we start, neither of these things had the batch effect. The first solution got the batching for the sender. The sender could just swipe through all the bytes directly. The second solution actually brought the batching to both, and actually they amortized. You need to know where the costs are to figure this stuff out. In this particular case, if you run a profiler, your profiler will be all over the place. You will not find the bottleneck. Just because Martin knows the system as well. You are chatting with him, and he helps you out. What are the leverage points? Where are the problems? It doesn't show up. It's not in HotSpot because the thing is spread around. It is multiple calls. They're done in a loop. You see, it's probably this loop. What's costly in this loop? What's the problem there?

See more presentations with transcripts

Recorded at:

Sep 08, 2020

Dmitry Vyazelenko

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?