BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Novel Algos and Optimizations in JCTools Concurrent Queues

Novel Algos and Optimizations in JCTools Concurrent Queues

Bookmarks
50:34

Summary

Nitsan Wakart follows several examples of optimizations, tradeoffs and implementation details from the JCTools library. In addition, he explores the driving forces behind some of JCTools novel algorithms and their applicability.

Bio

Nitsan Wakart is an experienced performance engineer with decades of programming experience ranging from finance to commercial JVM implementations. He is a blogger, public speaker, open source contributor, instructor, JUG organizer and Java Champion, and the lead developer on the JCTools project, the concurrency library of choice for Netty, DSE and many others.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

[Note: please be advised this transcript contains strong language]

Transcript

Wakart: My name is Nitasan Wakart, or as pronounced by various spell checkers, Nutsack Walmart on occasion. I get this, "You look familiar" thing a lot, so if I look familiar, it might be from the "Fast and the Frustrated" or the "Fast and the Irritable: the Sequel", or "The Pacifier", one of those. More seriously, I'm the main contributor to JCTools, which is probably why I'm up here talking about it. I've been working as a performance engineer for a while and I'm everywhere. Also just the Cape Town JUG organizer. If you're ever on holiday to Cape Town and you've run out of things to do, which sounds remote, but maybe, you can come and talk at our wonderful Java user group. So let's move on.

JCTools is a kind of complement to the collections you get with the JDK. It's more specialized collections, particularly focused on queues. Cliff Click also contributed his non-blocking cache map to JCTools, which is why I need to fix a bug in it this week. But there are bits and pieces in there and there's quite a good community around it now. So it's a healthy open-source project that's going pretty well for the last few years. We're going to talk about optimizations as someone who's new to the codebase might view them. So this is more of a once you've opened your IDE and you look at the classes and you have this what-the-fuck moment. I hope to explain some of that terror away, or at least justify some of the other bits.

We're going to cover bits in the code and we're going to talk about one algorithm that is new and exciting and novel, in the sense that I came up with it and it might not work. But it does seem to work for quite a lot of people and offers an interesting compromise between two design approaches. So from our a la carte menu, we're going to cover layout, we're going to talk about Unsafe, the benefits of specializing for different concurrency use cases, and bits and bobs that are perhaps non-trivial for people looking at the codebase. Have you heard about Unsafe? Great. Love that. So we won't have to explain why that's there.

Layout

So one of the first things you'll see in pretty much all of the classes is a bunch of useless fields that nobody ever accesses and in fact would not be accessible from any code following it because they're all called the same. So why do we have all these fields that nobody ever uses and what are they good for? One way to find out you're reflecting on your experience. You're still happy, but what is this shit? So one way to find out is measuring. Measuring is always good. So you delete all these fields and then you measure before and after, which is a good way to find out why something there or not there.

If we measure before and after, we'll find that there's quite a big difference with and without those padding fields. They don't serve a functional purpose as such, but they do seem to give us a bunch of performance. Why is that? This is where memory layout comes into the picture. Usually, when we program in Java, we don't really think about object layout or memory layout. It's all in the magic that is the managed memory system, so we tend to not worry about it too much. But on occasion, memory layout can be quite a big issue for us, as demonstrated by the benefit we see here.

So Java object layout is not part of the language spec, as such. It is part of the language spec that fields are aligned. But otherwise, they can be in any order that the JVM feels like. So most of this is not set in stone. The one thing that is a fact across all JVM implementations to this day, is that sub-classes fields will be after parent class fields. The reason for that is that it's very practical to do it like that, because then the sub-class fields are always at the same offset. You never have two sub-classes that you need to worry about, "Oh, for this class the offset is A and then for that class the offset is B."

But the field order within a class can change and this is not a hypothetical thing. It does change. There's a blog post that Marvin wrote a while back about how it changed under his feet between JDK 7 and JDK 8, so you can't rely on the order that you named the fields and put them in your class. You have to fix them in place and the best method that we know of at this time to fix them in place, is by ordering the inheritance. And we can use a tool called. JOL, Java Object Layout, to find the actual layout of objects. So when we use that, we get this. Monica, can you read that?

Monica: Yes.

Wakart: Great. What does this say?

Monica: I can read the object patterns.

Wakart: Let's break it up. So JOL very usefully points out what kind of JVM it's running on and if compressed oops are on and off. That will impact the size of references, it'll impact the size of the object header, and then it gives us something that is less beautiful than this, but essentially boils down to what are the offsets of the different fields? Where do they come from? What's their types, what's their size, etc. The only thing I've done here is I took the padding fields and I collapsed them all together and gave them the joint size rather than giving you a breakdown, which is why we have such a big screen to look at.

The first thing in this list is the object header. So the object header is 12 bytes that we actually use and 4 bytes that we don't use. There's a four-bytes waste in there because we need the next fields to be aligned and because we need the next fields to be aligned. Nothing to do with inheritance in that stage. So altogether there are 16 bytes there, and then following that we have the padding. Here we have a 120 bytes of padding. First of all, how many bytes of padding do we need and why do we need that particular number? The last thing I wanted to point out here, and we'll get to why we need the padding in a second- the instance size as a result of this padding has exploded into 536 bytes, 480 of those are just padding. So memory-wise, it’s quite an expensive thing to do to pad everywhere and everything. But it does seem to win us some performance. Why does it win us some performance? The main reason we bother with performance in JCTools is because of false sharing. Who knows what false sharing is? Excellent, great. I'll only half explain it then.

The other reason we might want to consider field ordering and memory layout is because we want to minimize load misses. Just a quick check because I had a really disappointing conversation last week. I want to avoid it. Who knows about CPUs? Who's heard about CPUs? Have we all heard about CPUs? Great. Who heard about memory? We're all good on memory. Who knows that there's a cache system? Awesome. So we're all on the same page here, I don't have to start from scratch. So load misses are cache load misses.

The last reason we might want to worry about layout is when we go off-heap and we have to worry about accessing memory off-heap, we don't get the benefits of the JVM doing all the hard work of aligning everything, and then it's on us. So if you're writing any off-heap data structures, you need to worry about the alignment and you need to worry about alignment for atomicity. If you don't worry about alignment and you write along to an off-heap buffer, for instance, there's no guarantee that that value is written atomically. That is as opposed to writing along into a field in any of your classes because that's guaranteed by the JVM language that everything will be aligned, all the writes will be atomic. You won't see half a long value separate from the other one, whereas if you're writing anything off-heap that's your problem. You need to worry about it.

The allocations off-heap used to be page-aligned, then they became not page aligned, and there's a whole back and forth on that. It changes by the Java version from Java 8 onwards. They're not guaranteed to be page-aligned and there's a flag you could set but you probably shouldn't. So if you're going to allocate anything off-heap you need to worry about how to deal with alignment, otherwise you might want to worry about alignment because you're using off-heap memory to deal with some kind of OS resource that requires that you have some cache-line alignment or page alignment or sector alignments because you're doing IO. Again, it's all on you. If you need to figure out how to work out alignment, there are lots of examples in errand. There are some examples in JCTools, and there are some examples on my blog. It's not super difficult, just something you might not have been aware. On some platforms, if you do unaligned memory access you might crash your process, so really recommended you take care of this.

So some people know what false sharing is, some people don't. We'll go through this pretty quickly. We have a cache line. I did all the animation myself, so if you think this is terrible it's my fault. So we have the object header. We have two fields we really care about. We have red thread and green thread. They're two different threads. They're accessing different fields. Notionally, they're not sharing anything, but because we have the cache system that handles memory in units of cache lines and these two fields are sharing a cache line, we're going to have a problem where each CPU, each thread will try to claim the cache line whenever it feels to change it and invalidate the copy that the other thread is using.

While we might think that red and green are happening separately, in actual fact, the line is either red, it either belongs to one thread, or it's green, it belongs to another thread. Whenever any thread mutates it, the other thread copy will get invalidated. This is all done by the magic of cache coherency which is a big topic I won't be tackling, but you can look it up. There are a couple of articles on my blog and lots of articles elsewhere.

So we pad to get away from this problem- mostly this picture is padding- but we pad all these values away from one another so that when we change the consumer index here, the producer index isn't affected. We don't get a cache miss on the consumer thread, and vice-versa when the consumer thread changes the consumer index the producer thread is not unaffected by that.

Notionally, I said that the atomic unit of memory we're dealing with is a cache line. Because of adjacent cache line pre-fetching, there's an amplification of the false sharing effect on x86. In JCTools, we pad by two lines. You could pad by less and you will see some regression to performance here and there. Another important observation here, which caught me by surprise - we need to pad between fields that we mutate and between p-index and c-index. They're part of my data structure, but we also need to pad between our fields and the object header if there's another thread trying to read that object header.

And the reason another thread might need data from that object header is, for instance, if you're accessing this object via an interface, then the code generator needs to check the type of that object. That means it needs to look in the object header. If the object header keeps getting invalidated by what we're writing into p-index, then we're going to have trouble. We're going to have false sharing between the reading of the type and the updates on the p-index.

What are the alternatives we have to this terrible method of having lots of fields we don't want and doing a lot of inheritance that's very verbose? We can rely on field order, which doesn't work, so don't do that. We can use @Contended, which is an annotation that came in JKA. It requires you to enable a flag that makes it available to non-JDK code. There's a fairly good chance that it will get taken away or ignored by different runtimes, or it might become deprecated in the future because people are regretting letting us mortals have such access. So I wouldn't necessarily recommend it. Or you can go off-heap which gives you a lot more granular control over memory layout, where you can know what addresses you're accessing and decide exactly what you'd like.

We wouldn't have to do this if we had something like this, so this is a suggestion but we don't have it. In fantasy land this is what I would like, so I can say upfront what alignment I want and then this is what happens in C when you have a byte array that is a fixed size. The data is in line to the data structure, so if I allocate 120 bytes of data here, it's not a pointer to an array. It's actually 120 bytes of padding inside my data structure. So this doesn't work but wouldn't it be nice if it did.

Unsafe

Unsafe. Unsafe begins as this hot issue with the release of JDK 9, JDK 10 and 11 because OpenJDK developers were playing this game of, “We're going to take it away” and everybody was panicking, because everybody is using it and then they say, “Okay, you can have it because we want Java to work”, and so, "Oh, we're going to take it away a little bit." And then, “Okay, but by default, you all have access to it”. And then it came to nothing, but on the plus side, it made everybody aware of Unsafe and how bad it is to use it.

If we look at the code, we have something like this, and it does look perfectly safe, so I don't know what they're on about. But more seriously, Unsafe gives you access to the JVM with no holds barred. So if you told the JVM, I need to write to this bit of memory and that's, whatever, 120 bytes after the end of my class, put along there whether it's aligned or not, it will do it because it's Unsafe. You can just do whatever you like. From JVM maintain a point of view of someone comes to you and says the JVM crashed in a way that no Java code should be able to crash it, it must be a JVM bug. And you're like, "Oh, wow. I don't know how to reproduce it." And then three months later it turns out you messing around and I don't know, putting objects off-heap and expecting JC to just work.

So yes. I totally understand why Unsafe is a problem for the JDK maintainers. On the other hand, why do we need it? So it's definitely better than a poke in the eye. The argument traditionally made is that it's used for performance, and the last item might be a surprise to some, is that we might be using it for compatibility. So how can an unofficial API offer us compatibility?

Let's start with that. Unsafe just works. Unsafe is available on all the JVMs. It's the unofficial API that is more guaranteed than actual APIs. So it's in JDK 6, 7, 11. If you've used 9 and 10, I'm not sure what's wrong with you, but yes. It's in all the versions you would use and some that are completely hypothetical. Then the question would be, shouldn't we be using varHandle? Because that's the new thing that came with Java 11 that's supposed to take away a lot of the use cases, at least for JCTools. So let's compare and make, at least measurements that back this decision.

First of all, Unsafe versus Atomic Field Updaters, or Atomic FU as referred to lovingly. With Unsafe, we have this benchmark that measures perhaps something relevant. Who knows? It measures sending a hundred messages between threads and waiting for the consumer thread to observe that it received all these messages and then notify the producer thread back. It's a burst of messages being sent to another thread, which seems like a useful scenario for using a concurrent queue.

If I'm using my Unsafe implementation, it takes 950 nanoseconds to do this. If I'm doing it with the Atomic Field Updaters on Java 7, it takes roughly 35, 40% more. So there's definitely a win here. Things with JDK 8 update 101 improved a bit. The reason they improved is because a lot of people were saying, "We're using Unsafe because of performance," and the JDK people are saying. "Well, use this,” and they say, “Well, it's not as performance." So they improved the performance for Atomic Field Updaters and JDK 8 and then it regressed a bit in JDK 11. So that's great. If this is minor enough for you to say, "Okay, I care about performance but not that much," maybe you should be using Atomic Field Updaters, it’s perfectly fine to go with that.

Then we have varHandles. With Unsafe on JDK 11, we have the same result. With the varHandle, we have the 20% regression that we see here. Now, mind you, there's some unfairness in these comparisons. In particular, when I'm using Unsafe, no holds barred Java, so I couldn't use elbows and knees, headbutts, but more importantly, I don't need to check array bounds when I'm writing into an array. So one of the things that is happening here is that my Unsafe code is enjoying the fact that it never needs to load the length of the array. It never has to do an array bound check. It can just blindly write into an array because I know what I'm doing. I don't need no fricking blocks, as they say. So it's benefiting from that. It's also a lot more direct code, as these things go.

Now, behold what happens here. If I'm using Unsafe, I always get the same result. It works on all the versions. Performance is pretty much the same everywhere, and I don't need to worry about anything until they actually take Unsafe away. If I'm using Atomic Field Updaters, sometimes it's a lot slower, sometimes it's just a little bit slower, and sometimes it became slower just to help varHandles look better. So we're in the situation where you know there's this really bad solution that gives you all the stability from an API perspective and from performance perspective. So don't go with that because that's not safe. You should never use it.

What's the Unsafe versus the alternatives summary? Before JDK 11 you have volatile fields. You can use those. You can use the atomic kind of abstractions. Atomic Boolean, atomic long, atomic reference array, and you can use those to construct your concurrent cues or concurrent data structures or you can use Atomic Field Updaters which lets you access in-line fields, so your actual class fields with atomic access. After JDK 11, maybe you should move to varHandles. But if you do and you're a library writer, then you have to endure the hell that is multi-version JARs.

That's something that as far as I'm aware, kind of works, but there's a really entertaining thread from Charles Nutter trying to make it work for JRuby. I was very happy I never gave it a go as I was reading through his hilarious adventures. So I think just sit back and wait for it to pass. I don't know. As a library maintainer, it's a pain in the ass. Using varHandles at the moment does not win you any performance. It's slightly worse, but safe. You need to go through a lot of pain to make it work through all the versions of Java, so I'm not totally sold on the whole thing.

Finally, if you're off-heap, there's really no alternative. So if you're doing something like errand, if you're doing an off-heap IPC queue, if you're doing an off-heap map, you're going to have to use Unsafe to do concurrency. There's no good concurrency alternative to concurrency off-heap at the moment. So you're stuck with that.

Concurrency Specialization

Onwards to specialization. This is the part where I tell you I don't pretend to know more than the JDK people. I don't. Doug Lea is a great mind in our field, and the people who work on the JDK are very bright people. So there's no reason why they can't write any kind of code that I can't write. They definitely could, but they have different requirements. So the JDK generally goes wide. They go very generic. They go for multi-producer, multi-consumer scenarios, whereas my approach would be, if you don't know what you're doing you shouldn't be using this. They can't have that luxury. They would say, "Well, it has to not break. Maybe it will slow down, but it can't become unusable just because you used it in the wrong way, but perhaps trivially so."

They use a lock-based queue. The queues in JCTools are lockless. Who knows what the difference is between weight-free, lock-free, and lockless? Who knows what a lock is? Awesome. You know so much. So, in JCTools, we have a variety of options. This is like when you get tired of writing multi-producer, multi-consumer, single-consumer. So multi-producer multi-consumer or single-producer multi-consumer, etc., You have all the options pretty much covered, especially for array-backed queues. You can have array-backed queues or linked queues or a combination of both and you have a range of between lockless and weight-free that you can pick from, and we'll cover those definitions in a second. Importantly, you have zero allocation options which are unavailable with the JDK.

Generic versus specialized access in the concurrency sense and the memory barrier sense, if you watched Monica's talk, not that she's paying attention. So with varHandles, we have three types of loads with volatiles, and before varHandles we only had two. We have a play load that can be reordered freely and widely. We have opaque loads which are new and exciting. So I'll get to them in a second. We have volatile loads that have very clear semantics, so we use volatile loads to establish happened before relationships. They have load-load barriers and so on. Opaque is a new thing where you say, “I want you to read this from memory. I don't care about the ordering but I really want you to read this from memory.”

If we have the classic concurrency bug where you have a thread. It's in a while loop. While is running, is being read but is running, isn't a volatile and now you're stuck in that loop forever because if you set that value, the JVM is free to hoist the load out of the loop and now you're stuck in that loop forever. An opaque load is a way of saying, "I just need you to read this from memory. I don't particularly care in what way you do this, as long as it's done”. So we have a range of options here, and we have a range of ways to do stores. We have plain stores. Again, they can be freely reordered. We have ordered stores. Who knows what lazy set is? Who's seen the method lazy set? Laziness is a virtue.

Lazy set is a badly named method. What it goes to, under the cupboard there's an ordered store or a store barrier. So it's the release barrier that Monica was talking about. Then we have the volatile store that is the expensive option in the volatile arsenal. That means that no loads or stores can be reordered around it. And then we have two other primitives here that we use. We use compare and swap. I put them separately because compare and swap- it's actually compare and set in the JDK- it's a way of setting a value conditionally so we actually read it or we notionally read it. We compare it and then we set the value if a condition is met. And then we have getAndAdd where we atomically add to a value and get the previous value in these variations on this theme. X86 has instruction support for CAS and getAndAdd. Monica [Beckwith] was talking about the support they have on ARM for these, which is different.

The important thing about these two is that they give me strong barrier, so that the kind of barrier they provide is similar to a volatile store with the extra functionality that they offer. If we have a single writer queue and we know it's a single-writer queue, we can use plain and volatile loads. We need to use volatile loads to see what our counterpart is doing, what the producer is doing, or the consumer. We need to establish happened-before relationships, which generally means we need a volatile load somewhere in there. And we can use ordered stores. We don't need to use volatile stores. We don't need to use compare and set. We can get by on weaker concurrency primitives which means we can do things cheaper.

If we have multi-writers and we don't want to use locks, again, if only one side of this equation is multi, let's say we have a multi-producer single-consumer which is quite typical. Then on the consumer side, I can get by with the plain loads as well for the consumer data, but we are definitely going to need some volatile loads in the mix. And on the store side, we're going to have to use something that is either get and add, or compare and swap depending on the algorithm.

What does that specialization bias? So here we're comparing different implementations with different requirements, still running through the same scenario. So there are two threads communicating. If I have a single-producer single-consumer and that's the scenario for all of them, but I know about it up front so I can use the single-producer single-consumer queue, I get the wonderful performance of 950 nanoseconds per op.

If I need to accommodate multiple producers- in the algorithm that I'm using here, I'm using compare and set- so the cost goes up to six and a half microseconds. Multi-producer multi-consumer is actually faster in this scenario, which is funny in a way, if you find that kind of thing humorous. It happens because the consumer is slower. Because the consumer is slower, it doesn't hold down the producer quite as much. So because the consumer is roughly the same speed as the producer in the multi-consumer multi-producer case, the balance works out as faster performance for this use case. Then we have the alternatives from the JDK, which are concurrent link queue and array-blocking queue. Link-blocking queue gives something ridiculous.

So we can see that they're very far behind in terms of performance, 2X or much worse. In case anybody's thinking, "Oh, that's not so bad,” this is a forgiving use case. If you look at the throughput numbers for these queues, they're like 50 times worse. But the throughput workload, I feel, is often a lot less representative. This is closer to what you might encounter. Obviously specializing is good. What we can do further than that, is improve the guarantees that we offer. This is where I have to explain the difference between lockless. So everybody knew what locks are. Lockless is without them. Great, right? Already we're gaining knowledge here. We're halfway there. So you know locks. You know lockless. There are just two more.

Lock-free is where there's a guarantee that if I block any one thread, other threads can still make progress. So if I have a lock and then I suspend the thread that's holding the lock, everybody's screwed, because the lock is held and the thread is suspended. It's never coming back. The lock is never coming back. Everybody's going to wait for that lock. Everybody's excluded. If I'm using something like an atomic instruction, like compare and set, I don't have this problem. If I suspend one thread and it never comes back, that's okay. It's an atomic instruction. One of the threads is guaranteed to succeed, and the others will have to retry. There's typically a retry loop around the access that's governed by compare and set.

We can see here that the single-producer single-consumer is wonderful. It's weight-free. What's weight-free? Weight-free means that any of the operations will complete in a finite number of instructions. In a single-producer single-consumer case, the producer or the consumer is going to fail or succeed in a finite number of instructions. There's no weight-loop in there that makes it into a lock-free algorithm. Now, you will notice that we have two alternatives in JCTools for the multi-producers single-consumer and the MPMC. The reasons for that is that it's actually the same algorithm. It's actually the same data structure, but the JDK API for queues requires that if you can't find anything in the queue, you also have to check if the queue is empty. There's no way for you to say, “I tried, it didn't work, try again later.” If you failed to find something in the queue, you have to check if the queue is empty. It's a strong guarantee.

It's part of the queue API so I didn't want to break that. I added another method you can use to do a relaxed offer or a relaxed pull. And that, as we can see, evens out things. If I'm using the lock-free alternative to the queue API, I can get down to 5 microseconds versus 6.5, or 5 microseconds versus 5 and a half for the MPMC. So removing those guarantees gives us some extra performance that we can gain. And there's something to be said there about what kind of guarantees you want to put into your APIs, and how much is that going to dictate the performance you're going to have.

Finally, I mentioned that there are no allocation-free alternatives in the JDK. And, when we have linked queues, then that's obvious that there is going to be allocation. But how much does array-blocking queue allocate when we put stuff in it and take stuff out? Why does it allocate at all? We allocated the array that it's using, so it's not allocating cells to wrap and put into the queue. Why does it allocate anything? Any guesses? Not you, fuck off. Somebody else. The Irish man. Locks. Locks are queues, or at least in Java they are. The implementation of lock in the JDK has a waiting list. The waiting list is a link list. So if you're waiting on a lock, you allocate a cell so that you can put your thread into it so that you can wait.

Any queue you use from the JDK, you're going to end up allocating. If we look at the allocation rate per operation which is sending a hundred messages, we'll see that linked blocking queue is allocating cells. Each cell is 24 bytes to wrap the element we put in the queue, and it's allocating a little bit more because it needs to allocate for the lock. We have the concurrent link queue that doesn't hit the lock because it's lockless, and then we have the array-blocking queue that's allocating half as much as the link blocking queue, but it's still allocating quite a bit. If we use an array-backed queue from JCTools we don't allocate anything, because we don't hit a lock and we don't play stupid games. And we pre-allocated everything.

Helping the Compiler

Let's do this really quickly. Helping the compiler. Who cares, right? Why should we help it anyway? We have this bit of functionality and we want to do a modular operation on the size, but we're clever and we know that size is sum power of two. If we know that size is sum power of two, we can use the splitwise trick to mask off the size minus one. I'm not going to go through the fingers demonstration where I get to give you the fingers part of it because it's fun but we don't have time. So we have two options. One is full of bit wisdom and one isn't.

It turns out that doing the bit-wise operation is really worthwhile and gives you a huge boost. It's a trick used all over the JDK and data structures that cycle around arrays. But are all bit-wise tricks worthwhile? So let's have a look at another bit-wise trick. I need to compute the address inside the array of an element. I have some base quantity I'm adding, then I'm calculating the index by doing the modular on the consumer index in this case. Then I can either multiply by the scale of each element, so the size of each element. Or I can shift, because I know the preference is either four bytes or eight bytes. I can, instead of multiplying by a power of two, I can shift.

Who thinks option A, which is the top one, is faster than option B? Who supports A? Who supports B? Who's confused and lonely? So there's no difference. There's no difference because the compiler is clever enough to understand what we understood that if you multiply by eight, I can just shift it for you. We don't need to worry about it. But if that's the case, if it's so clever, why did we have to explain it the modular trick? Why did we have to do this? Why can't it just figure out that size is a power of two as well, and then do the modular trick? It can, but only if it's a constant. If it's a final field then that's not good enough. If it's a final field, as we'll see shortly, the JVM does not respect finals half as much as they should be respected. If it's just a final field and not a static field, a static final field, the JVM can't do that strong assumption where it would just figure out a lot of bit-wise tricks for you. So you're going to hand-hold for the JVM on certain cases, but you don't have to in all cases, which makes for a confusing reality.

Finally, we have another form of abstraction. We can have straight in the code option A, which is do the whole computation inline, or we can hoist it out into a method, call a method, and do the computation. So who's for option A is faster than option B? Option A being doing it directly and not calling anything? Who thinks option B is faster? Who thinks they're the same? Good for you. So once the compiler gets to this, they're the same. If the compiler for whatever reason fails to inline count element offset, then yes, it would be slow. But the typical case is that that method is trivial in size and it will get in line almost always. And the JVM is really good at inlining stuff. So having small methods is generally a good idea. You don't have to inline stuff everywhere for performance reasons, so don't do that.

This one is something that caught me out and I was sorely disappointed about this. Buffer is a final field, and if you look at the code in JCTools, which I hope you all go home and offer patches for stuff and everything, you will see a lot of this upfront loading and reusing of values. One of the reasons we do that everywhere is because when you do a volatile read, as Monica explained beautifully only 50 minutes ago, you have to reload everything afterwards because it's a load barrier. Loads that happen after a volatile read, happen after a volatile read. They have to be redone. So if I have a value for a buffer at the top and then I do a volatile read here, then I'm going to have to reload the field here. I'm going to have two loads instead of one.

If there's no volatile read, then the compiler is perfectly sane and would take care of this. But because there's a volatile read I have to hoist it up. You have to do it even if the field is final and you told the JVM that it will never change because the JVM doesn't trust you. It doesn't trust you because of people like Spring, and I will explain that later if anybody is interested.

A Novel Algorithm

We have five minutes for novel algorithm which I'm sure you'll all just understand because it's so intuitive. But if you don't we can dive into the code in the break. Generally, speaking in concurrent queues we have linked queues and we have array queues. Array queues, you pre-allocate. And if you're not using a lock then there's no reason why you should allocate anything else. You keep reusing the same buffer. Linked queues are found in a lot of the literacy, because they make for nice data structures for concurrent algorithms. But you do have to allocate a cell per element traditionally. You don't do any pre-allocation, and you can do bounded and unbounded variations. You don't have to do unbounded, but people typically do. Unbounded queues are still bounded because you have a finite amount of memory. That's worth remembering. But they're notionally unbounded.

I came about this use case with actors. Particularly I was talking to some of the guys who are writing Akka, and they have tons and tons and tons of actors. Each actor has a mailbox. Actors need multi-producer single-consumer queues because they have lots of other actors running and giving the messages, but any actor will only run on one thread. So that's their use case. Some actors are very busy. Most actors do nothing. So if you have a million actors and the queues are mostly empty, you don't want to pre-allocate huge queues for them, even if they have a bounded size that is very large, because then you run out of memory. But then again, some of them are very busy so you don't want to keep allocating cells all the time either.

How do you find a middle ground? You can do linked array queues. Linked array queues are kind of a compromise in between. You allocate cells that are multi-celled. So you have chunks that are multi-celled and you link them together instead of linking cells. The nice thing about the algorithm I ended up on, is that as long as you stay within a single chunk, so if you say a chunk is 128 elements, then you don't need to keep allocating because you never fall out to the next chunk. You can stay in that chunk forever. If you fall out of your chunk, you have to allocate and then you can do resizing on the fly, which is one of the variations.

The challenges here are- well, doing any lock-free/lockless algorithm is a challenge, or doing it correctly anyway. If you don't care about correctness, not a problem. Actually doing it with locks is super easy because if you hit any edge condition, you lock. Nobody can change the state and so on. But doing it in a lock-free way is hard. Then you need to worry about linking the new array. How do you communicate the need to jump between arrays and so on between the consumer and the producer? How do you handle the sizing of the chunks? There are different options for that.

So what I ended up with, and by no means do I say this is the best option, but it was an option, one minute to understand, take it all in very quickly now. I stole the parity bit. So I saw the parity bit means all the indices are sort of multiples of two. If the producer index is odd, that's a way to signal that I'm allocating a new chunk and linking it in. So this algorithm is not lock-free. In particular, it's lockless, but if the allocating thread is suspended, everybody will be screwed, just a note. But otherwise it's great. I had to fix up all the multi-producer single-consumer code to work with only even indices, and then use a CAS to block out everyone with a parity bit and release them again. That worked pretty well. Thank you. I thought it was good.

Then there's a special element that is used for indicating that the consumer needs to jump ahead. The algorithm is online and makes for riveting reading, and I'm sure you'll all go back. So the performance, in brief, is great. It's slightly slower than the array-backed MPC, but only marginally so. It was adopted as the backing queue for [inaudible 00:49:56] and the Akka guys who it was kind of a present for, didn't pick it up but a lot of other people using Akka did pick it up and tried it out and are very happy with it. It comes in several variations that might fit different occasions, so take that into consideration if you're going to adopt it.

In closing, thank you very much, and I hope this explains some of the oddities in the code. I hope it encourages some people to actually have a look at the code and maybe contribute, because that would be nice. Thank you very much.

 

See more presentations with transcripts

 

Recorded at:

Apr 17, 2019

BT