Bio Gil Tene is CTO and co-founder of Azul Systems. His pet focus areas include system responsiveness and latency behavior. Gil is a frequent speaker at technology conferences worldwide, and an official JavaOne Rock Star. Gil pioneered the Continuously Concurrent Compacting Collector (C4) that powers Azul's continuously reactive Java platforms.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
1. Hi. I’m here with Gil Tene at London QCon 2016. Gil, you have a talk tomorrow that’s talking about hardware transactional memory. I wonder if you can just give a summary of what hardware transactional memory is for our listeners.
Well, hardware transactional memory is hardware-assisted transactional memory. But as a feature set, the exciting thing is that it’s coming to a server near you in the very near future. This year, we expect the mass commodity servers that will ship to include this feature and it is for the first time ever that that’s been the case. What hardware transactional memory features do, is they allow you to run long blocks of code speculating that they are executing atomically and transactionally against other things in the same system. And if they successfully complete that, you can run code that will avoid serializing on locks and other things, for example, if you can successfully do it.
The hardware transactional memory needs to be coupled with software that uses it well. But the interesting promise is for libraries, things like POSIX locks and JVM synchronized blocks and locks to be able to use the capability of the hardware so that existing software will transparently be able to use it without changes.
The architectural set of instructions that is in Intel’s architecture that is aimed that this is called TSX, and it’s been in the works for probably a decade. It was first announced and documented targeting the Haswell processors, and those shipped a couple of years ago. But in the first implementation, apparently, there was some problem because that feature was actually turned off late with a firmware fix to Haswells to disable it because it wasn’t good and stable for the CPUs. The Broadwell chipset – actually, the higher-end Haswells that recently shipped as well – but primarily, the Broadwell chipset, and later ones that are coming and already shipping from Intel include the feature set.
The Broadwells and Skylakes and the rest have been around in desktops and single-socket servers for a year and some now, but two-socket servers are -- those are the mass of commodity servers, say, what you will get to run on in Amazon or in other places; those are just now coming. I would say this is the first time that we expect the actual mass availability of the feature set to be there for regular programmers.
Yeah, I have a lot of experience using it and actually designing it, using it, figuring out what it’s good for and what it’s not good for. At Azul, we used to build our own hardware. Today, what we do is build JVMs for regular Linux x86 servers and other platforms too. But our beginnings were in custom hardware with custom chips and interesting massive systems with hundreds of CPUs and hundreds of gigabytes back in the mid 2000s.
Our designs included hardware transactional memory from the start and we shipped three generations of our Vega systems that included that capability – including a JVM that was able to use it, to speculatively execute synchronized blocks concurrently. So we know very well what it does, what it can do, what is required to make it run well, and we’ve actually had working JVMs that did it in the past. But when we transitioned from custom hardware that we built our stuff onto moving to pure software on x86, we had to leave hardware transactional memory behind because it just wasn’t available on anything other than on our hardware.
At the time, we thought we may never have it, but then when Intel announced a few years ago that they’re doing it, we were really excited. And now that it’s actually going to be available in servers, we can start turning on features that we were only able to do on custom hardware seven and eight years ago.
Well, it’s sort of a mix. Our hardware was massively parallel, a tremendous number of fairly weak cores. Our Vega 3 systems had over 800 cores in them. But each was in the order of a gigahertz and much fewer concurrently issued instructions per cycle. So if you look at today’s, say four-socket x86 servers, they have more power than we had in our servers back then. Of course there’s also a few years’ difference and Moore's law that’s working on their side.
So I would say from an absolute performance point of view, x86 was actually, or is now a step-up given the years that have passed. But from a concurrency point of view, we were able to keep workloads concurrent, in some cases (and in many cases) across a much larger numbers of cores where to us, a 200 core system was common case. In the x86 world that’s just coming like we’re about to cross the hundred in the common case for cores. At those levels, keeping that much workload concurrent so that you’re not just keeping a few cores busy and the rest are waiting on locks, there comes a real challenge and speculative concurrent execution becomes interesting and important.
So I think that we didn’t feel the loss simply because in the x86 world, there just weren’t that many cores. But we’re starting to see the same bottlenecks over and over again, I think. There is some very interesting potential for the use of hardware transactional memory for scaling code that isn’t designed lock-free but wants the capability or to benefit from similar to lock-free performance in many cases.
5. When you hit a synchronized Java block and it can speculatively assume that it has the lock, how do you deal with cases where another thread might have been executing in that same synchronized block? How do you prevent the wrong effects going out to main memory and then how do you recover from that and the one that’s chosen to fail?
So when you apply hardware transactional memory towards lock speculation – there are other uses for it too – but when you apply it towards that, what you usually are aiming to do is to leverage the fact that data contention tends to be a lot lower in frequency than lock contention. The reason we even have locks in software is to protect data from races. And the reason we have to have the locks in place is because we can’t detect whether the data collides. So we have to prevent it from colliding.
That means that by definition, data contention on the same piece of data cannot be higher in frequency than lock contention, or you have a bug, if this is not a loud contention. In practice, the act of putting a lock together is very pessimistic. You do it in case there’s a collision because you can’t do anything about it, so you prevent the collision. And depending on your data structure and use case, there might be three orders of magnitude of frequency between them.
So imagine a very large common dictionary or hash map that lots and lots of threads are using and putting things in like a cache. If you actually grab a lock on this whole thing, that’s extremely pessimistic. There are many uses of this data structure that don’t have data collisions that are problematic. Two readers, who cares, right? And if you have a reader and a writer and the writer is writing here in this bucket and then the reader over here, well, that’s also concurrent. A queue can run that concurrently. And even two writers not in the same bucket, it could have run concurrently.
You’re protecting things in case two of these collide in a way that there’s no way for the software to detect. What we can do with hardware transactional memory is say, instead of grabbing the lock to prevent the collisions, we can use hardware transactional memory to act like we grab the lock, track all the operations we do under the lock, and if we get to the end of the lock without any conflict in data happening, then commit because this was allowed, and you can run a hundred different threads all doing read operations and they’ll not collide with each other in a conflicting way or some wrote here and some read there. And again, no collisions.
As long as the actual execution doesn’t have actual data contention, you will get concurrent effects. If you have actual data contention, then you really did want that lock, and then you go and grab the lock and back off to a non-transactional execution. So it’s not that you can use hardware transactional memory to concurrently update a counter. If all you’re doing is updating one word by many, many threads, that has to be atomic in some way. You could do that with atomic operations, you could do it with locks. If you’re updating seven counters across the large thing, you don’t have an atomic operation to do it with, so you grab a lock to do it.
And there’s nothing that’s going to help you there if everybody is updating the same seven counters all the time. They do contend. But if you are looking at a much more sparse kind of data piece like large data sets, hash maps, linked lists, other things, then you can actually either avoid the need for very fine grained locking or for various non-locking concurrent algorithms to some degree by using coarse-grained locking and hardware transactional memory to blow through the coarse-grained part in the most case.
6. How does not committing the visible effects back work? Do the effects exist in the cache but presumably, where other cores can’t see that changed data until it’s committed? Or is it that it just doesn’t go all the way back out to main memory?
The answer to that is actually very implementation dependent, but let’s talk specifically about the two that I understand well, or I believe I do, Vega and x86. I helped design the Vega 1 and used it, and the x86 I believe I understand the mechanisms fairly well. The simple way to think of what is done is that the hardware tracks the transactional state in the CPU and its L1 cache. Between those, the state of what has been read within a transaction and what was updated during the transaction is held and its not visible outside of the L1 cache of that one CPU.
That usually includes the L1 cache itself and say the store buffer of the CPU. But the L1 cache is probably the simple place to think about this. The really cool thing about the hardware transactional memory types that I’m talking about, which are the ones where you implicitly execute instructions in a transactional mode because you’re in a transactional mode. So you say, “Enter transactional mode.” From now on, all stores and loads just mean “Do it transactionally rather than not.” And then when you finish, they go back to normal.
In that world, the normal cache coherency schemes – the protocols that are already in place – do everything we needed to do this well. So outside of that CPU’s L1 cache or the first cache line that the store reference spills into, which is usually L1, you don’t need to change anything in the system.
Fundamentally, the cache tracks three or four important states. People often call it a MESI cache protocol. There are lots of variants, but you have the state where you don’t have it. It’s not in your cache. That’s an easy one that’s called invalid. Then you have the state where you have a copy of it, but somebody else might also have it. That’s called shared. Then you have the state where you have a piece of data and you know that you’re the only one in the world that has it, that’s exclusive. And then there’s a sub-state of that where I have my own copy and I have changed it since I last got it; that’s modified. Those four states Modified, Exclusive, Shared and Invalid are the basic MESI states for caches. And given just those four states, we can do safe transactional execution.
The basic way you do this is once you enter a transaction, you track every word of cache that was actually read from during the transaction, speculative read, and you track any word that was modified speculatively during that transaction with special state. And the simple rule of thumb is if you lose track of any of those, then you fail the transaction. But the interesting similar corollary is if you don’t lose track of any of those and you’ve reached the commit point, you can safely commit and nobody will know the difference.
So if you look at how you might lose track of things, there are a handful of very simple ways that could happen. When I say lose track of, I mean that this cache line that was used speculatively needs to leave my cache or change state in the cache in a way they don’t like. But it needs to somehow propagate outside of my cache. So for example, if I had been reading something and it’s in my shared state, I don’t have it exclusive but I read it. So it’s shared speculative, then as long as nobody else modified it everything is okay.
The nice thing is for anyone to modify this memory that I have in my cache in a shared state, they first need to get it in exclusive state in their cache, which means they have to yank it out of my cache before they can do that. The act of yanking of that out of my cache will invalidate my transaction. So if that didn’t happen, I know nobody has modified anything I read. In reverse, if I’ve modified anything, then I have it exclusive in my cache, which means that if anybody wants to read it, they first have to move it out of my exclusive mode, which will fail the transactions.
Those are the two fundamental things of how others might collide, but there’s another very important set. The cache is only so big. It only has so many sets. I’m may try and look at a gigabyte of memory in that transaction. At some point, I may run out of cache space. I may need to evict the cache line to make room for another, and that will be losing track of a cache line, and I cannot do that safely. In fact, that is the most common, probably, reason for actual transactional failure. It’s capacity-based rather than actual data collision that will cause those.
Alex: And presumably, this means that all of the transactions have to start and finish on one core. So you’ll never going to see a situation in which you have a thread from the operating system running on one core and then being scheduled somewhere else if it’s trying to do things transactionally.
Yes, that’s very true. How that’s implemented? Again varies, Vega and x86 do this in different ways; but specifically in x86, there are many operations that will fail a transaction when they happen. There are many that won’t, which is the important thing. You could pretty much run very far with normal user code without failing a transaction. You can call methods and come back, functions come back from them. You could do math and floating point and vector operators. None of these will fail your stuff.
But if you ever take an interrupt, if you ever change from the user mode ring to the kernel mode ring, you make a system call. Any of those will absolutely fail a transaction if it’s in flight. And the fundamental reason for that is once you do an operation like IO or non-temporal memory access or a system call, you can’t take that back and you can’t let that happen within the transaction if you can’t abort it. So the transaction will abort if it reaches a point where if it reaches an operation that cannot be reverted. So as long as you’re running normal user mode code and just manipulating memory, you’re safe because that can be taken back. A context switch, an interrupt, a reason to switch from one to another, will be a reason to abort if it happens in the middle of a transaction.
Alex's full question: So there presumably needs to be something that will come round. It's not a case you’re just spinning on and retrying all the time. Is there a different path of logic that would have to happen in the case of the transaction failing because you simply can’t fit all the memory inside the cache?
Yes, absolutely. So in the field of transactional memory in general (pretty wide academic field) there are many different ways to do this and suggested ways to do this. But the actual practical implementations are speculative transactions, not guaranteed to commit and move forward transactions. And because of that, you can always have sort of a live fail situation where this transaction may never complete.
A simple example of that on an x86 would be this: the L1 data cache of current x86 chips are 32 kilobytes 8-way associative. That means that you could get a set of nine different words in memory that cannot fit in the cache without evicting one of them because they happen to align in addresses. You need nine sets and you only have eight. If you ever try to run the code on that data, it will always fail. It cannot complete the transaction. And there are many other examples of like, if you did a system call, it’s never going to succeed.
So that means that whenever you run speculative execution under a transaction, you need to be able to fall back to a non-speculative mode or you are not guaranteed forward progress. So in the current implementations that I know of in this hardware, you always need the non-speculative mode, and the speculative mode is an attempt to run it faster, better, more concurrently. And how that’s done, whether it’s two separate pieces of code or the same code that also runs in different modes, that varies.
So fundamentally, when you enter the transaction, you hope to succeed. If you abort the transaction, software can tell that it aborted. Nothing happened except the jump to the abort, and then you could figure out if you want to try again, how many times you want to try again, should you even try this the next time, and there are lots of very interesting heuristics and adaptations on how to do this well.
Yes. That is the common one. The fallback is grab an actual lock, and when somebody grabs an actual lock, that prevents all other transactions from running because they conflict with it. So it basically means stop doing that, just do the one at a time thing. It’s actually a lot of the interesting software side research and implementation and this has to do with how to do this well. So for example, in the Azul JVMs that we have had working for years on our Vega platforms, we can adapt every lock instance separately to whether or not to even try to speculate. And that adaptation is very good. So if a lock is on data that never actually contends but the lock itself also never contends, it turns out that it actually just cheaper to grab the lock because it’s never going to contend, just a CAS operation which is similar to the cost of that transaction. We call those thin locks in Java.
But if the lock is contended, historically, in Java if the lock was contended, you block. If you can’t grab it, you wait. You could spin wait. You could back off to a semaphore, but you end up blocking behind the lock. We now have this third option where if the lock is contended, maybe we should speculate. And the question of whether or not you should do this is an interesting balance. You don’t want to do this to every lock because it’s actually faster to grab the non-contended locks, and 95% of locks are non-contended.
But you also don’t want to always grab it when it’s contended because maybe you’ll never succeed; in a system call you don’t want to grab it. So if it’s in the Goldilocks range, not too cold, not too hot, then it’s worth speculating because you win, and the more cores you have, the more win you have.
Alex's full question: It’s sort of like branch protection, isn’t it, when you go over something many times or even some of the optimizations that the JIT will do speculatively; if the object that you’ve seen in this is X all of the time then you keep coming back to it, then you can just kind of assume X and then fail to slower part or it’s not?
Yeah, it's interesting. You’ll see a lot of different attempts of implementations that try to use different forms of guessing. So for example, HotSpot actually, has a variant today that will use this feature when it comes out, and it tends to speculate on at the code point and to decide across all locks at that code point whether or not it’s worth speculating here in the code or on a class, for example. Where in our JVMs, what we’ve learned is that it’s worthwhile speculating on the lock instance because the same code point, especially in common case library code, might deal with some data structures of exactly the same type that are not contended, some that are always contended and some that are in the middle.
So a code point alone for some very small piece of code could choose right for one or the other but not for all. If you make the choice in a per instance of the lock perspective, then your specific hash table or specific hash map or linked list or whatever you’re locking on will learn for that instance whether it’s worthwhile to do. And the good news is the learning happens very quickly. It’s not something that takes a while to stabilize, and it stabilizes very fast. It tends to be very effective and you can get some very impressive (at least micro benchmarks) that show a locked, shared dictionary running concurrently at a hundred threads instead of one. Really cool.
So the Zulu builds of OpenJDK will have exactly the same features as HotSpot. So those are in our current code. Zing will have support for using TSX (x86 TSX) in coming versions. It’s not in the current version. We’re actually waiting for the actual real chips to come out and gain the tweaking there because the tweaking is – it’s interesting. Getting the heuristics right and what success to failure ratio to go for and how long to wait to stabilize. You want to collect actual data on what’s right, and we’re still in the mode of collecting that.
As I said, we have a lot of experience from doing this in Vega systems. And there, it took us probably a year plus to get to a stable state. And when I mean stable, I don’t mean crashing. We’re not crashing, but the actual goal with transactional memory is to make it never hurt. If it sometimes is really good but sometimes hurts to you, you tend to not turn it on, because many things don’t benefit from it. But if you could get it to a state where it’s never a negative, then you have it on by default, and we reached that with Vega. We expect to reach that with Zing in the coming year or so.
It’s more than likely. PowerPC variants from IBM already have semi-equivalent features. SPARC had designs that would have had it in it that didn’t quite go through. There are newer chips that have some features of hardware assisted transactional memory. Not all hardware transactional memory is the same. But of the type of, like TSX and Vega and say, to my understanding what Power has in it, we’re seeing that coming up a lot. You will probably see this more in server-oriented chips in environments that expect to have tens of cores or more because the wins are really there.
For every cycle you save in serialized code, you gain one cycle of performance for every CPU you have. So if you have a hundred CPUs, this thing is worth 50 times as much as when you have two. And that’s why I think you tend towards server heavy platforms being the actual focus now. The day where one laptop has 120 cores, this should be used for everything. But we’re not there yet.
Alex: Maybe one day.
Well, I mean my laptop actually has 16 virtual cores where eight of them were – actually, eight virtual cores. That’s a nice number but a few months and years, we’ll get there.
Alex: Something to look forward to. Gil Tene, thank you very much.