BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Let's Talk Locks!

Let's Talk Locks!

Bookmarks
41:39

Summary

Kavya Joshi explores when and why locks affect performance, delves into Go’s lock implementation as a case study, and discusses strategies one can use when locks are actually a problem.

Bio

Kavya Joshi works as a software engineer at Samsara - a start-up in San Francisco. She particularly enjoys architecting and building highly concurrent, highly scalable systems.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Joshi: I'm Kavya [Joshi] and I'm here today to talk with you about locks, specifically, lock internals and some performance. I take it everybody here is familiar with multi-threaded programming, so you're all familiar with locks.

We all use locks, but locks have somewhat of this big, bad, scary reputation, and this is for good reason. We've all probably worked with or heard about systems that where the locking caused performance problems.

Locks Are Slow, But They Are Used Everywhere

Here's an example from one of our production services. This is from Samsara, where I work, and this is a graph of a service that's a critical component of our reed pipelines. We care a lot about the latency of the service, this graph we see this is from pretty recently where the latency went up by about 10X, and we traced this down to lock contention. We know locks, it’s better to be cautious, better yet, try and not even use them.

That said, this production service is written in Go, and the Go runtime uses locks extensively under the hood, as does your favorite operating system scheduler, your favorite memory allocator, so really, locks are everywhere. So, what is it about locks that causes them to have this fascinating spectrum of effects on the performance of our programs? In what scenarios do they do well? When do they cause a problem? What can we do about it? These are the questions we'll answer today. To do that, we'll first talk about lock internals, we'll then talk locking performance, and then finally we'll touch upon smart locking strategies.

First things first, before we begin, an important note. Everything about locks, the lock implementation and performance is specific to the hardware, the instruction set, the operating system, and the specific language implementation, so, today, we'll assume a standard SMP system, so we have multiple cores that share a memory bus. We'll assume x86-64, running modern Linux, and we'll look at the lock implementation in Go.

The reason we choose Go is because it's a modern programming language. It has a sophisticated concurrency model, and so it has sophisticated lock implementation. For those of you who are not Go programmers, don't worry, I got you.

A Brief Go Primer

All you need to know about Go for this talk today is that the unit of concurrency in Go is the goroutine, and goroutines are conceptually very similar to regular threads. You use them just like you use threads in a language like Java or C++, but the big difference is that they are user space. They are run and managed entirely by the Go runtime and not by the operating system. They're created and scheduled by the runtime. The details don't matter right now.

Like with regular threads, when you have two goroutines accessing a shared memory location, that access needs to be synchronized. You can synchronize it using Go's lock implementation, the implementation we'll be looking at today, the sync.Mutex. The Mutex is a blocking non-recursive lock, so, there are no try acquire semantics. If a goroutine tries to acquire a lock and it can't get it, it will block.

Let’s Build a Lock!

With that out of the way, let's turn our attention and talk about block internals, or better yet, let's just build a lock. Where do we start? Let's start with what we know we want then we'll work backwards. We know we want locks to give us mutual exclusion. That is, say we have this simple example program, we have two goroutines that access a shared task queue. T1 is the reader, it's going to read tasks from the task queue, and T2 is the writer that puts items into the queue. We know we want whatever lock construct we build to make it so that T1 and T2 can never access that task queue at the same time.

How do we do this? We just use a flag. The way this would work is, our flag is going to track whether the task queue is being accessed already, so, whether it's free. If it's free, it's going to be zero, if it's being used, it's going to be one. The way the code would work is, you have the reader, it checks to see if the flag is zero, so if the task queue is free, then it sets flag to indicate that the task queue is in use, it does its thing, it unsets flag. If flag is one, however, it means the other goroutine is accessing the task queue, so it's simply going to loop. It's going to try again. Pretty simple. The writer code looks exactly the same.

Does this work? Are we done? Can we go home? No, this doesn't work for a couple of reasons that both come down to what finally gets run on the hardware. The first problem is with this line, this flag++. This is compiled down in x86 to the increment instruction which is run by the processor in three steps. This is a read-modify-write instruction, so, the variable is read from memory into a local CPU register, it's modified, here it's incremented, and then it's written back to memory.

Our single instruction has resulted in two memory accesses. Why is this a problem? Well, what this means is that it's totally possible for the read and write of a thread's read-modify-write instruction to be interleaved with the other threads read. It's totally possible for T2's read to start after T1's flag++ and still see the old value of flagged. It'll see flag equals zero, which is no bueno.

This is an example of a non-atomic operation, all this means is that other processors can see its effects half complete. Operations can be non-atomic because they use many CPU instructions, so they are compiled onto many instructions under the hood, like a load or store on a large data structure, or, because they compile down to a single CPU instruction, but that results in many memory accesses like the example we just saw. The opposite of a non-atomic operation is an atomic operation. In x86, loads in stores that are naturally aligned to within a cache line, these are guaranteed to be atomic. These guarantees come from the fact that x86 is cache coherent. It guarantees that if your data sits within a single cache line, then all the CPU cores have a consistent view. This is cache coherency, they have a consistent view for a single cache line. Atomic is good and atomic is what we want, but using a flag is not atomic.

There's another problem. The second problem has to do with this block of code, this is setting the flag, reading tasks, and setting the flag again. The problem with this is a problem we're familiar with thanks to the recent Spectre and Meltdown, which is that memory operations get reordered. They can be reordered by the compiler, so over here, we see that the flag equal to zero is reordered to happen before reading tasks. The processor can also reorder memory operations. In this example, this is an example of a store load reordering, where the load, the read of tasks is hoisted to happen before the write, and the reason to do this is to hide write latency. Writing is expensive because of cache coherency, because of cache validations that have to happen, and so this is something that x86 does. What it does is it starts the read while the write is in progress. That can happen to hide the write's latency.

The compiler and the processor, they reorder memory operations all the time to speed things up. There are some rules around what they can and can't reorder, the only cardinal rule is sequential consistency for single-threaded programs. They can reorder things as long as it doesn't change the order as visible to a single-threaded program. There are other rules in the case of the compiler. This is captured by the programs, by the language’s memory model, Go, C++, and I think Java too, they guarantee that if your multi-threaded program is data-erase free, so if you've used synchronization constructs in all the right ways, then it guarantees that memory operations will not be reordered. Your program remains sequentially consistent, even as a multi-threaded program.

This is compiler reordering. As for processor reordering, the rules around that are captured by the hardware's memory model. In the case of x86, for example, this is a form of relaxed consistency called total store ordering. What this says is, most reorderings of memory are not allowed, they're not valid, but store load reordering, like the thing we just saw, that's totally valid.

Our flag is not going to work because it's not atomic and it doesn't give us memory ordering guarantees. Ok, no problem. Why don't we just build a construct that gets us atomicity, and gets us memory ordering? Not easy, but, luckily for us, the hardware provides. The instruction set exposes these special hardware instructions that give us atomicity, and give us memory ordering guarantees. In x86, an example of a memory instruction that gives us atomicity is the exchange instruction. An example of instructions that preserve memory ordering - these instructions are called memory fences or memory barriers - and examples are the memory fence, the store fence, and the load fence.

The really powerful tool x86 gets us though is the lock instruction prefix. The lock prefixes and instruction prefix. You can apply it to certain memory operations like loads, and stores, and increments. This gets us both into atomicity and memory operations are preserved, the order is preserved. In fact, these lock instruction prefix are the things that power atomic operations in most programming languages.

This sounds useful. Can we use this to solve our conundrum? Yes, we can. The lock prefixed compare and exchange instruction, or the atomic compare-and-swap is exactly what we need. This instruction conditionally updates a variable, so, only if it's equal to a certain value does it update it to another value, but this read-modify-write is atomic, which is exactly what we want.

Let's take this atomic compare-and-swap, and try and build our lock with it. Back to our example program. Here, all our operations using all our operations on the flag variable, we're going to make them atomic operations. We're going to use the atomic compare-and-swap to check if flag is zero, and if it's zero to set it to one. We're going to use atomic operations everywhere else. If that atomic compare-and-swap fails, we're just going to loop around and try it again.

It makes sense, does this work? Actually, it does work. This is a simplified implementation of a spinlock, and the spinlocks are used extensively all through the Linux kernel. What does this cost us? That's what we're interested in. Let's just measure it.

Before we go onto measuring it though, an important note is that these atomic compare-and-swaps, they are the quintessence of any lock implementation as we'll see throughout this talk.

Here we have a simple microbenchmark. This is in C, we're going to perform an atomic store 10 million times in a tight loop, and measure the time it takes per operation. We see that with one thread, so when there's no contention, it takes about 10 nanoseconds. This about 10 times as much as a regular operation.

In the contended case, when we have 12 threads, it takes about 12 times as much, which is exactly what we'd expect, because these atomic operations effectively serialize our threads. We have this construct, it gives us mutual exclusion, it gives us atomicity, it prevents memory reordering, and it's pretty inexpensive. Should we ship it? No. There is one big problem, and the big problem is that we have this thread just spinning, just busy waiting. This is wasteful, it burns CPU cycles, and it takes precious CPU time away from other threads that could be running.

You know what would be really nice? It would be really nice if, when our thread did this compare-and-swap and it failed, rather than spinning, we could just put it away. If we could just put it to sleep and resume it when the value flag of has changed and it can try again, that sounds nice. Lucky for us, the operating system gives us a way to do just that. In Linux, you have these futexes, and futexes provide both an interface and a mechanism for user space, for programs to request the kernel to sleep and wake threads. The interface piece of this is the futex system call, and the mechanism is kernel-managed wait queues.

Let's see how these futexes work in practice. We're going to extend our flag variable to be able to take on another value, so, zero if tasks Q is free, one if it's locked, and two if there is a thread waiting on the value of flag to change. Here in our reader, we do the atomic compare-and-swap. Assume it fails, here first we're going to change flag to be two to indicate that the thread is going to sleep, and then we issue the futex system call to tell the kernel to put us to sleep, to suspend us, to tell the kernel, "We want to be woken up once the value of flag changes." The kernel takes care of that part, and then once we're resumed, we're going to try and compare and swap again.

Switching to kernel land, let's see what the kernel does. The kernel needs to do two things. The first thing is, it needs to store away this information that we have T1 waiting on the flag variable so it can be resumed in the future. The way the Linux kernel does this is, we generate a key from the user space address - we need to do this because we're in the kernel now. This is stored away in a wait queue entry that stores the thread and the key. If we had one wait queue per user space address, that sounds incredibly wasteful. What actually happens is - we use a hash table - the key is hashed, and we have a wait queue per hash bucket. What this means is, you can have two user space addresses that hash to the same hash bucket, and those entries would be stored in the same wait queue. Not a problem because the entries store the key as well.

Now that we've stored away this information, to resume the thread later, we can put the thread to sleep. This is what the kernel does, it deschedules the calling thread. On the writer side, say the writer comes along, it finishes what it's doing, it's going to set the flag to unlocked, and then it's going to issue a futex system call to tell the kernel to wake up a thread that was waiting on flag. The kernel does its thing, it finds the right hash bucket, it walks the wait queue, and it wakes up the first thread that was waiting on flag.

This is pretty nice, this is really convenient. This implementation we saw was an extremely simplified futex implementation. Futexes are notoriously tricky, but conceptually, this is how they work. What they give us is this nice, lightweight primitive to build synchronization constructs like locks. In fact, the pthread Mutex that you use in C and Java, they all use variants of this futex.

Our question again is, this is nice and all, but what is this upgrade going to cost us? We measure it, here is our microbenchmark. This is again in C, we're measuring the cost of a lock-unlock pair for a pthread Mutex. We see that in the uncontended case with one thread - again, it's about 10 nanoseconds, which is the cost of an atomic compare-and-swap, which is what we'd expect, it’s that atomic compare-and-swap that succeeded. In the contended case though, with 12 threads, the cost goes up to one microsecond, and this comes from the context switching cost, from that syscall, from switching into the kernel from the waiting, that's why it goes up to a whole microsecond.

At this point, it's worth asking, if it makes sense to sleep rather than spin. Really, it comes down to an amortized cost argument. It makes sense to spin if the lock duration, the time for which we're going to hold a lock is going to be short, but the trade-off is, while spinning, we're not doing any useful work, we're burning CPU cycles. At some point, it makes sense to pay the cost of that thread contact switch, to put the thread to sleep and run other threads so we can do useful work.

This insight, this tradeoff is captured in hybrid futexes. The way hybrid futexes work is, the thread does that compare and swap, if it fails, it first spins a small fixed number of times, and if it still doesn't get flagged, only then is the thread suspended. These hybrid futexes are pretty clever, and this is a variant of the pthread Mutex, uses the hybrid futex. Go has an internal futex implementation, and this is a hybrid futex.

At this point, we've evolved from spinlocks to futexes to hybrid futexes. Are we done now? Well, we could be. If we were talking about a language like Java or C++ that has a native threading model, we would be. What about a language that uses user space threads like Go with its goroutines? Can we do even better? The whole thing about user space threads is that they run on top of regular threads. The Go runtime takes care of multiplexing of scheduling goroutines on top of regular operating system threads, but the goroutines themselves run entirely in user space. Context switching between goroutines is cheap, it's fast, it takes tens of nanoseconds rather than the one microsecond we just saw.

Why does this matter? This gives us an opportunity to do something clever. This gives us an opportunity to block the goroutine rather than the underlying operating system thread. If a goroutine tries to acquire a Mutex and it can't, we can totally put the goroutine to sleep without putting the underlying thread to sleep so we don't have to pay that expensive thread switching cost. This is clever and this is exactly what the Go runtime does. The Go runtime has a semaphore implementation. The semaphore is conceptually very similar to the futex we just saw, except it's used to sleep and wake up goroutines, not threads.

Let's see how this works. In our program, when the compare-and-swap fails, the Go runtime is going to add the goroutine to a wait queue, except, in this case, the wait queue is managed in user space. The wait queue looks very similar to the futex wait queue. We have a hash table and we have a wait queue per hash bucket. The details of the wait queues themselves are slightly different, but the details don't matter today. Once the goroutine has been added to the wait queue, the Go runtime deschedules the goroutine by calling into the Go scheduler. The goroutine is suspended, but the underlying thread keeps running, and the underlying thread just picks up other goroutines to run instead. When the writer comes along, it finishes its thing, the Go runtime walks the wait queue, finds the first goroutine to run that was waiting on the flag variable, and then it rescheduled it onto an operating system thread so the goroutine can run.

This is clever, we've found a way to avoid that heavy thread context switch cost. At this point, we could almost be done, but we're not, and we're not for a couple of reasons. There are a couple of problems with this implementation, and they both come down to this - they come down to the fact that once a goroutine is woken up, so a goroutine tried to acquire a lock, it failed, it was put on a wait queue. Once it's resumed, it has to compete, it has to do that compare and swap, it has to compete with other incoming goroutines that are also trying to do the compare and swap. It's totally possible it's going to lose again, and it's going to be put to sleep again. In fact, not only is it possible, it's very likely this will happen because the other goroutines, the incoming goroutines, they're already scheduled onto threads, they're already running on the CPU, versus the goroutine that we just woke up, there's a scheduling delay before it's put onto a thread and run. It's likely that it will lose.

This is problematic for a couple of reasons. First of all, we're going to unnecessarily wake and sleep, we're going to pay that goroutine context switching costs over and over again. Secondly, this can cause goroutine starvation. This, in turn, means that there's like a high scheduling tail latency which can show up in our application's performance.

What does the Go runtime do about it? Well, the Go runtime adds another layer of implementation, another layer of sophistication around the semaphore and this nicely wrapped up package is the sync.Mutex that our applications use. Finally, we can talk about the sync.Mutex. The sync.Mutex, it's a hybrid block, so if the compare and swap fails, the goroutine first spins a fixed number of times, and if that fails, it uses a semaphore to go to sleep and resume, but it's not a simple hyper lock, it has the additional layer of implementation. It has this additional state tracking that fixes the two problems that we just saw.

The details don't matter today, if you're curious, ask me afterwards, but, yes, this is our sync.Mutex, we have arrived.

Our question now is, what does this cost us? What does this fancy Mutex get us performance-wise? Well, let's see. In the uncontended case, it takes what we'd expect, about 10 nanoseconds, the cost of an atomic compare-and-swap. In the contented case, it seems to take about a microsecond again. That's curious, so let's dig into it more. Let's break down that contended case performance and compare it to the pthread Mutex implementation and see what's going on.

We see that initially, as the number of goroutines increases, as the contention increases, there's a significant difference between the Go performance and C's performance. Eventually, once the concurrency level, the number of goroutines gets high enough that different gets smaller and Go's performance starts to converge with C's performance. Why does that happen? We expect Go to do better than C because of this fancy Mutex implementation, because of goroutines, but, at a high enough concurrency level, that doesn't hold true anymore.

I'll let you in on a secret. Remember Mutexes use semaphores, and semaphores have that hash table? What happens if you have two goroutines both waiting on variables that hash to the same hash bucket, and need to be added to the same wait queue? It's almost like those hash buckets need to be synchronized. We need per hash bucket locks. It turns of these per hash bucket locks are futexes. What happens at high enough concurrency is, we're seeing thread contention. We have threads contending on getting the hash bucket futex, and so we're paying the thread context switching cost.

Wait a minute, the futexes also had a hash table, so do the futexes also need per hash bucket locks? Yes, they do, and the futexes use spinlocks. We've come full circle, we have Mutexes which use semaphores, which use futexes, which use spinlocks. It's locks all the way down. You know what this reminds me of? Have you all watched the movie, "The Inception"? It's like that but with locks, it's totally wild.

Let’s analyze its performance!

At this point, we're done, it's time to move on. I could stand here and talk about lock internals all day, but let's turn our attention to talking about performance.

We've seen that there's a huge disconnect between locking performance in the uncontended and the contended case. We knew that from our programs, but now we understand why. What our microbenchmarks don't tell us though is, how our application's performance degrades with concurrency. As contention increases, we know performance is going to get worse, but how much worse is it going to get? Because, really, as application developers and systems builders, what we'd like to know is, "As I increase the number of threads in my program, how is my program's performance going to change?" We care about this for throughput considerations, "If I have a target throughput, how many extra threads should I add while keeping response time the same?" Or perhaps you care about latency, so, "As I change the number of threads, how much more can I speed up my program? How can I decrease the latency of my program?"

To answer these questions, we turn to the theory. The theory says, use Amdahl's Law, maybe. Amdahl's Law, you're probably familiar with, it basically tells us how much speedup you can get by adding threads, by increasing the concurrency, that speedup depends on the workload. It depends on how much of the workload is serial versus parallel. That's the formula for Amdahl's law, it doesn't mean much to me, let's just measure it. In our simple experiment, we are going to create different workloads. The workloads are pretty much the same except they have a different serial fraction, so a different fraction that's done holding the lock and a different parallel fraction. We're going to scale up the number of goroutines and see how that affects our performance.

This is the Amdahl's Law graph. It tells us that, when our workload is mostly serial, as the number of threads increases, we expect the program to get faster up to a point and then flatline, but that line is going to be a lot lower. We're going to start to flatline a lot earlier than if we had a highly parallelizable program, which is what we'd expect. Amdahl's Law is one way of getting at this. Amdahl's Law is good and all, but Amdahl's Law doesn't account for an additional cost factor in the performance of our programs. Our programs don't just exhibit contention, they also exhibit a coordination or a cross-talk penalty. Amdahl's Law doesn't capture that, but a performance model that does capture that is the Universal Scalability Law.

The Universal Scalability Law or the USL, tells us that the scalability of our programs depends on both contention, on how much contention there is in the system, but also on how much coordination there is in the system. The graph the USL gives us looks like this. Don't let the formula scare you, breaking it down, all this tells us is, in a system with no contention penalty and no coordination penalty, we expect linear scaling, we expect that straight line. As we increase the number of threads, we expect the throughput of our program to increase. If there is contention though, we get the Amdahl's Law graph. The throughput increases up to a point and then it flatlines. If we have both contention and coordination, then we actually start to see retrograde scalability.

The way to use the Universal Scalability Law is, you get performance measurements from your application. You measure its throughput or its response time at different concurrency levels and then you can use R, R has a USL package. You can use other software to extend this, to graph it, to predict as I continue to increase the number of threads in my program, what's going to happen to its performance. This for example, is the same experiment plotted under the Universal Scalability Law using the R package.

Let’s Use It, Smartly!

With that, we now have a couple of tools, a couple of performance models we can use to answer the question of contention and application performance. Let’s touch upon a few smart strategies we can use in the case we start to see contention. What can we do about it to reduce contention?

First things first, profile. We see that locks have this vast spectrum of performance characteristics. Locks might not even be the problem, so profile your application. If you're working with Go, you can use the Go Mutex contention profiler, if you're running on Linux, there are several tools available to analyze this. The resources are in the slides, there's perf-lock, you can write an eBPF script, you can use Dtrace or systemtap depending on what you like to do, but yes, first profile.

If contention is actually the problem, a few strategies are, one, don't use a lock. This sounds snarky, but what this really means is, try and remove the need from synchronization from the hot paths. For example, try and do more thread-local work. In our producer-consumer example, try and copy on write, or try and buffer work, try and batch work. Another way to do this is, you can use atomic operations instead, so, for example, the Go runtime scheduler has these run queues, and the run queues are totally lock free. Lock-free data structures or lock-free programming is available to you. Those are some things you could do here.

The second strategy is, if you do have to use locks, use more granular locks, so you're holding the lock for a shorter time. For example, you can partition your data. If you're partitioning data, make sure there's no false sharing, that is, the locks don't fall on the same cache line or you'll have cache line bouncing. Again, you can measure whether or not there is false sharing by using something like perf. You can use per-processor locks, the Linux scheduler, the Go runtime scheduler, do this.

Then finally, you can do less serial work. This speaks to the same thing. You move work out of the critical section. This is a graph that we began with, in this case, we used the Go Mutex profiler to determine that the lock was in fact the problem, and we restructured it so we were holding the lock for less time.

 

See more presentations with transcripts

 

Recorded at:

Aug 15, 2019

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Question on Go sync performance

    by Vijay Teja Gottipati,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Looks like the reason go performance is converging with pthread performance is due to the hashtable contention inside go and then the kernel. Can we not increase the number of locks or size of go's hashtable to reduce the contention?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT