I have been working on runtimes and garbage collectors for several years – 20-30 years – I have written several papers, hold a bunch a patents, and now I am working on the Go team trying to make sure that their Runtime and their garbage collector is world class. That is what I am here to talk about today.
All GCs basically have three phases. The first one locates all of the pointers that go into the heap that the program itself has direct access to. Typically these are in the global variables, or they are on the stack, or they are in registers. So that is the first phase, that is the root scan phase or the stack scan phase. The second phase is typically the “mark” phase; somehow you have to follow all of the pointers that go into the heap and then transitively follow the pointers you find and what they point to until you have located all of the objects in the heap that are reachable from those original pointers that the application has access to. Once you have done that, you obviously know that if you haven’t reached an area of memory or an object in memory then you know that it is free and it can be reused. That process is the third process. You sweep through the memory or through the heap, locating all the unmarked objects you did not reach in the previous phase, and then you just simply reuse them for allocation.
The GC does have some stop the world pauses. They are fairly small. That said, they are there for very pragmatic reasons. The actual algorithms that we use don't require them. We know how to, and we have proven that we can, get rid of those stop the world phases. But the current GC, just to be honest with people, does have these short stop the world phases. That is where 1.5 is right now.
Charles: You are presumably using write barriers, are you, somewhere in the process?
We are using write barriers.
Charles: Is that at the scan stage?
No, it is actually during the “mark” phase. The write barriers – when you do a concurrent collector, the issue here is that the mutators or the application program, if you will, is writing pointers to the heap at the same time the garbage collector is trying to trace them through the heap during the “mark” phase. The write barrier is the piece of code that makes sure that the garbage collector knows that the application is writing pointers to some places, otherwise you might lose track of what is going on. So, it is integral. The interesting thing about the 1.5 Go GC is that we can turn our write barriers on and off. So, when the concurrent GC is running, we turn our write barriers on. When the concurrent GC isn’t running, the write barriers are turned off. A quick check to see if they are turned off and if so you just install the pointer and move on. One of the great speed ups that we found in 1.5 is being able to turn the write barriers off during times when we don’t actually need them.
We start the process by flipping the write barrier on, we then go through the “mark” phase, then at the end of the “mark” phase we make the write barrier much more aggressive. That way, there are fewer and fewer things that can hide from the garbage collector. One of the interesting problems we had with 1.5 was that pointers would hide during the “mark” phase and we would not discover them until we stopped the world and then did the “mark termination” phase. When we make the write barrier more aggressive near the end of the “mark” phase, we found more objects and then we could shorten considerably the actual “mark termination phase” which is a stop the world phase. Obviously, if the world stops, you do not need write barriers, it finishes all the marking, and then we can turn the write barriers off because all that is left is the “sweep” phase which does not need a write barrier and we move on, right.
Charles: Is it a generational collector?
It is not a generational collector. Generational collectors are interesting beasts. They were around in a time of single CPU machines, single core machines, and time actually meant something. Today, if you’ve got 4, 8, or 16 cores, each of these cores has its own clock. So, “young” and “old” starts to mean different things. Multiply that by the number of Go routines we have, hundreds of thousands, all with a different clock and all of the sudden “young” means something different. So I am not sure that generational collecting is what we want moving forward. So, we did not use it here. We could have, its certainly one of the things that can be put into our system moving forward, we just didn’t need it but maybe the applications moving forward are going to need it. We don’y know. Maybe they are going to need some other, more inventive, nicer algorithms going forward. We shall see.
Charles: That is quite a profound difference from how Java works, for instance. Not just the generational thing, but the fact that you are talking about many thousands of Go routines, which are effectively light-weight threads, as I understand it, while as with Java that would not be a thing that you would see.
No. We spent a lot of time optimizing our threads to be light-weight, and that is something that Java chose not to do, which is perfectly reasonable. But you have to remember that Java was born in a single CPU per machine world. So there are some really good historical reasons why they want with generational collection - it was the right choice, and one reason why they are still there today.
5. How is the coordinator implemented in 1.5?
Brilliantly. By a colleague of mine, right? The coordinator has a lot of troubles, it is responsible for a lot of things. The first thing it has to do is to decide when to start a concurrent collector, a collection cycle. And if it chooses too late, then all the advantages of a concurrent collector are dissipated because you are spending all your time during the cycle, trying to keep the heap small. If it starts too early, then you are wasting your time and the write barrier is turned on. So it is really important to get the right math and the right implementation to know when to turn the GC cycle on. And that is only the start.
Once you are inside the GC cycle, you have a lot of problems. One of the problems that you have is that you may have an application thread or an application Goroutine that is doing a lot of allocation, and it becomes clear that if you keep doing this many allocations we will not get the GC done and keep the heap size where you want it. So our coordinator, which we call our pacer, needs to somehow slow down that application thread so that it can meet its deadlines, the GCs deadlines, and it does that quite cleverly. It says “OK. You want more space, but before we give you this space, you have to do two things: first, you have to check to see if the GC has made enough progress that you can just take the space, or you have to stop and you have to help the GC out enough so that there is enough credit, if you will, for you to go and do the allocation.” So the GC can get assists from the application threads, or Goroutines. The goroutines are slowed down by helping the GC, and the GC is sped up by getting the assist from the mutator, and all this has to be coordinated by the pacer. Ask the application thread to do too much – not good; Ask the application thread to not do enough – not good. You want everything to happen just as you had predicted in the first place, and if you find out that you ended up having to have the mutators assist too much, then on the next cycle there is a feedback loop that resets when you should start. So it is a complex, mathematically rigorous feedback control structure that we have to control our GCs.
Charles: As I understand it, that is implemented as a single Go routine. Is that correct?
No.
Charles: Oh, OK.
You can think of it as a distributed state machine where we just basically move from one to another state and sort of whoever finds himself in a state, finds that the GC should be moved to another state, it does that. So it is not centralized into a single Go routine. At one point, maybe it was. But certainly for 1.6 we have corrected that single point bottleneck, if you will, and we have taken care of that.
Charles: OK. That will remove some of the delays that you used to get in phase transitions, for instance, back in the 1.4 days.
That and some of the problems in 1.5 – that has been redone for 1.6 and so a lot of those problems, which I really do not think that anybody actually saw, have been fixed and its made our system more robust and easier to live with.
Well, they want to operate at zero, and we let them go into the red, and now we no longer let them go into the red.
Charles: They are both running in black now.
They are both running in black. Part of the pacer problem is trying to make sure that first of all that they are running in the black, but second that they are running as close to zero as possible because, as I said, you get too much credit, you waste energy. So getting that right is really the job of the pacer.
7. Do you use pre-fetching during “mark” at all?
Pre-fetching is important in a concurrent collector. Now the problem with pre-fetching obviously or the problem with the garbage collector is that it is a cache smashing machine. As it wanders through the heap, it really does do a good job of smashing the cache. So using the pre-fetcher, we go back and forth whether it is good or bad: Does it further smash the cache? Should we use a non-temporal prefetch so that we reduce the number of ways that happen to be smashed by the cache? And so forth and so on. Right now, I think what is on tip today does not do a pre-fetch. That may change if we get different data. But we have done pre-fetches in the past and I am sure we will do them in the future, but you build and measure, build and measure, build and measure, and not just the GC, you have to think about pre-fetches and not how much it speeds up the GC, but whether is slows down the mutator. So, it is not a simple “We are bright enough to know where we are going to go, we will just pre-fetch it.” It is much more complex than that.
Charles: I think I am correct in saying that the main contributor to pause time in 1.5 was in the mark termination phase.
In 1.5 – yes, it was.
8. Could you talk a little bit about why that was and how that is being addressed in 1.6?
In 1.5 we had a lot to do and there wasn’t much time and we ran out of time. What we had done with the “mark termination” phase is that if things were complex, we just missed something. It was easier in the stop the world to do something. That is when we just put it in the “mark termination” phase. There were two things in the “mark termination” phase that had a per span or per page cost. As we increased the size of the heap, because of these two issues: one was doing nothing but keeping a counter and the other was dealing with finalizers. You increase the heap from, say, 20GB to 50GB, then you ended up doubling the amount of pause times that you would have in the “mark termination” phase. So for 1.6, which is due out in three months, we moved all of those problems, all those per page, if you will, problems into the concurrent phase. That is how we cleaned that up. Now, we are seeing situations where we can have good, solid, under 10ms things with 250GB heaps, which is another step above what we had even considered for 1.5 when we were really talking about 25GB heaps. Well, you work on the machine that is on your desk until you’ve got that problem fixed, and then you go and find another machine and if it is a big machine, they do not put it on your desk. But that is the problem [you’re working on]. Again, the algorithm – we know how to get rid of all of these things and it is just a matter of picking them off as we have the resources and we have gotten all the tough ones now and real flat out there at 250GB heaps – flat pause times.
We are discovering them, its an issue of discovering them, not actually executing them. So with the finaliser you have to find the finalizer, you then have to trace what it belongs to, and we know how to do that concurrently. We just did not do it concurrently until 1.6.
Charles:
Right. And there was some issues in root marking, I think as well, that I believe have also been addressed.
There were lots of issues. For example, if you have a very steep stack, say several megabytes, and that stack is running, then we had to rescan it, the whole stack, and so we had to go in there and say “Well, we have scanned the lower 90% and then we have to remember that we have already scanned that part. Then we only have to scan the stuff that actually got changed since we scanned it at the beginning of the GC cycle,” and so that was tricky because when you return into an area of the stack where you thought you had scanned, you have to go and fix up a lot of stuff so that we know that we have to scan that stack. So, yes, that was another problem, but that was fixed also. Again, we know how to do this. But, we will just have to knock them off as people have applications that cause problems, we will go and get rid of the those sources of latency.
No. The sweeper is actually...I like to think of it as it was already concurrent, before 1.5. So it wasn’t an issue that affected latency. So we did not focus in 1.5 because it did not affect latency and 1.5 is really a latency-oriented improvement. What it does do is it tends to go over this map of bits that says which objects in the heap are marked and which aren’t. From that map, it run through, looks at the bits, creates a free list, it then turns the free list over to the allocator which starts peeling things off of the free list and then when the next GC cycle starts, it just has all of those bits just flipped back to zero and we do it again.
Now clearly, if you think for half a second, you do not actually need to build this free list. You can just have the allocator, instead of finding an empty, non-set bit and creating a free list, you can just slow it down until it finds a bit and then it just goes ahead and allocates at that location. It is a little bit complicated, but basically that is the high bit, if you will, of where the allocator is going to go, the sweeper and allocator will be formed into one in 1.7 and that way the sweeper will actually seem to completely disappear and it will become one with the allocator.
Charles: Very interesting.
That is 1.7 work. On 1.6 – still no work there. I mean we have prototyped it, but it is not going to make the cut. It's closed, the 1.6 is now closed. We had a code freeze or a feature freeze on it and we are just fixing bugs at this point.
Charles: That’s really interesting, Rick. Thank you very much indeed for your time.
Oh, I was glad to be here. That was great. Thanks.