BT

Case for Defaulting to G1 Garbage Collector in Java 9

| Posted by Monica Beckwith on Sep 29, 2015. Estimated reading time: 11 minutes |

I have introduced and discussed the Garbage First Garbage Collector here on InfoQ in a couple of previous articles - G1: One Garbage Collector To Rule Them All and Tips for Tuning the Garbage First Garbage Collector.

Today I would like to discuss JEP 248, the proposal to make G1 the default GC, targeted for OpenJDK 9. As of OpenJDK 8, the throughput GC (also known as Parallel GC), and more recently - ParallelOld GC (ParallelOld means that both -XX:+UseParallelGC and -XX:+UseParallelOldGC are enabled) has been the default GC for OpenJDK. Anyone wanting to use a different garbage collection algorithm, would have to explicitly enable it on the command line. For example, if you wanted to employ G1 GC, you would need to select it on the command line using -XX:+UseG1GC.

The proposal to set G1 GC as a default GC for OpenJDK 9 has been a major source of community concern, which has given rise to a few amazing discussions, and has eventually led to the updating of the original proposal in order to incorporate a clause to provide the ability to revert back to using Parallel GC as the default.

So, Why G1 GC?

You may be familiar with the software optimization tradeoff: Software can be optimized for latency, throughput or footprint. The same is true for GC optimizations, and is reflected in the various popular GCs. You could also focus on two of those three, but trying to optimize for all three is enormously difficult. OpenJDK HotSpot GC algorithms are geared towards optimizing one of the three - for example, Serial GC is optimized to have minimal footprint, Parallel GC is optimized for throughput and (mostly) Concurrent Mark and Sweep GC (commonly known as CMS) is optimized for minimizing GC induced latencies and providing improved response times. So, why do we need G1?

G1 GC comes in as a long term replacement for CMS. CMS in its current state has a pathological issue that will lead it to concurrent mode failures, eventually leading to a full heap compacting collection. You can tune CMS to postpone the currently single threaded full heap compacting collection, but ultimately it can’t be avoided. You can tune CMS to postpone the currently single threaded full heap compacting fallback collection, but ultimately it can’t be avoided. In the future, the fallback collection could be improved to employ multiple GC threads for faster execution; but again, a full compacting collection can’t be avoided.

Another important point is that even for the well seasoned GC engineer, maintenance of CMS has proven to be very challenging; one of the goals for the active HotSpot GC maintainers has been to keep CMS stable.

Also, CMS GC, Parallel GC and G1 GC all are implemented with different GC frameworks. The cost of maintaining three different GCs, each using its own distinct GC framework is high. It seems to me that G1 GC’s regionalized heap framework, where the unit of collection is a region and various such regions can make up the generations within the contiguous Java heap, is where the future is heading - IBM has their Balanced GC, Azul has C4, and most recently there is the OpenJDK proposal called Shenandoah. It wouldn’t be surprising to see a similar regionalized heap-based implementation of a throughput GC, which could offer the throughput and adaptive sizing benefits of Parallel GC. So potentially the number of GC frameworks used in HotSpot could be reduced, thereby reducing the cost of maintenance, which in turn enables more rapid development of new GC features and capabilities.

G1 GC became fully supported in OpenJDK 7 update 4, and since then it’s been getting better and more robust with massive help from the OpenJDK community. To learn more about G1, I highly recommend the earlier mentioned InfoQ articles, but let me summarize a few key takeaways:

  • G1 GC provides a regionalized heap framework.
    • This helps provide immense tunability to the generations, since now the unit of collection (a region) is smaller than the generation itself. And increasing/ decreasing the generation size is as simple as adding/removing a region from the free regions list. Note: Even though the entire heap is contiguous; the regions in a particular generation don’t have to be contiguous.
  • G1 GC is designed on the principle of collecting the most garbage first.
    • G1 has distinct collection sets (CSet) for young and mixed collections (for more information please refer to this article). For mixed collections, the collection set is comprised of all the young regions and a few candidate old regions. The concurrent marking cycle helps identify these candidate old regions, and they are effectively added to the mixed collection set. The tuning switches available for the old generation in G1 GC are more direct, many more in number, and provide more control than the limited size tuneables offered in Parallel GC or the size and 'initiation of marking' threshold settings offered in CMS. The future that I envision here is an adaptive G1 GC that can predictively optimize the collection set and the marking threshold based on the stats gathered during the marking and collection cycles.
    • An evacuation failure in G1 GC is also (so-to-speak) a “tunable”. Unlike CMS, fragmentation in G1 is not something that accumulates over time and which leads to expensive collection(s) and concurrent mode failures. In G1, the fragmentation is minimal and controlled by tunables. Some fragmentation is also introduced by very large objects that that don't follow the normal allocation path. These very large objects (also known as 'humongous objects') are allocated directly out of the old generation into regions known as 'humongous regions'. (Note: To learn more about humongous objects and humongous allocations please refer to this article). But when these humongous objects die, they are collected and thus the fragmentation dies with them. In the current state, it can still at times be a bit of a heap region and heap occupancy tuning nightmare, especially when you are trying to work with restricted resources; but again, making the G1 algorithm more adaptive would lead to the end user not encountering any failures.
  • G1 GC is scalable!
    • The G1 GC algorithm is designed with scalability in mind. Compare this with ParallelGC and you have something that scales with your heap size and load without much of a compromise in your application’s throughput.

Why Now?

The proposal is targeted for OpenJDK 9. OpenJDK 9 general availability is targeted for September of 2016, which is still a year away. The hope is that the OpenJDK community members that choose to work with early access builds and release candidates are the ones that can test the feasibility of G1 GC as the default GC and also help with providing timely feedback and even provide code changes.

Also, the only end users that are impacted are those who do not set an explicit GC today; those that set an explicit GC on the command line, are not impacted by this change. The ones who do not set a GC explicitly will be using G1 GC instead of Parallel GC, and if they want to continue to use Parallel GC, they merely have to set -XX:+UseParallelGC (the current default that enables parallel GC threads for young collection) on their JVM command line. Note: With the introduction of -XX:+UseParallelOldGC in JDK 5 update 6; for all recent builds, you will find that if you set -XX:+UseParallelGC on the JVM command line, -XX:+UseParallelOldGC will also be enabled, hence parallel GC threads will also be employed for full collections. Hence, if you are working with >JDK 6 builds, setting either of these command line options will offer the same GC behavior as you had previously.

When Would You Choose G1 GC over Parallel GC?

As mentioned in this article, Parallel GC doesn’t do incremental collection, hence it ends up sacrificing latency for throughput. For larger heaps as the load increases, the GC pause times will often increase as well, possibly compromising your latency related system level agreements (SLAs).

G1 may help deliver your response time SLAs with a smaller heap footprint, since G1’s mixed collection pauses should be considerably shorter than the full collections in Parallel GC.

When Would You Choose G1 GC over CMS GC?

In its current state, a tuned G1 can and will meet the latency SLAs that a CMS GC can’t due to fragmentation and concurrent mode failures. Worst case pause times with mixed collections are expected to be better than the worst case full compaction pauses that CMS will encounter. As mentioned earlier, one can postpone but not prevent the fragmentation of a CMS heap. Some developers working with CMS have come up with workarounds to combat the fragmentation issue by allocating objects in similar sized chunks. But those are workarounds that are built around CMS; the inherent nature of CMS is that it is prone to fragmentation and will need a full compacting collection. I am also aware of companies like Google who build and run their own private JDK built from OpenJDK sources with specific source code changes to help their needs. For example in an effort to reduce fragmentation, a Google engineer has mentioned that they have added a form of incremental compaction to their (private) CMS GC’s remark phase and have also made their CMS GC more stable (see: http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-July/019534.html).

Note: Incremental compaction comes with its own costs. Google probably added incremental compaction after weighing the benefits to their specific use-case.

Why Did The JEP Become Such A Hot Topic?

Many OpenJDK community members have voiced their concern over whether G1 is ready for prime time. Members have provided their observations on their experience with G1 in the field. Ever since G1 was fully supported, it has been touted as a CMS replacement. But the community has concerns that with this JEP it now feels like G1 is in fact replacing Parallel GC, not CMS. Hence, it is widely believed that while there may be data comparing CMS to G1 (due to businesses migrating from CMS to G1) there is not sufficient data comparing Parallel GC (the current default) to G1 (the proposed default). Also, field data seems to indicate that most businesses are still using the default GC, and so will definitely observe a change in behavior when G1 becomes the default GC.

There have also been observations that G1 has showcased some important (albeit very hard to reproduce) issues of index corruption and such issues need to be studied and rectified before G1 is made the default.

There are also others that ask whether we still need a single default GC that is not based on “ergonomics”. (For example, since Java 5, if your system identified as a “server-class” system, the default JVM would change to server VM instead of client VM (ref: http://docs.oracle.com/javase/7/docs/technotes/guides/vm/server-class.html)).

Summary

After much back and forth, finally Charlie Hunt, Performance Architect at Oracle summarized and proposed the following plan moving forward (Note: The excerpt below is referenced from here: http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-June/018804.html):

  • “Make G1 the default collector in JDK 9, continue to evaluate G1 and enhance G1 in JDK 9
  • Mitigate risk by reverting back to Parallel GC before JDK 9 goes “Generally Available” (Sept 22, 2016 [1]) if warranted by continuing to monitor observations and experiences with G1 in both JDK 9 pre-releases and latest JDK 8 update releases.
  • Address enhancing ergonomics for selecting a default GC as a separate JEP if future observations suggests it’s needed.” 

Also, Staffan Friberg of Java SE Performance team at Oracle urged the community to help gather data points for key metrics. I have paraphrased Staffan’s message for conciseness:

  • the startup time: to ensure that the G1 infrastructural complexity doesn’t introduce much delay at the Java Virtual Machine (JVM) initialization;
  • the throughput: G1 is going head to head with the throughput GC. G1 also has pre and post write barriers. The throughput metric is the key in understanding how much of an overload can the barriers impose on the application.
  • the footprint: G1 has remembered sets and collection set that do increase the footprint. Data gathered from the field should provide enough information to understand the impact of increased footprint.
  • the out-of-box performance: businesses that go with the default GC, many-a-times also go with the out-of-box performance provided by that GC. Hence it is important to understand the out-of-box performance of G1. Here the GC ergonomics and adaptiveness plays an important part.

Staffan also helped identify the business applications that currently employ the default GC algorithm will be the ones impacted by the change in the default GC. Similarly scripts that don’t specify a GC or interfaces that specify just the Java heap and generation sizes on the command line with be impacted by the change in the default GC algorithm.

Acknowledgement

I would like to extend my gratitude to Charlie Hunt for his review of this article.

About the Author

Monica Beckwith is a Java Performance Consultant. Her past experiences include working with Oracle/Sun and AMD; optimizing the JVM for server class systems. Monica was voted a Rock Star speaker @JavaOne 2013 and was the performance lead for Garbage First Garbage Collector (G1 GC). You can follow Monica on twitter @mon_beck

Rate this Article

Relevance
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Simple equation by Cameron Purdy

Parallel GC works great up to a heap of a certain size (up to 2GB range).

G1 GC works great starting with a heap of a certain size (maybe 2GB+ range). It also happens to be the only 64-bit GC in OpenJDK/Oracle JDK.

Heaps are only getting larger -- already headed into the terabytes!

So, where are you going to put your investment if you only have so many brilliant GC engineers? You could split those engineers across many different GC implementations and deal with the complexity of supporting, maintaining, and even improving all of them. Or you could focus them all on one implementation, i.e. _the_ implementation.

Which one would you choose? One optimized for single threaded applications and a 16MB heap? One optimized for a 512MB heap? Or one optimized for gigabytes of heap?

If some other company wants to shoulder the cost of developing and supporting all of those other GC implementations, then they should step forward with a large yearly donation. It's pretty obvious why Oracle would want to focus their limited resources on one single GC algorithm that can handle today's application workloads better (i.e. better than Parallel or CMS), and has a fighting chance of handling tomorrow's workloads as well.

There's a complexity cost and a performance cost and a quality cost of not making a decision to focus on one GC algorithm. Java has been paying that "tax" for years, and it's time to stop paying it.

Kudos to Oracle and the Java team for making a difficult but necessary decision to move forward.

Is this really the right time? by Ben Evans

First of all, thank you Monica for contributing this. I thought you did a decent job of explaining the issue & pretty balanced coverage.

I do disagree on quite a few points though:

1) In reality, it's only with the advent of Java 8u25 (and maybe Java 7u71) that G1 has actually been reliable enough for full on production use. That gives us less that 18 months in the field across the full range of workloads - that's a huge amount to test, and I would argue, too much.

2) I do find it worrying that G1 was initially proposed as a "better CMS" . It has never been able (& still can't) beat CMS on any workload that requires genuinely short STW pauses. I have been unable to reliably tune G1 to less than a 100ms STW - and that's not really "low-pause" by the standards of financal trading and other industries where 5-10ms is often all the STW time an app can get away with.

3) Heap size. Sure - there are an increasing number of apps that need very large heaps. But the average heap size I see is still <2G.

4) So I have to disagree with Cameron - I've seen a few cherry-picked examples that G1 outperforms CMS on very specific workloads, but no examples of G1 outperforming Parallel.

I've tried to reach out to Oracle about this - to have a conversation about what numbers actually lead to the conclusion that G1 is ready for prime time. I've asked if Oracle can provide customer testimonials or references that document a transition to G1. Nothing. So, as a non-Oracle employee, the conclusion that seems inescapable is that the studies haven't been done, and the data doesn't exist. Not very encouraging when the clock is ticking on the JDK 9 release.

Monica, anything you can do to encourage Oracle to prove me wrong, and release data that shows the results of a full application migration to G1 would be very useful. No-one opposed to the JEP feels that G1 is a bad collector, or that longer-term, it isn't the right choice. The concerns are purely about timing, and evidencing the decision, one way or another.

Re: Is this really the right time? by Monica Beckwith

Thanks, Ben. I think your requests are valid and I have reached out to a few of my contacts at Oracle to help provide numbers to support the decision. When I worked for Oracle, we did various performance tests that covered a vast range of heap sizes. So, I would think that the information is there.

Regards,
Monica

Re: Is this really the right time? by Monica Beckwith

Forgot to mention that about 2 years ago, we gave this presentation @J1:www.slideshare.net/mobile/MonicaBeckwith/garbag...

This highlights cases where G1 was beneficial.

Re: Is this really the right time? by Cameron Purdy

Ben -

Regardless of performance, the sooner the combinatorial complexity is removed from the GC support matrix, the better. The engineering resources are simply stretched too thin at present. Focusing the resources on a single GC algorithm makes sense; attempting to support many combinations of GC algorithms does not make sense.

You are right that G1 generally did not out-perform Parallel for small heaps. It is unclear whether that will ever happen on small heaps. When I was at Oracle, I did argue for a default implementation that would automatically select either Parallel or G1 depending on (among other things) heap size, but the JVM group felt that was a bad idea.

However, even a year ago now, the latest G1 performance surpassed the old Parallel GC performance on many (most?) of the validation and performance tests that Oracle has been running on its own suite of software -- and trust me when I say that is a lot of testing with a lot of different workloads. (Ironically, some of the G1 improvements also helped improve the Parallel collector performance, so the "finish line" in some cases kept moving.)

As far as your concern that G1 has "only" been stable in the field for 18 months now, remember that the switch to G1 as a default will happen with Java 9, which is still a year away. That's still a lot of time to optimize, tune, and validate G1.

As far as asking Oracle to prove you wrong, that is a tough request. Oracle just doesn't have the resources to maintain Serial and Parallel and CMS and G1 and the various combinations of options that the JVM currently supports for GC. That also means that they're not going to use their limited resources in the GC field to "prove you wrong". The sooner that they simplify the set of supported GC options and put their resources behind a single implementation, the better off Java will be. Maybe G1 (the only 64 bit collector) is the wrong horse to bet on, but the "even wronger" horse to bet on is all of them.

You can't hedge all your bets, or you will always be losing. If you want "plan A" to work, then there is no "plan B".

Peace,

Cameron.

G1 vs C4 comparison by Alex Koturanov

To put it in perspective, how does G1 GC performance compares to the industry-leading (but commercial) C4 collector from Azul Zing?

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

6 Discuss
BT