Xml and More: JDK 8: Is Tuning MaxNewSize in G1 GC a Good Idea?

For a throughput GC such as Parallel GC, it often helps performance by setting MaxNewSize (or -Xmn). For G1 GC,^[1] is it the same? The short answer is yes/no and it depends (confession: I said "no" because I haven't done enough tests to exclude it).

In this article, we will demonstrate one case that setting MaxNewSize for G1 actually hurts the application's performance. BTW, after benchmarking G1 in JDK 7, another author has claimed that:^[14]

There is little to gain and much to lose from setting the new generation size explicitly.

Live Data Size

To tune the heap size of a Java Application, it's important to find out what its live data size is. To learn how to estimate a Java application's live data size, read [2]. For the benchmark we are using, its live data size is around 1400 MB. With this benchmark, we are interested in knowing how G1 performs with MaxNewSize set to 400MB and Java heap set to 2GB.

Regularly we use the following settings for the G1 tuning:

-Xms2g -Xmx2g -XX:+UseG1GC

plus some options for mixed GC (read [3]). So, for this experiment, we have added MaxNewSize setting like this:

-Xms2g -Xmx2g -XX:+UseG1GC -Xmn400m

This has turned out hurting the performance a lot. Note that -Xmn400m is the optimal setting of our benchmark if Parallel GC is used.

Concurrent Cycle in G1 GC

G1 has four main operations:^[5]

A young collection
A background, concurrent cycle
A mixed collection
If necessary, a full GC

Without much ado, read [5-9] for the needed background of G1 GC.

G1 is a concurrent collector. Its concurrent cycle has multiple phases—some of which stop all application threads (denoted by STW) and some of which do not:

Initial-mark (STW)

Denoted by:

[GC pause (G1 Evacuation Pause) (young) (initial-mark)
[GC pause (Metadata GC Threshold) (young) (initial-mark)

The G1 GC marks the roots during this phase. This phase is piggybacked on a normal (STW) young garbage collection.

Root Region Scan

Denoted by:

[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end

The G1 GC scans survivor regions of the initial mark for references to the old generation and marks the referenced objects.
This phase runs concurrently with the application (not STW) and must complete before the next STW young garbage collection can start.

Concurrent marking

Denoted by:

[GC concurrent-mark-start]
[GC concurrent-mark-end

The G1 GC finds reachable (live) objects across the entire heap.
This phase happens concurrently with the application, and can be interrupted by STW young garbage collections.

Remark (STW)

Denoted by:

[GC remark

This phase is STW collection and helps the completion of the marking cycle.

Finds objects that were missed by the concurrent mark phase due to updates by Java application threads to objects after the concurrent collector had finished tracing that object.

G1 GC drains SATB buffers, traces unvisited live objects, and performs reference processing.

Cleanup (STW)

Denoted by:

[GC cleanup 1936M->1931M(2048M)

Prepare for next concurrent collection by clearing data structures.

In this final phase, the G1 GC performs the STW operations of accounting and RSet scrubbing.
During accounting, the G1 GC identifies completely free regions and mixed garbage collection candidates. The cleanup phase is partly concurrent when it resets and returns the empty regions to the free list.

Concurrent cleanup:

Denoted by:

[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end

In this phase, G1 reclaims regions which were found empty during marking.

Adds empty regions to the free list—i.e., thread local free lists are merged into global free list

G1 GC uses the Snapshot-At-The-Beginning (SATB) algorithm, which takes a snapshot of the set of live objects in the heap at the start of a marking cycle. During the concurrent cycle, young garbage collections are allowed, which are triggered when eden fills up (note that initial-mark is implemented using a young collection cycle) .

The set of live objects after marking cycle is composed of the live objects in the snapshot, and the objects allocated since the start of the marking cycle. The G1 GC marking algorithm uses a pre-write barrier to record and mark objects that are part of the logical snapshot.

After the concurrent cycle, we expect to see:^[5]

The eden regions before the marking cycle have been completely freed and new eden regions have been allocated
Old regions could be more occupied because the promotion of live objects from young regions
Some old regions are identified to be mostly garbage and become candidates in later mixed or old collection cycles

Diagnosis

The reason of setting -Xmn400m has caused the regression of G1 is:

The option has forced G1 to use a young generation space of up to 400 MB and leaves G1 just 200 MB (i.e. 2048MB - 400MB - 1400MB) breathing room for shuffling live objects around.

The small free space has negative impact on G1's concurrent cycles, which has caused most marking cycles not being completed in time.

Before setting -Xmn400m on the command line, we have found only one instance of aborted marking cycle, which is denoted by:

[GC concurrent-mark-abort]

After setting it, we have seen marking cycle was aborted 105 times out of 150 instances. The lesson we have learnt here is that for a tight heap like our benchmark, it's not a good idea to set -Xmn400m for G1. Without setting MaxNewSize, G1's ergonomics will dynamically adjust young generation space and it turns out that G1 can do a better job in this case.

Differences between Parallel GC and G1 GC

Note that -Xmn400m is the optimal setting for our benchmark if Parallel GC is used. When we set MaxNewSize to be 400MB for G1, our benchmark regressed. So, what's the difference between G1 GC and Parallel GC?

Initially, G1 GC was designed to replace CMS ^[10] for its relatively lower and more predictable pause times. This is different from the design of Parallel GC which aims for higher throughput. To help gain higher throughput, Parallel GC tries to:

adjust young generation as large as possible and let it run into a full GC

In our 4-hour experiments, we usually see around 50 full GC's in Parallel GC

However, for G1's performance tuning, we want to avoid full GC's (see [11] for how)

Which means more frequent full GC is expected

Note that Parallel GC's Full GC pause time is shorter than G1's in the current JDK 8 releases

To conclude, if you upgrade JDK or switch garbage collector,^[7] always re-evaluate original optimal VM settings.

Acknowledgement

Some writings here are based on the feedback from Thomas Schatzl. However, the author would assume the full responsibility for the content himself.

References

Garbage First Garbage Collector Tuning
JRockit: How to Estimate the Size of Live Data Set
g1gc logs - basic - how to print and how to understand
g1gc logs - Ergonomics -how to print and how to understand
Java Performance: The Definitive Guide (Strongly recommended)
Garbage First Garbage Collector Tuning - Oracle
Our Collectors by Jon Masamitsu
Understanding Garbage Collection
HotSpot VM Performance Tuning Tips
Understanding CMS GC Logs
G1 GC: Tuning Mixed Garbage Collections in JDK 8
G1 GC Glossary of Terms
Learn More About Performance Improvements in JDK 8
Benchmarking G1 and other Java 7 Garbage Collectors
HotSpot Virtual Machine Garbage Collection Tuning Guide
Getting Started with the G1 Garbage Collector
Garbage-First Garbage Collector (JDK 8 HotSpot Virtual Machine Garbage Collection Tuning Guide)
Other JDK 8 articles on Xml and More
Tuning that was great in old JRockit versions might not be so good anymore
- Trying to bring over each and every tuning option from a JR configuration to an HS one is probably a bad idea.
- Even when moving between major versions of the same JVM, we usually recommend going back to the default (just pick a collector and heap size) and then redoing any tuning work from scratch (if even necessary).

Cross Column

Sunday, October 12, 2014

JDK 8: Is Tuning MaxNewSize in G1 GC a Good Idea?