Insights on Oracle & Tech: Generational Garbage Collector

Showing posts with label Generational Garbage Collector. Show all posts

Friday, January 10, 2014

JRockit: Parallel vs Concurrent Collectors

The mark and sweep algorithm is the basis of all commercial garbage collectors in JVMs today.^[1,5,12] There are various implementations of mark and sweep algorithm in JRockit, which use different garbage collection strategies.

Unless you really know what you are doing, the –XgcPrio flag is the preferred way to tell JRockit what garbage collection strategies to run.^[1] However, if you want further control over GC behavior, a more fine grained garbage collection strategy can be set from the command line using the –Xgc flag.

In this article, we will examine four Generational Garbage Collectors (Generational GCs) in JRockit, which uses nursery. Note that JRockit refers to the young generations^[2] as nurseries.

Generational Garbage Collection Strategies

The garbage collection strategy is defined by nursery, mark strategy and sweep strategy. The nursery can either be present or not present. The mark and sweep phases can either be concurrent or parallel.

Here we only cover Generational GCs that use the nursery. For example, we will ignore "–Xgc:singlecon"—or single generational concurrent—in this discussion. As shown below, JRockit supports four different garbage collection strategies. They differ only in the using either concurrent or parallel in its mark and sweep phases.

-Xgc: option	Mark	Sweep
genconcon or gencon	Concurrent	Concurrent
genconpar	Concurrent	Parallel
genparpar or genpar (default)	Parallel	Parallel
genparcon	Parallel	Concurrent

GC Mode

Running JRockit with –Xverbose:gc will output plenty of verbose information on what the JVM memory management system is doing. This information includes garbage collections, where they take place (nurseries or old space), changes of GC strategy, and the time a particular garbage collection takes. For example, we have specified the following options:

-Xverbose:gc -Xverbosedecorations=level,module,timestamp,millis,pid

to find the GC mode of each strategy uses. As you can see, different strategies are designed for different purposes:

Throughput, or
Low lantency (or short pausetimes)

-Xgc: option	GC Mode
genconcon or gencon	Garbage collection optimized for short pausetimes, strategy: Generational Concurrent Mark & Sweep.
genconpar	Garbage collection optimized for short pausetimes, strategy: Generational Concurrent Mark, Parallel Sweep.
genparpar or genpar	Garbage collection optimized for throughput, strategy: Generational Parallel Mark & Sweep.
genparcon	Garbage collection optimized for short pausetimes, strategy: Generational Parallel Mark, Concurrent Sweep.

Stopping the World (STW)

Stopping the world means that collectors are halting all executing Java threads to run a collection cycle, thus guaranteeing that new objects are not allocated and objects do not suddenly become unreachable while the collector is running. This has the obvious disadvantage that the application can perform no useful work while a collection cycle is running.

Even though an algorithm such as mark and sweep may run mostly in parallel, it is still a complicated problem that references may change during actual garbage collections. If the Java application is allowed to run, executing arbitrary field assignments and move instructions at the same time as the garbage collector tries to clean up the heap, synchronization between the application and the garbage collector is needed. Stopping the world for short periods of time is necessary for most garbage collectors and this is one of the main sources of latency and non-determinism in a runtime.

Parallel vs Concurrent Collectors

The mark-and-sweep algorithm consists of two phases:^[5]

Mark phase

In which, it finds and marks all accessible objects (or live objects).
Marking is very parallelizable and large parts of a mark phase can also be run concurrently with executing Java code.

Sweep phase

In which, it scans through the heap and reclaims all the unmarked objects.
Sweeping and compaction (JRockit uses partial compaction to avoid fragmentation), however, tend to be more troublesome for parallelization, even though it is fairly simple to compact parts of the heap while others are swept, thus achieving better throughput.

Many stages of a mark and sweep garbage collection algorithm can be made to run concurrently with the executing Java application. Marking is the most critical of the stages, as it usually takes up around 90 percent of the total garbage collection time. The computational complexity of mark and sweep is both a function of the amount of live data on the heap (for mark) and the actual heap size (for sweep).

Parallel collectors require stop-the-world pause for the whole duration of major collection phases (mark or sweep), but employ all available cores to compress pause time. Parallel collectors usually have better throughput, but they are not a good fit for pause critical applications.

Concurrent collectors try to do most work concurrently (though they also do it in parallel on multi-core systems), stopping the application only for short duration. Note that the concurrent collection algorithm in JRockit is fairly different from both HotSpot's concurrent collectors (CMS^[3] and G1^[4] ).

Conclusions

In memory management, the time spent in GC is detrimental to an application's performance. So, minimizing the time spent in GC seems to be a good thing to achieve (or is it?). Collectors use simple stop-the-world strategy consumes less CPU time because it is simple and requires less bookkeeping. However, simple stop-the-world collectors halt all Java threads and they are not a good fit for pause critical applications. Note that lantency is caused by not spending CPU cycle executing Java code.

Optimizing for low lantencies is basically a matter of avoiding stopping the world and let the Java application run as much as possible. However, performing GC and executing Java code concurrently requires a lot more bookkeeping and thus, the total time spent in GC will be longer. Also, if the garbage collector gets too little total CPU time and it can't keep up with the allocation frequency, the heap will fill up and an OutOfMemoryError^[6-8] will be thrown by the JVM. Therefore, another key to low latency is to keep heap usage and fragmentation at a proper level.

To conclude, the tradeoff in memory management is between maximizing throughput and maintaining low latencies. In the real world, we can't achieve both at the same time.

References

Oracle JRockit- The Definitive Guide
Understanding Garbage Collection (HotSpot)
Understanding CMS GC Logs (HotSpot)
Garbage First Garbage Collector Tuning (HotSpot)
Mark-and-Sweep Garbage Collection
How to Debug Native OutOfMemory in JRockit (Xml and More)
Diagnosing OutOfMemoryError or Memory Leaks in JRockit (XML and More)
Diagnosing Java.lang.OutOfMemoryError (XML and More)
JRockit: All Posts on "Xml and More"
JRockit R27.8.1 and R28.3.1 versioning

Note that R28 went from R28.2.9 to R28.3.1—these are just ordinary maintenance releases, not feature releases. There is zero significance to the jump in minor version number.

JRockit: Out-of-the-Box Behavior of Four Generational GC's (Xml and More)

Sunday, April 7, 2013

Understanding CMS GC Logs

In this article, we will examine the following topics:

What's Mark and Sweep algorithm?
What's CMS Collector in HotSpot?
What's the format of CMS logs?

Mark and Sweep (MS) Algorithm

The mark-and-sweep algorithm consists of two phases^[3]: In the first phase, it finds and marks all accessible objects. The first phase is called the mark phase. In the second phase, the garbage collection algorithm scans through the heap and reclaims all the unmarked objects. The second phase is called the sweep phase. The algorithm can be expressed as follows:

for each root variable r
    mark (r);
sweep ();

The computational complexity of mark and sweep is both a function of the amount of live data on the heap (for mark) and the actual heap size (for sweep).

Mark
- Add each object in the root set to a queue
- For each object X in the queue Mark X reachable
- Add all objects referenced from X to the queue
- Marking is the most critical of the stages, as it usually takes up around 90 percent of the total garbage collection time.
  - Fortunately, marking is very parallelizable and large parts of a mark phase can also be run concurrently with executing Java code.
Sweep
- For each object X on the heap,
- Sweeping (or compaction which is not included in MS algorithm), however, tend to be more troublesome to be made to run concurrently with the executing Java program.

Concurrent Mark Sweep (CMS) Collector

CMS (Concurrent Mark Sweep) is one of garbage collectors implemented in HotSpot JVM. It is enabled using:

-XX:+UseConcMarkSweepGC -XX:+UseParNewGC

Note that there are two different implementations of parallel collectors for the young generation:

UseParNewGC
UseParallelGC

UseParNewGC should be used with CMS, and UseParallelGC should be used with the throughput collector. Also that there is another collector named iCMS,^[4] which is a variant of CMS. You should not use iCMS anymore because it is (or will be) deprecated.

CMS is designed to be mostly concurrent, requiring just two quick stop-the-world pauses per old space garbage collection cycle. These two phases are the initial mark phase (single-threaded) and the remark phase (multithreaded). It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

CMS does not perform compaction in normal CMS cycle. However, an old generation overflow will trigger a stop-the-world compacting garbage collection.

Par new generation (or Young Gen)
- Young generation space is further split into
- On CMS, it would have defaulted to a maximum size 665 MB if you didn't explicitly set it.
- Survivors from the young generation are evacuated to
Concurrent mark-sweep generation (or Tenured Gen)
- Does in-place de-allocation (which can lead to heap fragmentation)
- Is managed by free lists, which need synchronization
- Concurrent marking phase
- Concurrent sweeping phase

Enabling Logging

We have used the following HotSpot VM options to generate the log file:

-Xloggc:/gc_0.log
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintReferenceGC

CMS GC Logs

Let's take a look at some of the CMS logs generated with 1.7.0_10:

2715.210: [GC 2715.210: [ParNew2715.282: [SoftReference, 0 refs, 0.0000120 secs]2715.282: [WeakReference, 2433 refs, 0.0004150 secs]2715.283: [FinalReference, 1947 refs, 0.0074410 secs]2715.290: [PhantomReference, 0 refs, 0.0000080 secs]2715.290: [JNI Weak Reference, 0.0000040 secs]: 310460K->27980K(318912K), 0.0804180 secs] 3828421K->3561073K(4158912K), 0.0805140 secs] [Times: user=0.36 sys=0.00, real=0.08 secs]

Young generation (ParNew) collection. Young generation capacity is 318912K and after the collection its occupancy drops down from 310460K to 27980K. This collection took 0.08 secs.

2715.294: [GC [1 CMS-initial-mark: 3533092K(3840000K)] 3564157K(4158912K), 0.0418950 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]

Beginning of tenured generation collection with CMS collector. This is initial Marking phase of CMS where all the objects directly reachable from roots are marked and this is done with all the mutator threads stopped.

Capacity of tenured generation space is 3840000K and CMS was triggered at the occupancy of 3533092K

2715.337: [CMS-concurrent-mark-start]

Start of concurrent marking phase.

In Concurrent Marking phase, threads stopped in the first phase are started again and all the objects transitively reachable from the objects marked in first phase are marked here.

2717.428: [CMS-concurrent-mark: 2.091/2.091 secs] [Times: user=6.34 sys=0.08, real=2.09 secs]

Concurrent marking took total 2.091 econds cpu time and 2.091 seconds wall time that includes the yield to other threads also.

2717.428: [CMS-concurrent-preclean-start]

Start of precleaning.

Precleaning is also a concurrent phase. Here in this phase we look at the objects in CMS heap which got updated by promotions from young generation or new allocations or got updated by mutators while we were doing the concurrent marking in the previous concurrent marking phase. By rescanning those objects concurrently, the precleaning phase helps reduce the work in the next stop-the-world “remark” phase.

2717.428: [Preclean SoftReferences, 0.0000020 secs]2717.428: [Preclean WeakReferences, 0.0066370 secs]2717.434: [Preclean FinalReferences, 0.0011100 secs]2717.436: [Preclean PhantomReferences, 0.0003480 secs]2717.475: [CMS-concurrent-preclean: 0.045/0.048 secs] [Times: user=0.19 sys=0.00, real=0.04 secs]

Concurrent precleaning took 0.045 secs total cpu time and 0.048 wall time.

2717.475: [CMS-concurrent-abortable-preclean-start]

2717.876: [GC 2717.876: [ParNew2717.958: [SoftReference, 0 refs, 0.0000140 secs]2717.958: [WeakReference, 2683 refs, 0.0004340 secs]2717.959: [FinalReference, 1972 refs, 0.0069420 secs]2717.966: [PhantomReference, 0 refs, 0.0000080 secs]2717.966: [JNI Weak Reference, 0.0000050 secs]: 311500K->28150K(318912K), 0.0900310 secs] 3844593K->3577211K(4158912K), 0.0901370 secs] [Times: user=0.41 sys=0.01, real=0.09 secs]

2719.157: [CMS-concurrent-abortable-preclean: 1.582/1.681 secs] [Times: user=4.16 sys=0.07, real=1.68 secs]

2719.157: [GC[YG occupancy: 171027 K (318912 K)]2719.157: [Rescan (parallel) , 0.0680250 secs]2719.225: [weak refs processing2719.226: [SoftReference, 0 refs, 0.0000070 secs]2719.226: [WeakReference, 9940 refs, 0.0010950 secs]2719.227: [FinalReference, 10877 refs, 0.0226430 secs]2719.249: [PhantomReference, 0 refs, 0.0000050 secs]2719.249: [JNI Weak Reference, 0.0000110 secs], 0.0238550 secs]2719.249: [scrub string table, 0.0057930 secs] [1 CMS-remark: 3549060K(3840000K)] 3720087K(4158912K), 0.0989920 secs] [Times: user=0.40 sys=0.01, real=0.10 secs]

In the above log, after a preclean, 'abortable preclean' starts. After the young generation collection, the young gen occupancy drops down from 328912K to 28150K When young gen occupancy reaches 171027 K which is 52% of the total capacity, precleaning is aborted and 'remark' phase is started.

Note that in young generation occupancy also gets printed in the final remark phase.

Stop-the-world phase. This phase rescans any residual updated objects in CMS heap, retraces from the roots and also processes Reference objects. Here the rescanning work took 0.0680250 secs and weak reference objects processing took 0.0010950 secs. This phase took total 0.0238550 secs to complete.

2719.257: [CMS-concurrent-sweep-start]

Start of sweeping of dead/non-marked objects. Sweeping is concurrent phase performed with all other threads running.

2722.945: [CMS-concurrent-sweep: 3.606/3.688 secs] [Times: user=8.79 sys=0.12, real=3.69 secs]

Sweeping took 3.69 secs.

2722.945: [CMS-concurrent-reset-start]

Start of reset.

2722.973: [CMS-concurrent-reset: 0.028/0.028 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]

In this phase, the CMS data structures are reinitialized so that a new cycle may begin at a later time. In this case, it took 0.03 secs.

This was how a normal CMS cycle runs. Note that, in our experiment, we didn't see the "concurrent mode failure". So, we didn't discuss it here. For more information of it, read [1].

References

Understanding CMS GC Logs
Java GC, HotSpot's CMS and heap fragmentation
Mark-and-Sweep Garbage Collection
Really? iCMS? Really?
Oracle JRockit--The Definitive Guide
JRockit: Parallel vs Concurrent Collectors (Xml and More)
The Unspoken - CMS and PrintGCDetails (Jon Masamitsu's Weblog)
The Unspoken - Phases of CMS (Jon Masamitsu's Weblog)

Friday, September 14, 2012

HotSpot Performance Option — SurvivorRatio

As shown in [1], -XX:SurvivorRatio has been classified as a performance option on HotSpot VM.

In this article, we will discuss how this option works and how it can impact an application's performance.

Survivor Spaces

The young generation space in all HotSpot garbage collectors is subdivided into an eden space and two survivor spaces. Eden space is where new Java objects are allocated. One of the survivor spaces is labeled the “from” survivor space, and the other survivor space is labeled the “to” survivor space.

If during a minor garbage collection, the “to” survivor space is not large enough to hold all of the live objects being copied from the eden space and the “from” survivor space, the overflow will be promoted to the old generation space. Overflowing into the old generation space may cause the old generation space to grow more quickly than desired and result in an eventual stop-the-world compacting full garbage collection.

If you are interested to learn more on garbage collector, read this excellent book— "Java Performance."

-XX:SurvivorRatio

For the following discussions, we will use the following HotSpot command options:

-server -Xms2560m -Xmx2560m -Xmn512m -XX:SurvivorRatio=4

In other words, we have set Young Generation size to be 512MB and the Survivor Ratio to be 4. Since we didn't specify which Garbage Collector to be used, the default GC (i.e., -XX+UseParallelGC ) is selected.

When you set the SurvivorRatio to be 4, it means that:

The ratio of eden/survivor space size will be 4. The default value is 8.

Below we will see how the setting of SurvivorRatio will impact the sizes of survivor spaces.

If InitialSurvivorRatio or MinSurvivorRatio were not specified, but the SurvivorRatio has been set, their values will be set to:

SurvivorRatio + 2

Then the calculation is as follows (Note that young_gen_size is the value specified with -Xmn):


  size_t survivor_size = young_gen_size / InitialSurvivorRatio;

  eden_size = size - (2 * survivor_size);

So, for SurvivorRatio = 4, different spaces are derived as follows:


 InitialSurvivorRatio = SurvivorRatio + 2 = 6
 survivor_size = 512m / 6 = 87360K  (Note that young_gen_size = 512m)

 eden = young_gen_size - 2 * survivor_size = 512m - 2 * 
87360K  = 349568K

 young_gen_total(as reported in the gc print below) = eden + 1*survivor = 436928K

How to Verify?

When we started the application, we have specified the following GC-related print options:

-Xloggc:/<your_path>/logs/gc_0.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC

After the application (i.e., a benchmark in our case) ends, you can find the following printouts at the end of gc_0.log:

Heap
PSYoungGen total 436928K, used 64296K [0x00000007e0000000, 0x0000000800000000, 0x0000000800000000)

eden space 349568K, 15% used [0x00000007e0000000,0x00000007e3571608,0x00000007f5560000)
from space 87360K, 10% used [0x00000007f5560000,0x00000007f5eb8cd0,0x00000007faab0000)
to space 87360K, 0% used [0x00000007faab0000,0x00000007faab0000,0x0000000800000000)
ParOldGen total 2097152K, used 1457708K [0x0000000760000000, 0x00000007e0000000, 0x00000007e0000000)

object space 2097152K, 69% used [0x0000000760000000,0x00000007b8f8b2a8,0x00000007e0000000)
PSPermGen total 393216K, used 246405K [0x0000000748000000, 0x0000000760000000, 0x0000000760000000)

object space 393216K, 62% used [0x0000000748000000,0x00000007570a14c0,0x0000000760000000)

As you can see it:

eden space / from space = 349568K / 87360K = 4
PSYoungGen total = 349568K + 87360K = 436928K

Why It's a Performance Option?

Given there is limited physical memory, you can increase heap to a certain limited size. After that, you can tune survivor ratio to see if short lived objects can be allowed a longer time period to die in the young generation (or if they can be promoted less directly into the old generation), which could help overall response time.

Bottom line: larger survivor spaces allow short lived objects a longer time period to die in the young generation and this helps application's response time (in other words, there will be less long-paused full garbage collections).

Acknowledgement

Some of the writings here are based on the feedback from Jon Masamitsu and Scott Oaks. However, the author would assume the full responsibility for the content himself.

Cross Column

Friday, January 10, 2014

JRockit: Parallel vs Concurrent Collectors

Generational Garbage Collection Strategies

GC Mode

Stopping the World (STW)

Parallel vs Concurrent Collectors

Conclusions

References

Sunday, April 7, 2013

Understanding CMS GC Logs

Mark and Sweep (MS) Algorithm

Concurrent Mark Sweep (CMS) Collector

Enabling Logging

CMS GC Logs

References

Friday, September 14, 2012

HotSpot Performance Option — SurvivorRatio

Survivor Spaces

-XX:SurvivorRatio

How to Verify?

Why It's a Performance Option?

Acknowledgement

References