Cross Column

Showing posts with label Generational Garbage Collector. Show all posts
Showing posts with label Generational Garbage Collector. Show all posts

Friday, January 10, 2014

JRockit: Parallel vs Concurrent Collectors

The mark and sweep algorithm is the basis of all commercial garbage collectors in JVMs today.[1,5,12] There are various implementations of mark and sweep algorithm in JRockit, which use different garbage collection strategies.

Unless you really know what you are doing, the –XgcPrio flag is the preferred way to tell JRockit what garbage collection strategies to run.[1] However, if you want further control over GC behavior, a more fine grained garbage collection strategy can be set from the command line using the –Xgc flag.

In this article, we will examine four Generational Garbage Collectors (Generational GCs) in JRockit, which uses nursery.  Note that JRockit refers to the young generations[2] as nurseries.


Generational Garbage Collection Strategies


The garbage collection strategy is defined by nursery, mark strategy and sweep strategy. The nursery  can either be present or not present. The mark and sweep phases can either be concurrent or parallel.

Here we only cover Generational GCs that use the nursery. For example, we will ignore "–Xgc:singlecon"—or single generational concurrent—in this discussion. As shown below, JRockit supports four different garbage collection strategies. They differ only in the using either concurrent or parallel in its mark and sweep phases.
-Xgc: option
Mark
Sweep
genconcon or gencon 
ConcurrentConcurrent
genconpar 
ConcurrentParallel
genparpar or genpar (default)
ParallelParallel
genparcon 
ParallelConcurrent

GC Mode


Running JRockit with –Xverbose:gc will output plenty of verbose information on what the JVM memory management system is doing. This information includes garbage collections, where they take place (nurseries or old space), changes of GC strategy, and the time a particular garbage collection takes. For example, we have specified the following options:
-Xverbose:gc -Xverbosedecorations=level,module,timestamp,millis,pid
to find the GC mode of each strategy uses. As you can see, different strategies are designed for different purposes:
  • Throughput, or
  • Low lantency (or short pausetimes)
-Xgc: option
GC Mode
genconcon or gencon 
Garbage collection optimized for short pausetimes, strategy: Generational Concurrent Mark & Sweep.

genconpar 
Garbage collection optimized for short pausetimes, strategy: Generational Concurrent Mark, Parallel Sweep.
genparpar or genpar 
Garbage collection optimized for throughput, strategy: Generational Parallel Mark & Sweep.

genparcon 
Garbage collection optimized for short pausetimes, strategy: Generational Parallel Mark, Concurrent Sweep.


Stopping the World (STW)


Stopping the world means that collectors are halting all executing Java threads to run a collection cycle, thus guaranteeing that new objects are not allocated and objects do not suddenly become unreachable while the collector is running. This has the obvious disadvantage that the application can perform no useful work while a collection cycle is running.

Even though an algorithm such as mark and sweep may run mostly in parallel, it is still a complicated problem that references may change during actual garbage collections. If the Java application is allowed to run, executing arbitrary field assignments and move instructions at the same time as the garbage collector tries to clean up the heap, synchronization between the application and the garbage collector is needed. Stopping the world for short periods of time is necessary for most garbage collectors and this is one of the main sources of latency and non-determinism in a runtime.



Parallel vs Concurrent Collectors


The mark-and-sweep algorithm consists of two phases:[5]
  1. Mark phase
    • In which, it finds and marks all accessible objects (or live objects).
    • Marking is very parallelizable and large parts of a mark phase can also be run concurrently with executing Java code.
  2. Sweep phase
    • In which, it scans through the heap and reclaims all the unmarked objects.
    • Sweeping and compaction (JRockit uses partial compaction to avoid fragmentation), however, tend to be more troublesome for parallelization, even though it is fairly simple to compact parts of the heap while others are swept, thus achieving better throughput.
Many stages of a mark and sweep garbage collection algorithm can be made to run concurrently with the executing Java application. Marking is the most critical of the stages, as it usually takes up around 90 percent of the total garbage collection time. The computational complexity of mark and sweep is both a function of the amount of live data on the heap (for mark) and the actual heap size (for sweep).

Parallel collectors require stop-the-world pause for the whole duration of major collection phases (mark or sweep), but employ all available cores to compress pause time. Parallel collectors usually have better throughput, but they are not a good fit for pause critical applications.

Concurrent collectors try to do most work concurrently (though they also do it in parallel on multi-core systems), stopping the application only for short duration. Note that the concurrent collection algorithm in JRockit is fairly different from both HotSpot's concurrent collectors (CMS[3] and G1[4] ).

Conclusions


In memory management, the time spent in GC is detrimental to an application's performance. So, minimizing the time spent in GC seems to be a good thing to achieve (or is it?). Collectors use simple stop-the-world strategy consumes less CPU time because it is simple and requires less bookkeeping. However, simple stop-the-world collectors halt all Java threads and they are not a good fit for pause critical applications.  Note that lantency is caused by not spending CPU cycle executing Java code.

Optimizing for low lantencies  is basically a matter of avoiding stopping the world and let the Java application run as much as possible. However, performing GC and executing Java code concurrently requires a lot more bookkeeping and thus, the total time spent in GC will be longer. Also, if the garbage collector gets too little total CPU time and it can't keep up with the allocation frequency, the heap will fill up and an OutOfMemoryError[6-8] will be thrown by the JVM. Therefore, another key to low latency is to keep heap usage and fragmentation at a proper level.

To conclude, the tradeoff in memory management is between maximizing throughput and maintaining low latencies. In the real world, we can't achieve both at the same time.

Sunday, April 7, 2013

Understanding CMS GC Logs

In this article, we will examine the following topics:
  • What's Mark and Sweep algorithm?
  • What's CMS Collector in HotSpot?
  • What's the format of CMS logs?

Mark and Sweep (MS) Algorithm


The mark-and-sweep algorithm consists of two phases[3]: In the first phase, it finds and marks all accessible objects. The first phase is called the mark phase. In the second phase, the garbage collection algorithm scans through the heap and reclaims all the unmarked objects. The second phase is called the sweep phase. The algorithm can be expressed as follows:

for each root variable r
    mark (r);
sweep ();

The computational complexity of mark and sweep is both a function of the amount of live data on the heap (for mark) and the actual heap size (for sweep).
  • Mark
    • Add each object in the root set to a queue
      • Typically, the root set contains all objects that are available without having to trace any references[5], which includes all Java objects on local frames in whatever methods the program is executing when it is halted for GC. This includes:
        • Everything we can obtain from the user stack and registers in the thread contexts of the halted program. 
        • Global data, such as static fields.
    • For each object X in the queue Mark X reachable
      • mark bit is typically associated with each reachable object. 
    • Add all objects referenced from X to the queue
    • Marking is the most critical of the stages, as it usually takes up around 90 percent of the total garbage collection time. 
      • Fortunately, marking is very parallelizable and large parts of a mark phase can also be run concurrently with executing Java code. 
  • Sweep
    • For each object X on the heap, 
      • If the X not marked, garbage collect it
    • Sweeping (or compaction which is not included in MS algorithm), however, tend to be more troublesome to be made to run concurrently with the executing Java program.

Concurrent Mark Sweep (CMS) Collector


CMS (Concurrent Mark Sweep) is one of garbage collectors implemented in HotSpot JVM.  It is enabled using:
  • -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

Note that there are two different implementations of parallel collectors for the young generation:
  • UseParNewGC
  • UseParallelGC 

UseParNewGC should be used with CMS, and UseParallelGC should be used with the throughput collector.  Also that there is another collector named iCMS,[4] which is a variant of CMS.  You should not use iCMS anymore because it is (or will be) deprecated.

CMS is designed to be mostly concurrent, requiring just two quick stop-the-world pauses per old space garbage collection cycle.   These two phases are the initial mark phase (single-threaded) and the remark phase (multithreaded).  It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

CMS does not perform compaction in normal CMS cycle.  However, an old generation overflow will trigger a stop-the-world compacting garbage collection.
  • Par new generation (or Young Gen)
    • Young generation space is further split into 
      • Eden 
      • Survivor spaces
    • On CMS, it would have defaulted to a maximum size 665 MB if you didn't explicitly set it.
    • Survivors from the young generation are evacuated to 
      • Other survivor space, or
      • Old generation space
  • Concurrent mark-sweep generation (or Tenured Gen)
    • Does in-place de-allocation (which can lead to heap fragmentation)
    • Is managed by free lists, which need synchronization
    • Concurrent marking phase
      • Two stop-the-world pauses
        • Initial mark
          • Marks reachable (live) objects
        • Remark
          • Unmarked objects are deducted to be unreachable (dead)
    • Concurrent sweeping phase
      • Sweeps over the heap
      • In-place de-allocated unmarked (dead) objects
      • End of concurrent sweeping--all unmarked objects have been de-allocated

Enabling Logging


We have used the following HotSpot VM options to generate the log file:
  • -Xloggc:/gc_0.log 
  • -verbose:gc 
  • -XX:+PrintGCDetails 
  • -XX:+PrintGCTimeStamps 
  • -XX:+PrintReferenceGC 

CMS GC Logs


Let's take a look at some of the CMS logs generated with 1.7.0_10:

2715.210: [GC 2715.210: [ParNew2715.282: [SoftReference, 0 refs, 0.0000120 secs]2715.282: [WeakReference, 2433 refs, 0.0004150 secs]2715.283: [FinalReference, 1947 refs, 0.0074410 secs]2715.290: [PhantomReference, 0 refs, 0.0000080 secs]2715.290: [JNI Weak Reference, 0.0000040 secs]: 310460K->27980K(318912K), 0.0804180 secs] 3828421K->3561073K(4158912K), 0.0805140 secs] [Times: user=0.36 sys=0.00, real=0.08 secs]
Young generation (ParNew) collection. Young generation capacity is 318912K and after the collection its occupancy drops down from 310460K to 27980K. This collection took 0.08 secs.

2715.294: [GC [1 CMS-initial-mark: 3533092K(3840000K)] 3564157K(4158912K), 0.0418950 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]

Beginning of tenured generation collection with CMS collector. This is initial Marking phase of CMS where all the objects directly reachable from roots are marked and this is done with all the mutator threads stopped.

Capacity of tenured generation space is 3840000K and CMS was triggered at the occupancy of 3533092K

2715.337: [CMS-concurrent-mark-start]
Start of concurrent marking phase.

In Concurrent Marking phase, threads stopped in the first phase are started again and all the objects transitively reachable from the objects marked in first phase are marked here.

2717.428: [CMS-concurrent-mark: 2.091/2.091 secs] [Times: user=6.34 sys=0.08, real=2.09 secs]
Concurrent marking took total 2.091 econds cpu time and 2.091 seconds wall time that includes the yield to other threads also.

2717.428: [CMS-concurrent-preclean-start]
Start of precleaning.

Precleaning is also a concurrent phase. Here in this phase we look at the objects in CMS heap which got updated by promotions from young generation or new allocations or got updated by mutators while we were doing the concurrent marking in the previous concurrent marking phase. By rescanning those objects concurrently, the precleaning phase helps reduce the work in the next stop-the-world “remark” phase.

2717.428: [Preclean SoftReferences, 0.0000020 secs]2717.428: [Preclean WeakReferences, 0.0066370 secs]2717.434: [Preclean FinalReferences, 0.0011100 secs]2717.436: [Preclean PhantomReferences, 0.0003480 secs]2717.475: [CMS-concurrent-preclean: 0.045/0.048 secs] [Times: user=0.19 sys=0.00, real=0.04 secs]

Concurrent precleaning took 0.045 secs total cpu time and 0.048 wall time.


2717.475: [CMS-concurrent-abortable-preclean-start]

2717.876: [GC 2717.876: [ParNew2717.958: [SoftReference, 0 refs, 0.0000140 secs]2717.958: [WeakReference, 2683 refs, 0.0004340 secs]2717.959: [FinalReference, 1972 refs, 0.0069420 secs]2717.966: [PhantomReference, 0 refs, 0.0000080 secs]2717.966: [JNI Weak Reference, 0.0000050 secs]: 311500K->28150K(318912K), 0.0900310 secs] 3844593K->3577211K(4158912K), 0.0901370 secs] [Times: user=0.41 sys=0.01, real=0.09 secs]

2719.157: [CMS-concurrent-abortable-preclean: 1.582/1.681 secs] [Times: user=4.16 sys=0.07, real=1.68 secs]

2719.157: [GC[YG occupancy: 171027 K (318912 K)]2719.157: [Rescan (parallel) , 0.0680250 secs]2719.225: [weak refs processing2719.226: [SoftReference, 0 refs, 0.0000070 secs]2719.226: [WeakReference, 9940 refs, 0.0010950 secs]2719.227: [FinalReference, 10877 refs, 0.0226430 secs]2719.249: [PhantomReference, 0 refs, 0.0000050 secs]2719.249: [JNI Weak Reference, 0.0000110 secs], 0.0238550 secs]2719.249: [scrub string table, 0.0057930 secs] [1 CMS-remark: 3549060K(3840000K)] 3720087K(4158912K), 0.0989920 secs] [Times: user=0.40 sys=0.01, real=0.10 secs]
In the above log, after a preclean, 'abortable preclean' starts. After the young generation collection, the young gen occupancy drops down from 328912K to 28150K When young gen occupancy reaches 171027 K which is 52% of the total capacity, precleaning is aborted and 'remark' phase is started.
Note that in young generation occupancy also gets printed in the final remark phase.
Stop-the-world phase. This phase rescans any residual updated objects in CMS heap, retraces from the roots and also processes Reference objects. Here the rescanning work took 0.0680250 secs and weak reference objects processing took 0.0010950 secs. This phase took total 0.0238550 secs to complete.

2719.257: [CMS-concurrent-sweep-start]
Start of sweeping of dead/non-marked objects. Sweeping is concurrent phase performed with all other threads running.

2722.945: [CMS-concurrent-sweep: 3.606/3.688 secs] [Times: user=8.79 sys=0.12, real=3.69 secs]
Sweeping took 3.69 secs.

2722.945: [CMS-concurrent-reset-start]
Start of reset.

2722.973: [CMS-concurrent-reset: 0.028/0.028 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
In this phase, the CMS data structures are reinitialized so that a new cycle may begin at a later time. In this case, it took 0.03 secs.

This was how a normal CMS cycle runs.  Note that, in our experiment, we didn't see the "concurrent mode failure".  So, we didn't discuss it here.  For more information of it, read [1].

References

  1. Understanding CMS GC Logs
  2. Java GC, HotSpot's CMS and heap fragmentation
  3. Mark-and-Sweep Garbage Collection
  4. Really? iCMS? Really?
  5. Oracle JRockit--The Definitive Guide
  6. JRockit: Parallel vs Concurrent Collectors (Xml and More)
  7. The Unspoken - CMS and PrintGCDetails (Jon Masamitsu's Weblog)
  8. The Unspoken - Phases of CMS (Jon Masamitsu's Weblog)

Friday, September 14, 2012

HotSpot Performance Option — SurvivorRatio

As shown in [1], -XX:SurvivorRatio has been classified as a performance option on HotSpot VM.

In this article, we will discuss how this option works and how it can impact an application's performance.

Survivor Spaces


The young generation space in all HotSpot garbage collectors is subdivided into an eden space and two survivor spaces.  Eden space is where new Java objects are allocated.  One of the survivor spaces is labeled the “from” survivor space, and the other survivor space is labeled the “to” survivor space.

If during a minor garbage collection, the “to” survivor space is not large enough to hold all of the live objects being copied from the eden space and the “from” survivor space, the overflow will be promoted to the old generation space. Overflowing into the old generation space may cause the old generation space to grow more quickly than desired and result in an eventual stop-the-world compacting full garbage collection.

If you are interested to learn more on garbage collector, read this excellent book— "Java Performance."


-XX:SurvivorRatio


For the following discussions, we will use the following HotSpot command options:
  • -server -Xms2560m -Xmx2560m -Xmn512m -XX:SurvivorRatio=4
In other words, we have set Young Generation size to be 512MB and the Survivor Ratio to be 4.  Since we didn't specify which Garbage Collector to be used, the default GC (i.e., -XX+UseParallelGC ) is selected.

When you set the SurvivorRatio to be 4, it means that:
  • The ratio of eden/survivor space size will be 4. The default value is 8.
Below we will see how the setting of SurvivorRatio will impact the sizes of  survivor spaces.

If InitialSurvivorRatio or MinSurvivorRatio were not specified, but the SurvivorRatio has been set, their values will be set to:
  • SurvivorRatio + 2
Then the calculation is as follows (Note that young_gen_size is the value specified with -Xmn):

  size_t survivor_size = young_gen_size / InitialSurvivorRatio;
  eden_size = size - (2 * survivor_size);


So, for SurvivorRatio = 4, different spaces are derived as follows:

 InitialSurvivorRatio = SurvivorRatio + 2 = 6
 survivor_size = 512m / 6 = 87360K  (Note that young_gen_size = 512m)
 eden = young_gen_size - 2 * survivor_size = 512m - 2 *  87360K  = 349568K
 young_gen_total(as reported in the gc print below) = eden + 1*survivor = 436928K

How to Verify?


When we started the application, we have specified the following GC-related print options:
  • -Xloggc:/<your_path>/logs/gc_0.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC 

After the application (i.e., a benchmark in our case) ends, you can find the following printouts at the end of gc_0.log:

Heap
PSYoungGen total 436928K, used 64296K [0x00000007e0000000, 0x0000000800000000, 0x0000000800000000) 
  eden space 349568K, 15% used [0x00000007e0000000,0x00000007e3571608,0x00000007f5560000)
  from space 87360K, 10% used [0x00000007f5560000,0x00000007f5eb8cd0,0x00000007faab0000)
  to space 87360K, 0% used [0x00000007faab0000,0x00000007faab0000,0x0000000800000000)
ParOldGen total 2097152K, used 1457708K 
[0x0000000760000000, 0x00000007e0000000, 0x00000007e0000000) 
  object space 2097152K, 69% used [0x0000000760000000,0x00000007b8f8b2a8,0x00000007e0000000)
PSPermGen total 393216K, used 246405K 
[0x0000000748000000, 0x0000000760000000, 0x0000000760000000) 
  object space 393216K, 62% used [0x0000000748000000,0x00000007570a14c0,0x0000000760000000)

As you can see it:
  • eden space / from space = 349568K / 87360K = 4
  • PSYoungGen total =  349568K +  87360K = 436928K

Why It's a Performance Option?


Given there is limited physical memory, you can increase heap to a certain limited size.  After that, you can tune survivor ratio to see if short lived objects can be allowed a longer time period to die in the young generation  (or if they can be promoted less directly into the old generation), which could help overall response time.

Bottom line: larger survivor spaces allow short lived objects a longer time period to die in the young generation and this helps application's response time (in other words, there will be less long-paused full garbage collections).

Acknowledgement


Some of the writings here are based on the feedback from Jon Masamitsu and Scott Oaks. However, the author would assume the full responsibility for the content himself.

References

  1. Java HotSpot VM Options
  2. The Fault with Defaults
  3. Garbage Collector Ergonomics
  4. Java Performance by Charlie Hunt and Binu John

© Travel for Life Guide. All Rights Reserved.

Analytical Insights on Health, Culture, and Security.