Friday, January 10, 2014

JRockit: Parallel vs Concurrent Collectors

The mark and sweep algorithm is the basis of all commercial garbage collectors in JVMs today.[1,5,12] There are various implementations of mark and sweep algorithm in JRockit, which use different garbage collection strategies.

Unless you really know what you are doing, the –XgcPrio flag is the preferred way to tell JRockit what garbage collection strategies to run.[1] However, if you want further control over GC behavior, a more fine grained garbage collection strategy can be set from the command line using the –Xgc flag.

In this article, we will examine four Generational Garbage Collectors (Generational GCs) in JRockit, which uses nursery.  Note that JRockit refers to the young generations[2] as nurseries.


Generational Garbage Collection Strategies


The garbage collection strategy is defined by nursery, mark strategy and sweep strategy. The nursery  can either be present or not present. The mark and sweep phases can either be concurrent or parallel.

Here we only cover Generational GCs that use the nursery. For example, we will ignore "–Xgc:singlecon"—or single generational concurrent—in this discussion. As shown below, JRockit supports four different garbage collection strategies. They differ only in the using either concurrent or parallel in its mark and sweep phases.
-Xgc: option
Mark
Sweep
genconcon or gencon 
ConcurrentConcurrent
genconpar 
ConcurrentParallel
genparpar or genpar (default)
ParallelParallel
genparcon 
ParallelConcurrent

GC Mode


Running JRockit with –Xverbose:gc will output plenty of verbose information on what the JVM memory management system is doing. This information includes garbage collections, where they take place (nurseries or old space), changes of GC strategy, and the time a particular garbage collection takes. For example, we have specified the following options:
-Xverbose:gc -Xverbosedecorations=level,module,timestamp,millis,pid
to find the GC mode of each strategy uses. As you can see, different strategies are designed for different purposes:
  • Throughput, or
  • Low lantency (or short pausetimes)
-Xgc: option
GC Mode
genconcon or gencon 
Garbage collection optimized for short pausetimes, strategy: Generational Concurrent Mark & Sweep.

genconpar 
Garbage collection optimized for short pausetimes, strategy: Generational Concurrent Mark, Parallel Sweep.
genparpar or genpar 
Garbage collection optimized for throughput, strategy: Generational Parallel Mark & Sweep.

genparcon 
Garbage collection optimized for short pausetimes, strategy: Generational Parallel Mark, Concurrent Sweep.


Stopping the World (STW)


Stopping the world means that collectors are halting all executing Java threads to run a collection cycle, thus guaranteeing that new objects are not allocated and objects do not suddenly become unreachable while the collector is running. This has the obvious disadvantage that the application can perform no useful work while a collection cycle is running.

Even though an algorithm such as mark and sweep may run mostly in parallel, it is still a complicated problem that references may change during actual garbage collections. If the Java application is allowed to run, executing arbitrary field assignments and move instructions at the same time as the garbage collector tries to clean up the heap, synchronization between the application and the garbage collector is needed. Stopping the world for short periods of time is necessary for most garbage collectors and this is one of the main sources of latency and non-determinism in a runtime.



Parallel vs Concurrent Collectors


The mark-and-sweep algorithm consists of two phases:[5]
  1. Mark phase
    • In which, it finds and marks all accessible objects (or live objects).
    • Marking is very parallelizable and large parts of a mark phase can also be run concurrently with executing Java code.
  2. Sweep phase
    • In which, it scans through the heap and reclaims all the unmarked objects.
    • Sweeping and compaction (JRockit uses partial compaction to avoid fragmentation), however, tend to be more troublesome for parallelization, even though it is fairly simple to compact parts of the heap while others are swept, thus achieving better throughput.
Many stages of a mark and sweep garbage collection algorithm can be made to run concurrently with the executing Java application. Marking is the most critical of the stages, as it usually takes up around 90 percent of the total garbage collection time. The computational complexity of mark and sweep is both a function of the amount of live data on the heap (for mark) and the actual heap size (for sweep).

Parallel collectors require stop-the-world pause for the whole duration of major collection phases (mark or sweep), but employ all available cores to compress pause time. Parallel collectors usually have better throughput, but they are not a good fit for pause critical applications.

Concurrent collectors try to do most work concurrently (though they also do it in parallel on multi-core systems), stopping the application only for short duration. Note that the concurrent collection algorithm in JRockit is fairly different from both HotSpot's concurrent collectors (CMS[3] and G1[4] ).

Conclusions


In memory management, the time spent in GC is detrimental to an application's performance. So, minimizing the time spent in GC seems to be a good thing to achieve (or is it?). Collectors use simple stop-the-world strategy consumes less CPU time because it is simple and requires less bookkeeping. However, simple stop-the-world collectors halt all Java threads and they are not a good fit for pause critical applications.  Note that lantency is caused by not spending CPU cycle executing Java code.

Optimizing for low lantencies  is basically a matter of avoiding stopping the world and let the Java application run as much as possible. However, performing GC and executing Java code concurrently requires a lot more bookkeeping and thus, the total time spent in GC will be longer. Also, if the garbage collector gets too little total CPU time and it can't keep up with the allocation frequency, the heap will fill up and an OutOfMemoryError[6-8] will be thrown by the JVM. Therefore, another key to low latency is to keep heap usage and fragmentation at a proper level.

To conclude, the tradeoff in memory management is between maximizing throughput and maintaining low latencies. In the real world, we can't achieve both at the same time.

No comments: