Cross Column

Showing posts with label Garbage Collector. Show all posts
Showing posts with label Garbage Collector. Show all posts

Sunday, November 2, 2014

G1 GC: What Is "to-space exhausted" in GC Log?

As shown in [1], you can enable basic garbage collection (GC) event printing using two options:
  • -XX:+PrintGCTimeStamps (or -XX:+PrintGCDateStamps) 
  • -XX:+PrintGCDetails
These additional printing incurs minimal overhead while provide useful information for understanding the healthiness of your running garbage collector (e.g., Generation First GC).

In this article, we will discuss the significance of to-space exhausted events in your GC logs.

What Is "to-space exhausted"?


When you see to-space exhausted messages in GC logs, it means that G1 does not have enough memory for either survived or tenured objects, or for both, and the Java heap cannot be further expanded since it is already at its max.[5]  This event only happens in mixed GC 's when G1 tries to copy live objects from the source space to the destination space.  Note that, in G1, it avoids heap fragmentation by compaction (i.e., done when objects are copied from source region to destination region).

Example message:

5843.494: [GC pause (G1 Evacuation Pause) (mixed)... (to-space exhausted), 0.1145790 secs]

When that happens, GC pause time will be longer (i.e., since both RSet and CSet[5] need to be re-generated) and sometimes it may require a Full GC to resolve the issue. 

A Case Study of "to-space exhausted" Events


Using a benchmark with large live data size,[3] a "to-space exhausted" event can be triggered if there is not enough breathing room for shuffling live data objects in the Java heap.  For example, we have a test case which its live data set occupies 68% of the Java heap at run time.  In a 4-hour run, we have seen the following sequence of events:

5843.494: [GC pause (G1 Evacuation Pause) (mixed)... (to-space exhausted), 0.1145790 secs]
5848.831: [GC pause (G1 Evacuation Pause) (mixed)... (to-space exhausted), 0.0847160 secs]
5856.640: [GC pause (G1 Evacuation Pause) (mixed)... 0.0864060 secs]
5866.182: [GC pause (G1 Evacuation Pause) (mixed)... 0.0821220 secs]
5875.550: [GC pause (G1 Evacuation Pause) (mixed)... 0.0549090 secs]
5882.353: [GC pause (G1 Evacuation Pause) (mixed)... 0.0827630 secs]
5896.331: [GC pause (G1 Evacuation Pause) (mixed)... 0.0632660 secs]
5901.966: [GC pause (G1 Evacuation Pause) (mixed)... 0.5356580 secs]
5902.813: [GC pause (G1 Evacuation Pause) (mixed)... (to-space exhausted), 1.3480320 secs]
5904.162: [Full GC (Allocation Failure) ... 5.0413590 secs]

A series of "to-space exhausted" events eventually led to a Full GC.  As discussed in [4], adding -Xmn400m on the command line to the same test case further add insult to injury—instead of seeing 3 "to-space exhausted" events, now we saw 20 of them.  This is due to we have further cut down the breathing room for G1.



How to Deal with "to-space exhausted" Events?


"to-space exhausted" events happen when G1 runs out of space (either survivor space or old space; see diagram above) to copy live objects to.  When it happens, sometimes G1's ergonomics can get by the issue by dynamically re-sizing heap regions.  If not, it eventually leads to a Full GC event.

When "to-space exhausted" events happen, it could mean you have a very tight heap allocated for your application.  The easiest approach to resolve this is increasing your heap size if it's possible. Alternatively, sometimes it's possible to tune G1's mixed GC to get by (see [4]). As shown in the above section, we have a high live data occupancy (i.e., 68%) in Java heap.  However, after tuning mixed GC, we can get by with a couple of Full GC's in a 4-hour run, which yields satisfactory result.

Note that mixed GC can clean up both young and old regions. If it does not take an old region, it would be a young(-only) GC. G1's mixed GC's are meant as a replacement for Full GC's. Instead of doing all the work at once, the idea is to achieve the same task in steps. Each small steps in mixed GC is smaller than a Full GC. Because G1 does not intend to reclaim all garbage as in a Full GC event, mixed GC uses less time.  However, Full GC is a compaction event which it can walk around "to-space exhausted" issue without needing extra space.

Acknowledgement


Some writings here are based on the feedback from Thomas Schatzl and Yu Zhang. However, the author would assume the full responsibility for the content himself.

References

  1. g1gc logs - basic - how to print and how to understand
  2. g1gc logs - Ergonomics -how to print and how to understand
  3. JRockit: How to Estimate the Size of Live Data Set
  4. JDK 8: Is Tuning MaxNewSize in G1 GC a Good Idea?
  5. Garbage First Garbage Collector Tuning
    • Note that "to-space overflow" message was removed and only "to-space exhausted" remains.
    • RSet (Remembered Set)—Tracks object references into a given region. There is one RSet per region in the heap. The RSet enables the parallel and independent collection of a region. The overall footprint impact of RSets is less than 5%.
    • CSet (Collection Set)—Are the set of regions that will be collected in a GC. All live data in a CSet is evacuated (copied/moved) during a GC. Sets of regions can be Eden, survivor, and/or old generation. CSets have a less than 1% impact on the size of the JVM. 
  6. HotSpot Virtual Machine Garbage Collection Tuning Guide
  7. Garbage-First Garbage Collector Tuning
  8. Getting Started with the G1 Garbage Collector
  9. Garbage-First Garbage Collector (JDK 8 HotSpot Virtual Machine Garbage Collection Tuning Guide)
  10. Other JDK 8 articles on Xml and More
  11. Concurrent Marking in G1

Saturday, January 11, 2014

JRockit: Out-of-the-Box Behavior of Four Generational GC's

In [2], we have presented four Generational GCs in JRockit:
  • -Xgc:gencon
  • -Xgc:genpar
  • -Xgc:genconpar
  • -Xgc:genparcon
In this article, we will examine their Out-of-the-box (OOTB)[7] behaviors, especially on adaptive (or automatic) memory management in JRockit.

Adaptive Memory Management


JRockit was the first JVM to recognize that adaptive optimizations based on feedback could be applied to all subsystems in the runtime,[1] which include:
  • Code Generation
  • Memory Management
  • Threads and Synchronization

In this article, we will focus mainly on the memory management of R28. In R28, JRockit adaptively modifies many aspects of the garbage collection, but to a lesser extent than R27.[1]

Adaptive optimizations based on runtime feedback work in this way:[1] In the beginning, these changes are fairly frequent, but after a warm-up period and maintained steady-state behavior, the idea is that the JVM should settle upon an optimal algorithm. If, after a while, the steady-state behavior changes from one kind to another, the JVM may once again change strategies to a more optimal one.

So, JRockit may heuristically change garbage collection behavior at runtime, based on feedback from the memory system by doing:
  • Changing GC strategies
  • Automatic heap resizing
  • Getting rid of memory fragmentation at the right intervals
  • Recognizing when it is appropriate to "stop the world"
  • Changing the number of garbage collecting threads

-Xverbose:gc flag


To investigate the OOTB behavior of four mentioned Generational GC's (see also [2]), we have used:
  • -Xverbose:gc flag
Typically, the log shows things such as garbage collection strategy changes and heap size adjustments, as well as when a garbage collection take place and for how long.

OOTB Behavior


The OOTB behavior of four Generational GC's was investigated using one of our benchmarks, which has the following characteristics (see also [3,4]):
  • High churning rate
  • Allocating large objects

Here are the Average Response Time (ART; on relative scale) of four different GC strategies:
  • -Xgc:genpar
    • Baseline
  • -Xgc:gencon
    • -3.21%
  • -Xgc:genconpar
    • -36.56%
  • -Xgc:genparcon
    • -17.60%
After tests, it turns out that the default throughput GC (i.e., genpar) performs the best while other low-pausetimes GC's lag behind. There are a couple of reasons:
  • Large live data set
    • Our benchmark has large live data size (i.e., 1,471,425 KB) and we have assigned it a relative tight heap space (i.e., 2GB).
  • Concurrent sweeping phase cannot keep up with the workload generated from marking phase
    • This can be manifested by the following facts:
      • Emergency parallel sweep requested for both genconcon and genparcon
      • But, not genconpar

Changing GC Strategies at Runtime


The mark-and-sweep algorithm consists of two phases:[1,5,8]
  1. Mark phase
    • In which, it finds and marks all accessible objects (or live objects)
  2. Sweep phase
    • In which, it scans through the heap and reclaims all the unmarked objects
In the GC log of both gencon and genparcon, we have seen the following messages:

gencon
  • [INFO ][memory ][Thu Jan 9 06:02:12 2014][1389247332488][25536] [OC#6] Changing GC strategy from: genconcon to: genconpar, reason: Emergency parallel sweep requested.

genparcon
  • [INFO ][memory ][Wed Jan 8 21:15:00 2014][1389215700143][24163] [OC#1] Changing GC strategy from: genparcon to: genparpar, reason: Emergency parallel sweep requested.


Possible reasons could be that:
Sweeping and compaction (JRockit uses partial compaction to avoid fragmentation) tend to be more troublesome for parallelization. When you allow Java threads to be run concurrently in the sweep phase, it makes sweeping run longer and slower because more bookkeeping and/or synchronization needed. Also, using fewer GC threads introduces the issue that the garbage collector cannot keep up with the growing set of dead objects.

Conclusions


Ideally, an adaptive runtime would never need tuning at all, as the runtime feedback alone would determine how the application should behave for any given scenario at any given time. However, the computational complexity of mark-and-sweep algorithm is both a function of the amount of live data on the heap (for mark) and the actual heap size (for sweep). Depending on the amount of live data on the heap and system configuration, the OOTB behavior of chosen Generational GC's may or may not be able to keep up the garbage collections.

It is often argued that automatic memory management can slow down execution for certain applications to such an extent that it becomes impractical. This is because automatic memory management can introduce a high degree of non-determinism to a program that requires short response times. And, there are more bookkeeping or overhead. For example, it would need an Old GC to change garbage collection strategies. To avoid this, some manual tuning may be needed to get good application performance.

Before attempting to tune the performance of an application, it is important to know where the bottlenecks are. That way no unnecessary effort is spent on adding complex optimizations in places where it doesn't really matter.

References

  1. Oracle JRockit- The Definitive Guide
  2. JRockit: Parallel vs Concurrent Collectors (Xml and More)
  3. JRockit: A Case Study of Thread Local Area (TLA) Tuning (Xml and More)
  4. JRockit: Thread Local Area Size and Large Objects (Xml and More)
  5. Mark-and-Sweep Garbage Collection
  6. JRockit: All Posts on "Xml and More"
  7. The Unspoken - The Why of GC Ergonomics (Jon Masamitsu's Weblog)
  8. Mark and Sweep (MS) Algorithm (Xml and More)

Sunday, April 7, 2013

Understanding CMS GC Logs

In this article, we will examine the following topics:
  • What's Mark and Sweep algorithm?
  • What's CMS Collector in HotSpot?
  • What's the format of CMS logs?

Mark and Sweep (MS) Algorithm


The mark-and-sweep algorithm consists of two phases[3]: In the first phase, it finds and marks all accessible objects. The first phase is called the mark phase. In the second phase, the garbage collection algorithm scans through the heap and reclaims all the unmarked objects. The second phase is called the sweep phase. The algorithm can be expressed as follows:

for each root variable r
    mark (r);
sweep ();

The computational complexity of mark and sweep is both a function of the amount of live data on the heap (for mark) and the actual heap size (for sweep).
  • Mark
    • Add each object in the root set to a queue
      • Typically, the root set contains all objects that are available without having to trace any references[5], which includes all Java objects on local frames in whatever methods the program is executing when it is halted for GC. This includes:
        • Everything we can obtain from the user stack and registers in the thread contexts of the halted program. 
        • Global data, such as static fields.
    • For each object X in the queue Mark X reachable
      • mark bit is typically associated with each reachable object. 
    • Add all objects referenced from X to the queue
    • Marking is the most critical of the stages, as it usually takes up around 90 percent of the total garbage collection time. 
      • Fortunately, marking is very parallelizable and large parts of a mark phase can also be run concurrently with executing Java code. 
  • Sweep
    • For each object X on the heap, 
      • If the X not marked, garbage collect it
    • Sweeping (or compaction which is not included in MS algorithm), however, tend to be more troublesome to be made to run concurrently with the executing Java program.

Concurrent Mark Sweep (CMS) Collector


CMS (Concurrent Mark Sweep) is one of garbage collectors implemented in HotSpot JVM.  It is enabled using:
  • -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

Note that there are two different implementations of parallel collectors for the young generation:
  • UseParNewGC
  • UseParallelGC 

UseParNewGC should be used with CMS, and UseParallelGC should be used with the throughput collector.  Also that there is another collector named iCMS,[4] which is a variant of CMS.  You should not use iCMS anymore because it is (or will be) deprecated.

CMS is designed to be mostly concurrent, requiring just two quick stop-the-world pauses per old space garbage collection cycle.   These two phases are the initial mark phase (single-threaded) and the remark phase (multithreaded).  It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

CMS does not perform compaction in normal CMS cycle.  However, an old generation overflow will trigger a stop-the-world compacting garbage collection.
  • Par new generation (or Young Gen)
    • Young generation space is further split into 
      • Eden 
      • Survivor spaces
    • On CMS, it would have defaulted to a maximum size 665 MB if you didn't explicitly set it.
    • Survivors from the young generation are evacuated to 
      • Other survivor space, or
      • Old generation space
  • Concurrent mark-sweep generation (or Tenured Gen)
    • Does in-place de-allocation (which can lead to heap fragmentation)
    • Is managed by free lists, which need synchronization
    • Concurrent marking phase
      • Two stop-the-world pauses
        • Initial mark
          • Marks reachable (live) objects
        • Remark
          • Unmarked objects are deducted to be unreachable (dead)
    • Concurrent sweeping phase
      • Sweeps over the heap
      • In-place de-allocated unmarked (dead) objects
      • End of concurrent sweeping--all unmarked objects have been de-allocated

Enabling Logging


We have used the following HotSpot VM options to generate the log file:
  • -Xloggc:/gc_0.log 
  • -verbose:gc 
  • -XX:+PrintGCDetails 
  • -XX:+PrintGCTimeStamps 
  • -XX:+PrintReferenceGC 

CMS GC Logs


Let's take a look at some of the CMS logs generated with 1.7.0_10:

2715.210: [GC 2715.210: [ParNew2715.282: [SoftReference, 0 refs, 0.0000120 secs]2715.282: [WeakReference, 2433 refs, 0.0004150 secs]2715.283: [FinalReference, 1947 refs, 0.0074410 secs]2715.290: [PhantomReference, 0 refs, 0.0000080 secs]2715.290: [JNI Weak Reference, 0.0000040 secs]: 310460K->27980K(318912K), 0.0804180 secs] 3828421K->3561073K(4158912K), 0.0805140 secs] [Times: user=0.36 sys=0.00, real=0.08 secs]
Young generation (ParNew) collection. Young generation capacity is 318912K and after the collection its occupancy drops down from 310460K to 27980K. This collection took 0.08 secs.

2715.294: [GC [1 CMS-initial-mark: 3533092K(3840000K)] 3564157K(4158912K), 0.0418950 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]

Beginning of tenured generation collection with CMS collector. This is initial Marking phase of CMS where all the objects directly reachable from roots are marked and this is done with all the mutator threads stopped.

Capacity of tenured generation space is 3840000K and CMS was triggered at the occupancy of 3533092K

2715.337: [CMS-concurrent-mark-start]
Start of concurrent marking phase.

In Concurrent Marking phase, threads stopped in the first phase are started again and all the objects transitively reachable from the objects marked in first phase are marked here.

2717.428: [CMS-concurrent-mark: 2.091/2.091 secs] [Times: user=6.34 sys=0.08, real=2.09 secs]
Concurrent marking took total 2.091 econds cpu time and 2.091 seconds wall time that includes the yield to other threads also.

2717.428: [CMS-concurrent-preclean-start]
Start of precleaning.

Precleaning is also a concurrent phase. Here in this phase we look at the objects in CMS heap which got updated by promotions from young generation or new allocations or got updated by mutators while we were doing the concurrent marking in the previous concurrent marking phase. By rescanning those objects concurrently, the precleaning phase helps reduce the work in the next stop-the-world “remark” phase.

2717.428: [Preclean SoftReferences, 0.0000020 secs]2717.428: [Preclean WeakReferences, 0.0066370 secs]2717.434: [Preclean FinalReferences, 0.0011100 secs]2717.436: [Preclean PhantomReferences, 0.0003480 secs]2717.475: [CMS-concurrent-preclean: 0.045/0.048 secs] [Times: user=0.19 sys=0.00, real=0.04 secs]

Concurrent precleaning took 0.045 secs total cpu time and 0.048 wall time.


2717.475: [CMS-concurrent-abortable-preclean-start]

2717.876: [GC 2717.876: [ParNew2717.958: [SoftReference, 0 refs, 0.0000140 secs]2717.958: [WeakReference, 2683 refs, 0.0004340 secs]2717.959: [FinalReference, 1972 refs, 0.0069420 secs]2717.966: [PhantomReference, 0 refs, 0.0000080 secs]2717.966: [JNI Weak Reference, 0.0000050 secs]: 311500K->28150K(318912K), 0.0900310 secs] 3844593K->3577211K(4158912K), 0.0901370 secs] [Times: user=0.41 sys=0.01, real=0.09 secs]

2719.157: [CMS-concurrent-abortable-preclean: 1.582/1.681 secs] [Times: user=4.16 sys=0.07, real=1.68 secs]

2719.157: [GC[YG occupancy: 171027 K (318912 K)]2719.157: [Rescan (parallel) , 0.0680250 secs]2719.225: [weak refs processing2719.226: [SoftReference, 0 refs, 0.0000070 secs]2719.226: [WeakReference, 9940 refs, 0.0010950 secs]2719.227: [FinalReference, 10877 refs, 0.0226430 secs]2719.249: [PhantomReference, 0 refs, 0.0000050 secs]2719.249: [JNI Weak Reference, 0.0000110 secs], 0.0238550 secs]2719.249: [scrub string table, 0.0057930 secs] [1 CMS-remark: 3549060K(3840000K)] 3720087K(4158912K), 0.0989920 secs] [Times: user=0.40 sys=0.01, real=0.10 secs]
In the above log, after a preclean, 'abortable preclean' starts. After the young generation collection, the young gen occupancy drops down from 328912K to 28150K When young gen occupancy reaches 171027 K which is 52% of the total capacity, precleaning is aborted and 'remark' phase is started.
Note that in young generation occupancy also gets printed in the final remark phase.
Stop-the-world phase. This phase rescans any residual updated objects in CMS heap, retraces from the roots and also processes Reference objects. Here the rescanning work took 0.0680250 secs and weak reference objects processing took 0.0010950 secs. This phase took total 0.0238550 secs to complete.

2719.257: [CMS-concurrent-sweep-start]
Start of sweeping of dead/non-marked objects. Sweeping is concurrent phase performed with all other threads running.

2722.945: [CMS-concurrent-sweep: 3.606/3.688 secs] [Times: user=8.79 sys=0.12, real=3.69 secs]
Sweeping took 3.69 secs.

2722.945: [CMS-concurrent-reset-start]
Start of reset.

2722.973: [CMS-concurrent-reset: 0.028/0.028 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
In this phase, the CMS data structures are reinitialized so that a new cycle may begin at a later time. In this case, it took 0.03 secs.

This was how a normal CMS cycle runs.  Note that, in our experiment, we didn't see the "concurrent mode failure".  So, we didn't discuss it here.  For more information of it, read [1].

References

  1. Understanding CMS GC Logs
  2. Java GC, HotSpot's CMS and heap fragmentation
  3. Mark-and-Sweep Garbage Collection
  4. Really? iCMS? Really?
  5. Oracle JRockit--The Definitive Guide
  6. JRockit: Parallel vs Concurrent Collectors (Xml and More)
  7. The Unspoken - CMS and PrintGCDetails (Jon Masamitsu's Weblog)
  8. The Unspoken - Phases of CMS (Jon Masamitsu's Weblog)

Friday, October 26, 2012

HotSpot VM Performance Tuning Tips

In some cases, it may be obvious from benchmarking that parts of an application need to be rewritten using more efficient algorithms[1]. Sometimes it may just be enough to provide a more optimal runtime environment by tuning the JVM parameters.

In this article, we will show you some of the HotSpot VM performance tuning tips.

What to tune?


You can tune HotSpot performance on multiple fronts:
  • Code generation[8,10]
  • Memory management
In this article, we will focus more on memory management (or garbage collector).  The goals of tuning garbage collector include:
  • To make a garbage collector operate efficiently, by
    • Reducing pause time or
    • Increasing throughput
  • To avoid heap fragmentation
    • Different garbage collector uses different compaction to eliminate fragmentation
  • To make it scalable for multithreaded applications on multiprocessor systems
In this article, we will cover the following tuning options:
  1. Client VM or Server VM
  2. 32-bit VM or 64-bit VM
  3. GC strategy
  4. Heap sizing
  5. Further tuning

Client vs. Server VM


The HotSpot Client JVM has been specially tuned to reduce application startup time and memory footprint, making it particularly well suited for client environments. On all platforms, the HotSpot Client JVM is the default.

The Java HotSpot Server VM is similar to the HotSpot Client JVM except that it has been specially tuned to maximize peak operating speed. It is intended for long-running server applications, for which the fastest possible operating speed is generally more important than having the fastest startup time. To invoke the HotSpot Server JVM instead of the default HotSpot Client JVM, use the -server parameter; for example,
  • java -server MyApp
In [7], authors have mentioned a third HotSpot VM runtime named tiered.   If you are using Java 6 Update 25, Java 7, or later, you may consider using tiered server runtime as a replacement for the client runtime.   For more details, read [8,10].

32-Bit or 64-Bit VM


The 32-bit JVM is the default for the HotSpot VM. The choice of using a 32-bit or 64-bit JVM is dictated by the memory footprint required by the application along with whether any third-party software used in the application supports 64-bit JVMs and if there are any native components in the Java application. All native components using the Java Native Interface (JNI) in a 64-bit JVM must be compiled in 64-bit mode.

Running 64-bit VM has the following advantages[7]:
  • Larger address space
  • Better performance in two fronts
    • 64-bit JVMs can make use of additional CPU registers
    • Help avoid register spilling
      • Register spilling occurs where there is more live state (i.e. variables) in the application than the CPU has registers. 
and one disadvantage:
  • Increased width for oops
    • Results in fewer oops being available on a CPU cache line and as a result decreases CPU cache efficiency. 
    • This negative performance impact can be mitigated by setting:
      • -XX:+UseCompressedOops VM command line option
Note that client runtimes are not available in 64-bit HotSpot VMs.   See [11] for more details.

GC Strategy


JVM performance is usually measured by its GC's effectiveness.  Garbage collection (GC) reclaims the heap space previously allocated to objects no longer needed. The process of locating and removing those dead objects can stall your Java application while consuming as much as 25 percent of throughput.

The Java HotSpot virtual machine includes five garbage collectors.[27] All the collectors are generational.
  • Serial Collector
    • Both young and old collections are done serially (using a single CPU), in a stop-the-world fashion.
    • The old and permanent generations are collected via a mark-sweep-compact collection algorithm. 
      • The sweep phase “sweeps” over the generations, identifying garbage. The collector then performs sliding compaction, sliding the live objects towards the beginning of the old generation space (and similarly for the permanent generation), leaving any free space in a single contiguous chunk at the opposite end.
    • When to use
      • For most applications that are run on client-style machines and that do not have a requirement for low pause times
    • How to select
      • In the J2SE 5.0 and above, the serial collector is automatically chosen as the default garbage collector on machines that are not server-class machines. On other machines, the serial collector can be explicitly requested by using the -XX:+UseSerialGC command line option.
  • Parallel Collector (or throughput collector)
    • Young generation collection
      • Uses a parallel version of the young generation collection algorithm utilized by the serial collector
      • It is still a stop-the-world and copying collector, but performing the young generation collection in parallel, using many CPUs, decreases garbage collection overhead and hence increases application throughput.
    • Old generation collection
      • Uses the same serial mark-sweep-compact collection algorithm as the serial collector
    • When to use
      • For applications run on machines with more than one CPU and do not have pause time constraints, since infrequent, but potentially long, old generation collections will still occur. 
      • Examples of applications for which the parallel collector is often appropriate include those that do batch processing, billing, payroll, scientific computing, and so on.
    • How to select
      • In the J2SE 5.0 and above, the parallel collector is automatically chosen as the default garbage collector on server-class machines. On other machines, the parallel collector can be explicitly requested by using the -XX:+UseParallelGC command line option.
  • Parallel Compacting Collector
      • Young generation collection
        • Use the same algorithm as that for young generation collection using the parallel collector.
      • Old generation collection
        • The old and permanent generations are collected in a stop-the-world, mostly parallel fashion with sliding compaction
        • The collector utilizes three phases (see [5] for more details):
          • Marking phase
          • Summary phase
          • Compaction phase
      • When to use
        • For applications that are run on machines with more than one CPU and applications that have pause time constraints.
      • How to select
        • If you want the parallel compacting collector to be used, you must select it by specifying the command line option -XX:+UseParallelOldGC.
    • Concurrent Mark-Sweep (CMS) Collector[5,6]
        • Young generation collection
          • The CMS collector collects the young generation in the same manner as the parallel collector. 
        • Old generation collection
          • Most of the collection of the old generation using the CMS collector is done concurrently with the execution of the application. 
          • The CMS collector is the only collector that is non-compacting. That is, after it frees the space that was occupied by dead objects, it does not move the live objects to one end of the old generation. 
            • To minimize risk of fragmentation, CMS is doing statistical analysis of object’s sizes and have separate free lists for objects of different sizes.
        • When to use
          • For application needs shorter garbage collection pauses and can afford to share processor resources with the garbage collector when the application is running. (Due to its concurrency, the CMS collector takes CPU cycles away from the application during a collection cycle.)
          • Typically, applications that have a relatively large set of long-lived data (a large old generation), and that run on machines with two or more processors, tend to benefit from the use of this collector. 
          • Compared to the parallel collector, the CMS collector decreases old generation pauses—sometimes dramatically—at the expense of slightly longer young generation pauses, some reduction in throughput, and extra heap size requirements.
        • How to select
          • If you want the CMS collector to be used, you must explicitly select it by specifying the command line option -XX:+UseConcMarkSweepGC.  
          • If you want it to be run in incremental mode, also enable that mode via the  –XX:+CMSIncrementalMode option.
            • This feature is useful when applications that need the low pause times provided by the concurrent collector are run on machines with small numbers of processors (e.g., 1 or 2).
          • ParNewGC is the parallel young generation collector for use with CMS.  To choose it, you can specify the command line option  ‑XX:+UseParNewGC.
      • Generation First (G1) Garbage Collector[31]
        • Summary
          • Differs from CMS in the following ways
            • Compacting
              • Reduce fragmentation and is good for long-running applications 
            • Heap is split into regions
              • Easy to allocate and resize
          • Evacuation pauses 
            • For both young and old regions
        • Young generation collection
          • During a young GC, survivors from the young regions are evacuated to either survivor regions or old regions
            • Done with stop-the-world evacuation pauses
            • Performed in parallel
        • Old generation collection
          • Some garbage objects in regions with very high live ratio may be left in the heap and be collected later
          • Concurrent marking phase
            • Calculates liveness information per region
              • Empty regions can be reclaimed immediately
              • Identifies best regions for subsequent evacuation pauses
            • Remark is done with one stop-the-world puase while initial mark is piggybacked on an evacuation pause
            • No corresponding sweeping phase
            • Different marking algorithm than CMS
          • Old regions are reclaimed by
            • Evacuation pauses
              • Using compaction
              • Where most reclamation is done
            • Remark (when totally empty)
        • When to use
          • The G1 collector is a server-style garbage collector, targeted for multi-processor machines with large memories. It meets garbage collection (GC) pause time goals with high probability, while achieving high throughput. 
        • How to select
          • If you want the G1 garbage collector to be used, you must explicitly select it by specifying the command line option -XX:+UseG1GC.  
      Note that the difference between:
      • -XX:+UseParallelOldGC
      and
      • -XX:+UseParallelGC
      is that -XX:+UseParallelOldGC enables both a multithreaded young generation garbage collector and a multithreaded old generation garbage collector, that is, both minor garbage collections and full garbage collections are multithreaded. -XX:+UseParallelGC enables only a multithreaded young generation garbage collector. The old generation garbage collector used with -XX:+UseParallelGC is single threaded. 

      Using -XX:+UseParallelOldGC also automatically enables -XX:+UseParallelGC. Hence, if you want to use both a multithreaded young generation garbage collector and a multithreaded old generation garbage collector, you need only specify -XX:+UseParallelOldGC.

      Note that the above distinction between -XX:+UseParallelOldGC and -XX:+UseParallelGC are no longer true in JDK 7.  In JDK 7, the following three settings are equivalent:

      • Default
      • -XX:+UseParallelGC
      • -XX:+UseParallelOldGC
      They all use multithreaded collectors for both young generation and old generation.

      Heap Sizing


      If a heap size is small, collection will be fast but the heap will fill up more quickly, thus requiring more frequent collections. Conversely, a large heap will take longer to fill up and thus collections will be less frequent, but they may take longer.

      Command line parameters that divide the heap between new and old generations usually cause the greatest performance impact.  If you increase the new generation's size, you often improve the overall throughput; however, you also increase footprint, which may slow down servers with limited memory.

      For more details, you can read [12-15].

      Further Tuning


      HotSpot's default parameters are effective for most small applications that require faster startup and a smaller footprint.  However, more often than not, you will find default settings are not good enough and need further tuning your Java applications.  As shown in [16], there are many VM options exoosed and can be further tuned by brave souls.  I'm not going to discuss such tunings in this article.  But, I'll keep on posting articles on this blogger with VM tunings.  Stay tuned!

      References

      1. Java Performance Tips
      2. Pick up performance with generational garbage collection
      3. Java HotSpot™ Virtual Machine Performance Enhancements
      4. Oracle JRockit
      5. Memory Management in the Java HotSpot™ Virtual Machine
      6. Understanding GC pauses in JVM, HotSpot's CMS collector
      7. Java Performance by Charlie Hunt and Binu John
      8. Performance Tuning with Hotspot VM Option: -XX:+TieredCompilation
      9. Java Tuning White Paper
      10. A Case Study of Using Tiered Compilation in HotSpot
      11. HotSpot VM Binaries: 32-Bit vs. 64-Bit
      12. HotSpot Performance Option — SurvivorRatio
      13. A Case Study of java.lang.OutOfMemoryError: GC overhead limit exceeded
      14. Understanding Garbage Collection
      15. Diagnosing Java.lang.OutOfMemoryError
      16. What Are the Default HotSpot JVM Values?
      17. Understanding Garbage Collector Output of Hotspot VM
      18. On Stack Replacement in HotSpot JVM
      19. Professional Oracle WebLogic Server by Robert Patrick, Gregory Nyberg, and Philip Aston
      20. Sun Performance and Tuning: Java and the Internet by Adrian Cockroft and Richard Pettit
      21. Concurrent Programming in Java: Design Principles and Patterns by Doug Lea
      22. Capacity Planning for Web Performance: Metrics, Models, and Methods by Daniel A. Menascé and Virgilio A.F. Almeida
      23. Java Performance Tuning (Michael Finocchiaro)
      24. Diagnosing Heap Stress in HotSpot
      25. Introduction to HotSpot JVM Performance and Tuning
      26. Tuning the JVM (video)
        • Frequency of Minhor GC dictated by:
          • Application object allocation rate
          • Size of Eden
        • Frequency of object promotion dictated by:
          • Frequency of minor GCs (tenuring)
          • Size of survivor spaces
        • Full GC Frequency dictated by
          • Promotion rate
          • Size of old generation
      27. JEP 173: Retire Some Rarely-Used GC Combinations
      28. G1 GC Glossary of Terms
      29. Learn More About Performance Improvements in JDK 8 
      30. Java SE HotSpot at a Glance
      31. Garbage-First Garbage Collector (JDK 8 HotSpot Virtual Machine Garbage Collection Tuning Guide)
      32. Tuning that was great in old JRockit versions might not be so good anymore
        • Trying to bring over each and every tuning option from a JR configuration to an HS one is probably a bad idea.
        • Even when moving between major versions of the same JVM, we usually recommend going back to the default (just pick a collector and heap size) and then redoing any tuning work from scratch (if even necessary).

      Thursday, August 30, 2012

      Java Performance Tips

      Every little things add up. The overall performance of an application depends on the individual performance of its components. Here we list some of the Java programming tips to help your application perform:

      Programming Tips Why and How
      Avoid creating duplicate objects Why:
      • Less objects less GC.
      • Application with smaller footprint performs better.
      How:
      • Reuse a single object instead of creating a new functionally equivalent object each time it's needed.
        • (Do)
          • String s = "No longer silly";
        • (Don't)
          • String s = new String("silly");
      • Use static factory methods in preference to constructors on immutable classes that provide both.
      • Reuse mutable objects that are never modified once their values have been computed (within a static initializer).
      Avoid circular references Why:
      • Groups of mutually referencing objects which are not directly referenced by other objects and are unreachable can thus become permanently resident.
      How:
      • You can use strong references for "parent-to-child" references, and weak references for "child-to-parent" references, thus avoiding cycles.
      Caching frequently-used data or objects Why:
      • One of the canonical performance-related use cases for Java is to supply a middle tier that caches data from back-end database resources.
        • (Warning) Like all cases where objects are reused, there is a potential performance downside: if the cache consumes too much memory, it will cause GC pressure.
      How:
      • Code should use a PreparedStatement rather than a Statement for its JDBC calls.
      • Reuse JDBC connections because connections to a database are time-consuming to create.
      Use the == operator instead of the equals(Object) method Why:
      • == operator performs better.
        • For example, for String comparison, the equals( ) method compares the characters inside a String object. The == operator compares two object references to see whether they refer to the same instance.
      How:
      • a.equals(b) if and only if a == b
        • For example, use static factory method to return the same object from repeated invocations.
      Eliminate obsolete object references Why:
      • Reduced performance with more garbage collector activity
      • Reduced performance with increased memory footprint
      How:
      • Null out references once they become obsolete
      • (Do)
        • public Object pop() {
           if (size == 0)
             throw new EmptyStackException(;
           Object result = elements[--size];
           elements[size] == null;  // Eliminate obsolete reference
           return result;
          }
      • (Don't)
        • public Object pop(){
           if (size == 0)
             throw new EmptyStackException(;
           return elements[--size];
          }
      Avoid finalizers Why:
      • Objects waiting for fnalization have to be kept track of separately by the garbage collector. 
      • There is also call overhead when the finalize method is invoked
      • Finalizers are unsafe in that they can resurrect objects and interfere with the GC.

      Avoid reference objects Why:

      • As with fnalizers, the garbage collector has to treat soft, weak, and phantom references specially. 
      • Although all of these can provide great aid in, for example, simplifying a cache implementation, too many live Reference objects will make the garbage collector run slower. 
      • A Reference object is usually a magnitude more expensive than a normal object (strong reference) to bookkeep.
      • To find out the number of references and the time it takes to process them in HotSpot, add the following JVM option:
        • -XX:+PrintReferenceGC
      Avoid object pooling Why:
      • Object pooling contributes both to more live data and to longer object life spans.
      • Note that large amounts of live data is a GC bottleneck and the GC is optimized to handle many objects with short life spans.
      • Also, allocating fresh objects instead of keeping old objects alive, will most likely  be more beneficial to cache locality.
      • However, performance in an environment with many large objects, for example large arrays, may occasionally benefit from object pooling.
      Choose good algorithms and data structures Why:
      • Consider a queue implementation in the form of a linked list.
        • Even if your program never iterates over the entire linked list, the garbage collector still has to.
        • Bad cache locality can ensue because payloads or element wrappers aren't guaranteed to be stored next to each other in memory. This will cause long pause times as, if the object pointers are spread over a very large heap area, a garbage collector would repeatedly miss the cache while doing pointer chasing during its mark phase.
      Avoid System.gc  Why:
      • There is no guarantee from the Java language specifcation that calling System.gc will do anything at all. But if it does, it probably does more than you want or doesn't do the same thing every time you call it. 
      Avoid too many threads Why:
      • The number of context switches grows proportionally to the number of fairly scheduled threads, and there may also be hidden overhead here.
        • For example, a native thread context on a system such as Intel IA-64 processor is on the order of several KB.
      Avoid contented locks Why:
      • Contended locks are bottlenecks, as their presence means that several threads want to access the same resource or execute the same piece of code at the same time. 
      Avoid unnecessary exceptions Why:
      • Handling exceptions takes time and interrupts normal program flow. 
      • Authors[1] have seen cases with customer applications throwing tens of thousands of unnecessary NullPointerExceptions every second, as part of normal control fow. Once this behavior was rectifed, performance gains of an order of magnitude were achieved.

      Avoid large objects Why:
      • Large objects sometimes have to be allocated directly on the heap and not in thread local areas (TLA).
        • Large objects on the heap are bad in that they contribute to fragmentation more quickly.
      • Large object allocation in VMs (for example, in JRockit) also contributes to overhead because it may require taking a global heap lock on allocation.
      • An overuse of large objects leads to full heap compaction being done too frequently, which is very disruptive and requires stopping the world for large amounts of time.


      Tuning Guidelines


      As stated in [6], here are the guidelines for tuning your applications:
      You should take an overall system approach to ensure that you cover all facets of the environment in which the application runs. The overall system approach begins with the external environment and continues drilling down into all parts of the system and application. Taking a broad approach and tuning the overall environment ensures that the application will perform well and that system performance can meet all of your requirements.

      References

      1. Oracle JRockit
      2. Effective Java by Joshua Bloch
      3. hashCode() and equals() in Java
      4. HotSpot VM Performance Tuning Tips
      5. Java Performance by Charlie Hunt, Binu John, David Dagastine
      6. Professional Oracle WebLogic Server by Robert Patrick, Gregory Nyberg, and Philip Aston
      7. Sun Performance and Tuning: Java and the Internet by Adrian Cockroft and Richard Pettit
      8. Concurrent Programming in Java: Design Principles and Patterns by Doug Lea
      9. Capacity Planning for Web Performance: Metrics, Models, and Methods by Daniel A. Menascé and Virgilio A.F. Almeida
      10. Big-O complexities of common algorithms (Cheat Sheet)


      Monday, January 16, 2012

      Understanding Garbage Collection

      Every Java programmer will encounter
      • java.lang.OutOfMemoryError
      sooner or later.  The most often offered advices include:
      • Try increase the MaxPermSize first
      • Increase the maximum size of java heap size to 512m
      In this article, we will show what Java heap space or MaxPermSize is.  Using Hotspot as our JVM, we will introduce you to the following topics:
      Garbage Collector

      A garbage collector is responsible for
      • Allocating memory
      • Ensuring that any referenced objects remain in memory 
      • Recovering memory used by objects that are no longer reachable from references in executing code. 

      The process of locating and removing those dead objects can stall your Java application while consuming as much as 25 percent of throughput.
      GC provided by Hotspot VM is a generational GC[9,16,18]—memory is divided into generations, that is, separate pools holding objects of different ages.  The design is based on the Weak Generational Hypothesis:
      1. Most newly-allocated objects die young
      2. There are few references from old to young objects

      This separation into generations has proven effective at reducing garbage collection pause times and overall costs in a wide range of applications.  Hotspot VM divides its heap space into three generational spaces:
      1. Young Generation
        • When a Java application allocates Java objects, those objects are allocated in the young generation space
        • Is typically small and collected frequently
        • The garbage collection algorithm chosen for a young generation typically puts a premium on speed, since young generation collections are frequent. 
      2. Old Generationn
        • Objects that are longer-lived are eventually promoted, or tenured, to the old generation
        • Is typically larger than the young generation, and its occupancy grows more slowly
        • The old generation is typically managed by an algorithm that is more space efficient, because the old generation takes up most of the heap and old generation algorithms have to work well with low garbage densities.
      3. Permanent Generation
        • Holds VM and Java class metadata as well as interned Strings and class static variables
        • Note that PermGen will be removed from Java heap in JDK 8.
      Minor vs Full GC

      A garbage collection occurs when any one of those three generational spaces is considered full and there is some request for additional space that is not available.  There are two types of garbage collection activities: minor and full.   When the young generation fills up, it triggers a minor collection in which the surviving objects are moved to the old generation. When the old generation fills up, it triggers a full collection which involves the entire object heap.

      Minor GC
      • Occurs when the young generation space does not have enough room
      • Tends to be short in duration relative to full garbage collections
      Full GC
      • When the old or permanent generation fills up, a full collection is typically done.
      • A Full GC can also be started explicitly by the application using System.gc()
      • Takes a longer time depending upon the heap size.  However, if it takes longer than 3 to 5 seconds, then it's too long[1].
      Full garbage collections typically have the longest duration and as a result are the number one reason for applications not meeting their latency or throughput requirements.   The goal of GC tuning is to reduce the number and frequency of full garbage collections experienced by the application.  To achieve it, we can approach from two sides:
      • From the system side
        • You should use as large a heap size as possible without causing your system to "swap" pages to disk.  Typically, you should use 80 percent of the available RAM (not taken by the operating system or other processes) for your JVM[1].
        • The larger the Java heap space, the better the garbage collector and application perform when it comes to throughput and latency.
      • From the application side
        • A reduction in object allocations, more importantly, object retention helps reduce the live data size, which in turn helps the GC and application to perform.
        • Read this article -- Java Performance Tips [19].
      OutOfMemoryError

      The dreaded OutOfMemoryError is something that no Java programmer ever wishes to see. Yet it can happen, particularly if your application involves a large amount of data processing, or is long lived.
      The total memory size of an application includes:
      • Java heap size
      • Thread stacks
      • I/O buffers
      • Memory allocated by native libraries
      If an application runs out of memory and JVM GC fails to reclaim more object spaces, an OutOfMemoryError exception will be thrown.  An OutOfMemoryError does not necessarily imply a memory leak. The issue might simply be a configuration issue, for example if the specified heap size (or the default size if not specified) is insufficient for the application.

      JVM Command Options

      Whether running a client or server application, if your system is running low on heap and spending a lot of time with garbage collection you'll want to investigate adjusting your heap size. You also don't want to set your heap size too large and impact other applications running on the system.   

      GC tuning is non-trivial.  Finding the optimal generation space sizes involves an iterative process[3,10,12].  Here we assume you have successfully identified the optimal heap space sizes for your application.  Then you can use the following JVM command options to set them:

      

      GC Command Line OptionsDescription
      -Xms Sets the  initial and minimum size of java heap size.  For example, -Xms512m (note that there is no "=").
      -Xmx Sets the maximum size of java heap size
      -Xmn Sets the initial, minimum, and maximum size of the young generation space.   Note that  the size of the old generation space is implicitly set based on the size of the young generation space.
      -XX:PermSize=<n>[g|m|k] Sets the initial and minimum size of permanent generation space
      -XX:MaxPermSize=<n>[g|m|k] Sets the maximum size of permanent generation space


      As a final note, ergonomics for servers was first introduced in Java SE 5.0[13]. It has greatly reduced application tuning time for server applications, particularly with heap sizing and advanced GC tuning. In many cases no tuning options when running on a server is the best tuning you can do.

      References
      1. Tuning Java Virtual Machines (JVMs)
      2. Diagnosing Java.lang.OutOfMemoryError
      3. Java Performance by Charlie Hunt and Binu John
      4. Java HotSpot VM Options
      5. GCViewer (a free open source tool)
      6. Comparison Between Sun JDK And Oracle JRockit 
      7. Java SE 6 Performance White Paper 
      8. F&Q about Garbage Collection 
      9. Hotspot Glossary of Terms 
      10. Memory Management in the Java HotSpot Virtual Machine White Paper
      11. Java Hotspot Garbage Collection 
      12. FAQ about GC in the Hotspot JVM  (with good details) 
      13. Java Heap Sizing: How do I size my Java heap correctly? 
      14. Java Tuning White Paper 
      15. JRockit JVM Heap Size Options
      16. Pick up performance with generational garbage collection
      17. Which JVM?
      18. Memory Management in the Java HotSpot™ Virtual Machine
      19. Java Performance Tips
      20. A Case Study of java.lang.OutOfMemoryError: GC overhead limit exceeded
      21. HotSpot—java.lang.OutOfMemoryError: PermGen space

      © Travel for Life Guide. All Rights Reserved.

      Analytical Insights on Health, Culture, and Security.