Saturday, March 31, 2012

A Case Study of java.lang.OutOfMemoryError: GC overhead limit exceeded

When you see the following exception:
  • java.lang.OutOfMemoryError: GC overhead limit exceeded
it means the garbage collector is taking an excessive amount of time and recovers very little memory in each run.

In this article, we will showcase a real example using the trace records from the garbage collector's output. To learn more on garbage collector's output, read [1] first.

Correlating Timestamps between server.log and gc.log

Two log files generated from Hotspot VM and WebLogic Server are used in this study:
  • server.log
  • gc.log
Note that the names or locations of the log files could be different (which are configurable) in your environment. By default, they are generated in the log directory of the WLS domain.

server.log

In the server log file, we have spotted OutOfMemoryError at 5:21:40 AM as shown below:
  <Mar 30, 2012 5:21:40 AM PDT> <Error> 
  <Kernel> <BEA-000802> <ExecuteRequest failed

  java.lang.OutOfMemoryError: GC overhead limit exceeded.

We have also noted the first timestamp in the server log file is  4:13:56 AM:

  <Mar 30, 2012 4:13:56 AM PDT> <Info> <Security> <BEA-090905> <Disabling CryptoJ

Because gc log file uses elapsed time (in seconds) since server started, we know this out-of-memory incident happened 4064 seconds after server started:
  • 5:21:40 - 4:13:56 = 4064 secs.
gc.log

Searching for the Full GC events, we have found that something went bad at 3167 seconds after server started.

3167.301: [Full GC3168.441:
  [SoftReference, 786 refs, 0.0001560 secs]3168.442:
  [WeakReference, 25548 refs, 0.0041550 secs]3168.446:
  [FinalReference, 7977 refs, 0.0024050 secs]3168.448:
  [PhantomReference, 40 refs, 0.0000070 secs]3168.448:
  [JNI Weak Reference, 0.0000080 secs]
  [PSYoungGen: 100268K->23067K(349568K)]
  [ParOldGen: 2713398K->2752509K(2752512K)]
  2813666K->2775576K(3102080K)
  [PSPermGen: 210744K->210744K(393216K)], 2.8599650 secs]
  [Times: user=18.52 sys=0.03, real=2.85 secs]

3170.809: [Full GC3171.973:
  [SoftReference, 991 refs, 0.0002140 secs]3171.974:
  [WeakReference, 26280 refs, 0.0041760 secs]3171.978:
  [FinalReference, 12136 refs, 0.0142580 secs]3171.992:
  [PhantomReference, 34 refs, 0.0000100 secs]3171.992:
  [JNI Weak Reference, 0.0000080 secs]
  [PSYoungGen: 197915K->32131K(349568K)]
  [ParOldGen: 2752509K->2752511K(2752512K)]
  2950424K->2784643K(3102080K)
  [PSPermGen: 210744K->210744K(393216K)], 2.7171030 secs]
  [Times: user=17.79 sys=0.04, real=2.72 secs]

We have noticed that a series of Full GC's happened after 3167.301 seconds.  For example, the above shows the first two such GC events.  Only after 3.5 seconds, the second Full GC was triggered.  Usually, in between Full GC's, it should be interleaved with multiple Minor GC's.  But, it was not the case.

From examining the GC output, we have concluded that:
  • Full GC's happened too frequent (in about 3 to 4 seconds)
  • Old generation space was full and remained full after garbage collection.
However,  at this time, the exception
  • java.lang.OutOfMemoryError: GC overhead limit exceeded
was not thrown yet.

Based on our calculation, we know that exception was thrown after 4064 seconds after server started.  Here are the garbage collector's output around that time:

4064.456: [Full GC4065.667:
  [SoftReference, 2380 refs, 0.0003730 secs]4065.667:
  [WeakReference, 47627 refs, 0.0091360 secs]4065.677:
  [FinalReference, 3007 refs, 0.0007760 secs]4065.677:
  [PhantomReference, 42 refs, 0.0000070 secs]4065.677:
  [JNI Weak Reference, 0.0000090 secs]
  [PSYoungGen: 174848K->172682K(349568K)]
  [ParOldGen: 2752509K->2752508K(2752512K)]
  2927357K->2925191K(3102080K)
  [PSPermGen: 210598K->210598K(393216K)], 2.3630820 secs]
  [Times: user=16.01 sys=0.05, real=2.37 secs]

4066.837: [Full GC4068.051:
  [SoftReference, 2355 refs, 0.0003750 secs]4068.051:
  [WeakReference, 43590 refs, 0.0081260 secs]4068.060:
  [FinalReference, 6859 refs, 0.0012270 secs]4068.061:
  [PhantomReference, 42 refs, 0.0000220 secs]4068.061:
  [JNI Weak Reference, 0.0000090 secs]
  [PSYoungGen: 174848K->171709K(349568K)]
  [ParOldGen: 2752508K->2752510K(2752512K)]
  2927356K->2924219K(3102080K)
  [PSPermGen: 210598K->210598K(393216K)], 2.8005630 secs]
  [Times: user=18.47 sys=0.02, real=2.81 secs]

Heap Size Adjustment

When a Full GC happens, you need to determine whether it is the occupancy of the old generation space or the occupancy of the permanent generation space that triggers a full garbage collection.  In our case, the trigger was not from the occupancy of the permanent generation space:
  • [PSPermGen: 210598K->210598K(393216K)]
From the above, we know the total permanent generation space was 393216K.  After Full GC, its size remained stable (i.e., 210598K).

However, we know there was serious issue regarding old generation space:
  • [ParOldGen: 2752508K->2752510K(2752512K)]
It became full and remained full after space reclamation.

Conclusion

If you observe an OutOfMemoryError in the garbage collection logs, try increasing the Java heap size (including young generation and old generation) or permanent generation space.  If the issue is coming from old generation, try increasing  the Java heap size up to 80% of the physical memory you have available for the JVM taking into the consideration of the available memory on your system, the memory needed by the OS, the memory needed by other applications running concurrently, and so on.

Based on whether the old generation space or the permanent generation space is running out of memory, you adjust the sizes of heap spaces in this way:
  • For old generation space OutOfMemoryErrors
    • increase -Xms and -Xmx
  • For permanent generation OutOfMemoryErrors
    • increase -XX:PermSize and -XX:MaxPermSize

References

  1. Understanding Garbage Collector Output of Hotspot VM 
  2. Java Tuning White Paper 
  3. Java 2 Platform, Standard Edition 5.0 "Trouobingshooting and Diagnostic Guide"
  4. Which JVM?

1 comment:

Unknown said...

very useful article stanley!