Insights on Oracle & Tech: HotSpot Performance

Showing posts with label HotSpot Performance. Show all posts

Sunday, October 12, 2014

JDK 8: Is Tuning MaxNewSize in G1 GC a Good Idea?

For a throughput GC such as Parallel GC, it often helps performance by setting MaxNewSize (or -Xmn). For G1 GC,^[1] is it the same? The short answer is yes/no and it depends (confession: I said "no" because I haven't done enough tests to exclude it).

In this article, we will demonstrate one case that setting MaxNewSize for G1 actually hurts the application's performance. BTW, after benchmarking G1 in JDK 7, another author has claimed that:^[14]

There is little to gain and much to lose from setting the new generation size explicitly.

Live Data Size

To tune the heap size of a Java Application, it's important to find out what its live data size is. To learn how to estimate a Java application's live data size, read [2]. For the benchmark we are using, its live data size is around 1400 MB. With this benchmark, we are interested in knowing how G1 performs with MaxNewSize set to 400MB and Java heap set to 2GB.

Regularly we use the following settings for the G1 tuning:

-Xms2g -Xmx2g -XX:+UseG1GC

plus some options for mixed GC (read [3]). So, for this experiment, we have added MaxNewSize setting like this:

-Xms2g -Xmx2g -XX:+UseG1GC -Xmn400m

This has turned out hurting the performance a lot. Note that -Xmn400m is the optimal setting of our benchmark if Parallel GC is used.

Concurrent Cycle in G1 GC

G1 has four main operations:^[5]

A young collection
A background, concurrent cycle
A mixed collection
If necessary, a full GC

Without much ado, read [5-9] for the needed background of G1 GC.

G1 is a concurrent collector. Its concurrent cycle has multiple phases—some of which stop all application threads (denoted by STW) and some of which do not:

Initial-mark (STW)

Denoted by:

[GC pause (G1 Evacuation Pause) (young) (initial-mark)
[GC pause (Metadata GC Threshold) (young) (initial-mark)

The G1 GC marks the roots during this phase. This phase is piggybacked on a normal (STW) young garbage collection.

Root Region Scan

Denoted by:

[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end

The G1 GC scans survivor regions of the initial mark for references to the old generation and marks the referenced objects.
This phase runs concurrently with the application (not STW) and must complete before the next STW young garbage collection can start.

Concurrent marking

Denoted by:

[GC concurrent-mark-start]
[GC concurrent-mark-end

The G1 GC finds reachable (live) objects across the entire heap.
This phase happens concurrently with the application, and can be interrupted by STW young garbage collections.

Remark (STW)

Denoted by:

[GC remark

This phase is STW collection and helps the completion of the marking cycle.

Finds objects that were missed by the concurrent mark phase due to updates by Java application threads to objects after the concurrent collector had finished tracing that object.

G1 GC drains SATB buffers, traces unvisited live objects, and performs reference processing.

Cleanup (STW)

Denoted by:

[GC cleanup 1936M->1931M(2048M)

Prepare for next concurrent collection by clearing data structures.

In this final phase, the G1 GC performs the STW operations of accounting and RSet scrubbing.
During accounting, the G1 GC identifies completely free regions and mixed garbage collection candidates. The cleanup phase is partly concurrent when it resets and returns the empty regions to the free list.

Concurrent cleanup:

Denoted by:

[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end

In this phase, G1 reclaims regions which were found empty during marking.

Adds empty regions to the free list—i.e., thread local free lists are merged into global free list

G1 GC uses the Snapshot-At-The-Beginning (SATB) algorithm, which takes a snapshot of the set of live objects in the heap at the start of a marking cycle. During the concurrent cycle, young garbage collections are allowed, which are triggered when eden fills up (note that initial-mark is implemented using a young collection cycle) .

The set of live objects after marking cycle is composed of the live objects in the snapshot, and the objects allocated since the start of the marking cycle. The G1 GC marking algorithm uses a pre-write barrier to record and mark objects that are part of the logical snapshot.

After the concurrent cycle, we expect to see:^[5]

The eden regions before the marking cycle have been completely freed and new eden regions have been allocated
Old regions could be more occupied because the promotion of live objects from young regions
Some old regions are identified to be mostly garbage and become candidates in later mixed or old collection cycles

Diagnosis

The reason of setting -Xmn400m has caused the regression of G1 is:

The option has forced G1 to use a young generation space of up to 400 MB and leaves G1 just 200 MB (i.e. 2048MB - 400MB - 1400MB) breathing room for shuffling live objects around.

The small free space has negative impact on G1's concurrent cycles, which has caused most marking cycles not being completed in time.

Before setting -Xmn400m on the command line, we have found only one instance of aborted marking cycle, which is denoted by:

[GC concurrent-mark-abort]

After setting it, we have seen marking cycle was aborted 105 times out of 150 instances. The lesson we have learnt here is that for a tight heap like our benchmark, it's not a good idea to set -Xmn400m for G1. Without setting MaxNewSize, G1's ergonomics will dynamically adjust young generation space and it turns out that G1 can do a better job in this case.

Differences between Parallel GC and G1 GC

Note that -Xmn400m is the optimal setting for our benchmark if Parallel GC is used. When we set MaxNewSize to be 400MB for G1, our benchmark regressed. So, what's the difference between G1 GC and Parallel GC?

Initially, G1 GC was designed to replace CMS ^[10] for its relatively lower and more predictable pause times. This is different from the design of Parallel GC which aims for higher throughput. To help gain higher throughput, Parallel GC tries to:

adjust young generation as large as possible and let it run into a full GC

In our 4-hour experiments, we usually see around 50 full GC's in Parallel GC

However, for G1's performance tuning, we want to avoid full GC's (see [11] for how)

Which means more frequent full GC is expected

Note that Parallel GC's Full GC pause time is shorter than G1's in the current JDK 8 releases

To conclude, if you upgrade JDK or switch garbage collector,^[7] always re-evaluate original optimal VM settings.

Acknowledgement

Some writings here are based on the feedback from Thomas Schatzl. However, the author would assume the full responsibility for the content himself.

References

Garbage First Garbage Collector Tuning
JRockit: How to Estimate the Size of Live Data Set
g1gc logs - basic - how to print and how to understand
g1gc logs - Ergonomics -how to print and how to understand
Java Performance: The Definitive Guide (Strongly recommended)
Garbage First Garbage Collector Tuning - Oracle
Our Collectors by Jon Masamitsu
Understanding Garbage Collection
HotSpot VM Performance Tuning Tips
Understanding CMS GC Logs
G1 GC: Tuning Mixed Garbage Collections in JDK 8
G1 GC Glossary of Terms
Learn More About Performance Improvements in JDK 8
Benchmarking G1 and other Java 7 Garbage Collectors
HotSpot Virtual Machine Garbage Collection Tuning Guide
Getting Started with the G1 Garbage Collector
Garbage-First Garbage Collector (JDK 8 HotSpot Virtual Machine Garbage Collection Tuning Guide)
Other JDK 8 articles on Xml and More
Tuning that was great in old JRockit versions might not be so good anymore
- Trying to bring over each and every tuning option from a JR configuration to an HS one is probably a bad idea.
- Even when moving between major versions of the same JVM, we usually recommend going back to the default (just pick a collector and heap size) and then redoing any tuning work from scratch (if even necessary).

Sunday, September 14, 2014

G1 GC: Tuning Mixed Garbage Collections in JDK 8

To tune G1 GC (Garbage First Garbage Collector), a place to start with is [1]. For the latest information, read [2].

Assuming you have minimal knowledge of G1 GC, the focus of this article is to tune the Mixed GC for better performance.

The Need to Tune Mixed GC

G1 GC is a generational garbage collector—read [3] to learn what eden, survivor, and old generation spaces are. It also uses region-based architecture which divides large contiguous Java heap space into multiple fixed-sized heap regions.

The destination region for a particular object depends upon the object's age; an object that has aged sufficiently is evacuated to an old generation region (or promoted); otherwise, the object is evacuated to a survivor region and will be included in the CSet^[5] of the next young or mixed garbage collection.

During young collections, G1 GC adjusts its young generation (eden and survivor sizes) to meet its pause target. During mixed collections, the G1 GC adjusts the number of old gen regions that are collected based on a target number of mixed garbage collections, the percentage of live objects in each region of the heap, and the overall acceptable heap waste percentage (see details later).

Depending on your application's workload, Full GC could be expensive in G1 GC. Since Mixed GC collects both young and old regions, you can get better performance by tuning Mixed GC—the goal is to reduce the number of Full GC's when your applications run.

More on Mixed Garbage Collections

Upon successful completion of a concurrent marking cycle, the G1 GC switches from performing young garbage collections to performing mixed garbage collections. In a mixed garbage collection, the G1 GC optionally adds some old regions to the set of eden and survivor regions that will be collected. The exact number of old regions added is controlled by a number of flags that will be discussed later. After the G1 GC collects a sufficient number of old regions (over multiple mixed garbage collections), G1 reverts to performing young garbage collections until the next marking cycle completes.

To summarize, Mixed GC can be characterized by:

Collecting both young and old regions
Denoted by: [GC pause (G1 Evacuation Pause) (mixed) in the GC log file^[2]
Only after a completed marking cycle
A sequence of mixed GC events to collect old regions, up to 8 by default

Tuning Mixed Garbage Collections

You can play with the following options to tune Mixed GC:^[1,2,4]

-XX:InitiatingHeapOccupancyPercent

Percentage of the (entire) heap occupancy to start a concurrent GC cycle. It is used by GCs that trigger a concurrent GC cycle based on the occupancy of the entire heap, not just one of the generations. The default value is 45.
This option can be used to change the marking threshold

If threshold is exceeded, a concurrent marking will be initiated next.
The higher the threshold is, the less concurrent marking cycles will be, which also means the less mixed GC evacuation will be.

-XX:G1MixedGCLiveThresholdPercent

This option can be used to change the threshold which determines whether a region should be added to the CSet or not.

Only regions whose live data percentage are less than the threshold will be added to the CSet.
The higher the threshold (default: 65) is, the more likely a region will be added to the CSet, which also means more mixed GC evacuation and longer evacuation time will happen.

-XX:G1HeapWastePercent

Amount of reclaimable space, expressed as a percentage of the heap size that G1 will stop doing mixed GC's. If the amount of space that can be reclaimed from old generation regions compared to the total heap is less than this, G1 will stop mixed GC's.
Current default is 10%. A lower value, say 5%, will potentially cause G1 to add more expensive region(s) to evacuate for space reclamation.

G1 will continue triggering mixed GC if the reclaimable is higher than the waste threshold. So, there will be more mixed GC and maybe more expensive if you set this waste threshold lower.

Besides the above-mentioned options, you may also want to tune the following ones:

-XX:G1MixedGCCountTarget

Sets the target number of mixed garbage collections after a marking cycle to collect old regions with at most G1MixedGCLIveThresholdPercent live data.
The default is 8 mixed garbage collections. The goal for mixed collections is to be within this target number.

-XX:G1OldCSetRegionThresholdPercent

Sets an upper limit on the number of old regions to be collected during a mixed garbage collection cycle. The default is 10 percent of the Java heap.

-XX:G1HeapRegionSize

Sets the size of a G1 region. The value will be a power of two and can range from 1MB to 32MB.
The goal is to have around 2048 regions based on the minimum Java heap size.

Conclusion

Each application is unique, you may need to tune G1 GC in an iterative process. Note that all of the above options are product options except G1MixedGCLiveThresholdPercent and G1OldCSetRegionThresholdPercent, which are experimental options. Be warned that: for the experimental options, Oracle may remove them at its discretion in the future releases.

Acknowledgement

Some writings here are based on the feedback from Thomas Schatzl and Yu Zhang. However, the author would assume the full responsibility for the content himself.

References

Garbage First Garbage Collector Tuning
g1gc logs - Ergonomics -how to print and how to understand
Understanding Garbage Collection
Garbage First (G1) Garbage Collection Options
CSet

The G1 GC reduces heap fragmentation by incremental parallel copying of live objects from one or more sets of regions (called Collection Set (CSet)) into different new region(s) to achieve compaction.
The goal of G1 GC is to reclaim as much heap space as possible, starting with those regions that contain the most reclaimable space (i..e, garbage first), while attempting to not exceed the pause time goal.

G1 GC Glossary of Terms
Learn More About Performance Improvements in JDK 8
HotSpot Virtual Machine Garbage Collection Tuning Guide
Garbage-First Garbage Collector (JDK 8 HotSpot Virtual Machine Garbage Collection Tuning Guide)
G1 GC: Humongous Objects and Humongous Allocations (Xml and More)
Other JDK 8 articles on Xml and More
Concurrent Marking in G1

Friday, August 22, 2014

HotSpot: Sizing System Dictionary in JDK 8

The system dictionary is an internal JVM data structure which holds all the classes loaded by the system. As described in JDK-7114376,^[1]

The System Dictionary hashtable bucket array size is fixed at 1009. This value is too small for large programs with many names, too large for small programs and just about right for medium sized ones. The default should remain at 1009, but it should be possible to override it on the command line.

Finally, this has happened in JDK 8 and you can set hashtable bucket array size to be larger if you have more classes loaded in your applications by using:^[2]

-XX:+UnlockExperimentalVMOptions -XX:PredictedLoadedClassCount=<#>

This can help calls like Class.forName(), which do lookups into this data structure.

In this article, we will look into sizing system dictionary in more details.

Performance of System Dictionary

In 1.4.2, there were some performance enhancements. One of them is making system dictionary reads lock-free.^[3] From this enhancement, you probably can guess that tuning system dictionary could be important to the JVM's performance.

The current number of buckets in hash table for system dictionary is set to be 1009 by default, which is relatively small for an application which has 49987 classes loaded as shown below:

Loaded   Bytes Unloaded   Bytes       Time
49987 110488.8     2453  7743.8     105.24

From the experience, we know:^[5]

In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely more than that.

Assuming the average ideal length of buckets is three, the hashtable bucket array size then should be:

Ideal hashtable bucket array size = 49987 / 3 = 16662

System Dictionary Tuning

Seeing 49987classes loaded in our application at end of its run, we decided to set:

-XX:PredictedLoadedClassCount=16661 (a prime)

Note that PredictedLoadedClassCount is an experimental flag. So, you also need to set:

-XX:+UnlockExperimentalVMOptions

before it in the command line.

When the JVM allocates memory, the largest chunk is the java heap. The system dictionary is in C heap and it is the loaded class cache. The goal of setting PredictedLoadedClassCount flag is to increase the size of the system dictionary in order to make lookups of loaded classes faster. Before every class being loaded, it requires a checking to see if the class is already loaded. Larger system dictionary will improve JVM performance during this class loading and resolution phase.

Note that the total number of entries in the hashtable does not change— that is based on the number of loaded classes. What sizing of system dictionary do is to increase the spread of the entries so that the average length of the buckets under a single hash result could be reduced. This would reduce the time it takes to find a given entry.

How to Find Number of Loaded Classes?

There are multiple ways to find out number of loaded classes in an application. For example, you can use -Xverbose:class to determine what classes are loaded. Another way to find out is what we have done in this article by setting -XX:+PerfDataSaveToFile and then use "jstat -class" command to decipher the class statistics offline:

$<snipped>/jdk-hs/bin/jstat -class file:////slot/myserver/appmgr/APPTOP/instance/domains/slcaf977.us.oracle.com/CRMDomain/hsperfdata_9872

Loaded   Bytes Unloaded   Bytes       Time
49987 110488.8     2453  7743.8     105.24

Acknowledgement

Some writings here are based on the feedback from Mikael Gerdin and Karen Kinnear. However, the author would assume the full responsibility for the content himself.

References

Make system dictionary hashtable bucket array size configurable (JDK-7114376)
Tuning the JVM (at 43:02 mark)
Java 2 Platform, Standard Edition (J2SE Platform), version 1.4.2 Performance White Paper
The Class Loader Subsystem
Hash table (Wikipedia)
VM Class Loading

Actually, the HotSpot VM maintains three main hash tables to track class loading

SystemDictionary
PlaceholderTable
LoaderConstraintTable

Garbage-First Garbage Collector Tuning

Friday, August 1, 2014

HotSpot: What Does {pd product} Mean?

In [1], we have discussed how to find out default HotSpot JVM values. In the output, you may have spotted:

{pd product}

What does it mean? In this article, we will examine that in depth.

Platform Dependent

{pd product} means platform-dependent product option. First of all, if a flag is a product option, maybe you want to pay attention to it and may want to change its value for better performance. If it's an experimental or diagnostic flag, you may want to ignore it.

If an option is {pd product}, it means that its default value will be dependent on which platform its implementation would be compiled on. Currently, these platforms are supported, but not limited to:

ARM
PPC
Solaris
x86

Examples

Two of the {pd product} flags, we would like to discuss here further. In JDK 7u51,^[2] they have the following default values on x86 platforms:

    uintx ReservedCodeCacheSize     = 50331648        {pd product}
    bool TieredCompilation         = false           {pd product}

These default values have been changed in later JDK 8 builds. For example, in JDK 8u20, they have the following default values on the same server:

    uintx ReservedCodeCacheSize     = 251658240        {pd product}
    bool TieredCompilation         = true           {pd product}

As discussed in [3,4], you may want to tune these flags for better performance.

References

Friday, May 31, 2013

Understanding String Table Size in HotSpot

In JDK-6962930^[2], it requested that string table size be configurable. The resolved date of that bug was on 04/25/2011 and it's available in JDK 7. In another JDK bug^[3], it has requested the default size (i.e. 1009) of string table be increased.

In this article, we will examine the following topics:

What string table is
How to find the number of interned strings in your applications
The tradeoff between memory footprint and lookup cost

String Table

In Java, string interning^[1] is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool, which is the string table in HotSpot.

The size of the string table (i.e., a chained hash table) is configurable in JDK 7. When the overflow chains become long, performance can degrade. The current default size of string table is 1009 (or 1009 buckets), which is too small for applications that stress the string table. Note that the string table itself is allocated in native memory but the strings are java objects.

Increasing the size improves performance (i..e, reducing look-up cost) but increases the StringTable size by 16 bytes on 64-bit systems, 8 bytes on 32-bit systems for every additional entry. For example, changing the default size to 60013 increases the String Table size by 460K on 32 bit systems.

Finding Number of Interned Strings in the Applications

In HotSpot, it provides a product level option named PrintStringTableStatistics which can be used to print hash table statistics^[4]. For example, using one of our applications (hereafter will be referred as JavaApp), it prints out the following information:

StringTable statistics:
Number of buckets : 60013
Average bucket size : 5
Variance of bucket size : 5
Std. dev. of bucket size: 2
Maximum bucket size : 17

You can find the above output from your manged server's log file in the WebLogic domain. Note that we have set the following option:

-XX:StringTableSize=60013

So, there are 60013 buckets in the hash table (or string table).

In JDK, there is also a tool named jmap which can be used to find out number of interned strings in your application. For example, we have found the following information using:

$ jdk-hs/bin/jmap -heap 18974
Attaching to process ID 18974, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.0-b43

using thread-local object allocation.
Parallel GC with 18 thread(s)

Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 2147483648 (2048.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 402653184 (384.0MB)
MaxPermSize = 402653184 (384.0MB)
G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
<deleted for brevity>

270145 interned Strings occupying 40429904 bytes.

Therefore, we know there are around 260K interned Strings in the table.

Tradeoff Between Memory Footprint and Lookup Cost

Based on curiosity, we have tried to set the string table size to be 277331 (a prime number) to see how JavaApp performs. Here are our findings:

Average Response Time: +0.75%
90% Response Time: +0.56%

However, the memory footprint has increased:

Total Memory Footprint: -1.03%

Finally, here is the hash table statistics based on the new size (i.e., 277331):

StringTable statistics:
Number of buckets : 277331
Average bucket size : 1
Variance of bucket size : 1
Std. dev. of bucket size: 1
Maximum bucket size : 8

The conclusion is that increasing string table size from 60013 to 277331 helps JavaApp's performance a little bit at the expense of larger memory footprint. In this case, the benefit is minimal, keeping string table size to be 60013 is good enough.

References

String Interning (Wikipedia)
JDK 6962930 : make the string table size configurable
JDK 8009928: Increase default value for StringTableSize
Java GC tuning for strings
All other performance tuning articles on XML and More
G1 GC Glossary of Terms

Wednesday, March 27, 2013

Analyzing the Performance Issue Caused by WebLogic Session Size Too Big

In this article, we will show you lots of interesting stuffs:

How to investigate an issue with too many live objects kept in old generation, which cannot be reclaimed in Full GC by HotSpot VM

For background information regarding Garbage Collectors, read [1-7]

How important sesssion-timeout element in web.xml is
How to use java tool to generate heap dump
How to use Eclipse Memory Analyzer to analyze heap dump (i.e. fod.hprof file)

Download link: here

The Issues

We have seen slow performance in one of our benchmarks (i.e., FOD). Here are the symptoms:

GC took about 50% of the application's CPU time
Full GC's were not triggered by full Perm Gen, but by full Old Gen

The old generation is completely full and it isn't clearing up anything but a few soft references during Full GC.

So, the symptoms all point to too many live objects kept in Old Gen and they have caused many Full GC's. There could be different causes for the above symptoms. To investigate, you need to create heap dumps and check out what objects HotSpot VM is holding onto. Below we will show you how to investigate this.

Generating Heap Dump

First, we have used a Java tool jmap to generate a summary of the heap contents:

$./jmap -histo 16345 >/scratch/aime1/tmp/fod.summary

num #instances #bytes class name
----------------------------------------------
1: 15777231 822030680 [Ljava.lang.Object;
2: 3074270 463615192 [C
3: 4076972 331840256 [Ljava.util.HashMap$Entry;
4: 9287101 297187232 java.util.HashMap$Entry
5: 2105437 151591464 org.apache.myfaces.trinidad.bean.util.PropertyHashMap
6: 4075444 97810656 java.util.HashMap$FrontCache
7: 1893014 90864672 java.util.HashMap
8: 2018043 80721720 org.apache.myfaces.trinidad.bean.util.FlaggedPropertyMap
9: 3017161 72411864 java.lang.String

Unfortunately, it didn't tell us much regarding the biggest java object that have taken ~800M bytes. But, we did notice that two other property maps also took up a lot of memory space. So, the next step is to generate a full dump as follows:

$./jmap -dump:live,file=/scratch/aime1/tmp/fod.hprof 16345

To analyze the full heap dump, you need to use Eclipse Memory Analyzer which we have chosen the standalone version.

Analyzing Heap Dump With Memory Analyzer (MAT)

I started the Memory Analyzer with the following command:

$ cd /scratch/sguan/mat/

$ ./MemoryAnalyzer &

When I opened the heap dump file (i.e., fod.hprof), MAT failed with a message related to Java Heap Space. So, I need to modified the following line in the MemoryAnalyzer.ini file:

-vmargs

-Xmx1024m

by changing -Xmx1024m to -Xmx6240m. Note that how big you should set for -Xmx option depends on:

Size of heap dump

Our original file size is 6.4g and was reduced to 3.6g after loading. Note that MAT will remove unreachable objects in the loading process.

Size of physical RAM in your system

The Culprit— WebLogic Session Objects

From the MAT, we have found out that the biggest live object kept in the heap was:

weblogic.servlet.internal.session.MemorySessionContext @ 0x705e0ff08

Also, for two previously-mentioned property map objects, they are also related to the session object. If you trace those objects back to the GC root, you will see that they are pointed to by the session state. They are the reason the session state is so big.

session-timeout Element in web.xml

Session object is used by WebLogic container to track a user's session that uses the StoreFrontModule web application in our FOD benchmark. Every time a new user comes in, it starts a session. This session object keeps user's data or states in it. If the user leaves, the session is considered idle and the server will retain it in memory for a period of time before reclaiming it.

Java web applications can be configured with a session timeout value which specifies the number of minutes a session can be idle before it is abandoned. This session timeout element is defined in the Web Application Deployment Descriptor web.xml^[9]. For example, this was the session timeout value we had:

<session-config>
<session-timeout>35</session-timeout>
</session-config>

The problem with our benchmark's slowness is due to this session timeout value being too high. This has caused the sessions being piled up and taking up lots of memory. Setting it to be a lower value (see [10] on how to patch web.xml in an EAR), our benchmark then performs normally.

References

Java Performance by Charlie Hunt and Binu John
Understanding Garbage Collection
Java HotSpot VM Options
GCViewer (a free open source tool)
Understanding Garbage Collector Output of Hotspot VM
A Case Study of java.lang.OutOfMemoryError: GC overhead limit exceeded
HotSpot VM Performance Tuning Tips
java.lang.OutOfMemoryError – Weblogic Session size too big
Web Application Deployment Descriptor schema
WARNING: VERSION_DISPLAY_ENABLED_IN_PRODUCTION_STAGE
Professional Oracle WebLogic Server by Robert Patrick, Gregory Nyberg, and Philip Aston

Tuesday, March 12, 2013

HotSpot—java.lang.OutOfMemoryError: PermGen space

There could be different causes that lead to out-of-memory error in HotSpot VM. For example, you can run out of memory in PermGen space:

java.lang.OutOfMemoryError: PermGen space

In this article, we will discuss:

Java Objects vs Java Classes
PermGen Collection^[2]
Class unloading
How to find the classes allocated in PermGen?
How to enable class unloading for CMS?

Note that this article is mainly based on Jon's excellent article^[1].

Java Objects vs Java Classes

Java objects are instantiations of Java classes. HotSpot VM has an internal representation of those Java objects and those internal representations are stored in the heap (in the young generation or the old generation^[2]). HotSpot VM also has an internal representation of the Java classes and those are stored in the permanent generation.

PermGen Collector

The internal representation of a Java object and an internal representation of a Java class are very similar. From now on, we use Java objects and Java classes to refer to their internal representations. The Java objects and Java classes are similar to the extent that during a garbage collection both are viewed just as objects and are collected in exactly the same way.

Besides its basic fields, Java class also include the following:

Methods of a class (including the bytecodes)
Names of the classes (in the form of an object that points to a string also in the permanent generation)
Constant pool information (data read from the class file, see chapter 4 of the JVM specification for all the details).
Object arrays and type arrays associated with a class (e.g., an object array containing references to methods).
Internal objects created by the JVM (java/lang/Object or java/lang/exception for instance)
Information used for optimization by the compilers (JITs)

There are a few other bits of information that end up in the permanent generation but nothing of consequence in terms of size. All these are allocated in the permanent generation and stay in the permanent generation.

Class Loading/Unloading

Back in old days, most classes were mostly static and custom class loaders were rarely used. Then class unloading may not be necessary. However, things have changed and sometimes you could run into the following message:

java.lang.OutOfMemoryError: PermGen space

In this case, there are at least two options:

Increasing the size of PermGen
Enabling class unloading

Increasing the Size of PermGen

Sometimes there is a legitimate need to increase PermGen size by setting the following options:

-XX:PermSize=384m -XX:MaxPermSize=384m

However, before you do that, you may want to find out what Java classes were allocated in PermGen by running

jmap -permstat

This is supported in JDK5 and later on both Solaris and Linux.

Enabling Class Unloading

By default, most HosSpot Garbage Collectors do class unloading except CMS collector^[2] (enabled by -XX:+UseConcMarkSweepGC).

If you use CMS collector and run into PermGen's out-of-memory error, you could consider enabling class unloading by setting:

-XX:+CMSClassUnloadingEnabled
-XX:+CMSPermGenSweepingEnabled

Depending on the release you may have, earlier versions (i.e., Java 6 Update 3 or earlier) require you to set both options. However, in later releases you only need to specify:

-XX:+CMSClassUnloadingEnabled

The following are the cases that you want to enable class unloading in CMS collectors:

If your application is using multiple class loaders and/or reflection, you may need to enable collecting of garbage in permanent space.
Objects in permanent space may have references to normal old space thus even if permanent space is not full itself, references from perm to old space may keep some dead objects unreachable for CMS if class unloading is not enabled.
Lots of redeployment may pressure PermGen space

A class and it's classloader have to both be unreachable in order for them to be unloaded. A class X with classloader A and the same class X with classloader B will result in two distinct objects (klassses) in the permanent generation.

References

Saturday, January 26, 2013

Memlock limit too small: 32768 to accommodate segment size: 4194304

In our database alert_<sid>.log, we have found the following warning message:

Memlock limit too small: 32768 to accommodate segment size: 4194304

For Oracle to lock shared memory for the shared pool, memlock limit (i.e., locked-in-memory address space) must be large enough. For example, our system has the following maximum locked-in-memory address space:

# -l The maximum size that may be locked into memory
$ ulimit -l
32

Note that the "memlock" figure is specified in Kb. From the warning, we know our memlock limit needs to be larger than 4194304 bytes. In reality, it is better to oversize it a little.

How to increase memlock limit on Linux?

Unix operating system can enforce "Limits" the resources a process/user can consume. Memlock is one of the resource. On Linux systems, you can adjust the "memlock" parameter in the

/etc/security/limits.conf

For example, this is what we have specified to correct the "memlock-limit-too-small" issue:

#vi /etc/security/limits.conf

@perfgrp    soft    memlock         14680064
@perfgrp    hard    memlock         14680064
*           soft    nofile          65535
*           hard    nofile          65535

In general, individual limits have priority over group limits, so if you impose no limits for admin group, but one of the members in this group have a limits line, the user will have its limits set according to this line. In our case, we specify the group limit for "perfgrp" to be 14680064 bytes. For other users, default values will be applied, which is 65535 bytes. Finally, you need to reboot to make new settings effective.

Consideration for Java Applications

Similar to database software, you also have the same tuning requirement for application running on HotSpot. For example,

For all users who will run HotSpot with large pages, you need to set their memlock limit to a value higher than the maximum heap they will run. This ensures user running the Java application can lock the correct amount of memory.

Of course, the memlock limit to be considered should be based on how much physical memory you have on your systems. See [1] for details.

References

Friday, October 26, 2012

HotSpot VM Performance Tuning Tips

In some cases, it may be obvious from benchmarking that parts of an application need to be rewritten using more efficient algorithms^[1]. Sometimes it may just be enough to provide a more optimal runtime environment by tuning the JVM parameters.

In this article, we will show you some of the HotSpot VM performance tuning tips.

What to tune?

You can tune HotSpot performance on multiple fronts:

Code generation^[8,10]
Memory management

In this article, we will focus more on memory management (or garbage collector). The goals of tuning garbage collector include:

To make a garbage collector operate efficiently, by

Reducing pause time or
Increasing throughput

To avoid heap fragmentation

Different garbage collector uses different compaction to eliminate fragmentation

To make it scalable for multithreaded applications on multiprocessor systems

In this article, we will cover the following tuning options:

Client VM or Server VM
32-bit VM or 64-bit VM
GC strategy
Heap sizing
Further tuning

Client vs. Server VM

The HotSpot Client JVM has been specially tuned to reduce application startup time and memory footprint, making it particularly well suited for client environments. On all platforms, the HotSpot Client JVM is the default.

The Java HotSpot Server VM is similar to the HotSpot Client JVM except that it has been specially tuned to maximize peak operating speed. It is intended for long-running server applications, for which the fastest possible operating speed is generally more important than having the fastest startup time. To invoke the HotSpot Server JVM instead of the default HotSpot Client JVM, use the -server parameter; for example,

java -server MyApp

In [7], authors have mentioned a third HotSpot VM runtime named tiered. If you are using Java 6 Update 25, Java 7, or later, you may consider using tiered server runtime as a replacement for the client runtime. For more details, read [8,10].

32-Bit or 64-Bit VM

The 32-bit JVM is the default for the HotSpot VM. The choice of using a 32-bit or 64-bit JVM is dictated by the memory footprint required by the application along with whether any third-party software used in the application supports 64-bit JVMs and if there are any native components in the Java application. All native components using the Java Native Interface (JNI) in a 64-bit JVM must be compiled in 64-bit mode.

Running 64-bit VM has the following advantages^[7]:

Larger address space
Better performance in two fronts

64-bit JVMs can make use of additional CPU registers
Help avoid register spilling

and one disadvantage:

Increased width for oops

Results in fewer oops being available on a CPU cache line and as a result decreases CPU cache efficiency.
This negative performance impact can be mitigated by setting:

-XX:+UseCompressedOops VM command line option

Note that client runtimes are not available in 64-bit HotSpot VMs. See [11] for more details.

GC Strategy

JVM performance is usually measured by its GC's effectiveness. Garbage collection (GC) reclaims the heap space previously allocated to objects no longer needed. The process of locating and removing those dead objects can stall your Java application while consuming as much as 25 percent of throughput.

The Java HotSpot virtual machine includes five garbage collectors.^[27] All the collectors are generational.

Serial Collector

Both young and old collections are done serially (using a single CPU), in a stop-the-world fashion.
The old and permanent generations are collected via a mark-sweep-compact collection algorithm.

The sweep phase “sweeps” over the generations, identifying garbage. The collector then performs sliding compaction, sliding the live objects towards the beginning of the old generation space (and similarly for the permanent generation), leaving any free space in a single contiguous chunk at the opposite end.

When to use

For most applications that are run on client-style machines and that do not have a requirement for low pause times

How to select

In the J2SE 5.0 and above, the serial collector is automatically chosen as the default garbage collector on machines that are not server-class machines. On other machines, the serial collector can be explicitly requested by using the -XX:+UseSerialGC command line option.

Parallel Collector (or throughput collector)

Young generation collection

Uses a parallel version of the young generation collection algorithm utilized by the serial collector
It is still a stop-the-world and copying collector, but performing the young generation collection in parallel, using many CPUs, decreases garbage collection overhead and hence increases application throughput.

Old generation collection

Uses the same serial mark-sweep-compact collection algorithm as the serial collector

When to use

For applications run on machines with more than one CPU and do not have pause time constraints, since infrequent, but potentially long, old generation collections will still occur.
Examples of applications for which the parallel collector is often appropriate include those that do batch processing, billing, payroll, scientific computing, and so on.

How to select

In the J2SE 5.0 and above, the parallel collector is automatically chosen as the default garbage collector on server-class machines. On other machines, the parallel collector can be explicitly requested by using the -XX:+UseParallelGC command line option.

Parallel Compacting Collector

Young generation collection

Use the same algorithm as that for young generation collection using the parallel collector.

Old generation collection

The old and permanent generations are collected in a stop-the-world, mostly parallel fashion with sliding compaction
The collector utilizes three phases (see [5] for more details):

Marking phase
Summary phase
Compaction phase

When to use

For applications that are run on machines with more than one CPU and applications that have pause time constraints.

How to select

If you want the parallel compacting collector to be used, you must select it by specifying the command line option -XX:+UseParallelOldGC.

Concurrent Mark-Sweep (CMS) Collector^[5,6]

Young generation collection

The CMS collector collects the young generation in the same manner as the parallel collector.

Old generation collection

Most of the collection of the old generation using the CMS collector is done concurrently with the execution of the application.
The CMS collector is the only collector that is non-compacting. That is, after it frees the space that was occupied by dead objects, it does not move the live objects to one end of the old generation.

To minimize risk of fragmentation, CMS is doing statistical analysis of object’s sizes and have separate free lists for objects of different sizes.

When to use

For application needs shorter garbage collection pauses and can afford to share processor resources with the garbage collector when the application is running. (Due to its concurrency, the CMS collector takes CPU cycles away from the application during a collection cycle.)
Typically, applications that have a relatively large set of long-lived data (a large old generation), and that run on machines with two or more processors, tend to benefit from the use of this collector.
Compared to the parallel collector, the CMS collector decreases old generation pauses—sometimes dramatically—at the expense of slightly longer young generation pauses, some reduction in throughput, and extra heap size requirements.

How to select

If you want the CMS collector to be used, you must explicitly select it by specifying the command line option -XX:+UseConcMarkSweepGC.
If you want it to be run in incremental mode, also enable that mode via the –XX:+CMSIncrementalMode option.

This feature is useful when applications that need the low pause times provided by the concurrent collector are run on machines with small numbers of processors (e.g., 1 or 2).

ParNewGC is the parallel young generation collector for use with CMS. To choose it, you can specify the command line option ‑XX:+UseParNewGC.

Generation First (G1) Garbage Collector^[31]

Summary

Differs from CMS in the following ways

Compacting

Reduce fragmentation and is good for long-running applications

Heap is split into regions

Easy to allocate and resize

Evacuation pauses

For both young and old regions

Young generation collection

During a young GC, survivors from the young regions are evacuated to either survivor regions or old regions

Done with stop-the-world evacuation pauses
Performed in parallel

Old generation collection

Some garbage objects in regions with very high live ratio may be left in the heap and be collected later
Concurrent marking phase

Calculates liveness information per region

Empty regions can be reclaimed immediately
Identifies best regions for subsequent evacuation pauses

Remark is done with one stop-the-world puase while initial mark is piggybacked on an evacuation pause
No corresponding sweeping phase
Different marking algorithm than CMS

Old regions are reclaimed by

Evacuation pauses

Using compaction
Where most reclamation is done

Remark (when totally empty)

When to use

The G1 collector is a server-style garbage collector, targeted for multi-processor machines with large memories. It meets garbage collection (GC) pause time goals with high probability, while achieving high throughput.

How to select

If you want the G1 garbage collector to be used, you must explicitly select it by specifying the command line option -XX:+UseG1GC.

Note that the difference between:

-XX:+UseParallelOldGC

and

-XX:+UseParallelGC

is that -XX:+UseParallelOldGC enables both a multithreaded young generation garbage collector and a multithreaded old generation garbage collector, that is, both minor garbage collections and full garbage collections are multithreaded. -XX:+UseParallelGC enables only a multithreaded young generation garbage collector. The old generation garbage collector used with -XX:+UseParallelGC is single threaded.

Using -XX:+UseParallelOldGC also automatically enables -XX:+UseParallelGC. Hence, if you want to use both a multithreaded young generation garbage collector and a multithreaded old generation garbage collector, you need only specify -XX:+UseParallelOldGC.

Note that the above distinction between -XX:+UseParallelOldGC and -XX:+UseParallelGC are no longer true in JDK 7. In JDK 7, the following three settings are equivalent:

Default
-XX:+UseParallelGC
-XX:+UseParallelOldGC

They all use multithreaded collectors for both young generation and old generation.

Heap Sizing

If a heap size is small, collection will be fast but the heap will fill up more quickly, thus requiring more frequent collections. Conversely, a large heap will take longer to fill up and thus collections will be less frequent, but they may take longer.

Command line parameters that divide the heap between new and old generations usually cause the greatest performance impact. If you increase the new generation's size, you often improve the overall throughput; however, you also increase footprint, which may slow down servers with limited memory.

For more details, you can read [12-15].

Further Tuning

HotSpot's default parameters are effective for most small applications that require faster startup and a smaller footprint. However, more often than not, you will find default settings are not good enough and need further tuning your Java applications. As shown in [16], there are many VM options exoosed and can be further tuned by brave souls. I'm not going to discuss such tunings in this article. But, I'll keep on posting articles on this blogger with VM tunings. Stay tuned!

References

Java Performance Tips
Pick up performance with generational garbage collection
Java HotSpot™ Virtual Machine Performance Enhancements
Oracle JRockit
Memory Management in the Java HotSpot™ Virtual Machine
Understanding GC pauses in JVM, HotSpot's CMS collector
Java Performance by Charlie Hunt and Binu John
Performance Tuning with Hotspot VM Option: -XX:+TieredCompilation
Java Tuning White Paper
A Case Study of Using Tiered Compilation in HotSpot
HotSpot VM Binaries: 32-Bit vs. 64-Bit
HotSpot Performance Option — SurvivorRatio
A Case Study of java.lang.OutOfMemoryError: GC overhead limit exceeded
Understanding Garbage Collection
Diagnosing Java.lang.OutOfMemoryError
What Are the Default HotSpot JVM Values?
Understanding Garbage Collector Output of Hotspot VM
On Stack Replacement in HotSpot JVM
Professional Oracle WebLogic Server by Robert Patrick, Gregory Nyberg, and Philip Aston
Sun Performance and Tuning: Java and the Internet by Adrian Cockroft and Richard Pettit
Concurrent Programming in Java: Design Principles and Patterns by Doug Lea
Capacity Planning for Web Performance: Metrics, Models, and Methods by Daniel A. Menascé and Virgilio A.F. Almeida
Java Performance Tuning (Michael Finocchiaro)
Diagnosing Heap Stress in HotSpot
Introduction to HotSpot JVM Performance and Tuning
Tuning the JVM (video)

Frequency of Minhor GC dictated by:

Application object allocation rate
Size of Eden

Frequency of object promotion dictated by:

Frequency of minor GCs (tenuring)
Size of survivor spaces

Full GC Frequency dictated by

Promotion rate
Size of old generation

JEP 173: Retire Some Rarely-Used GC Combinations
G1 GC Glossary of Terms
Learn More About Performance Improvements in JDK 8
Java SE HotSpot at a Glance
Garbage-First Garbage Collector (JDK 8 HotSpot Virtual Machine Garbage Collection Tuning Guide)
Tuning that was great in old JRockit versions might not be so good anymore
- Trying to bring over each and every tuning option from a JR configuration to an HS one is probably a bad idea.
- Even when moving between major versions of the same JVM, we usually recommend going back to the default (just pick a collector and heap size) and then redoing any tuning work from scratch (if even necessary).

Cross Column

Sunday, October 12, 2014

Live Data Size

Concurrent Cycle in G1 GC

Diagnosis

Differences between Parallel GC and G1 GC

Acknowledgement

References

Sunday, September 14, 2014

The Need to Tune Mixed GC

More on Mixed Garbage Collections

Tuning Mixed Garbage Collections

Conclusion

Acknowledgement

References

Friday, August 22, 2014

Performance of System Dictionary

System Dictionary Tuning

How to Find Number of Loaded Classes?

Acknowledgement

References

Friday, August 1, 2014

Platform Dependent

Examples

References

Friday, May 31, 2013

String Table

Finding Number of Interned Strings in the Applications

Tradeoff Between Memory Footprint and Lookup Cost

References

Wednesday, March 27, 2013

The Issues

Generating Heap Dump

Analyzing Heap Dump With Memory Analyzer (MAT)

The Culprit— WebLogic Session Objects

session-timeout Element in web.xml

References

Tuesday, March 12, 2013

Java Objects vs Java Classes

PermGen Collector

Class Loading/Unloading

Increasing the Size of PermGen

Enabling Class Unloading

References

Saturday, January 26, 2013

How to increase memlock limit on Linux?

Consideration for Java Applications

References

Friday, October 26, 2012

What to tune?

Client vs. Server VM

32-Bit or 64-Bit VM

GC Strategy

Heap Sizing

Further Tuning

References