Monday, December 30, 2013

JRockit: What's the Total Memory Footprint of a Java Process?

In [1], we have shown a case of performance tuning by sizing JRockit's Thread Local Area (TLA). For both test runs—TLA default and TLA tuned, they have been given the same heap size (i.e, 2g). However, in the conclusion, I have said:
Better performance is achieved by reducing pause time % in GC andTotal CPU % at the expense of total memory footprint (-1.6%).
In this article, we will clarify what the following phrase:
at the expense of total memory footprint (-1.6%)
means.

Java Heap vs Native Memory


As described in [2], total memory footprint of a Java process includes not only Java Heap, but also Native Memory. For example, you can use print_memusage to analyze the memory (including native memory) allocated by a Java process:

> ./jrcmd 26413 print_memusage >JR_print_memusage.txt
>cat JR_print_memusage.txt


26413:
Total mapped                  4897980KB           (reserved=1384484KB)
-              Java heap      2097152KB           (reserved=0KB)
-              GC tables        70156KB
-          Thread stacks        48324KB           (#threads=141)
-          Compiled code      1048576KB           (used=41160KB)
-               Internal         1480KB
-                     OS       401856KB
-                  Other       760932KB
-            Classblocks        27136KB           (malloced=26821KB #58548)
-        Java class data       441344KB           (malloced=439083KB #271063 in 58548 classes)
- Native memory tracking         1024KB           (malloced=242KB #10)

Native Memory


Internal JVM memory management is, to a large extent, kept off the Java heap and allocated natively in the operating system, through system calls like malloc. This non-heap system memory allocated by the JVM is referred to as native memory.

For JRockit, you can constraint the amount of memory allocated to the Java heap, but not the native memory, in a Java process.[3] As shown above, process 26413 has been allocated 4897980KB (i.e., VSZ) in the virtual address space. However, this VSZ value is not very useful. What counts is the one reported by Resident Set Size (RSS; or physically resident memory). To find out the RSS of a Java process, you can do:

>ps -o pid,uid,state,rss,vsz,minflt,majflt,args -p 26413 >ps_26413.tmp
>cat ps_26413.tmp

PID   UID S   RSS    VSZ MINFLT MAJFLT COMMAND
26413 60000 S 3295216 4889792 15194852 1 /scratch/user1/JVMs/jdk-jr/bin/java ...

For the process 26413, its total memory footprint (or RSS) is 3295216 KB and this is the KPI we have quoted in our benchmark comparison.

References

  1. JRockit: A Case Study of Thread Local Area (TLA) Tuning
  2. Why is my JVM process larger than max heap size?
  3. How to Debug Native OutOfMemory in JRockit (Xml and More)

Sunday, December 29, 2013

JRockit: Analyzing GC With JRockit Verbose Output (-Xverbose:memdbg)

Before any JRockit performance tuning, you need to assess the sizes of the objects allocated by your application.[4] One way to access live data size is to view object allocation statistics from the GC log file.

In this article, we will show you how to access object allocation statistics by enabling the JRockit verbose output:
  • -Xverbose:memdbg

Verbose memdbg Log Module[1]


The -Xverbose:memdbg log module is an alias that starts several memory-related log modules to provide a complete picture of the memory management. The memdbg log module sets the memory module to the debug level. This information can also be obtained by setting:
  • -Xverbose:memory=debug

The overhead of the verbose memdbg output is low, which means it can be used in production environments. For example, after enabling -Xverbose:memdb, the overhead has caused the Average Response Time (ART) to deteriorate by 2.25% using our benchmark.

Understanding Verbose Output


After setting the memory module to the debug level, JRockit provides us tons of information. To understand its formats, read [1]. Here we will focus only on OC's (old collections). For example, here is a typical output from the memory log module at the debug level:

[DEBUG][memory ][06310] [OC#41] GC reason: Allocation request failed (1048592 bytes).
[DEBUG][memory ][06310] [OC#41] 2607.371: OC started.
[INFO ][alloc  ][06310] [OC#41] Pending requests at 'Before OC' - Total: 1, TLAs: 0(approx 0 bytes), objects: 1 (1048592 bytes). Max age: 0.
[INFO ][compact][06310] [OC#41] Compaction reason: Normal.
[INFO ][compact][06310] [OC#41] Compacting 128 of 4096 parts at index 384. Compaction type is internal. Exceptional: No.
[INFO ][compact][06310] [OC#41] Compaction area start: 0x8c000000, end: 0x90000000.
Timeout: 411.471 ms.
[INFO ][compact][06310] [OC#41] Compactset limit (per thread): 316839 (dynamic), not using matrixes.
[DEBUG][memory ][06310] [OC#41] Starting parallel marking phase.
[DEBUG][memory ][06310] [OC#41] SemiRef phase WeakJNIHandles run in single threadedmode.
[DEBUG][memory ][06310] [OC#41] SemiRef phase ClassConstraints run in single threaded mode.
[DEBUG][memory ][06310] [OC#41] SemiRef phase FinalMemleak run in single threaded mode.
[DEBUG][memory ][06310] [OC#41] Adding 48 temporary work packets to permanent pool,now 811 packets.
[DEBUG][memory ] [OC#41] Total mark time: 307.204 ms.
[DEBUG][memory ] [OC#41] Ending marking phase.
[DEBUG][memory ] [OC#41] Starting parallel sweeping phase.
[INFO ][nursery] [OC#41] Setting keepAreaMarkers[0] to 0xc9872bc0.
[INFO ][nursery] [OC#41] Setting keepAreaMarkers[1] to 0xcc97f7a0.
[INFO ][nursery] [OC#41] Setting keepAreaMarkers[2] to 0xd20f6088.
[INFO ][nursery] [OC#41] Next keeparea will start at 0xcc97f7a0 and end at 0xd20f6088.
[INFO ][nursery] [OC#41] Nursery size increased from 0KB to 120414KB. Nursery list consists of 2760 parts.
[INFO ][nursery] [OC#41] Average part size: 44KB. Contraharmonic mean (CHM):946KB. CHM per part: 0KB. Normalized CHM: 0.007856.
[DEBUG][memory ][ [OC#41] Total sweep time: 159.954 ms.
[DEBUG][memory ] [OC#41] Ending sweeping phase.
[INFO ][nursery] [OC#41] Nursery size remains at 120414KB. Nursery list consists of 2760 parts.
[INFO ][nursery] [OC#41] Average part size: 44KB. Contraharmonic mean (CHM):946KB. CHM per part: 0KB. Normalized CHM: 0.007856.
[INFO ][alloc  ] [OC#41] Satisfied 0 object and 0 tla allocations. Pending requests went from 1 to 1.
[INFO ][compact] [OC#41] Average compact time ratio (move phase/total time):0.655049.
[INFO ][compact] [OC#41] Compaction time, total: 158.705 ms (target 411.471 ms).
[INFO ][compact] [OC#41] Compaction moved 1295486 objects and left 349 objects. Total moved size 64028976B.
[INFO ][compact] [OC#41] Compaction added 3047752B of free memory in 2 parts.
[INFO ][compact] [OC#41] Compaction time, move phase: 58.023 ms (target 205.735 ms).
[INFO ][compact] [OC#41] Compaction time, update phase: 100.669 ms (target 205.735 ms).
[INFO ][compact] [OC#41] Found 3350566 references. Internal: 1807732  External: 1542834.
[INFO ][compact] [OC#41] Updated 3339159 references. Internal: 1807048  External: 1532111.
[INFO ][alloc  ] [OC#41] Pending requests at 'After OC' - Total: 1, TLAs: 0 (approx 0 bytes), objects: 1 (1048592 bytes). Max age: 1.
[DEBUG][memory ] [OC#41] Page faults before OC: 1, page faults after OC: 1, pages in heap: 524288.
[DEBUG][memory ] [OC#41] Nursery size after OC: 120414KB. (Free: 120371KB Parts: 2760)
[INFO ][memory ] [OC#41] 2607.371-2607.928: OC 1890153KB->1756549KB (2097152KB), 0.558 s, sum of pauses 472.469 ms, longest pause 472.469 ms.

Note that we have used the following JRockit options:
  • -Xverbose:memdbg -Xverbose:gc -Xverbosedecorations=level,module,timestamp,millis,pid

GC Reason


Old collections can be triggered by different events. For example, in the above example, it shows the reason for OC#41 is:
  • Allocation request failed
There are other events can trigger an OC. For example, it can be:
  • GC reason: Artificial, description: Print Heap Summary
  • GC reason: Artificial, description: Print Heap Diagnostics

From our GC log file, we have found the following OC statistics (total: 235)
  • GC reason: Allocation request failed (total: 219)
  • GC reason: Artificial, description: Print Heap Summary (total: 8)
  • GC reason: Artificial, description: Print Heap Diagnostics (total: 8)
Among all events, you can count those artificial heap printing events as the overhead of enabling:
  • -Xverbose:memdbg

References

  1. Understanding Verbose Output
  2. JRockit: A Case Study of Thread Local Area (TLA) Tuning (Xml and More)
  3. JRockit: Thread Local Area Size and Large Objects (Xml and More)
  4. Analyzing the Performance Impact of Memory Utilization and Garbage Collection

Saturday, December 28, 2013

JRockit: A Case Study of Thread Local Area (TLA) Tuning

Increasing the Thread Local Area (TLA) size is beneficial for multi-threaded applications where each thread allocates a lot of objects. Increasing the TLA size is also beneficial when the average size of the allocated objects is large, as this allows larger objects to be allocated in the TLAs.

In this article, we will present a case of TLA tuning using an application with the following characteristics:
  • A multi-threaded application
  • Allocating a lot of objects
  • Allocating large objects 

TLA Basics

Without much ado, see [1] for the differences between JRockit's large-object implementations between pre- and post-R28.   The version of JRockit used in this article is R28.2.5.  For this version, TLA sizes can be tuned using the following option:
  • -XXtlaSize:min=size,preferred=size,wasteLimit=size

Chaning Default TLA Sizes


For our Linux system, the default TLA settings are:
  • min=2k
  • preferred=16k
  • wasteLimit=2k
To tune for better performance, read [2,4] for advice.  For example, one recommendation offered by Oracle is:
  • Setting  -XXtlaSize:wasteLimit to the same value as -XXtlaSize:min.
After trials and errors, we have found the following TLA settings to be better than the defaults:
  • -XXtlaSize:min=8k,preferred=512k,wasteLimit=8k

Performance Comparison: Default vs. TLA Tuned


Looking at different KPIs, we have found the application's performance improved by 5.9% in Average Response Time (ART) and 3.32% in 90% Response Time.  Better performance is achieved by reducing pause time % in GC and Total CPU % at the expense of total memory footprint (-1.6%).

Conclusion


As the benchmark results show, you can tune your application's performance by tuning TLA sizes. The performance improvement may vary based on your JRockit versions and system capabilities. But, before you do any fine tuning, read [2, 4] first.

Finally, be warned that options that are specified with -XX are not stable and are not recommended for casual use. These options are subject to change without notice.[3]

References

Friday, December 27, 2013

JRockit: Thread Local Area Size and Large Objects

Large objects can be a problem for an application. If your Java application allocates a lot of objects, especially large objects, it helps to play around with various Thread Local Area (TLA) settings.[1] For a case study, read [11].

If you try to find the default settings of TLA by using:
  • -XX (an alias for -Xprintflags)
it prints out the following information in JRockit R28.2.5:
        TlaWasteLimit = 0 (default)
                (Alias: -XXlargeObjectLimit)
                - Internal. Use -XXtlaSize:wasteLimit instead.
        TlaMinSize = 0 (default)
                (Alias: -XXminBlockSize)
                - Internal. Use -XXtlaSize:min instead.
        TlaPreferredSize = 0 (default)
                - Internal. Use -XXtlaSize:preferred instead.

Hmm. That is not very helpful. In this article, we will discuss:
  • Thread local area size and large objects
  • How to find out the default TLA settings?

Thread Local Area Size and Large Objects[1-4]


The thread local area (TLA) is a chunk of free space reserved on the heap or the nursery and given to a thread for its exclusive use. A thread can allocate small objects in its own TLA without synchronizing with other threads. When the TLA gets full the thread simply requests a new TLA. The objects allocated in a TLA are accessible to all Java threads and are not considered “thread local” in any way after they have been allocated.

Increasing the TLA size is beneficial for multi-threaded applications where each thread allocates a lot of objects. Increasing the TLA size is also beneficial when the average size of the allocated objects is large, as this allows larger objects to be allocated in the TLAs. Increasing the TLA size too much may however cause more fragmentation and more frequent garbage collections. Before any JRockit performance tuning, you need to assess the sizes of the objects allocated by your application. One way to access live data size is to view object allocation statistics. There are multiple ways of achieving that:
  • You can create a JRockit Flight Recorder recording and view object allocation statistics in the JRockit Flight Recorder[7]
  • You can use the following JVM option:
    • -Xverbose:memdbg Xverbose:gc -Xverbosedecorations=level,module,timestamp,millis,pid

Large Object: Pre-R28 vs. Post-R28[1]


In JRockit versions earlier than R28, large objects were allocated immediately on the heap and never in a TLA. A flag called –XXlargeObjectLimit was provided to tell JRockit the minimum number of bytes an object should be of in order to be treated as "large". The default was 2 KB.

JRockit post R28 uses a waste limit for TLA space instead. This constrains the amount of TLA space that can be thrown away for each TLA when large objects are allocated and is a more flexible solution.

The R28 allocation algorithm now works like this—JRockit tries to allocate every object regardless of its size in the current TLA. If it doesn't fit and the waste limit is less than the space left in the TLA, the object
goes directly on the heap. Otherwise, JRockit will "waste" the rest of this TLA and try to allocate the object in a new TLA or directly on the heap, depending on the size of the object.

The TLA sizes are set using the following option:
  • -XXtlaSize:min=size,preferred=size,wasteLimit=size[8]
    • min
      • Sets the minimum TLA size. 
    • preferred
      • Sets a preferred TLA size.  This means that TLAs will be the preferred size whenever possible, but can be as small as the min size.
    • wasteLimit
      • Sets the waste limit for TLAs. This is the maximum amount of free memory that a TLA is allowed to have when a thread requires a new TLA.
The following relation is true for the TLA size parameters:
  • -XXtlaSize:wasteLimit <= -XXtlaSize:min <= -XXtlaSize:preferred
Read [9] for the tuning advice.

What are the default TLA settings?


To find out the default TLA settings, you can use the following JVM option:[12]
  • -Xverbose:memdbg
For example, here is the output:

[DEBUG][memory ][19709] Minimum TLA size is 2048 bytes.
[DEBUG][memory ][19709] Preferred TLA size is 16384 bytes.
[DEBUG][memory ][1388121278654][19709] TLA waste limit is 2048 bytes.


from the following JVM options:
  • -Xms2g -Xmx2g -Xverbose:memdbg -Xgc:pausetime -Xverbose:gc -Xverbosedecorations=level,module,timestamp,millis,pid

References

  1. Oracle JRockit- The Definitive Guide
  2. Tuning Java Virtual Machines (JVMs)
  3. On Nursery Sizing (Migrated)
    • The goal of nursery sizing to get as much memory as possible freed by young collections rather than old collections.
    • The rule of thumb is that a nursery size of approximately half of the free memory on the heap is nearly optimal.
    • If your application is sensitive to latencies you may want to decrease the nursery size to shorten the young collection pause times.
    • If you're using the garbage collection mode optimizing for short pauses (-Xgcprio:pausetime) or the static generational concurrent garbage collector (-Xgc:gencon) you will most likely want to tune the nursery size manually.
  4. First Steps for Tuning the Oracle JRockit JVM
  5. Oracle® JRockit Diagnostics and Troubleshooting Guide (Release R28)
  6. Oracle® JRockit Performance Tuning Guide (Release R28)
  7. Oracle® JRockit Flight Recorder Run Time Guide (Release R28)
  8. -XXtlaSize Parameters
  9. Optimizing Memory Allocation Performance (Section 4.4 of this pdf)
  10. Oracle® JRockit Command-Line Reference (Release R28)
  11. JRockit: A Case Study of Thread Local Area (TLA) Tuning (Xml and More)
  12. JRockit: Analyzing GC With JRockit Verbose Output (-Xverbose:memdbg)  (Xml and More)
  13. Where did all of these ConstPoolWrapper objects come from?!