Friday, October 26, 2012

HotSpot VM Performance Tuning Tips

In some cases, it may be obvious from benchmarking that parts of an application need to be rewritten using more efficient algorithms[1]. Sometimes it may just be enough to provide a more optimal runtime environment by tuning the JVM parameters.

In this article, we will show you some of the HotSpot VM performance tuning tips.

What to tune?


You can tune HotSpot performance on multiple fronts:
  • Code generation[8,10]
  • Memory management
In this article, we will focus more on memory management (or garbage collector).  The goals of tuning garbage collector include:
  • To make a garbage collector operate efficiently, by
    • Reducing pause time or
    • Increasing throughput
  • To avoid heap fragmentation
    • Different garbage collector uses different compaction to eliminate fragmentation
  • To make it scalable for multithreaded applications on multiprocessor systems
In this article, we will cover the following tuning options:
  1. Client VM or Server VM
  2. 32-bit VM or 64-bit VM
  3. GC strategy
  4. Heap sizing
  5. Further tuning

Client vs. Server VM


The HotSpot Client JVM has been specially tuned to reduce application startup time and memory footprint, making it particularly well suited for client environments. On all platforms, the HotSpot Client JVM is the default.

The Java HotSpot Server VM is similar to the HotSpot Client JVM except that it has been specially tuned to maximize peak operating speed. It is intended for long-running server applications, for which the fastest possible operating speed is generally more important than having the fastest startup time. To invoke the HotSpot Server JVM instead of the default HotSpot Client JVM, use the -server parameter; for example,
  • java -server MyApp
In [7], authors have mentioned a third HotSpot VM runtime named tiered.   If you are using Java 6 Update 25, Java 7, or later, you may consider using tiered server runtime as a replacement for the client runtime.   For more details, read [8,10].

32-Bit or 64-Bit VM


The 32-bit JVM is the default for the HotSpot VM. The choice of using a 32-bit or 64-bit JVM is dictated by the memory footprint required by the application along with whether any third-party software used in the application supports 64-bit JVMs and if there are any native components in the Java application. All native components using the Java Native Interface (JNI) in a 64-bit JVM must be compiled in 64-bit mode.

Running 64-bit VM has the following advantages[7]:
  • Larger address space
  • Better performance in two fronts
    • 64-bit JVMs can make use of additional CPU registers
    • Help avoid register spilling
      • Register spilling occurs where there is more live state (i.e. variables) in the application than the CPU has registers. 
and one disadvantage:
  • Increased width for oops
    • Results in fewer oops being available on a CPU cache line and as a result decreases CPU cache efficiency. 
    • This negative performance impact can be mitigated by setting:
      • -XX:+UseCompressedOops VM command line option
Note that client runtimes are not available in 64-bit HotSpot VMs.   See [11] for more details.

GC Strategy


JVM performance is usually measured by its GC's effectiveness.  Garbage collection (GC) reclaims the heap space previously allocated to objects no longer needed. The process of locating and removing those dead objects can stall your Java application while consuming as much as 25 percent of throughput.

The Java HotSpot virtual machine includes five garbage collectors.[27] All the collectors are generational.
  • Serial Collector
    • Both young and old collections are done serially (using a single CPU), in a stop-the-world fashion.
    • The old and permanent generations are collected via a mark-sweep-compact collection algorithm. 
      • The sweep phase “sweeps” over the generations, identifying garbage. The collector then performs sliding compaction, sliding the live objects towards the beginning of the old generation space (and similarly for the permanent generation), leaving any free space in a single contiguous chunk at the opposite end.
    • When to use
      • For most applications that are run on client-style machines and that do not have a requirement for low pause times
    • How to select
      • In the J2SE 5.0 and above, the serial collector is automatically chosen as the default garbage collector on machines that are not server-class machines. On other machines, the serial collector can be explicitly requested by using the -XX:+UseSerialGC command line option.
  • Parallel Collector (or throughput collector)
    • Young generation collection
      • Uses a parallel version of the young generation collection algorithm utilized by the serial collector
      • It is still a stop-the-world and copying collector, but performing the young generation collection in parallel, using many CPUs, decreases garbage collection overhead and hence increases application throughput.
    • Old generation collection
      • Uses the same serial mark-sweep-compact collection algorithm as the serial collector
    • When to use
      • For applications run on machines with more than one CPU and do not have pause time constraints, since infrequent, but potentially long, old generation collections will still occur. 
      • Examples of applications for which the parallel collector is often appropriate include those that do batch processing, billing, payroll, scientific computing, and so on.
    • How to select
      • In the J2SE 5.0 and above, the parallel collector is automatically chosen as the default garbage collector on server-class machines. On other machines, the parallel collector can be explicitly requested by using the -XX:+UseParallelGC command line option.
  • Parallel Compacting Collector
      • Young generation collection
        • Use the same algorithm as that for young generation collection using the parallel collector.
      • Old generation collection
        • The old and permanent generations are collected in a stop-the-world, mostly parallel fashion with sliding compaction
        • The collector utilizes three phases (see [5] for more details):
          • Marking phase
          • Summary phase
          • Compaction phase
      • When to use
        • For applications that are run on machines with more than one CPU and applications that have pause time constraints.
      • How to select
        • If you want the parallel compacting collector to be used, you must select it by specifying the command line option -XX:+UseParallelOldGC.
    • Concurrent Mark-Sweep (CMS) Collector[5,6]
        • Young generation collection
          • The CMS collector collects the young generation in the same manner as the parallel collector. 
        • Old generation collection
          • Most of the collection of the old generation using the CMS collector is done concurrently with the execution of the application. 
          • The CMS collector is the only collector that is non-compacting. That is, after it frees the space that was occupied by dead objects, it does not move the live objects to one end of the old generation. 
            • To minimize risk of fragmentation, CMS is doing statistical analysis of object’s sizes and have separate free lists for objects of different sizes.
        • When to use
          • For application needs shorter garbage collection pauses and can afford to share processor resources with the garbage collector when the application is running. (Due to its concurrency, the CMS collector takes CPU cycles away from the application during a collection cycle.)
          • Typically, applications that have a relatively large set of long-lived data (a large old generation), and that run on machines with two or more processors, tend to benefit from the use of this collector. 
          • Compared to the parallel collector, the CMS collector decreases old generation pauses—sometimes dramatically—at the expense of slightly longer young generation pauses, some reduction in throughput, and extra heap size requirements.
        • How to select
          • If you want the CMS collector to be used, you must explicitly select it by specifying the command line option -XX:+UseConcMarkSweepGC.  
          • If you want it to be run in incremental mode, also enable that mode via the  –XX:+CMSIncrementalMode option.
            • This feature is useful when applications that need the low pause times provided by the concurrent collector are run on machines with small numbers of processors (e.g., 1 or 2).
          • ParNewGC is the parallel young generation collector for use with CMS.  To choose it, you can specify the command line option  ‑XX:+UseParNewGC.
      • Generation First (G1) Garbage Collector[31]
        • Summary
          • Differs from CMS in the following ways
            • Compacting
              • Reduce fragmentation and is good for long-running applications 
            • Heap is split into regions
              • Easy to allocate and resize
          • Evacuation pauses 
            • For both young and old regions
        • Young generation collection
          • During a young GC, survivors from the young regions are evacuated to either survivor regions or old regions
            • Done with stop-the-world evacuation pauses
            • Performed in parallel
        • Old generation collection
          • Some garbage objects in regions with very high live ratio may be left in the heap and be collected later
          • Concurrent marking phase
            • Calculates liveness information per region
              • Empty regions can be reclaimed immediately
              • Identifies best regions for subsequent evacuation pauses
            • Remark is done with one stop-the-world puase while initial mark is piggybacked on an evacuation pause
            • No corresponding sweeping phase
            • Different marking algorithm than CMS
          • Old regions are reclaimed by
            • Evacuation pauses
              • Using compaction
              • Where most reclamation is done
            • Remark (when totally empty)
        • When to use
          • The G1 collector is a server-style garbage collector, targeted for multi-processor machines with large memories. It meets garbage collection (GC) pause time goals with high probability, while achieving high throughput. 
        • How to select
          • If you want the G1 garbage collector to be used, you must explicitly select it by specifying the command line option -XX:+UseG1GC.  
      Note that the difference between:
      • -XX:+UseParallelOldGC
      and
      • -XX:+UseParallelGC
      is that -XX:+UseParallelOldGC enables both a multithreaded young generation garbage collector and a multithreaded old generation garbage collector, that is, both minor garbage collections and full garbage collections are multithreaded. -XX:+UseParallelGC enables only a multithreaded young generation garbage collector. The old generation garbage collector used with -XX:+UseParallelGC is single threaded. 

      Using -XX:+UseParallelOldGC also automatically enables -XX:+UseParallelGC. Hence, if you want to use both a multithreaded young generation garbage collector and a multithreaded old generation garbage collector, you need only specify -XX:+UseParallelOldGC.

      Note that the above distinction between -XX:+UseParallelOldGC and -XX:+UseParallelGC are no longer true in JDK 7.  In JDK 7, the following three settings are equivalent:

      • Default
      • -XX:+UseParallelGC
      • -XX:+UseParallelOldGC
      They all use multithreaded collectors for both young generation and old generation.

      Heap Sizing


      If a heap size is small, collection will be fast but the heap will fill up more quickly, thus requiring more frequent collections. Conversely, a large heap will take longer to fill up and thus collections will be less frequent, but they may take longer.

      Command line parameters that divide the heap between new and old generations usually cause the greatest performance impact.  If you increase the new generation's size, you often improve the overall throughput; however, you also increase footprint, which may slow down servers with limited memory.

      For more details, you can read [12-15].

      Further Tuning


      HotSpot's default parameters are effective for most small applications that require faster startup and a smaller footprint.  However, more often than not, you will find default settings are not good enough and need further tuning your Java applications.  As shown in [16], there are many VM options exoosed and can be further tuned by brave souls.  I'm not going to discuss such tunings in this article.  But, I'll keep on posting articles on this blogger with VM tunings.  Stay tuned!

      References

      1. Java Performance Tips
      2. Pick up performance with generational garbage collection
      3. Java HotSpot™ Virtual Machine Performance Enhancements
      4. Oracle JRockit
      5. Memory Management in the Java HotSpot™ Virtual Machine
      6. Understanding GC pauses in JVM, HotSpot's CMS collector
      7. Java Performance by Charlie Hunt and Binu John
      8. Performance Tuning with Hotspot VM Option: -XX:+TieredCompilation
      9. Java Tuning White Paper
      10. A Case Study of Using Tiered Compilation in HotSpot
      11. HotSpot VM Binaries: 32-Bit vs. 64-Bit
      12. HotSpot Performance Option — SurvivorRatio
      13. A Case Study of java.lang.OutOfMemoryError: GC overhead limit exceeded
      14. Understanding Garbage Collection
      15. Diagnosing Java.lang.OutOfMemoryError
      16. What Are the Default HotSpot JVM Values?
      17. Understanding Garbage Collector Output of Hotspot VM
      18. On Stack Replacement in HotSpot JVM
      19. Professional Oracle WebLogic Server by Robert Patrick, Gregory Nyberg, and Philip Aston
      20. Sun Performance and Tuning: Java and the Internet by Adrian Cockroft and Richard Pettit
      21. Concurrent Programming in Java: Design Principles and Patterns by Doug Lea
      22. Capacity Planning for Web Performance: Metrics, Models, and Methods by Daniel A. Menascé and Virgilio A.F. Almeida
      23. Java Performance Tuning (Michael Finocchiaro)
      24. Diagnosing Heap Stress in HotSpot
      25. Introduction to HotSpot JVM Performance and Tuning
      26. Tuning the JVM (video)
        • Frequency of Minhor GC dictated by:
          • Application object allocation rate
          • Size of Eden
        • Frequency of object promotion dictated by:
          • Frequency of minor GCs (tenuring)
          • Size of survivor spaces
        • Full GC Frequency dictated by
          • Promotion rate
          • Size of old generation
      27. JEP 173: Retire Some Rarely-Used GC Combinations
      28. G1 GC Glossary of Terms
      29. Learn More About Performance Improvements in JDK 8 
      30. Java SE HotSpot at a Glance
      31. Garbage-First Garbage Collector (JDK 8 HotSpot Virtual Machine Garbage Collection Tuning Guide)
      32. Tuning that was great in old JRockit versions might not be so good anymore
        • Trying to bring over each and every tuning option from a JR configuration to an HS one is probably a bad idea.
        • Even when moving between major versions of the same JVM, we usually recommend going back to the default (just pick a collector and heap size) and then redoing any tuning work from scratch (if even necessary).

      No comments: