Load unsigned and better Compressed Oops Friday, Apr 3 2009 

The HotSpot engineers are constantly working on improving performance. I noticed two interesting commits recently:

Vladimir Kozlov improved Compressed Oops so that it doesn’t need to do encoding/decoding if the heap is smaller than 4GB and to reduce branches/checks if the heap is between 4GB and 32GB. The end result is that 64-bit now surpasses 32-bit performance in more situations. See my entry about Compressed Oops if you don’t know what I’m talking about. :)

Christian Thalinger added support for load unsigned in the -server JIT. This means that things like bytearray[i] & 0xFF and intarray[i] & 0xFFFFFFFF (necessary since JVM bytecode doesn’t support unsigned types) can be transformed into load unsigned operations to avoid the performance penalty. This can make a decent difference in some cases (e.g. charset operations).

Advertisement

32-bit or 64-bit JVM? How about a Hybrid? Tuesday, Oct 14 2008 

Before x86-64 came along, the decision on whether to use 32-bit or 64-bit mode for architectures that supported both was relatively simple: use 64-bit mode if the application requires the larger address space, 32-bit mode otherwise. After all, no point in reducing the amount of data that fits into the processor cache while increasing memory usage and bandwidth if the application doesn’t need the extra addressing space.

When it comes to x86-64, however, there’s also the fact that the number of named general-purpose registers has doubled from 8 to 16 in 64-bit mode. For CPU intensive apps, this may mean performance at the cost of extra memory usage. On the other hand, for memory intensive apps 32-bit mode might be better in if you manage to fit your application within the address space provided. Wouldn’t it be nice if there was a single JVM that would cover the common cases?

It turns out that the HotSpot engineers have been working on doing just that through a feature called Compressed oops. The benefits:

  • Heaps up to 32GB (instead of the theoretical 4GB in 32-bit that in practice is closer to 3GB)
  • 64-bit mode so we get to use the extra registers
  • Managed pointers (including Java references) are 32-bit so we don’t waste memory or cache space

The main disadvantage is that encoding and decoding is required to translate from/to native addresses. HotSpot tries to avoid these operations as much as possible and they are relatively cheap. The hope is that the extra registers give enough of a boost to offset the extra cost introduced by the encoding/decoding.

Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode.

The tested application has two phases, a single threaded one followed by a multi-threaded one. Both phases do a large amount of allocation so memory bandwidth is very important. All tests were done on a dual quad-core Xeon 5400 series with 10GB of RAM. I should note that a different JDK version had to be used for 32-bit mode (JDK6u10rc2) because there is no Linux x86 build of JDK6u6p. I chose the largest heap size that would allow the 32-bit JVM to run the benchmark to completion without crashing.

I started by running the application with a smaller dataset:

JDK6u10rc2 32-bit
Single-threaded phase: 6298ms
Multi-threaded phase (8 threads on 8 cores): 17043ms
Used Heap after full GC: 430MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

JDK6u6p 64-bit with Compressed Oops
Single-threaded phase: 6345
Multi-threaded phase (8 threads on 8 cores): 16348
Used Heap after full GC: 500MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops

The performance numbers are similar and the memory usage of the 64-bit JVM with Compressed Oops is 16% larger.

JDK6u6p 64-bit
Single-threaded phase: 6463
Multi-threaded phase (8 threads on 8 cores): 18778
Used Heap after full GC: 700MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

The performance is again similar, but the memory usage of the 64-bit JVM is much higher, over 60% higher than the 32-bit JVM one.

Let’s try the larger dataset now:

JDK6u10rc2 32-bit
Single-threaded phase: 14188ms
Multi-threaded phase (8 threads on 8 cores): 73451ms
Used Heap after full GC: 1.25GB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

JDK6u6p 64-bit with CompressedOops
Single-threaded phase: 13742
Multi-threaded phase (8 threads on 8 cores): 76664ms
Used Heap after full GC: 1.45GB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops

The performance difference and memory overhead are the same as with the smaller dataset. The benefit of Compressed Oops here is that we still have plenty of headroom while the 32-bit JVM is getting closer to its limits. This may not be apparent from the heap size after a full GC, but during the multi-threaded phase the peak memory usage is quite a bit larger and the fact that the allocation rate is high does not help. This becomes more obvious when we look at the results for the 64-bit JVM.

JDK6u6p 64-bit
Single-threaded phase: 14610
Multi-threaded phase (8 threads on 8 cores): 104992
Used Heap after full GC: 2GB
JVM Args: -XX:MaxPermSize=256m -Xms4224m -Xmx4224m -server -XX:+UseConcMarkSweepGC

I had to increase the Xms/Xmx to 4224m for the application to run to completion. Even so, the performance of the multi-threaded phase took a substantial performance hit when compared to the other two JVM configurations. All in all, the 64-bit JVM without compressed oops does not do well here.

In conclusion, it seems that compressed oops is a feature with a lot of promise and it allows the 64-bit JVM to be competitive even in cases that favour the 32-bit JVM. It might be interesting to test applications with different characteristics to compare the results. It’s also worth mentioning that since this is a new feature, it’s possible that performance will improve further before it’s integrated into the normal JDK releases. As it is though, it already hits a sweet spot and if it weren’t for the potential for instability, I would be ready to ditch my 32-bit JVM.

Update: The early access release of JDK 6 Update 14 also contains this feature.
Update 2: This feature is enabled by default since JDK 6 Update 23.

Local variables scope in HotSpot Saturday, Oct 11 2008 

Assume the following code:

  public void foo() {
    C c = new C();
    bar(c.baz); // assume that baz does not reference c
  }

I was under the impression that HotSpot would not garbage collect c before the local variable went out of scope although it is legally allowed to do so. I even heard of issues where unnecessary garbage was retained as a result (usual workarounds are to null the variable or to inline it).

According to this bug report (Server JIT optimization can cause objects to go out of scope prematurely), the Server VM is actually able to GC the object before the local variable goes out of scope. It would be interesting to know if it can detect such cases reliably.

HotSpot JIT “Optimization” Saturday, Sep 27 2008 

I noticed Scala ticket #1377 the other day. Even though I think the bug is valid, it’s for different reasons to the ones supplied by the reporter. I had my doubts that the casts would have any measurable cost and the informal benchmark provided looked suspect. This is all described in the ticket, so I won’t go over it again here. The point of this blog is some weird behaviour I noticed in my version of the benchmark.

The benchmark looks like:

  def f(i: Iterator[_]) = i.hasNext

  def main(args: Array[String]) {
    test()
  }
    
  def test() {
    var i = 0
    while (i < 10) {
      val time = System.currentTimeMillis
      val result = inner()
      i += 1
      println("Time: " + (System.currentTimeMillis - time))
      println("Value: " + result)
    }
  }
    
  def inner() = {
    //val empty = Iterator.empty // checkcast occurs within loop
    val empty: Iterator&#91;_&#93; = Iterator.empty // checkcast outside loop
    var i = 0L
    while (i < 10000000000L) {
      f(empty)
      i += 1
    }
    i
  }
}
&#91;/sourcecode&#93;

According to this version of the benchmark the extra casts don't cause any performance difference, so I will just focus on the one without the extra casts. The results with JDK 6 Update 10 RC1 64-bit were:

<blockquote>
Time: 4903
Time: 4883
Time: 7213
Time: 7197
Time: 7203
Time: 7212
Time: 7185
Time: 7190
Time: 7210
Time: 7188
</blockquote>

That is odd. Instead of getting faster, the benchmark gets slower from the third iteration onwards. With JITs like these, we're better off with interpreters. ;) Ok, that's an exaggeration but let's investigate this further.

Before we do so, I should clarify two things. The ones paying attention might wonder where the "Value" output went. The purpose of that is just to make sure HotSpot does not optimize the inner loop away, so I trimmed it from the output. The second point is that the usage of a 64-bit JVM is important because the problem does not occur on the 32-bit version of HotSpot. I initially used the 64-bit version because it's much faster when performing operations on longs and I had to use a long index in the loop to allow the benchmark to take a reasonable amount of time.

Ok, so first step is to re-run the benchmark with -Xbatch and -XX:+PrintCompilation. The results of that were:

<blockquote>
  1   b   test.PerfTest$::f (7 bytes)
  2   b   scala.Iterator$$anon$5::hasNext (2 bytes)
  1%  b   test.PerfTest$::inner @ 12 (35 bytes)
Time: 4938
  3   b   test.PerfTest$::inner (35 bytes)
Time: 4861
Time: 7197
Time: 7199
Time: 7241

</blockquote>

Ok, so it seems like the inner loop got JIT'd a second time and that made it slower than the previous version, which sounds like a bug. I converted the code into Java and it turns out that we don't need much more than a loop in the inner method to reproduce the problem:


static long inner() {
  long i = 0L;
  for (; i < 10000000000L; ++i);
  return i;
}
&#91;/sourcecode&#93;

Before reporting the bug to Sun, I was curious if -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation would give any interesting information. I pasted some parts that someone that is not a HotSpot engineer might understand with the help of <a href="http://wikis.sun.com/display/HotSpotInternals/LogCompilation+overview">this</a>.


<nmethod compile_id='1' compile_kind='osr' compiler='C2' entry='0x000000002ee0d000' size='520' address='0x000000002ee0ced0' relocation_offset='264' code_offset='304' stub_offset='400' consts_offset='420' scopes_data_offset='424' scopes_pcs_offset='456' dependencies_offset='504' oops_offset='512' method='test/PerfTest inner ()J' bytes='19' count='1' backedge_count='14563' iicount='1' stamp='0.054'/>
<writer thread='12933456'/>
<task_queued compile_id='1' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' blocking='1' stamp='4.965' comment='count' hot_count='2'/>
<writer thread='43477328'/>
  1   b   test.PerfTest::inner (19 bytes)
<nmethod compile_id='1' compiler='C2' entry='0x000000002ee0d240' size='488' address='0x000000002ee0d110' relocation_offset='264' code_offset='304' stub_offset='368' consts_offset='388' scopes_data_offset='392' scopes_pcs_offset='424' dependencies_offset='472' oops_offset='480' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' stamp='4.966'/>
<...>

<task compile_id='1' compile_kind='osr' method='test/PerfTest inner ()J' bytes='19' count='1' backedge_count='14563' iicount='1' osr_bci='5' blocking='1' stamp='0.050'>
<...>
<task_done success='1' nmsize='120' count='1' backedge_count='14563' stamp='0.054'/>

<task compile_id='1' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' blocking='1' stamp='4.965'>
<...>
<task_done success='1' nmsize='88' count='2' backedge_count='5000' stamp='4.966'/>

So, it seems like the inner method was first JIT’d through on-stack-replacement (i.e. OSR). Usually, on-stack-replacement does not produce the best code, so HotSpot recompiles the method again once it gets the chance. Unfortunately, it generates slower code for some reason even though the compiled method size is smaller (88 instead of 120).

We could go deeper in this investigation by using a debug JVM like Kohsuke Kawaguchi did here, but I decided to just file a bug and let the HotSpot engineers handle it. :) I will update the blog once a bug number is assigned (I wonder when Sun is going to fix their bug database so that the bug id becomes available after submission of a bug…).

Update: The bug is now available in the Sun database.