Performance of FastMath from Commons Math Wednesday, Feb 23 2011 

Commons Math includes (in trunk and MATH_2_X branch) a FastMath class that is described as a “Faster, more accurate, portable alternative to StrictMath” and is implemented fully in Java (unlike java.lang.Math and java.lang.StrictMath). I am generally interested in faster math operations so I decided to investigate.

The first thing that I checked is if FastMath delegated to java.lang.Math or java.lang.StrictMath for any of its methods. It delegates to java.lang.Math for sqrt and random and to StrictMath for IEEEremainder. I found this interesting because I know that HotSpot includes intrinsic methods for many methods in java.lang.Math. Looking at vmSymbols.hpp, we can see that the following methods have intrinsics: abs, sin, cos, tan, atan2, sqrt, log, log10, pow, exp, min and max. Intrinsic methods are usually highly optimised (often involving assembly and sometimes CPU-specific instructions) and do not incur JNI overhead. I was interested in comparing all of FastMath’s methods with java.lang.Math and java.lang.StrictMath, but I was particularly interested in these ones.

Bill Rossi (the original author of FastMath) pointed me to a performance test in SVN, so I decided to run that (r1072209 from SVN trunk) on a Intel Core i7 CPU 860 2.80GHz using JDK 6 Update 25 early access release build 1 (includes HotSpot 20) and JDK 7 build 130 (includes HotSpot 21). I originally ran the test as it is in Commons Math with one addition but it has a few common micro-benchmarking mistakes (thanks to Aleksey for pointing them out in the comments). I fixed the issues (see here for the modified code) and ran the test again.

Since the relative results for the two JDKs did not differ much, I am only posting the charts for JDK 6 Update 25 early access release. The results for both JDKs are available in table form here. Note that the results are in milliseconds, so lower is better.

java.lang.Math does quite a bit better on tan, abs and log while FastMath does better on exp and cosh. The rest are similar.

FastMath does very well on hypot, pow, asin, acos, cbrt, sinh, tanh while java.lang.Math does well on log10 and log1p. java.lang.Math does so badly on hypot that it’s worth mentioning it again. Looks like it needs a bit of work.

As stated above, the results for JDK 7 are very similar with a few small improvements. Methods with an execution time reduction of 15% or higher were pow (202ms instead of 238ms), powII (125ms instead of 149ms) and atan (53ms instead of 66ms)

The usual disclaimers about benchmarks apply. The benchmark provided by commons-math is a good start, but I believe it should be fleshed out more so that common cases are measured separately (e.g. pow(x, 2) is much more common than the ones used in the test). Having said that, FastMath is a nice contribution and it seems like some methods in the JDK could be replaced by the ones in FastMath for better results. It also seems like FastMath could do better by delegating to the JDK for more of its methods.

Update: Some context for the contribution of FastMath to Apache Commons Math can be found here.

Update 2: I’ve updated the charts to use a modified test instead of the one in Commons Math as there were issues there. I had mentioned some suspicious scores, but thanks to Aleksey for pointing the issues out in the comments (they are common mistakes in micro-benchmarks).

Advertisement

AOT and JVM Wednesday, Oct 8 2008 

There have been a few AOT compilers for Java for some time. Two of the better known examples are GCJ and Excelsior JET. Even though Excelsior JET also has a JIT, the main focus was on the AOT aspect.

There have been many suggestions that HotSpot should cache the JIT-generated code to improve start-up performance, avoid the need for a warm-up phase on every invocation of the application and possibly share more data between applications. This is particularly relevant for desktop applications. HotSpot engineers claim that this is a complex task and that the benefits might not be as great as expected because AOT-generated code would be slower than JIT-generated one, so they implemented a simpler solution in JDK 5, Class Data Sharing. It’s pretty limited because it only works with the Client VM and the serial garbage collector and also because it only loads a set of classes from the system jar.

Interestingly, IBM did some work in this area in the IBM JRE for Java 5 and improved it further in the one for Java 6. This is described in some detail in a developerWorks article. It’s worth reading if you’re interested in this sort of thing, but I’ll list some points that I found interesting:

  • AOT code can be stored into a shared class cache allowing another JVM to use it to reduce startup time.
  • Class compression is used to increase the amount of classes that are stored in the cache.
  • AOT code is also subject to JIT compilation if it’s invoked often in order to optimise it further.
  • AOT code executed by a JVM is copied from the shared class cache so there is no direct footprint benefit but there are memory and CPU savings from being able to reuse this code rather than repeat the compilation.
  • AOT compilation is based on heuristics that select methods that are likely to improve startup time.
  • Eclipse without any additional plugins took 3.313 to start with AOT and shared classes versus 4.204 seconds without. Larger improvements could take place if there were more plugins installed.
  • Tomcat startup time improved from 1138ms to 851ms. Using shared classes without AOT caused the time to be 950ms, which means that both AOT and shared classes contributed to the improvement.

Downloading the IBM JRE/JDK requires registration and in case you need the Windows version, you must download a bundle that includes Eclipse. Links to the various downloads can be found in the developerWorks article.

These are interesting results, and it would be interesting to find out if HotSpot would also benefit from similar enhancements.

HotSpot JIT “Optimization” Saturday, Sep 27 2008 

I noticed Scala ticket #1377 the other day. Even though I think the bug is valid, it’s for different reasons to the ones supplied by the reporter. I had my doubts that the casts would have any measurable cost and the informal benchmark provided looked suspect. This is all described in the ticket, so I won’t go over it again here. The point of this blog is some weird behaviour I noticed in my version of the benchmark.

The benchmark looks like:

  def f(i: Iterator[_]) = i.hasNext

  def main(args: Array[String]) {
    test()
  }
    
  def test() {
    var i = 0
    while (i < 10) {
      val time = System.currentTimeMillis
      val result = inner()
      i += 1
      println("Time: " + (System.currentTimeMillis - time))
      println("Value: " + result)
    }
  }
    
  def inner() = {
    //val empty = Iterator.empty // checkcast occurs within loop
    val empty: Iterator&#91;_&#93; = Iterator.empty // checkcast outside loop
    var i = 0L
    while (i < 10000000000L) {
      f(empty)
      i += 1
    }
    i
  }
}
&#91;/sourcecode&#93;

According to this version of the benchmark the extra casts don't cause any performance difference, so I will just focus on the one without the extra casts. The results with JDK 6 Update 10 RC1 64-bit were:

<blockquote>
Time: 4903
Time: 4883
Time: 7213
Time: 7197
Time: 7203
Time: 7212
Time: 7185
Time: 7190
Time: 7210
Time: 7188
</blockquote>

That is odd. Instead of getting faster, the benchmark gets slower from the third iteration onwards. With JITs like these, we're better off with interpreters. ;) Ok, that's an exaggeration but let's investigate this further.

Before we do so, I should clarify two things. The ones paying attention might wonder where the "Value" output went. The purpose of that is just to make sure HotSpot does not optimize the inner loop away, so I trimmed it from the output. The second point is that the usage of a 64-bit JVM is important because the problem does not occur on the 32-bit version of HotSpot. I initially used the 64-bit version because it's much faster when performing operations on longs and I had to use a long index in the loop to allow the benchmark to take a reasonable amount of time.

Ok, so first step is to re-run the benchmark with -Xbatch and -XX:+PrintCompilation. The results of that were:

<blockquote>
  1   b   test.PerfTest$::f (7 bytes)
  2   b   scala.Iterator$$anon$5::hasNext (2 bytes)
  1%  b   test.PerfTest$::inner @ 12 (35 bytes)
Time: 4938
  3   b   test.PerfTest$::inner (35 bytes)
Time: 4861
Time: 7197
Time: 7199
Time: 7241

</blockquote>

Ok, so it seems like the inner loop got JIT'd a second time and that made it slower than the previous version, which sounds like a bug. I converted the code into Java and it turns out that we don't need much more than a loop in the inner method to reproduce the problem:


static long inner() {
  long i = 0L;
  for (; i < 10000000000L; ++i);
  return i;
}
&#91;/sourcecode&#93;

Before reporting the bug to Sun, I was curious if -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation would give any interesting information. I pasted some parts that someone that is not a HotSpot engineer might understand with the help of <a href="http://wikis.sun.com/display/HotSpotInternals/LogCompilation+overview">this</a>.


<nmethod compile_id='1' compile_kind='osr' compiler='C2' entry='0x000000002ee0d000' size='520' address='0x000000002ee0ced0' relocation_offset='264' code_offset='304' stub_offset='400' consts_offset='420' scopes_data_offset='424' scopes_pcs_offset='456' dependencies_offset='504' oops_offset='512' method='test/PerfTest inner ()J' bytes='19' count='1' backedge_count='14563' iicount='1' stamp='0.054'/>
<writer thread='12933456'/>
<task_queued compile_id='1' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' blocking='1' stamp='4.965' comment='count' hot_count='2'/>
<writer thread='43477328'/>
  1   b   test.PerfTest::inner (19 bytes)
<nmethod compile_id='1' compiler='C2' entry='0x000000002ee0d240' size='488' address='0x000000002ee0d110' relocation_offset='264' code_offset='304' stub_offset='368' consts_offset='388' scopes_data_offset='392' scopes_pcs_offset='424' dependencies_offset='472' oops_offset='480' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' stamp='4.966'/>
<...>

<task compile_id='1' compile_kind='osr' method='test/PerfTest inner ()J' bytes='19' count='1' backedge_count='14563' iicount='1' osr_bci='5' blocking='1' stamp='0.050'>
<...>
<task_done success='1' nmsize='120' count='1' backedge_count='14563' stamp='0.054'/>

<task compile_id='1' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' blocking='1' stamp='4.965'>
<...>
<task_done success='1' nmsize='88' count='2' backedge_count='5000' stamp='4.966'/>

So, it seems like the inner method was first JIT’d through on-stack-replacement (i.e. OSR). Usually, on-stack-replacement does not produce the best code, so HotSpot recompiles the method again once it gets the chance. Unfortunately, it generates slower code for some reason even though the compiled method size is smaller (88 instead of 120).

We could go deeper in this investigation by using a debug JVM like Kohsuke Kawaguchi did here, but I decided to just file a bug and let the HotSpot engineers handle it. :) I will update the blog once a bug number is assigned (I wonder when Sun is going to fix their bug database so that the bug id becomes available after submission of a bug…).

Update: The bug is now available in the Sun database.