Performance of FastMath from Commons Math Wednesday, Feb 23 2011 

Commons Math includes (in trunk and MATH_2_X branch) a FastMath class that is described as a “Faster, more accurate, portable alternative to StrictMath” and is implemented fully in Java (unlike java.lang.Math and java.lang.StrictMath). I am generally interested in faster math operations so I decided to investigate.

The first thing that I checked is if FastMath delegated to java.lang.Math or java.lang.StrictMath for any of its methods. It delegates to java.lang.Math for sqrt and random and to StrictMath for IEEEremainder. I found this interesting because I know that HotSpot includes intrinsic methods for many methods in java.lang.Math. Looking at vmSymbols.hpp, we can see that the following methods have intrinsics: abs, sin, cos, tan, atan2, sqrt, log, log10, pow, exp, min and max. Intrinsic methods are usually highly optimised (often involving assembly and sometimes CPU-specific instructions) and do not incur JNI overhead. I was interested in comparing all of FastMath’s methods with java.lang.Math and java.lang.StrictMath, but I was particularly interested in these ones.

Bill Rossi (the original author of FastMath) pointed me to a performance test in SVN, so I decided to run that (r1072209 from SVN trunk) on a Intel Core i7 CPU 860 2.80GHz using JDK 6 Update 25 early access release build 1 (includes HotSpot 20) and JDK 7 build 130 (includes HotSpot 21). I originally ran the test as it is in Commons Math with one addition but it has a few common micro-benchmarking mistakes (thanks to Aleksey for pointing them out in the comments). I fixed the issues (see here for the modified code) and ran the test again.

Since the relative results for the two JDKs did not differ much, I am only posting the charts for JDK 6 Update 25 early access release. The results for both JDKs are available in table form here. Note that the results are in milliseconds, so lower is better.

java.lang.Math does quite a bit better on tan, abs and log while FastMath does better on exp and cosh. The rest are similar.

FastMath does very well on hypot, pow, asin, acos, cbrt, sinh, tanh while java.lang.Math does well on log10 and log1p. java.lang.Math does so badly on hypot that it’s worth mentioning it again. Looks like it needs a bit of work.

As stated above, the results for JDK 7 are very similar with a few small improvements. Methods with an execution time reduction of 15% or higher were pow (202ms instead of 238ms), powII (125ms instead of 149ms) and atan (53ms instead of 66ms)

The usual disclaimers about benchmarks apply. The benchmark provided by commons-math is a good start, but I believe it should be fleshed out more so that common cases are measured separately (e.g. pow(x, 2) is much more common than the ones used in the test). Having said that, FastMath is a nice contribution and it seems like some methods in the JDK could be replaced by the ones in FastMath for better results. It also seems like FastMath could do better by delegating to the JDK for more of its methods.

Update: Some context for the contribution of FastMath to Apache Commons Math can be found here.

Update 2: I’ve updated the charts to use a modified test instead of the one in Commons Math as there were issues there. I had mentioned some suspicious scores, but thanks to Aleksey for pointing the issues out in the comments (they are common mistakes in micro-benchmarks).

New JVM options and Scala iteration performance Monday, Oct 26 2009 

Internal iterators like foreach tend to do very well in micro-benchmarks on the JVM. In fact, they often do as well as the equivalent manual while or for loop. There is a catch, however, and it’s easy to miss it in micro-benchmarks.

A few months ago, David MacIver forwarded an email from Martin Odersky to the scala-internals mailing list about this. The problem as described by Martin:

“The reason seems to be that the foreach itself is not inlined, so the call to the closure becomes megamorphic and therefore slow.”

I was curious about the benchmarks used to measure this and Tiark Rompf (who had performed the measurements) provided the source code. I said I’d try to take a look the next weekend and I did it… more than 3 months later. Well, better late than never. ;)

There are 3 benchmarks:

  • Benchmark A: traverse a collection with N elements
  • Benchmark B: traverse a collection with N elements, inside the loop/closure traverse another collection with N2 elements 3 times
  • Benchmark C: build a collection (front to back) of N elements

Various approaches are used for each benchmark with various collections types. One weakness of the these benchmarks is that they don’t include a verification mechanism to ensure that all benchmarks produce the same result. However, they include code that tries to prevent the JIT from performing unfair optimisations (e.g. print something if an element in the collection matches a certain condition).

The original results produced by Tiark can be found here.

I added a few benchmarks that use plain arrays (RawArrayIndexed, RawArrayForeach, RawArrayForeachMega), made minor changes to the scripts and pushed the code to a GitHub repository. I left the rest of the Scala code as it was to make it easy to compare with the original results and ran the benchmark with various JVM settings to see what effect they would have. All the tests shared the following:

  • Dual quad-core Xeon E5462 2.80GHz
  • 14 GB RAM
  • Fedora 11
  • Scala 2.8.0.r19261
  • Revision 59521431f5c118b73e35b0b396e3efd6aecec3dd of project
  • 64-bit JDK 6 Update 18 early access b03
  • JVM base settings: -Xms1G -Xmx1G -XX:+UseParallelGC -XX:+UseParallelOldGC

JDK 6 Update 18 is scheduled to be released on Q4, 2009 and it includes HotSpot 16. Even though JDK 6 Update 14 (HotSpot 14) introduced compressed references and scalar replacement, HotSpot 16 includes improved compressed references and many crucial fixes to both features. According to my testing these features are now approaching production-level stability and the OpenJDK engineers seem to agree as they are both enabled by default in HotSpot 17 (which will eventually hit JDK6 too).

Interested in how these features would affect the performance in these benchmarks, I ran them with various combinations. I also added Scala’s compiler -optimise flag in some cases.

The original benchmark from Tiark used 3 collection types: array (java.util.ArrayList), list (scala.List, immutable single linked list) and vector (earlier version of immutable vector that has recently been added to Scala 2.8). I added JVM arrays and they are shown as “rawarray” in the charts. Finally, we get to the actual numbers.

Benchmark A

Click on chart for expanded version

There are some interesting data points here:

  1. Compressed references is a _huge_ win. RawArrayIndexed went from 500ms to 142ms and many of the vector operations were much faster.
  2. Escape analysis (which enables scalar replacement) doesn’t seem to have much of an effect.
  3. scalac -optimise doesn’t seem to have much of an effect.
  4. foreach is misleadingly fast in micro-benchmarks, but it’s easy to bring it down to earth. RawArrayForeach performs similarly to RawArrayIndexed, but RawArrayForeachMega is 10 times slower. The latter simply calls foreach with a few different anonymous functions during the collection creation phase causing the call site to become megamorphic. Once this happens, the only hope for good performance is that the foreach method gets inlined and it doesn’t seem to happen here. With this in mind, it seems like ticket 1338 (Optimize simple for loops) is a good idea.
Benchmark B

Click on chart for expanded version

Once again, compressed references are a large factor in some benchmarks (halving the time taken in some cases).

The new bit of information is that scalac -optimise causes a huge improvement in VectorForeachFast and VectorForeachFastProtect. This makes sense once one considers one of the findings from the previous benchmark. We said that inlining of foreach is of extreme importance once a call site is megamorphic and this is precisely what -optimise does in this case (and the JVM fails to do so at runtime otherwise). Sadly, -optimise cannot do this safely in many cases as it’s shown by the results for VectorForeach.

Benchmark C

Click on chart for expanded version

Once again, compressed references provide a nice boost. Seems like this option is a winner in 64-bit JVMs (if you don’t need a heap larger than 32GB), it saves memory and gives better performance. The usual disclaimer applies though, you should benchmark your own application instead of relying on micro-benchmarks when deciding what JVM options to use.

The complete results are also available. Feel free to play with the source code and provide your own numbers, fixes and/or improvements.

Load unsigned and better Compressed Oops Friday, Apr 3 2009 

The HotSpot engineers are constantly working on improving performance. I noticed two interesting commits recently:

Vladimir Kozlov improved Compressed Oops so that it doesn’t need to do encoding/decoding if the heap is smaller than 4GB and to reduce branches/checks if the heap is between 4GB and 32GB. The end result is that 64-bit now surpasses 32-bit performance in more situations. See my entry about Compressed Oops if you don’t know what I’m talking about. :)

Christian Thalinger added support for load unsigned in the -server JIT. This means that things like bytearray[i] & 0xFF and intarray[i] & 0xFFFFFFFF (necessary since JVM bytecode doesn’t support unsigned types) can be transformed into load unsigned operations to avoid the performance penalty. This can make a decent difference in some cases (e.g. charset operations).

Objects with no allocation overhead Wednesday, Dec 17 2008 

We have all heard about how HotSpot is really good at dealing with short-lived objects (both allocation and GC), but the truth is that object allocation is still pretty costly when compared to operations like addition or multiplication. Allocating an object for each step of an iteration over a large collection to make a simple computation might sound like the kind of thing no-one would ever do, but it’s actually quite common in languages like Scala (as described in a previous post). In Java-land, if you use the Function class from Google Collections with primitive wrappers, the same issue may occur. There are many JVM improvements that could help depending on the specific case (generic specialisation, value types, fixnums to name a few), but it’s unclear if/when we’ll get them.

So, what about that title? Escape analysis was introduced during Java 6, but the information gathered was only used for lock elision. However, this information can also be used for other interesting optimisations like scalar replacement and stack allocation. There have been doubts about the benefits of stack allocation (discussed in the comments) so the focus has been on scalar replacement so that the object is never in memory. At least that’s the theory.

Edward Lee started a couple of threads in the Hotspot-dev mailing list about scalar replacement here and here which reminded me to do some experiments. Note that this feature is still in development so the results posted here are preliminary and not indicative of how it will perform once it’s final. Still, it’s interesting to see how well it works at this time. I picked the latest JDK7 build (build 41) and ran a few tests with the following arguments passed to java “-XX:MaxPermSize=256m -Xms128m -Xmx3328m -server -XX:+UseConcMarkSweepGC” and either XX:-DoEscapeAnalysis or XX:+DoEscapeAnalysis.

I started by choosing the simplest test possible. Note that either testSimpleAllocation or testNoAllocation would be commented out.

class C(val a: Int, val b: Int)

object Test {
  def main(args: Array[String]) {
    for (i <- 1 to 10) testSimpleAllocation()
    //for (i <- 1 to 10) testNoAllocation()
  }
  
  def testSimpleAllocation() = {
    System.gc()
    var time = System.currentTimeMillis;
    var i = 0
    var sum = 0
    while (i < 1000000000) {
      sum += baz(new C(i + 1, i + 2))
      i += 1
    }
    println(sum)
    println(System.currentTimeMillis - time)
  }
  
  def testNoAllocation() = {
    System.gc()
    var time = System.currentTimeMillis;
    var i = 0
    var sum = 0
    while (i < 1000000000) {
      sum += baz(i + 1, i + 2)
      i += 1
    }
    println(sum)
    println(System.currentTimeMillis - time)
  }
  
  def baz(a: Int, b: Int): Int = a + b
  
  def baz(c: C): Int = c.a + c.b
}

The result were:


testNoAllocation: 403
testSimpleAllocation with EA: 1006
testSimpleAllocation without EA: 9190

As we can see, escape analysis has a tremendous effect and the method finishes in 11% of the time taken with it disabled. However, the version with no allocation is still substantially faster.

I decided to test a foreach method that takes a Function object next (this time in Java because it does less magic behind the scenes):

package test;

public class EscapeAnalysis {
  
  interface Function<T, R> {
    R apply(T value);
  }
  
  interface IntProcedure {
    void apply(int value);
  }
  
  static class BoxedArray {
    private final int[] array;
    
    public BoxedArray(int length) {
      array = new int[length];
    }
    
    public int length() {
      return array.length;
    }
    
    public void foreach(Function<Integer, Void> function) {
      for (int i : array)
        function.apply(new Integer(i));
    }
    
    public void foreach(IntFunction function) {
      for (int i : array)
        function.apply(i);
    }

    public void set(int index, int value) {
      array[index] = value;
    }

    public void foreachWithAutoboxing(Function<Integer, Void> function) {
      for (int i : array)
        function.apply(i);
    }
  }
  
  public static void main(String[] args) {
    BoxedArray array = new BoxedArray(100000000);
    /* We are careful not to restrict our ints to the ones in the Integer.valueOf cache */
    for (int i = 0; i < array.length(); i++)
      array.set(i, i);
    
    for (int i = 0; i < 10; i++)
      test(array);
  }

  private static void test(BoxedArray array) {
    System.gc();
    long time = System.currentTimeMillis();
    final int[] sum = new int[] { 0 };
    
    /* Uncomment the one that should be executed */
    testIntegerForeach(array, sum);
//    testIntegerForeachWithAutoboxing(array, sum);
//    testIntForeach(array, sum);

    System.out.println(System.currentTimeMillis() - time);
    System.out.println(sum[0]);
  }
  
  private static void testIntegerForeachWithAutoboxing(BoxedArray array, final int[] sum) {
    array.foreachWithAutoboxing(new Function<Integer, Void>() {
      public Void apply(Integer value) {
        sum[0] += value;
        return null;
      }
    });
  }
  
  private static void testIntegerForeach(BoxedArray array, final int[] sum) {
    array.foreach(new Function<Integer, Void>() {
      public Void apply(Integer value) {
        sum[0] += value;
        return null;
      }
    });
  }

  private static void testIntForeach(BoxedArray array, final int[] sum) {
    array.foreach(new IntFunction() {
      public void apply(int value) {
        sum[0] += value;
      }
    });
  }
}

The results were:


testIntForeach: 130
testIntegerForeachWithAutoboxing with EA: 1064
testIntegerForeach with EA: 224
testIntegerForeachWithAutoboxing without EA: 1039
testIntegerForeach without EA: 1024

This test shows something interesting, EA gives no improvement if Integer.valueOf (called by auto-boxing) is used instead of new Integer. Apart from that, the results are somewhat similar to the first ones (EA provides a substantial boost, but not enough to match the specialised implementation). After quickly testing that the boxing methods in ScalaRunTime had the same effect as Integer.valueOf, I decided that it was not worth testing more complex scenarios.

It seems like there’s a lot of potential for scalar replacement, but HotSpot needs to do a better job at detecting cases where it can be used safely. If nothing else, at least knowledge of the valueOf methods should be hardcoded into the system. I hope that a more general solution is found though because other languages on the JVM may use different methods (as mentioned earlier Scala uses methods in ScalaRunTime instead). It will also be interesting to see if the performance can get even closer to the no allocation case. Since the feature is still in development, we can hope. :)

Update: The early access release of JDK 6 Update 14 also contains this feature.
Update 2: This feature is enabled by default since JDK 6 Update 23.

JDK6u12, Java 7 and Devoxx Saturday, Dec 13 2008 

With JavaFX 1.0 out of the way, it seems like the Java 7 train has started moving again.

Concurrency plans posted by Doug Lea. Two points of interest are that the parallel collections API is probably not going to get included (but the core fork-join framework will) and that a low-level fences API might get included. Without closures, it’s hard to include the parallel collections API with good performance for primitives while avoiding interface explosion. The positive aspect is that it will continue to be developed outside the JDK giving it more flexibility in terms of API evolution.

Devoxx summary by Danny Coward containing various links covering JavaFX, JDK7 and Languages in the JVM.

Summary of Java 7 update given by Mark Reinhold written by Hamlet D’Arcy. Small language changes and Project Jigsaw had already been announced and many of the items described there had been known for a while, but some new ones are Null dereference expressions, XRender pipeline for Java 2D and the chance of a MVM-lite. The XRender pipeline is interesting because it was developed by Clemens Eisserer as part of the OpenJDK challenge and would be a nice example of an external contribution to OpenJDK.

One thing that is clear from the various announcements is that Sun intends to use OpenJDK as the place where work for Java 7 will take place (even if JSRs are also planned). Some people expressed concern that this is bad because it circumvents the JCP and that OpenJDK is a project controlled by Sun since the vast majority of the committers are from Sun. Given how opaque the JCP is, I don’t see how OpenJDK could be worse, but my hope is that Sun will run these projects transparently as promised where the end-result is diversification of committers due to more external participation.

Finally, for the people who still use the proprietary JDK from Sun, there’s an early access release of JDK6u12 that includes a 64-bit Java Plugin.

32-bit or 64-bit JVM? How about a Hybrid? Tuesday, Oct 14 2008 

Before x86-64 came along, the decision on whether to use 32-bit or 64-bit mode for architectures that supported both was relatively simple: use 64-bit mode if the application requires the larger address space, 32-bit mode otherwise. After all, no point in reducing the amount of data that fits into the processor cache while increasing memory usage and bandwidth if the application doesn’t need the extra addressing space.

When it comes to x86-64, however, there’s also the fact that the number of named general-purpose registers has doubled from 8 to 16 in 64-bit mode. For CPU intensive apps, this may mean performance at the cost of extra memory usage. On the other hand, for memory intensive apps 32-bit mode might be better in if you manage to fit your application within the address space provided. Wouldn’t it be nice if there was a single JVM that would cover the common cases?

It turns out that the HotSpot engineers have been working on doing just that through a feature called Compressed oops. The benefits:

  • Heaps up to 32GB (instead of the theoretical 4GB in 32-bit that in practice is closer to 3GB)
  • 64-bit mode so we get to use the extra registers
  • Managed pointers (including Java references) are 32-bit so we don’t waste memory or cache space

The main disadvantage is that encoding and decoding is required to translate from/to native addresses. HotSpot tries to avoid these operations as much as possible and they are relatively cheap. The hope is that the extra registers give enough of a boost to offset the extra cost introduced by the encoding/decoding.

Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode.

The tested application has two phases, a single threaded one followed by a multi-threaded one. Both phases do a large amount of allocation so memory bandwidth is very important. All tests were done on a dual quad-core Xeon 5400 series with 10GB of RAM. I should note that a different JDK version had to be used for 32-bit mode (JDK6u10rc2) because there is no Linux x86 build of JDK6u6p. I chose the largest heap size that would allow the 32-bit JVM to run the benchmark to completion without crashing.

I started by running the application with a smaller dataset:

JDK6u10rc2 32-bit
Single-threaded phase: 6298ms
Multi-threaded phase (8 threads on 8 cores): 17043ms
Used Heap after full GC: 430MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

JDK6u6p 64-bit with Compressed Oops
Single-threaded phase: 6345
Multi-threaded phase (8 threads on 8 cores): 16348
Used Heap after full GC: 500MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops

The performance numbers are similar and the memory usage of the 64-bit JVM with Compressed Oops is 16% larger.

JDK6u6p 64-bit
Single-threaded phase: 6463
Multi-threaded phase (8 threads on 8 cores): 18778
Used Heap after full GC: 700MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

The performance is again similar, but the memory usage of the 64-bit JVM is much higher, over 60% higher than the 32-bit JVM one.

Let’s try the larger dataset now:

JDK6u10rc2 32-bit
Single-threaded phase: 14188ms
Multi-threaded phase (8 threads on 8 cores): 73451ms
Used Heap after full GC: 1.25GB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

JDK6u6p 64-bit with CompressedOops
Single-threaded phase: 13742
Multi-threaded phase (8 threads on 8 cores): 76664ms
Used Heap after full GC: 1.45GB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops

The performance difference and memory overhead are the same as with the smaller dataset. The benefit of Compressed Oops here is that we still have plenty of headroom while the 32-bit JVM is getting closer to its limits. This may not be apparent from the heap size after a full GC, but during the multi-threaded phase the peak memory usage is quite a bit larger and the fact that the allocation rate is high does not help. This becomes more obvious when we look at the results for the 64-bit JVM.

JDK6u6p 64-bit
Single-threaded phase: 14610
Multi-threaded phase (8 threads on 8 cores): 104992
Used Heap after full GC: 2GB
JVM Args: -XX:MaxPermSize=256m -Xms4224m -Xmx4224m -server -XX:+UseConcMarkSweepGC

I had to increase the Xms/Xmx to 4224m for the application to run to completion. Even so, the performance of the multi-threaded phase took a substantial performance hit when compared to the other two JVM configurations. All in all, the 64-bit JVM without compressed oops does not do well here.

In conclusion, it seems that compressed oops is a feature with a lot of promise and it allows the 64-bit JVM to be competitive even in cases that favour the 32-bit JVM. It might be interesting to test applications with different characteristics to compare the results. It’s also worth mentioning that since this is a new feature, it’s possible that performance will improve further before it’s integrated into the normal JDK releases. As it is though, it already hits a sweet spot and if it weren’t for the potential for instability, I would be ready to ditch my 32-bit JVM.

Update: The early access release of JDK 6 Update 14 also contains this feature.
Update 2: This feature is enabled by default since JDK 6 Update 23.

Local variables scope in HotSpot Saturday, Oct 11 2008 

Assume the following code:

  public void foo() {
    C c = new C();
    bar(c.baz); // assume that baz does not reference c
  }

I was under the impression that HotSpot would not garbage collect c before the local variable went out of scope although it is legally allowed to do so. I even heard of issues where unnecessary garbage was retained as a result (usual workarounds are to null the variable or to inline it).

According to this bug report (Server JIT optimization can cause objects to go out of scope prematurely), the Server VM is actually able to GC the object before the local variable goes out of scope. It would be interesting to know if it can detect such cases reliably.

AOT and JVM Wednesday, Oct 8 2008 

There have been a few AOT compilers for Java for some time. Two of the better known examples are GCJ and Excelsior JET. Even though Excelsior JET also has a JIT, the main focus was on the AOT aspect.

There have been many suggestions that HotSpot should cache the JIT-generated code to improve start-up performance, avoid the need for a warm-up phase on every invocation of the application and possibly share more data between applications. This is particularly relevant for desktop applications. HotSpot engineers claim that this is a complex task and that the benefits might not be as great as expected because AOT-generated code would be slower than JIT-generated one, so they implemented a simpler solution in JDK 5, Class Data Sharing. It’s pretty limited because it only works with the Client VM and the serial garbage collector and also because it only loads a set of classes from the system jar.

Interestingly, IBM did some work in this area in the IBM JRE for Java 5 and improved it further in the one for Java 6. This is described in some detail in a developerWorks article. It’s worth reading if you’re interested in this sort of thing, but I’ll list some points that I found interesting:

  • AOT code can be stored into a shared class cache allowing another JVM to use it to reduce startup time.
  • Class compression is used to increase the amount of classes that are stored in the cache.
  • AOT code is also subject to JIT compilation if it’s invoked often in order to optimise it further.
  • AOT code executed by a JVM is copied from the shared class cache so there is no direct footprint benefit but there are memory and CPU savings from being able to reuse this code rather than repeat the compilation.
  • AOT compilation is based on heuristics that select methods that are likely to improve startup time.
  • Eclipse without any additional plugins took 3.313 to start with AOT and shared classes versus 4.204 seconds without. Larger improvements could take place if there were more plugins installed.
  • Tomcat startup time improved from 1138ms to 851ms. Using shared classes without AOT caused the time to be 950ms, which means that both AOT and shared classes contributed to the improvement.

Downloading the IBM JRE/JDK requires registration and in case you need the Windows version, you must download a bundle that includes Eclipse. Links to the various downloads can be found in the developerWorks article.

These are interesting results, and it would be interesting to find out if HotSpot would also benefit from similar enhancements.