Performance of FastMath from Commons Math Wednesday, Feb 23 2011 

Commons Math includes (in trunk and MATH_2_X branch) a FastMath class that is described as a “Faster, more accurate, portable alternative to StrictMath” and is implemented fully in Java (unlike java.lang.Math and java.lang.StrictMath). I am generally interested in faster math operations so I decided to investigate.

The first thing that I checked is if FastMath delegated to java.lang.Math or java.lang.StrictMath for any of its methods. It delegates to java.lang.Math for sqrt and random and to StrictMath for IEEEremainder. I found this interesting because I know that HotSpot includes intrinsic methods for many methods in java.lang.Math. Looking at vmSymbols.hpp, we can see that the following methods have intrinsics: abs, sin, cos, tan, atan2, sqrt, log, log10, pow, exp, min and max. Intrinsic methods are usually highly optimised (often involving assembly and sometimes CPU-specific instructions) and do not incur JNI overhead. I was interested in comparing all of FastMath’s methods with java.lang.Math and java.lang.StrictMath, but I was particularly interested in these ones.

Bill Rossi (the original author of FastMath) pointed me to a performance test in SVN, so I decided to run that (r1072209 from SVN trunk) on a Intel Core i7 CPU 860 2.80GHz using JDK 6 Update 25 early access release build 1 (includes HotSpot 20) and JDK 7 build 130 (includes HotSpot 21). I originally ran the test as it is in Commons Math with one addition but it has a few common micro-benchmarking mistakes (thanks to Aleksey for pointing them out in the comments). I fixed the issues (see here for the modified code) and ran the test again.

Since the relative results for the two JDKs did not differ much, I am only posting the charts for JDK 6 Update 25 early access release. The results for both JDKs are available in table form here. Note that the results are in milliseconds, so lower is better.

java.lang.Math does quite a bit better on tan, abs and log while FastMath does better on exp and cosh. The rest are similar.

FastMath does very well on hypot, pow, asin, acos, cbrt, sinh, tanh while java.lang.Math does well on log10 and log1p. java.lang.Math does so badly on hypot that it’s worth mentioning it again. Looks like it needs a bit of work.

As stated above, the results for JDK 7 are very similar with a few small improvements. Methods with an execution time reduction of 15% or higher were pow (202ms instead of 238ms), powII (125ms instead of 149ms) and atan (53ms instead of 66ms)

The usual disclaimers about benchmarks apply. The benchmark provided by commons-math is a good start, but I believe it should be fleshed out more so that common cases are measured separately (e.g. pow(x, 2) is much more common than the ones used in the test). Having said that, FastMath is a nice contribution and it seems like some methods in the JDK could be replaced by the ones in FastMath for better results. It also seems like FastMath could do better by delegating to the JDK for more of its methods.

Update: Some context for the contribution of FastMath to Apache Commons Math can be found here.

Update 2: I’ve updated the charts to use a modified test instead of the one in Commons Math as there were issues there. I had mentioned some suspicious scores, but thanks to Aleksey for pointing the issues out in the comments (they are common mistakes in micro-benchmarks).

Advertisements

Json serialization/deserialization faster than protocol buffers? Wednesday, Mar 25 2009 

I was reading the The Book of JOSH and saw the following statement:

“Json delivers on what XML promised. Simple to understand, effective data markup accessible and usable by human and computer alike. Serialization/Deserialization is on par with or faster then XML, Thrift and Protocol Buffers.”

That seemed a bit too definite for my taste. There are so many variables that can affect the results that I was interested in more information, so I asked for it and eventually got an answer.

I had a brief look at the benchmark referenced and that was enough to come up with some talking points. To make it easier to follow, I will just compare protocol buffers and json (jackson). I started by running the benchmark in my machine (java 1.6.0_14-ea-b03):

Object create Serialization Deserialization Serialized Size
protobuf 312.95730 3052.26500 2340.84600 217
json 182.64535 2284.88300 3362.31850 310

Ok, so json doesn’t seem to be faster on deserialization and the size is almost 50% bigger (a big deal if the network is the bottleneck as is often the case). Why is serialization of protobuf so slow though? Let’s see the code:

    public byte[] serialize(MediaContent content, ByteArrayOutputStream baos) throws IOException
    {
        content.writeTo(baos);
        return baos.toByteArray();
    }

How about we replace that with content.toByteArray()?

Object create Serialization Deserialization Serialized Size
protobuf 298.89330 2087.79800 2339.44450 217
json (jackson) 174.49190 2482.53350 3599.90800 310

That’s more like it. Let’s try something a bit more exotic just for fun and add XX:+DoEscapeAnalysis:

Object create Serialization Deserialization Serialized Size
protobuf 260.51330 1925.32300 2302.74250 217
json (jackson) 176.20370 2385.99750 3647.01700 310

That reduces some of the cost of object creation for protobuf, but it’s still substantially slower than json. This is not hard to believe because of the builder pattern employed by the Java classes generated by protocol buffers, but I haven’t investigated it in more detail. In any case, protocol buffers is better in 3 of the measures for this particular benchmark.

What does this mean? Not a lot. As usual, where performance is important, you should create benchmarks that mirror your application and environment. I just couldn’t let the blanket “json is on par with or faster than…” statement pass without a bit of scrutiny. ;)

Objects with no allocation overhead Wednesday, Dec 17 2008 

We have all heard about how HotSpot is really good at dealing with short-lived objects (both allocation and GC), but the truth is that object allocation is still pretty costly when compared to operations like addition or multiplication. Allocating an object for each step of an iteration over a large collection to make a simple computation might sound like the kind of thing no-one would ever do, but it’s actually quite common in languages like Scala (as described in a previous post). In Java-land, if you use the Function class from Google Collections with primitive wrappers, the same issue may occur. There are many JVM improvements that could help depending on the specific case (generic specialisation, value types, fixnums to name a few), but it’s unclear if/when we’ll get them.

So, what about that title? Escape analysis was introduced during Java 6, but the information gathered was only used for lock elision. However, this information can also be used for other interesting optimisations like scalar replacement and stack allocation. There have been doubts about the benefits of stack allocation (discussed in the comments) so the focus has been on scalar replacement so that the object is never in memory. At least that’s the theory.

Edward Lee started a couple of threads in the Hotspot-dev mailing list about scalar replacement here and here which reminded me to do some experiments. Note that this feature is still in development so the results posted here are preliminary and not indicative of how it will perform once it’s final. Still, it’s interesting to see how well it works at this time. I picked the latest JDK7 build (build 41) and ran a few tests with the following arguments passed to java “-XX:MaxPermSize=256m -Xms128m -Xmx3328m -server -XX:+UseConcMarkSweepGC” and either XX:-DoEscapeAnalysis or XX:+DoEscapeAnalysis.

I started by choosing the simplest test possible. Note that either testSimpleAllocation or testNoAllocation would be commented out.

class C(val a: Int, val b: Int)

object Test {
  def main(args: Array[String]) {
    for (i <- 1 to 10) testSimpleAllocation()
    //for (i <- 1 to 10) testNoAllocation()
  }
  
  def testSimpleAllocation() = {
    System.gc()
    var time = System.currentTimeMillis;
    var i = 0
    var sum = 0
    while (i < 1000000000) {
      sum += baz(new C(i + 1, i + 2))
      i += 1
    }
    println(sum)
    println(System.currentTimeMillis - time)
  }
  
  def testNoAllocation() = {
    System.gc()
    var time = System.currentTimeMillis;
    var i = 0
    var sum = 0
    while (i < 1000000000) {
      sum += baz(i + 1, i + 2)
      i += 1
    }
    println(sum)
    println(System.currentTimeMillis - time)
  }
  
  def baz(a: Int, b: Int): Int = a + b
  
  def baz(c: C): Int = c.a + c.b
}

The result were:


testNoAllocation: 403
testSimpleAllocation with EA: 1006
testSimpleAllocation without EA: 9190

As we can see, escape analysis has a tremendous effect and the method finishes in 11% of the time taken with it disabled. However, the version with no allocation is still substantially faster.

I decided to test a foreach method that takes a Function object next (this time in Java because it does less magic behind the scenes):

package test;

public class EscapeAnalysis {
  
  interface Function<T, R> {
    R apply(T value);
  }
  
  interface IntProcedure {
    void apply(int value);
  }
  
  static class BoxedArray {
    private final int[] array;
    
    public BoxedArray(int length) {
      array = new int[length];
    }
    
    public int length() {
      return array.length;
    }
    
    public void foreach(Function<Integer, Void> function) {
      for (int i : array)
        function.apply(new Integer(i));
    }
    
    public void foreach(IntFunction function) {
      for (int i : array)
        function.apply(i);
    }

    public void set(int index, int value) {
      array[index] = value;
    }

    public void foreachWithAutoboxing(Function<Integer, Void> function) {
      for (int i : array)
        function.apply(i);
    }
  }
  
  public static void main(String[] args) {
    BoxedArray array = new BoxedArray(100000000);
    /* We are careful not to restrict our ints to the ones in the Integer.valueOf cache */
    for (int i = 0; i < array.length(); i++)
      array.set(i, i);
    
    for (int i = 0; i < 10; i++)
      test(array);
  }

  private static void test(BoxedArray array) {
    System.gc();
    long time = System.currentTimeMillis();
    final int[] sum = new int[] { 0 };
    
    /* Uncomment the one that should be executed */
    testIntegerForeach(array, sum);
//    testIntegerForeachWithAutoboxing(array, sum);
//    testIntForeach(array, sum);

    System.out.println(System.currentTimeMillis() - time);
    System.out.println(sum[0]);
  }
  
  private static void testIntegerForeachWithAutoboxing(BoxedArray array, final int[] sum) {
    array.foreachWithAutoboxing(new Function<Integer, Void>() {
      public Void apply(Integer value) {
        sum[0] += value;
        return null;
      }
    });
  }
  
  private static void testIntegerForeach(BoxedArray array, final int[] sum) {
    array.foreach(new Function<Integer, Void>() {
      public Void apply(Integer value) {
        sum[0] += value;
        return null;
      }
    });
  }

  private static void testIntForeach(BoxedArray array, final int[] sum) {
    array.foreach(new IntFunction() {
      public void apply(int value) {
        sum[0] += value;
      }
    });
  }
}

The results were:


testIntForeach: 130
testIntegerForeachWithAutoboxing with EA: 1064
testIntegerForeach with EA: 224
testIntegerForeachWithAutoboxing without EA: 1039
testIntegerForeach without EA: 1024

This test shows something interesting, EA gives no improvement if Integer.valueOf (called by auto-boxing) is used instead of new Integer. Apart from that, the results are somewhat similar to the first ones (EA provides a substantial boost, but not enough to match the specialised implementation). After quickly testing that the boxing methods in ScalaRunTime had the same effect as Integer.valueOf, I decided that it was not worth testing more complex scenarios.

It seems like there’s a lot of potential for scalar replacement, but HotSpot needs to do a better job at detecting cases where it can be used safely. If nothing else, at least knowledge of the valueOf methods should be hardcoded into the system. I hope that a more general solution is found though because other languages on the JVM may use different methods (as mentioned earlier Scala uses methods in ScalaRunTime instead). It will also be interesting to see if the performance can get even closer to the no allocation case. Since the feature is still in development, we can hope. :)

Update: The early access release of JDK 6 Update 14 also contains this feature.
Update 2: This feature is enabled by default since JDK 6 Update 23.

JDK6u12, Java 7 and Devoxx Saturday, Dec 13 2008 

With JavaFX 1.0 out of the way, it seems like the Java 7 train has started moving again.

Concurrency plans posted by Doug Lea. Two points of interest are that the parallel collections API is probably not going to get included (but the core fork-join framework will) and that a low-level fences API might get included. Without closures, it’s hard to include the parallel collections API with good performance for primitives while avoiding interface explosion. The positive aspect is that it will continue to be developed outside the JDK giving it more flexibility in terms of API evolution.

Devoxx summary by Danny Coward containing various links covering JavaFX, JDK7 and Languages in the JVM.

Summary of Java 7 update given by Mark Reinhold written by Hamlet D’Arcy. Small language changes and Project Jigsaw had already been announced and many of the items described there had been known for a while, but some new ones are Null dereference expressions, XRender pipeline for Java 2D and the chance of a MVM-lite. The XRender pipeline is interesting because it was developed by Clemens Eisserer as part of the OpenJDK challenge and would be a nice example of an external contribution to OpenJDK.

One thing that is clear from the various announcements is that Sun intends to use OpenJDK as the place where work for Java 7 will take place (even if JSRs are also planned). Some people expressed concern that this is bad because it circumvents the JCP and that OpenJDK is a project controlled by Sun since the vast majority of the committers are from Sun. Given how opaque the JCP is, I don’t see how OpenJDK could be worse, but my hope is that Sun will run these projects transparently as promised where the end-result is diversification of committers due to more external participation.

Finally, for the people who still use the proprietary JDK from Sun, there’s an early access release of JDK6u12 that includes a 64-bit Java Plugin.