New JVM options and Scala iteration performance Monday, Oct 26 2009 

Internal iterators like foreach tend to do very well in micro-benchmarks on the JVM. In fact, they often do as well as the equivalent manual while or for loop. There is a catch, however, and it’s easy to miss it in micro-benchmarks.

A few months ago, David MacIver forwarded an email from Martin Odersky to the scala-internals mailing list about this. The problem as described by Martin:

“The reason seems to be that the foreach itself is not inlined, so the call to the closure becomes megamorphic and therefore slow.”

I was curious about the benchmarks used to measure this and Tiark Rompf (who had performed the measurements) provided the source code. I said I’d try to take a look the next weekend and I did it… more than 3 months later. Well, better late than never. ;)

There are 3 benchmarks:

  • Benchmark A: traverse a collection with N elements
  • Benchmark B: traverse a collection with N elements, inside the loop/closure traverse another collection with N2 elements 3 times
  • Benchmark C: build a collection (front to back) of N elements

Various approaches are used for each benchmark with various collections types. One weakness of the these benchmarks is that they don’t include a verification mechanism to ensure that all benchmarks produce the same result. However, they include code that tries to prevent the JIT from performing unfair optimisations (e.g. print something if an element in the collection matches a certain condition).

The original results produced by Tiark can be found here.

I added a few benchmarks that use plain arrays (RawArrayIndexed, RawArrayForeach, RawArrayForeachMega), made minor changes to the scripts and pushed the code to a GitHub repository. I left the rest of the Scala code as it was to make it easy to compare with the original results and ran the benchmark with various JVM settings to see what effect they would have. All the tests shared the following:

  • Dual quad-core Xeon E5462 2.80GHz
  • 14 GB RAM
  • Fedora 11
  • Scala 2.8.0.r19261
  • Revision 59521431f5c118b73e35b0b396e3efd6aecec3dd of project
  • 64-bit JDK 6 Update 18 early access b03
  • JVM base settings: -Xms1G -Xmx1G -XX:+UseParallelGC -XX:+UseParallelOldGC

JDK 6 Update 18 is scheduled to be released on Q4, 2009 and it includes HotSpot 16. Even though JDK 6 Update 14 (HotSpot 14) introduced compressed references and scalar replacement, HotSpot 16 includes improved compressed references and many crucial fixes to both features. According to my testing these features are now approaching production-level stability and the OpenJDK engineers seem to agree as they are both enabled by default in HotSpot 17 (which will eventually hit JDK6 too).

Interested in how these features would affect the performance in these benchmarks, I ran them with various combinations. I also added Scala’s compiler -optimise flag in some cases.

The original benchmark from Tiark used 3 collection types: array (java.util.ArrayList), list (scala.List, immutable single linked list) and vector (earlier version of immutable vector that has recently been added to Scala 2.8). I added JVM arrays and they are shown as “rawarray” in the charts. Finally, we get to the actual numbers.

Benchmark A

Click on chart for expanded version

There are some interesting data points here:

  1. Compressed references is a _huge_ win. RawArrayIndexed went from 500ms to 142ms and many of the vector operations were much faster.
  2. Escape analysis (which enables scalar replacement) doesn’t seem to have much of an effect.
  3. scalac -optimise doesn’t seem to have much of an effect.
  4. foreach is misleadingly fast in micro-benchmarks, but it’s easy to bring it down to earth. RawArrayForeach performs similarly to RawArrayIndexed, but RawArrayForeachMega is 10 times slower. The latter simply calls foreach with a few different anonymous functions during the collection creation phase causing the call site to become megamorphic. Once this happens, the only hope for good performance is that the foreach method gets inlined and it doesn’t seem to happen here. With this in mind, it seems like ticket 1338 (Optimize simple for loops) is a good idea.
Benchmark B

Click on chart for expanded version

Once again, compressed references are a large factor in some benchmarks (halving the time taken in some cases).

The new bit of information is that scalac -optimise causes a huge improvement in VectorForeachFast and VectorForeachFastProtect. This makes sense once one considers one of the findings from the previous benchmark. We said that inlining of foreach is of extreme importance once a call site is megamorphic and this is precisely what -optimise does in this case (and the JVM fails to do so at runtime otherwise). Sadly, -optimise cannot do this safely in many cases as it’s shown by the results for VectorForeach.

Benchmark C

Click on chart for expanded version

Once again, compressed references provide a nice boost. Seems like this option is a winner in 64-bit JVMs (if you don’t need a heap larger than 32GB), it saves memory and gives better performance. The usual disclaimer applies though, you should benchmark your own application instead of relying on micro-benchmarks when deciding what JVM options to use.

The complete results are also available. Feel free to play with the source code and provide your own numbers, fixes and/or improvements.

Configgy for Scala 2.8 Sunday, Oct 25 2009 

I wanted to use Configgy with Scala 2.8, so I created a branch in my GitHub repository:

http://github.com/ijuma/configgy/tree/scala-2.8

The main code now compiles, but the tests and the build system have to be updated still.

I’m not particularly familiar with Ivy, so if someone else wants to do that, I won’t complain. ;)

The tests use specs and Eric has already released a binary for 2.8.0.

Update: I filed an issue upstream to track 2.8 support.

Update 2: The branch has been updated to Scala 2.8.0.Beta1-RC3 along with the build system and the tests pass. Thanks to paulp for tracking an issue that was taking place because the stack order changed in Scala 2.8.0.

Scala 2.7.4-rc1 released Thursday, Apr 16 2009 

Scala 2.7.4-rc1 has been released. It incorporates a few important fixes related to serialization of anonymous functions and generation of Java generics, but the big change is the improved Scala plugin Scala IDE for Eclipse. This means that my backport is now obsolete.

The announcement contains all the juicy information.

Json serialization/deserialization faster than protocol buffers? Wednesday, Mar 25 2009 

I was reading the The Book of JOSH and saw the following statement:

“Json delivers on what XML promised. Simple to understand, effective data markup accessible and usable by human and computer alike. Serialization/Deserialization is on par with or faster then XML, Thrift and Protocol Buffers.”

That seemed a bit too definite for my taste. There are so many variables that can affect the results that I was interested in more information, so I asked for it and eventually got an answer.

I had a brief look at the benchmark referenced and that was enough to come up with some talking points. To make it easier to follow, I will just compare protocol buffers and json (jackson). I started by running the benchmark in my machine (java 1.6.0_14-ea-b03):

Object create Serialization Deserialization Serialized Size
protobuf 312.95730 3052.26500 2340.84600 217
json 182.64535 2284.88300 3362.31850 310

Ok, so json doesn’t seem to be faster on deserialization and the size is almost 50% bigger (a big deal if the network is the bottleneck as is often the case). Why is serialization of protobuf so slow though? Let’s see the code:

    public byte[] serialize(MediaContent content, ByteArrayOutputStream baos) throws IOException
    {
        content.writeTo(baos);
        return baos.toByteArray();
    }

How about we replace that with content.toByteArray()?

Object create Serialization Deserialization Serialized Size
protobuf 298.89330 2087.79800 2339.44450 217
json (jackson) 174.49190 2482.53350 3599.90800 310

That’s more like it. Let’s try something a bit more exotic just for fun and add XX:+DoEscapeAnalysis:

Object create Serialization Deserialization Serialized Size
protobuf 260.51330 1925.32300 2302.74250 217
json (jackson) 176.20370 2385.99750 3647.01700 310

That reduces some of the cost of object creation for protobuf, but it’s still substantially slower than json. This is not hard to believe because of the builder pattern employed by the Java classes generated by protocol buffers, but I haven’t investigated it in more detail. In any case, protocol buffers is better in 3 of the measures for this particular benchmark.

What does this mean? Not a lot. As usual, where performance is important, you should create benchmarks that mirror your application and environment. I just couldn’t let the blanket “json is on par with or faster than…” statement pass without a bit of scrutiny. ;)

Scala plugin for Eclipse snapshot r17139 Wednesday, Feb 18 2009 

The snapshot of the Scala plugin for Eclipse that I posted a few days ago had a couple of annoying bugs. Here’s a new one build (note that it’s an Eclipse update site, so no point in trying to access it with a browser) from revision 17139 of the trunk of the plugin with the compiler and standard libraries from the 2.7.x branch.

Miles has been busy once again and fixes included since the last build are 1717, 1233, 1636, 1662, 1716, 1627, 1534, 1410, 1516, 1483 and 1253.

Update: New build published that includes fix for ticket 1741. Uninstallation followed by installation may be required if p2 starts giving cryptic errors (it seems like it doesn’t handle well the case where the update site is replaced). Sorry about that.

Update 2: With the release of Scala 2.7.4-rc1, this backport is obsolete.

Scala plugin snapshot Saturday, Feb 14 2009 

Miles Sabin has recently committed the Equinox Aspects-based approach to JDT integration for the Scala plugin to trunk. This approach is similar to the one used by the AJDT as described by Andrew Eisenberg.

Soon after that commit, Miles embarked on a bug-killing mission. Nightlies are available from the usual place, but they rely on the trunk of the compiler and standard library. With APIs still in flux, it would be nice to be able to use the improved plugin with the compiler and standard library from the 2.7.x branch (this includes a few important fixes added after the 2.7.3 release).

Since it seemed like no-one else had time to do that in the near future, I did it myself. It’s available here in case it’s useful for anyone else. In the interest of full disclosure, I applied the following patch to the plugin:


Index: META-INF/MANIFEST.MF
===================================================================
--- META-INF/MANIFEST.MF (revision 17110)
+++ META-INF/MANIFEST.MF (working copy)
@@ -34,8 +34,8 @@
org.eclipse.ui.navigator.resources,
org.eclipse.ui.views,
org.eclipse.ui.workbench.texteditor,
- scala.library,
- scala.tools.nsc
+ scala.library;bundle-version="[2.7.3,2.7.4)",
+ scala.tools.nsc;bundle-version="[2.7.3,2.7.4)"
Import-Package:
org.eclipse.contribution.jdt.cuprovider,
org.eclipse.contribution.jdt.imagedescriptor,
Index: src/scala/tools/editor/Typers.scala
===================================================================
--- src/scala/tools/editor/Typers.scala (revision 17110)
+++ src/scala/tools/editor/Typers.scala (working copy)
@@ -6,7 +6,6 @@

package scala.tools.editor

-import scala.annotation.unchecked.uncheckedStable
import scala.collection.jcl._

import nsc.{util,io}
Index: src/scala/tools/editor/TypersPresentations.scala
===================================================================
--- src/scala/tools/editor/TypersPresentations.scala (revision 17110)
+++ src/scala/tools/editor/TypersPresentations.scala (working copy)
@@ -6,7 +6,6 @@

package scala.tools.editor

-import scala.annotation.unchecked.uncheckedStable
import scala.collection.jcl.{LinkedHashMap,LinkedHashSet}
import scala.collection.mutable.ListBuffer

@@ -595,10 +594,13 @@
val str = name.decode
val key = if (sym.isMethod) str.trim+header(sym) else str.trim
val last = str.last
+
+ import org.eclipse.jdt.core.search.SearchPattern

// TODO: check accessibility.
- if (name.isTypeName == isType && str.toLowerCase.startsWith(leading.toLowerCase) &&
- (str.indexOf('$') == -1) && last != ' ' && !contains(key)) {
+ if (name.isTypeName == isType && (str.indexOf('$') == -1) && last != ' ' &&
+ !contains(key) && (str.toLowerCase.startsWith(leading.toLowerCase) ||
+ SearchPattern.camelCaseMatch(leading, str))) {
val sym0 = if (verify==null) sym else verify(sym)
if (sym0 != compiler.NoSymbol) {
val high = if (pt != null) sym0.tpe <:< pt

It includes my simple camel-case content assist patch, some version restrictions and the removal of the imports for the uncheckedStable annotation since it was moved outside of the scala package in 2.8.0. The compiler and standard library are the latest version from the 2.7.x branch. I’ve done some light testing and the plugin works for me, but the usual disclaimers apply. Be prepared to have to uninstall and reinstall the plugin if things don’t work out.

Update: I noticed that two annoying bugs were introduced on Friday. I would recommend not using the current build. I will update it once they are fixed.

Update 2: See here for a newer build with fixes for the mentioned issues.

Update 3: With the release of Scala 2.7.4-rc1, this backport is obsolete.

Code reuse by copy and paste Friday, Feb 13 2009 

I was investigating what would need to be done to implement camel-case code assist for the Scala plugin when I ran across the following gem in CharOperation (Eclipse JDT):

public static final boolean camelCaseMatch(char[] pattern, int patternStart, int patternEnd, char[] name, int nameStart, int nameEnd, boolean samePartCount) {

	/* !!!!!!!!!! WARNING !!!!!!!!!!
	 * The content of this method has been fully copied to
	 * SearchPattern#camelCaseMatch(String, int, int, String, int, int, boolean).
	 * 
	 * So, if current method is modified, do NOT forget to copy again its content
	 * to SearchPattern corresponding method!
	 */

And in SearchPattern:

public static final boolean camelCaseMatch(String pattern, int patternStart, int patternEnd, String name, int nameStart, int nameEnd, boolean samePartCount) {

	/* !!!!!!!!!! WARNING !!!!!!!!!!
	 * The algorithm of this method has been entirely copied from
	 * CharOperation#camelCaseMatch(char[], int, int, char[], int, int, boolean).
	 * Array lengths have been replaced with call to {@link String#length()} and
	 * array direct access have been replaced with call to {@link String#charAt(int)}.
	 * 
	 * So, do NOT modify this method directly to fix any bug but modify first the
	 * corresponding CharOperation method and do the copy again to be sure that
	 * these two methods are kept synchronized.
	 */

I am sure there is a reason why this was done, but how ugly…

Update: At least JDT made it easy to implement what I wanted.

Objects with no allocation overhead Wednesday, Dec 17 2008 

We have all heard about how HotSpot is really good at dealing with short-lived objects (both allocation and GC), but the truth is that object allocation is still pretty costly when compared to operations like addition or multiplication. Allocating an object for each step of an iteration over a large collection to make a simple computation might sound like the kind of thing no-one would ever do, but it’s actually quite common in languages like Scala (as described in a previous post). In Java-land, if you use the Function class from Google Collections with primitive wrappers, the same issue may occur. There are many JVM improvements that could help depending on the specific case (generic specialisation, value types, fixnums to name a few), but it’s unclear if/when we’ll get them.

So, what about that title? Escape analysis was introduced during Java 6, but the information gathered was only used for lock elision. However, this information can also be used for other interesting optimisations like scalar replacement and stack allocation. There have been doubts about the benefits of stack allocation (discussed in the comments) so the focus has been on scalar replacement so that the object is never in memory. At least that’s the theory.

Edward Lee started a couple of threads in the Hotspot-dev mailing list about scalar replacement here and here which reminded me to do some experiments. Note that this feature is still in development so the results posted here are preliminary and not indicative of how it will perform once it’s final. Still, it’s interesting to see how well it works at this time. I picked the latest JDK7 build (build 41) and ran a few tests with the following arguments passed to java “-XX:MaxPermSize=256m -Xms128m -Xmx3328m -server -XX:+UseConcMarkSweepGC” and either XX:-DoEscapeAnalysis or XX:+DoEscapeAnalysis.

I started by choosing the simplest test possible. Note that either testSimpleAllocation or testNoAllocation would be commented out.

class C(val a: Int, val b: Int)

object Test {
  def main(args: Array[String]) {
    for (i <- 1 to 10) testSimpleAllocation()
    //for (i <- 1 to 10) testNoAllocation()
  }
  
  def testSimpleAllocation() = {
    System.gc()
    var time = System.currentTimeMillis;
    var i = 0
    var sum = 0
    while (i < 1000000000) {
      sum += baz(new C(i + 1, i + 2))
      i += 1
    }
    println(sum)
    println(System.currentTimeMillis - time)
  }
  
  def testNoAllocation() = {
    System.gc()
    var time = System.currentTimeMillis;
    var i = 0
    var sum = 0
    while (i < 1000000000) {
      sum += baz(i + 1, i + 2)
      i += 1
    }
    println(sum)
    println(System.currentTimeMillis - time)
  }
  
  def baz(a: Int, b: Int): Int = a + b
  
  def baz(c: C): Int = c.a + c.b
}

The result were:


testNoAllocation: 403
testSimpleAllocation with EA: 1006
testSimpleAllocation without EA: 9190

As we can see, escape analysis has a tremendous effect and the method finishes in 11% of the time taken with it disabled. However, the version with no allocation is still substantially faster.

I decided to test a foreach method that takes a Function object next (this time in Java because it does less magic behind the scenes):

package test;

public class EscapeAnalysis {
  
  interface Function<T, R> {
    R apply(T value);
  }
  
  interface IntProcedure {
    void apply(int value);
  }
  
  static class BoxedArray {
    private final int[] array;
    
    public BoxedArray(int length) {
      array = new int[length];
    }
    
    public int length() {
      return array.length;
    }
    
    public void foreach(Function<Integer, Void> function) {
      for (int i : array)
        function.apply(new Integer(i));
    }
    
    public void foreach(IntFunction function) {
      for (int i : array)
        function.apply(i);
    }

    public void set(int index, int value) {
      array[index] = value;
    }

    public void foreachWithAutoboxing(Function<Integer, Void> function) {
      for (int i : array)
        function.apply(i);
    }
  }
  
  public static void main(String[] args) {
    BoxedArray array = new BoxedArray(100000000);
    /* We are careful not to restrict our ints to the ones in the Integer.valueOf cache */
    for (int i = 0; i < array.length(); i++)
      array.set(i, i);
    
    for (int i = 0; i < 10; i++)
      test(array);
  }

  private static void test(BoxedArray array) {
    System.gc();
    long time = System.currentTimeMillis();
    final int[] sum = new int[] { 0 };
    
    /* Uncomment the one that should be executed */
    testIntegerForeach(array, sum);
//    testIntegerForeachWithAutoboxing(array, sum);
//    testIntForeach(array, sum);

    System.out.println(System.currentTimeMillis() - time);
    System.out.println(sum[0]);
  }
  
  private static void testIntegerForeachWithAutoboxing(BoxedArray array, final int[] sum) {
    array.foreachWithAutoboxing(new Function<Integer, Void>() {
      public Void apply(Integer value) {
        sum[0] += value;
        return null;
      }
    });
  }
  
  private static void testIntegerForeach(BoxedArray array, final int[] sum) {
    array.foreach(new Function<Integer, Void>() {
      public Void apply(Integer value) {
        sum[0] += value;
        return null;
      }
    });
  }

  private static void testIntForeach(BoxedArray array, final int[] sum) {
    array.foreach(new IntFunction() {
      public void apply(int value) {
        sum[0] += value;
      }
    });
  }
}

The results were:


testIntForeach: 130
testIntegerForeachWithAutoboxing with EA: 1064
testIntegerForeach with EA: 224
testIntegerForeachWithAutoboxing without EA: 1039
testIntegerForeach without EA: 1024

This test shows something interesting, EA gives no improvement if Integer.valueOf (called by auto-boxing) is used instead of new Integer. Apart from that, the results are somewhat similar to the first ones (EA provides a substantial boost, but not enough to match the specialised implementation). After quickly testing that the boxing methods in ScalaRunTime had the same effect as Integer.valueOf, I decided that it was not worth testing more complex scenarios.

It seems like there’s a lot of potential for scalar replacement, but HotSpot needs to do a better job at detecting cases where it can be used safely. If nothing else, at least knowledge of the valueOf methods should be hardcoded into the system. I hope that a more general solution is found though because other languages on the JVM may use different methods (as mentioned earlier Scala uses methods in ScalaRunTime instead). It will also be interesting to see if the performance can get even closer to the no allocation case. Since the feature is still in development, we can hope. :)

Update: The early access release of JDK 6 Update 14 also contains this feature.
Update 2: This feature is enabled by default since JDK 6 Update 23.

Not just for academics? Friday, Dec 5 2008 

At least ClariFI seems to think so. Paul Chiusano announced in the Scala mailing list that ClariFI is looking to hire developers “with a strong background in functional programming to do a mixture of Scala and Java programming”. The mailing list thread contains more details including a link to the official job posting.

It will be interesting to see if posts like that will become a common occurrence in the future.

Scala in the JVM Languages Summit Tuesday, Sep 30 2008 

Since I’ve seen no mention of this in Planet Scala (although it was discussed in #scala), here are some Scala-related links from the JVM Languages summit:

  • Optimizing Higher-Order Functions in Scala by Iulian Dragos. I haven’t tried the various optimisation flags in scalac yet, but I am curious if they will have any effect in the primitive collections with scala benchmark from a previous blog entry. These flags have been a bit buggy in the past, but hopefully recent fixes have solved the problems.
  • Scalify: Automatic translation from Java to Scala by Paul Phillips. This tool looks very interesting because the approach used is a lot more sophisticated than jatran. I have been meaning to take a look at it, but all of my code uses generics and it doesn’t support that yet.
  • Cliff Click’s post-Summit blog entry. It includes a link to the slides for his talk, “Fast Bytecodes for Funny Languages”. Scala is one of the languages covered (although very briefly) with the results coming from the translation of a simple Java micro-benchmark to Scala available here (in the comments). As expected, Scala can generate bytecode that is pretty much the same as Java if low-level constructs are used, but the performance suffers quite a bit if idiomatic Scala is used while dealing with primitives (at least with the default compilation flags).

For more general coverage of the summit:

Ricky Clarkson mentioned on IRC that videos of the talks will be made available on InfoQ and Google Video.

Update: John Rose’s post-summit blog entry is a nice round-up.

Next Page »

Follow

Get every new post delivered to your Inbox.