New JVM options and Scala iteration performance Monday, Oct 26 2009 

Internal iterators like foreach tend to do very well in micro-benchmarks on the JVM. In fact, they often do as well as the equivalent manual while or for loop. There is a catch, however, and it’s easy to miss it in micro-benchmarks.

A few months ago, David MacIver forwarded an email from Martin Odersky to the scala-internals mailing list about this. The problem as described by Martin:

“The reason seems to be that the foreach itself is not inlined, so the call to the closure becomes megamorphic and therefore slow.”

I was curious about the benchmarks used to measure this and Tiark Rompf (who had performed the measurements) provided the source code. I said I’d try to take a look the next weekend and I did it… more than 3 months later. Well, better late than never. ;)

There are 3 benchmarks:

  • Benchmark A: traverse a collection with N elements
  • Benchmark B: traverse a collection with N elements, inside the loop/closure traverse another collection with N2 elements 3 times
  • Benchmark C: build a collection (front to back) of N elements

Various approaches are used for each benchmark with various collections types. One weakness of the these benchmarks is that they don’t include a verification mechanism to ensure that all benchmarks produce the same result. However, they include code that tries to prevent the JIT from performing unfair optimisations (e.g. print something if an element in the collection matches a certain condition).

The original results produced by Tiark can be found here.

I added a few benchmarks that use plain arrays (RawArrayIndexed, RawArrayForeach, RawArrayForeachMega), made minor changes to the scripts and pushed the code to a GitHub repository. I left the rest of the Scala code as it was to make it easy to compare with the original results and ran the benchmark with various JVM settings to see what effect they would have. All the tests shared the following:

  • Dual quad-core Xeon E5462 2.80GHz
  • 14 GB RAM
  • Fedora 11
  • Scala 2.8.0.r19261
  • Revision 59521431f5c118b73e35b0b396e3efd6aecec3dd of project
  • 64-bit JDK 6 Update 18 early access b03
  • JVM base settings: -Xms1G -Xmx1G -XX:+UseParallelGC -XX:+UseParallelOldGC

JDK 6 Update 18 is scheduled to be released on Q4, 2009 and it includes HotSpot 16. Even though JDK 6 Update 14 (HotSpot 14) introduced compressed references and scalar replacement, HotSpot 16 includes improved compressed references and many crucial fixes to both features. According to my testing these features are now approaching production-level stability and the OpenJDK engineers seem to agree as they are both enabled by default in HotSpot 17 (which will eventually hit JDK6 too).

Interested in how these features would affect the performance in these benchmarks, I ran them with various combinations. I also added Scala’s compiler -optimise flag in some cases.

The original benchmark from Tiark used 3 collection types: array (java.util.ArrayList), list (scala.List, immutable single linked list) and vector (earlier version of immutable vector that has recently been added to Scala 2.8). I added JVM arrays and they are shown as “rawarray” in the charts. Finally, we get to the actual numbers.

Benchmark A

Click on chart for expanded version

There are some interesting data points here:

  1. Compressed references is a _huge_ win. RawArrayIndexed went from 500ms to 142ms and many of the vector operations were much faster.
  2. Escape analysis (which enables scalar replacement) doesn’t seem to have much of an effect.
  3. scalac -optimise doesn’t seem to have much of an effect.
  4. foreach is misleadingly fast in micro-benchmarks, but it’s easy to bring it down to earth. RawArrayForeach performs similarly to RawArrayIndexed, but RawArrayForeachMega is 10 times slower. The latter simply calls foreach with a few different anonymous functions during the collection creation phase causing the call site to become megamorphic. Once this happens, the only hope for good performance is that the foreach method gets inlined and it doesn’t seem to happen here. With this in mind, it seems like ticket 1338 (Optimize simple for loops) is a good idea.
Benchmark B

Click on chart for expanded version

Once again, compressed references are a large factor in some benchmarks (halving the time taken in some cases).

The new bit of information is that scalac -optimise causes a huge improvement in VectorForeachFast and VectorForeachFastProtect. This makes sense once one considers one of the findings from the previous benchmark. We said that inlining of foreach is of extreme importance once a call site is megamorphic and this is precisely what -optimise does in this case (and the JVM fails to do so at runtime otherwise). Sadly, -optimise cannot do this safely in many cases as it’s shown by the results for VectorForeach.

Benchmark C

Click on chart for expanded version

Once again, compressed references provide a nice boost. Seems like this option is a winner in 64-bit JVMs (if you don’t need a heap larger than 32GB), it saves memory and gives better performance. The usual disclaimer applies though, you should benchmark your own application instead of relying on micro-benchmarks when deciding what JVM options to use.

The complete results are also available. Feel free to play with the source code and provide your own numbers, fixes and/or improvements.

Advertisement

Configgy for Scala 2.8 Sunday, Oct 25 2009 

I wanted to use Configgy with Scala 2.8, so I created a branch in my GitHub repository:

http://github.com/ijuma/configgy/tree/scala-2.8

The main code now compiles, but the tests and the build system have to be updated still.

I’m not particularly familiar with Ivy, so if someone else wants to do that, I won’t complain. ;)

The tests use specs and Eric has already released a binary for 2.8.0.

Update: I filed an issue upstream to track 2.8 support.

Update 2: The branch has been updated to Scala 2.8.0.Beta1-RC3 along with the build system and the tests pass. Thanks to paulp for tracking an issue that was taking place because the stack order changed in Scala 2.8.0.

Scala 2.7.4-rc1 released Thursday, Apr 16 2009 

Scala 2.7.4-rc1 has been released. It incorporates a few important fixes related to serialization of anonymous functions and generation of Java generics, but the big change is the improved Scala plugin Scala IDE for Eclipse. This means that my backport is now obsolete.

The announcement contains all the juicy information.

Scala plugin for Eclipse snapshot r17139 Wednesday, Feb 18 2009 

The snapshot of the Scala plugin for Eclipse that I posted a few days ago had a couple of annoying bugs. Here’s a new one build (note that it’s an Eclipse update site, so no point in trying to access it with a browser) from revision 17139 of the trunk of the plugin with the compiler and standard libraries from the 2.7.x branch.

Miles has been busy once again and fixes included since the last build are 1717, 1233, 1636, 1662, 1716, 1627, 1534, 1410, 1516, 1483 and 1253.

Update: New build published that includes fix for ticket 1741. Uninstallation followed by installation may be required if p2 starts giving cryptic errors (it seems like it doesn’t handle well the case where the update site is replaced). Sorry about that.

Update 2: With the release of Scala 2.7.4-rc1, this backport is obsolete.

Scala plugin snapshot Saturday, Feb 14 2009 

Miles Sabin has recently committed the Equinox Aspects-based approach to JDT integration for the Scala plugin to trunk. This approach is similar to the one used by the AJDT as described by Andrew Eisenberg.

Soon after that commit, Miles embarked on a bug-killing mission. Nightlies are available from the usual place, but they rely on the trunk of the compiler and standard library. With APIs still in flux, it would be nice to be able to use the improved plugin with the compiler and standard library from the 2.7.x branch (this includes a few important fixes added after the 2.7.3 release).

Since it seemed like no-one else had time to do that in the near future, I did it myself. It’s available here in case it’s useful for anyone else. In the interest of full disclosure, I applied the following patch to the plugin:


Index: META-INF/MANIFEST.MF
===================================================================
--- META-INF/MANIFEST.MF (revision 17110)
+++ META-INF/MANIFEST.MF (working copy)
@@ -34,8 +34,8 @@
org.eclipse.ui.navigator.resources,
org.eclipse.ui.views,
org.eclipse.ui.workbench.texteditor,
- scala.library,
- scala.tools.nsc
+ scala.library;bundle-version="[2.7.3,2.7.4)",
+ scala.tools.nsc;bundle-version="[2.7.3,2.7.4)"
Import-Package:
org.eclipse.contribution.jdt.cuprovider,
org.eclipse.contribution.jdt.imagedescriptor,
Index: src/scala/tools/editor/Typers.scala
===================================================================
--- src/scala/tools/editor/Typers.scala (revision 17110)
+++ src/scala/tools/editor/Typers.scala (working copy)
@@ -6,7 +6,6 @@

package scala.tools.editor

-import scala.annotation.unchecked.uncheckedStable
import scala.collection.jcl._

import nsc.{util,io}
Index: src/scala/tools/editor/TypersPresentations.scala
===================================================================
--- src/scala/tools/editor/TypersPresentations.scala (revision 17110)
+++ src/scala/tools/editor/TypersPresentations.scala (working copy)
@@ -6,7 +6,6 @@

package scala.tools.editor

-import scala.annotation.unchecked.uncheckedStable
import scala.collection.jcl.{LinkedHashMap,LinkedHashSet}
import scala.collection.mutable.ListBuffer

@@ -595,10 +594,13 @@
val str = name.decode
val key = if (sym.isMethod) str.trim+header(sym) else str.trim
val last = str.last
+
+ import org.eclipse.jdt.core.search.SearchPattern

// TODO: check accessibility.
- if (name.isTypeName == isType && str.toLowerCase.startsWith(leading.toLowerCase) &&
- (str.indexOf('$') == -1) && last != ' ' && !contains(key)) {
+ if (name.isTypeName == isType && (str.indexOf('$') == -1) && last != ' ' &&
+ !contains(key) && (str.toLowerCase.startsWith(leading.toLowerCase) ||
+ SearchPattern.camelCaseMatch(leading, str))) {
val sym0 = if (verify==null) sym else verify(sym)
if (sym0 != compiler.NoSymbol) {
val high = if (pt != null) sym0.tpe <:< pt

It includes my simple camel-case content assist patch, some version restrictions and the removal of the imports for the uncheckedStable annotation since it was moved outside of the scala package in 2.8.0. The compiler and standard library are the latest version from the 2.7.x branch. I’ve done some light testing and the plugin works for me, but the usual disclaimers apply. Be prepared to have to uninstall and reinstall the plugin if things don’t work out.

Update: I noticed that two annoying bugs were introduced on Friday. I would recommend not using the current build. I will update it once they are fixed.

Update 2: See here for a newer build with fixes for the mentioned issues.

Update 3: With the release of Scala 2.7.4-rc1, this backport is obsolete.

Not just for academics? Friday, Dec 5 2008 

At least ClariFI seems to think so. Paul Chiusano announced in the Scala mailing list that ClariFI is looking to hire developers “with a strong background in functional programming to do a mixture of Scala and Java programming”. The mailing list thread contains more details including a link to the official job posting.

It will be interesting to see if posts like that will become a common occurrence in the future.

Scala in the JVM Languages Summit Tuesday, Sep 30 2008 

Since I’ve seen no mention of this in Planet Scala (although it was discussed in #scala), here are some Scala-related links from the JVM Languages summit:

  • Optimizing Higher-Order Functions in Scala by Iulian Dragos. I haven’t tried the various optimisation flags in scalac yet, but I am curious if they will have any effect in the primitive collections with scala benchmark from a previous blog entry. These flags have been a bit buggy in the past, but hopefully recent fixes have solved the problems.
  • Scalify: Automatic translation from Java to Scala by Paul Phillips. This tool looks very interesting because the approach used is a lot more sophisticated than jatran. I have been meaning to take a look at it, but all of my code uses generics and it doesn’t support that yet.
  • Cliff Click’s post-Summit blog entry. It includes a link to the slides for his talk, “Fast Bytecodes for Funny Languages”. Scala is one of the languages covered (although very briefly) with the results coming from the translation of a simple Java micro-benchmark to Scala available here (in the comments). As expected, Scala can generate bytecode that is pretty much the same as Java if low-level constructs are used, but the performance suffers quite a bit if idiomatic Scala is used while dealing with primitives (at least with the default compilation flags).

For more general coverage of the summit:

Ricky Clarkson mentioned on IRC that videos of the talks will be made available on InfoQ and Google Video.

Update: John Rose’s post-summit blog entry is a nice round-up.

HotSpot JIT “Optimization” Saturday, Sep 27 2008 

I noticed Scala ticket #1377 the other day. Even though I think the bug is valid, it’s for different reasons to the ones supplied by the reporter. I had my doubts that the casts would have any measurable cost and the informal benchmark provided looked suspect. This is all described in the ticket, so I won’t go over it again here. The point of this blog is some weird behaviour I noticed in my version of the benchmark.

The benchmark looks like:

  def f(i: Iterator[_]) = i.hasNext

  def main(args: Array[String]) {
    test()
  }
    
  def test() {
    var i = 0
    while (i < 10) {
      val time = System.currentTimeMillis
      val result = inner()
      i += 1
      println("Time: " + (System.currentTimeMillis - time))
      println("Value: " + result)
    }
  }
    
  def inner() = {
    //val empty = Iterator.empty // checkcast occurs within loop
    val empty: Iterator&#91;_&#93; = Iterator.empty // checkcast outside loop
    var i = 0L
    while (i < 10000000000L) {
      f(empty)
      i += 1
    }
    i
  }
}
&#91;/sourcecode&#93;

According to this version of the benchmark the extra casts don't cause any performance difference, so I will just focus on the one without the extra casts. The results with JDK 6 Update 10 RC1 64-bit were:

<blockquote>
Time: 4903
Time: 4883
Time: 7213
Time: 7197
Time: 7203
Time: 7212
Time: 7185
Time: 7190
Time: 7210
Time: 7188
</blockquote>

That is odd. Instead of getting faster, the benchmark gets slower from the third iteration onwards. With JITs like these, we're better off with interpreters. ;) Ok, that's an exaggeration but let's investigate this further.

Before we do so, I should clarify two things. The ones paying attention might wonder where the "Value" output went. The purpose of that is just to make sure HotSpot does not optimize the inner loop away, so I trimmed it from the output. The second point is that the usage of a 64-bit JVM is important because the problem does not occur on the 32-bit version of HotSpot. I initially used the 64-bit version because it's much faster when performing operations on longs and I had to use a long index in the loop to allow the benchmark to take a reasonable amount of time.

Ok, so first step is to re-run the benchmark with -Xbatch and -XX:+PrintCompilation. The results of that were:

<blockquote>
  1   b   test.PerfTest$::f (7 bytes)
  2   b   scala.Iterator$$anon$5::hasNext (2 bytes)
  1%  b   test.PerfTest$::inner @ 12 (35 bytes)
Time: 4938
  3   b   test.PerfTest$::inner (35 bytes)
Time: 4861
Time: 7197
Time: 7199
Time: 7241

</blockquote>

Ok, so it seems like the inner loop got JIT'd a second time and that made it slower than the previous version, which sounds like a bug. I converted the code into Java and it turns out that we don't need much more than a loop in the inner method to reproduce the problem:


static long inner() {
  long i = 0L;
  for (; i < 10000000000L; ++i);
  return i;
}
&#91;/sourcecode&#93;

Before reporting the bug to Sun, I was curious if -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation would give any interesting information. I pasted some parts that someone that is not a HotSpot engineer might understand with the help of <a href="http://wikis.sun.com/display/HotSpotInternals/LogCompilation+overview">this</a>.


<nmethod compile_id='1' compile_kind='osr' compiler='C2' entry='0x000000002ee0d000' size='520' address='0x000000002ee0ced0' relocation_offset='264' code_offset='304' stub_offset='400' consts_offset='420' scopes_data_offset='424' scopes_pcs_offset='456' dependencies_offset='504' oops_offset='512' method='test/PerfTest inner ()J' bytes='19' count='1' backedge_count='14563' iicount='1' stamp='0.054'/>
<writer thread='12933456'/>
<task_queued compile_id='1' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' blocking='1' stamp='4.965' comment='count' hot_count='2'/>
<writer thread='43477328'/>
  1   b   test.PerfTest::inner (19 bytes)
<nmethod compile_id='1' compiler='C2' entry='0x000000002ee0d240' size='488' address='0x000000002ee0d110' relocation_offset='264' code_offset='304' stub_offset='368' consts_offset='388' scopes_data_offset='392' scopes_pcs_offset='424' dependencies_offset='472' oops_offset='480' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' stamp='4.966'/>
<...>

<task compile_id='1' compile_kind='osr' method='test/PerfTest inner ()J' bytes='19' count='1' backedge_count='14563' iicount='1' osr_bci='5' blocking='1' stamp='0.050'>
<...>
<task_done success='1' nmsize='120' count='1' backedge_count='14563' stamp='0.054'/>

<task compile_id='1' method='test/PerfTest inner ()J' bytes='19' count='2' backedge_count='5000' iicount='2' blocking='1' stamp='4.965'>
<...>
<task_done success='1' nmsize='88' count='2' backedge_count='5000' stamp='4.966'/>

So, it seems like the inner method was first JIT’d through on-stack-replacement (i.e. OSR). Usually, on-stack-replacement does not produce the best code, so HotSpot recompiles the method again once it gets the chance. Unfortunately, it generates slower code for some reason even though the compiled method size is smaller (88 instead of 120).

We could go deeper in this investigation by using a debug JVM like Kohsuke Kawaguchi did here, but I decided to just file a bug and let the HotSpot engineers handle it. :) I will update the blog once a bug number is assigned (I wonder when Sun is going to fix their bug database so that the bug id becomes available after submission of a bug…).

Update: The bug is now available in the Sun database.

Array variance and method overloading Sunday, Sep 21 2008 

One of these days I used jatran to translate the code for a simple Swing application from Java to Scala. Jatran’s translation had a few issues (some of which have been fixed since I tried it) so there were quite a few compiler errors after it finished. I fixed them and launched the application.

It was at this point that I noticed a change in behaviour and it was not clear why it was happening. After some investigation, it turned out that it was caused by the fact that variance of arrays in Scala is different from Java combined with poor usage of method overloading.

TreePath has two constructors that do different things, but one of them takes a supertype of the other:

public TreePath(Object[] path)
public TreePath(Object singlePath)

Because arrays are nonvariant in Scala, it means that scalac chooses the constructor that takes an Object for the following code:

def pathToRoot: Array[TreeNode] = ...
val treePath = new TreePath(pathToRoot)

On the other hand, arrays are covariant in Java so the constructor that takes an Object[] is chosen by javac for the equivalent Java code. One way to force scalac to choose the correct constructor is to cast the result of pathToRoot:

def pathToRoot: Array[TreeNode] = ...
val treePath = new TreePath(pathToRoot.asInstanceOf[Array[AnyRef]])

It’s a subtle issue that may affect other people automatically converting Java code to Scala, so I thought I’d bring it up in case it saves them some time. :)

Efficient Scala with Primitive Collections Monday, Sep 15 2008 

When dealing with large collections of primitives in Java/Scala, one can achieve substantial memory savings by using something like GNU Trove. Performance should also be better, but the difference may be small if the boxing does not happen in the performance-sensitive parts of the application. Using Trove is not as seamless as it should be because its collections don’t implement java.util interfaces (there are decorators, but they cause a lot of boxing defeating the purpose of using Trove in the first place). However, the advantages outweigh the disadvantages in some cases.

I wanted to use Trove collections from Scala conveniently without losing the efficiency benefits. Convenience in this case means the ability to pass anonymous functions to the usual suspects (i.e. map, flatMap, filter, etc.). One would ordinarily just write a wrapper, sprinkle some implicits and that would be it. Unfortunately, a naive wrapper will not meet the efficiency requirements.

Let’s start with the implementation that relies on an anonymous inner class to work out the baseline performance we’re aiming at. A simple implementation that serves the purpose is a method that sums all the elements of a TIntArrayList and returns it (preventing HotSpot from optimising it out).

  private def sumWithInnerClass(list: TIntArrayList): Int = {
    var sum = 0
    val p = new TIntProcedure() {
      def execute(i: Int) = {
        sum += i
        true
      }
    }
    list.forEach(p)
    sum
  }

This code is not much better than the Java equivalent (if at all), but it performs quite well. After a few iterations to allow the JVM to warm-up, it takes about 120ms to sum a list with 100 million items.

Let’s try a more convenient version of the code:

  implicit def toIntProcedureWithBoxing(f: Int => Unit): TIntProcedure = new TIntProcedure() {
    def execute(i: Int) = {
      f(i)
      true
    }
  }

  private def sumWithFunctionAndBoxing(list: TIntArrayList): Int = {
    var sum = 0
    list.forEach{x:Int => sum += x}
    sum
  }

We define an implicit from Int => Unit (which really means Function1[Int, Unit]) to a TIntProcedure and that allows us to pass an anonymous function to forEach. For the record, I am aware that a fold would have been a nicer way to define the sum (no need for the var), but I wanted to keep it similar to the initial version since I was interested in the performance difference caused by the wrapper. The results were certainly not encouraging, the boxing caused by calling Function1[Int, Unit] from TIntProcedure resulted in performance over 10 times worse: 1300ms on average.

That is not acceptable, so we need a way to avoid the boxing. Fortunately, the anonymous function for Function[Int, Unit] actually defines an apply(i: Int) method… but unfortunately we have no way to cast to the anonymous inner class generated by the Scala compiler, so we can’t invoke it. A possible solution is to generate traits like the following for the primitive types we care about and have the anonymous functions implement them when appropriate.

trait IntFunction[+R] extends Function1[Int, R] {
  def apply(i: Int): R
}

This gets more complicated once you consider that a function can take up to 22 parameters, but let’s for now assume that we only care about the case we use in our benchmark. I tried to think of ways to achieve this and it seemed like some compiler hacking was required. Given all my experience hacking compilers (read: none), I thought “why not?”.

To make my life easier, I decided not to read any documentation (not even sure if there’s any ;)) and settled for browsing and grep’ing the sources. Not too long after, I found what I was looking for, UncurryTransformer.tranform. After some analysis, it seemed like all that was needed was a conditional like:

        if (formals.head == IntClass.tpe && restpe == UnitClass.tpe) {
          anonClass setInfo ClassInfoType(
            List(ObjectClass.tpe, IntFunctionClass.tpe, fun.tpe, ScalaObjectClass.tpe), newScope, anonClass);
        }

Ok, I lie. I also defined IntFunctionClass in Definitions.scala. This is probably obvious, but I will mention it anyway, this is not a complete or general solution and it has a few limitations. However, it’s good enough to continue with the next part of the experiment. We can now define the following:

  implicit def toIntProcedure(f: Int => Unit): TIntProcedure = new TIntProcedure() {
    def execute(i: Int) = {
      f.asInstanceOf[IntFunction[_]](i)
      true
    }
  }
  
  private def sumWithFunction(list: TIntArrayList): Int = {
    var sum = 0
    list.forEach{x: Int => sum += x}
    sum
  }

We cast to IntFunction before calling apply, avoiding the boxing of the int. The performance for this was around 120ms to sum 100 million ints, which is about the same as the initial version. It seems like HotSpot does a very good job of inlining, so the additional indirection is basically free.

Ok, so the hack works. It’s relatively ugly, but at least the cast is encapsulated in the implicit converter and during normal usage, one does not have to deal with it. With more changes to the compiler, I believe it would be possible for Int => Unit to translate to IntFunction[Unit] or even IntUnitFunction instead of Function1[Int, Unit] and the cast would then not be needed. There are certainly some issues that would have to be worked out, but there are some nice performance benefits too.

It would solve issues like #1297 (where Range is 25 times slower than the while loop _after_ the fix) and mitigate #1338. It’s worth noting that work on scalar replacement through escape analysis in HotSpot and optimisation phases for the Scala compiler may reduce the amount of boxing that takes place, reducing the problem shown here. It’s unclear to what extent until we can try them out though.

The benchmark code can be found here. For my tests, I used JDK 6 Update 10 rc1 on an 8 core Mac Pro with the following options:

-XX:MaxPermSize=256m -Xms192m -Xmx2560m -server -XX:+UseConcMarkSweepGC

I think that’s enough for a first blog entry. Until next time. :)