Performance of FastMath from Commons Math Wednesday, Feb 23 2011 

Commons Math includes (in trunk and MATH_2_X branch) a FastMath class that is described as a “Faster, more accurate, portable alternative to StrictMath” and is implemented fully in Java (unlike java.lang.Math and java.lang.StrictMath). I am generally interested in faster math operations so I decided to investigate.

The first thing that I checked is if FastMath delegated to java.lang.Math or java.lang.StrictMath for any of its methods. It delegates to java.lang.Math for sqrt and random and to StrictMath for IEEEremainder. I found this interesting because I know that HotSpot includes intrinsic methods for many methods in java.lang.Math. Looking at vmSymbols.hpp, we can see that the following methods have intrinsics: abs, sin, cos, tan, atan2, sqrt, log, log10, pow, exp, min and max. Intrinsic methods are usually highly optimised (often involving assembly and sometimes CPU-specific instructions) and do not incur JNI overhead. I was interested in comparing all of FastMath’s methods with java.lang.Math and java.lang.StrictMath, but I was particularly interested in these ones.

Bill Rossi (the original author of FastMath) pointed me to a performance test in SVN, so I decided to run that (r1072209 from SVN trunk) on a Intel Core i7 CPU 860 2.80GHz using JDK 6 Update 25 early access release build 1 (includes HotSpot 20) and JDK 7 build 130 (includes HotSpot 21). I originally ran the test as it is in Commons Math with one addition but it has a few common micro-benchmarking mistakes (thanks to Aleksey for pointing them out in the comments). I fixed the issues (see here for the modified code) and ran the test again.

Since the relative results for the two JDKs did not differ much, I am only posting the charts for JDK 6 Update 25 early access release. The results for both JDKs are available in table form here. Note that the results are in milliseconds, so lower is better.

java.lang.Math does quite a bit better on tan, abs and log while FastMath does better on exp and cosh. The rest are similar.

FastMath does very well on hypot, pow, asin, acos, cbrt, sinh, tanh while java.lang.Math does well on log10 and log1p. java.lang.Math does so badly on hypot that it’s worth mentioning it again. Looks like it needs a bit of work.

As stated above, the results for JDK 7 are very similar with a few small improvements. Methods with an execution time reduction of 15% or higher were pow (202ms instead of 238ms), powII (125ms instead of 149ms) and atan (53ms instead of 66ms)

The usual disclaimers about benchmarks apply. The benchmark provided by commons-math is a good start, but I believe it should be fleshed out more so that common cases are measured separately (e.g. pow(x, 2) is much more common than the ones used in the test). Having said that, FastMath is a nice contribution and it seems like some methods in the JDK could be replaced by the ones in FastMath for better results. It also seems like FastMath could do better by delegating to the JDK for more of its methods.

Update: Some context for the contribution of FastMath to Apache Commons Math can be found here.

Update 2: I’ve updated the charts to use a modified test instead of the one in Commons Math as there were issues there. I had mentioned some suspicious scores, but thanks to Aleksey for pointing the issues out in the comments (they are common mistakes in micro-benchmarks).

New JVM options and Scala iteration performance Monday, Oct 26 2009 

Internal iterators like foreach tend to do very well in micro-benchmarks on the JVM. In fact, they often do as well as the equivalent manual while or for loop. There is a catch, however, and it’s easy to miss it in micro-benchmarks.

A few months ago, David MacIver forwarded an email from Martin Odersky to the scala-internals mailing list about this. The problem as described by Martin:

“The reason seems to be that the foreach itself is not inlined, so the call to the closure becomes megamorphic and therefore slow.”

I was curious about the benchmarks used to measure this and Tiark Rompf (who had performed the measurements) provided the source code. I said I’d try to take a look the next weekend and I did it… more than 3 months later. Well, better late than never. ;)

There are 3 benchmarks:

  • Benchmark A: traverse a collection with N elements
  • Benchmark B: traverse a collection with N elements, inside the loop/closure traverse another collection with N2 elements 3 times
  • Benchmark C: build a collection (front to back) of N elements

Various approaches are used for each benchmark with various collections types. One weakness of the these benchmarks is that they don’t include a verification mechanism to ensure that all benchmarks produce the same result. However, they include code that tries to prevent the JIT from performing unfair optimisations (e.g. print something if an element in the collection matches a certain condition).

The original results produced by Tiark can be found here.

I added a few benchmarks that use plain arrays (RawArrayIndexed, RawArrayForeach, RawArrayForeachMega), made minor changes to the scripts and pushed the code to a GitHub repository. I left the rest of the Scala code as it was to make it easy to compare with the original results and ran the benchmark with various JVM settings to see what effect they would have. All the tests shared the following:

  • Dual quad-core Xeon E5462 2.80GHz
  • 14 GB RAM
  • Fedora 11
  • Scala 2.8.0.r19261
  • Revision 59521431f5c118b73e35b0b396e3efd6aecec3dd of project
  • 64-bit JDK 6 Update 18 early access b03
  • JVM base settings: -Xms1G -Xmx1G -XX:+UseParallelGC -XX:+UseParallelOldGC

JDK 6 Update 18 is scheduled to be released on Q4, 2009 and it includes HotSpot 16. Even though JDK 6 Update 14 (HotSpot 14) introduced compressed references and scalar replacement, HotSpot 16 includes improved compressed references and many crucial fixes to both features. According to my testing these features are now approaching production-level stability and the OpenJDK engineers seem to agree as they are both enabled by default in HotSpot 17 (which will eventually hit JDK6 too).

Interested in how these features would affect the performance in these benchmarks, I ran them with various combinations. I also added Scala’s compiler -optimise flag in some cases.

The original benchmark from Tiark used 3 collection types: array (java.util.ArrayList), list (scala.List, immutable single linked list) and vector (earlier version of immutable vector that has recently been added to Scala 2.8). I added JVM arrays and they are shown as “rawarray” in the charts. Finally, we get to the actual numbers.

Benchmark A

Click on chart for expanded version

There are some interesting data points here:

  1. Compressed references is a _huge_ win. RawArrayIndexed went from 500ms to 142ms and many of the vector operations were much faster.
  2. Escape analysis (which enables scalar replacement) doesn’t seem to have much of an effect.
  3. scalac -optimise doesn’t seem to have much of an effect.
  4. foreach is misleadingly fast in micro-benchmarks, but it’s easy to bring it down to earth. RawArrayForeach performs similarly to RawArrayIndexed, but RawArrayForeachMega is 10 times slower. The latter simply calls foreach with a few different anonymous functions during the collection creation phase causing the call site to become megamorphic. Once this happens, the only hope for good performance is that the foreach method gets inlined and it doesn’t seem to happen here. With this in mind, it seems like ticket 1338 (Optimize simple for loops) is a good idea.
Benchmark B

Click on chart for expanded version

Once again, compressed references are a large factor in some benchmarks (halving the time taken in some cases).

The new bit of information is that scalac -optimise causes a huge improvement in VectorForeachFast and VectorForeachFastProtect. This makes sense once one considers one of the findings from the previous benchmark. We said that inlining of foreach is of extreme importance once a call site is megamorphic and this is precisely what -optimise does in this case (and the JVM fails to do so at runtime otherwise). Sadly, -optimise cannot do this safely in many cases as it’s shown by the results for VectorForeach.

Benchmark C

Click on chart for expanded version

Once again, compressed references provide a nice boost. Seems like this option is a winner in 64-bit JVMs (if you don’t need a heap larger than 32GB), it saves memory and gives better performance. The usual disclaimer applies though, you should benchmark your own application instead of relying on micro-benchmarks when deciding what JVM options to use.

The complete results are also available. Feel free to play with the source code and provide your own numbers, fixes and/or improvements.

Configgy for Scala 2.8 Sunday, Oct 25 2009 

I wanted to use Configgy with Scala 2.8, so I created a branch in my GitHub repository:

The main code now compiles, but the tests and the build system have to be updated still.

I’m not particularly familiar with Ivy, so if someone else wants to do that, I won’t complain. ;)

The tests use specs and Eric has already released a binary for 2.8.0.

Update: I filed an issue upstream to track 2.8 support.

Update 2: The branch has been updated to Scala 2.8.0.Beta1-RC3 along with the build system and the tests pass. Thanks to paulp for tracking an issue that was taking place because the stack order changed in Scala 2.8.0.

Scala 2.7.4-rc1 released Thursday, Apr 16 2009 

Scala 2.7.4-rc1 has been released. It incorporates a few important fixes related to serialization of anonymous functions and generation of Java generics, but the big change is the improved Scala plugin Scala IDE for Eclipse. This means that my backport is now obsolete.

The announcement contains all the juicy information.

Load unsigned and better Compressed Oops Friday, Apr 3 2009 

The HotSpot engineers are constantly working on improving performance. I noticed two interesting commits recently:

Vladimir Kozlov improved Compressed Oops so that it doesn’t need to do encoding/decoding if the heap is smaller than 4GB and to reduce branches/checks if the heap is between 4GB and 32GB. The end result is that 64-bit now surpasses 32-bit performance in more situations. See my entry about Compressed Oops if you don’t know what I’m talking about. :)

Christian Thalinger added support for load unsigned in the -server JIT. This means that things like bytearray[i] & 0xFF and intarray[i] & 0xFFFFFFFF (necessary since JVM bytecode doesn’t support unsigned types) can be transformed into load unsigned operations to avoid the performance penalty. This can make a decent difference in some cases (e.g. charset operations).

Json serialization/deserialization faster than protocol buffers? Wednesday, Mar 25 2009 

I was reading the The Book of JOSH and saw the following statement:

“Json delivers on what XML promised. Simple to understand, effective data markup accessible and usable by human and computer alike. Serialization/Deserialization is on par with or faster then XML, Thrift and Protocol Buffers.”

That seemed a bit too definite for my taste. There are so many variables that can affect the results that I was interested in more information, so I asked for it and eventually got an answer.

I had a brief look at the benchmark referenced and that was enough to come up with some talking points. To make it easier to follow, I will just compare protocol buffers and json (jackson). I started by running the benchmark in my machine (java 1.6.0_14-ea-b03):

Object create Serialization Deserialization Serialized Size
protobuf 312.95730 3052.26500 2340.84600 217
json 182.64535 2284.88300 3362.31850 310

Ok, so json doesn’t seem to be faster on deserialization and the size is almost 50% bigger (a big deal if the network is the bottleneck as is often the case). Why is serialization of protobuf so slow though? Let’s see the code:

    public byte[] serialize(MediaContent content, ByteArrayOutputStream baos) throws IOException
        return baos.toByteArray();

How about we replace that with content.toByteArray()?

Object create Serialization Deserialization Serialized Size
protobuf 298.89330 2087.79800 2339.44450 217
json (jackson) 174.49190 2482.53350 3599.90800 310

That’s more like it. Let’s try something a bit more exotic just for fun and add XX:+DoEscapeAnalysis:

Object create Serialization Deserialization Serialized Size
protobuf 260.51330 1925.32300 2302.74250 217
json (jackson) 176.20370 2385.99750 3647.01700 310

That reduces some of the cost of object creation for protobuf, but it’s still substantially slower than json. This is not hard to believe because of the builder pattern employed by the Java classes generated by protocol buffers, but I haven’t investigated it in more detail. In any case, protocol buffers is better in 3 of the measures for this particular benchmark.

What does this mean? Not a lot. As usual, where performance is important, you should create benchmarks that mirror your application and environment. I just couldn’t let the blanket “json is on par with or faster than…” statement pass without a bit of scrutiny. ;)

MercurialEclipse 1.3.1019 Wednesday, Feb 18 2009 

With the talk about DVCSs as active as ever in the Eclipse world, I thought I’d mention the release of MercurialEclipse 1.3.1019. It still lacks quite a bit of polish, but it already supports a decent amount of features.

Even though Mercurial doesn’t have as much momentum as Git, it has been adopted by a respectable amount of projects (including Mozilla, Netbeans and OpenJDK).

Scala plugin for Eclipse snapshot r17139 Wednesday, Feb 18 2009 

The snapshot of the Scala plugin for Eclipse that I posted a few days ago had a couple of annoying bugs. Here’s a new one build (note that it’s an Eclipse update site, so no point in trying to access it with a browser) from revision 17139 of the trunk of the plugin with the compiler and standard libraries from the 2.7.x branch.

Miles has been busy once again and fixes included since the last build are 1717, 1233, 1636, 1662, 1716, 1627, 1534, 1410, 1516, 1483 and 1253.

Update: New build published that includes fix for ticket 1741. Uninstallation followed by installation may be required if p2 starts giving cryptic errors (it seems like it doesn’t handle well the case where the update site is replaced). Sorry about that.

Update 2: With the release of Scala 2.7.4-rc1, this backport is obsolete.

Scala plugin snapshot Saturday, Feb 14 2009 

Miles Sabin has recently committed the Equinox Aspects-based approach to JDT integration for the Scala plugin to trunk. This approach is similar to the one used by the AJDT as described by Andrew Eisenberg.

Soon after that commit, Miles embarked on a bug-killing mission. Nightlies are available from the usual place, but they rely on the trunk of the compiler and standard library. With APIs still in flux, it would be nice to be able to use the improved plugin with the compiler and standard library from the 2.7.x branch (this includes a few important fixes added after the 2.7.3 release).

Since it seemed like no-one else had time to do that in the near future, I did it myself. It’s available here in case it’s useful for anyone else. In the interest of full disclosure, I applied the following patch to the plugin:

--- META-INF/MANIFEST.MF (revision 17110)
+++ META-INF/MANIFEST.MF (working copy)
@@ -34,8 +34,8 @@
- scala.library,
+ scala.library;bundle-version="[2.7.3,2.7.4)",
Index: src/scala/tools/editor/Typers.scala
--- src/scala/tools/editor/Typers.scala (revision 17110)
+++ src/scala/tools/editor/Typers.scala (working copy)
@@ -6,7 +6,6 @@


-import scala.annotation.unchecked.uncheckedStable
import scala.collection.jcl._

import nsc.{util,io}
Index: src/scala/tools/editor/TypersPresentations.scala
--- src/scala/tools/editor/TypersPresentations.scala (revision 17110)
+++ src/scala/tools/editor/TypersPresentations.scala (working copy)
@@ -6,7 +6,6 @@


-import scala.annotation.unchecked.uncheckedStable
import scala.collection.jcl.{LinkedHashMap,LinkedHashSet}
import scala.collection.mutable.ListBuffer

@@ -595,10 +594,13 @@
val str = name.decode
val key = if (sym.isMethod) str.trim+header(sym) else str.trim
val last = str.last
+ import

// TODO: check accessibility.
- if (name.isTypeName == isType && str.toLowerCase.startsWith(leading.toLowerCase) &&
- (str.indexOf('$') == -1) && last != ' ' && !contains(key)) {
+ if (name.isTypeName == isType && (str.indexOf('$') == -1) && last != ' ' &&
+ !contains(key) && (str.toLowerCase.startsWith(leading.toLowerCase) ||
+ SearchPattern.camelCaseMatch(leading, str))) {
val sym0 = if (verify==null) sym else verify(sym)
if (sym0 != compiler.NoSymbol) {
val high = if (pt != null) sym0.tpe <:< pt

It includes my simple camel-case content assist patch, some version restrictions and the removal of the imports for the uncheckedStable annotation since it was moved outside of the scala package in 2.8.0. The compiler and standard library are the latest version from the 2.7.x branch. I’ve done some light testing and the plugin works for me, but the usual disclaimers apply. Be prepared to have to uninstall and reinstall the plugin if things don’t work out.

Update: I noticed that two annoying bugs were introduced on Friday. I would recommend not using the current build. I will update it once they are fixed.

Update 2: See here for a newer build with fixes for the mentioned issues.

Update 3: With the release of Scala 2.7.4-rc1, this backport is obsolete.

Code reuse by copy and paste Friday, Feb 13 2009 

I was investigating what would need to be done to implement camel-case code assist for the Scala plugin when I ran across the following gem in CharOperation (Eclipse JDT):

public static final boolean camelCaseMatch(char[] pattern, int patternStart, int patternEnd, char[] name, int nameStart, int nameEnd, boolean samePartCount) {

	/* !!!!!!!!!! WARNING !!!!!!!!!!
	 * The content of this method has been fully copied to
	 * SearchPattern#camelCaseMatch(String, int, int, String, int, int, boolean).
	 * So, if current method is modified, do NOT forget to copy again its content
	 * to SearchPattern corresponding method!

And in SearchPattern:

public static final boolean camelCaseMatch(String pattern, int patternStart, int patternEnd, String name, int nameStart, int nameEnd, boolean samePartCount) {

	/* !!!!!!!!!! WARNING !!!!!!!!!!
	 * The algorithm of this method has been entirely copied from
	 * CharOperation#camelCaseMatch(char[], int, int, char[], int, int, boolean).
	 * Array lengths have been replaced with call to {@link String#length()} and
	 * array direct access have been replaced with call to {@link String#charAt(int)}.
	 * So, do NOT modify this method directly to fix any bug but modify first the
	 * corresponding CharOperation method and do the copy again to be sure that
	 * these two methods are kept synchronized.

I am sure there is a reason why this was done, but how ugly…

Update: At least JDT made it easy to implement what I wanted.

Next Page »


Get every new post delivered to your Inbox.