When dealing with large collections of primitives in Java/Scala, one can achieve substantial memory savings by using something like GNU Trove. Performance should also be better, but the difference may be small if the boxing does not happen in the performance-sensitive parts of the application. Using Trove is not as seamless as it should be because its collections don’t implement java.util interfaces (there are decorators, but they cause a lot of boxing defeating the purpose of using Trove in the first place). However, the advantages outweigh the disadvantages in some cases.

I wanted to use Trove collections from Scala conveniently without losing the efficiency benefits. Convenience in this case means the ability to pass anonymous functions to the usual suspects (i.e. map, flatMap, filter, etc.). One would ordinarily just write a wrapper, sprinkle some implicits and that would be it. Unfortunately, a naive wrapper will not meet the efficiency requirements.

Let’s start with the implementation that relies on an anonymous inner class to work out the baseline performance we’re aiming at. A simple implementation that serves the purpose is a method that sums all the elements of a TIntArrayList and returns it (preventing HotSpot from optimising it out).

  private def sumWithInnerClass(list: TIntArrayList): Int = {
    var sum = 0
    val p = new TIntProcedure() {
      def execute(i: Int) = {
        sum += i
        true
      }
    }
    list.forEach(p)
    sum
  }

This code is not much better than the Java equivalent (if at all), but it performs quite well. After a few iterations to allow the JVM to warm-up, it takes about 120ms to sum a list with 100 million items.

Let’s try a more convenient version of the code:

  implicit def toIntProcedureWithBoxing(f: Int => Unit): TIntProcedure = new TIntProcedure() {
    def execute(i: Int) = {
      f(i)
      true
    }
  }

  private def sumWithFunctionAndBoxing(list: TIntArrayList): Int = {
    var sum = 0
    list.forEach{x:Int => sum += x}
    sum
  }

We define an implicit from Int => Unit (which really means Function1[Int, Unit]) to a TIntProcedure and that allows us to pass an anonymous function to forEach. For the record, I am aware that a fold would have been a nicer way to define the sum (no need for the var), but I wanted to keep it similar to the initial version since I was interested in the performance difference caused by the wrapper. The results were certainly not encouraging, the boxing caused by calling Function1[Int, Unit] from TIntProcedure resulted in performance over 10 times worse: 1300ms on average.

That is not acceptable, so we need a way to avoid the boxing. Fortunately, the anonymous function for Function[Int, Unit] actually defines an apply(i: Int) method… but unfortunately we have no way to cast to the anonymous inner class generated by the Scala compiler, so we can’t invoke it. A possible solution is to generate traits like the following for the primitive types we care about and have the anonymous functions implement them when appropriate.

trait IntFunction[+R] extends Function1[Int, R] {
  def apply(i: Int): R
}

This gets more complicated once you consider that a function can take up to 22 parameters, but let’s for now assume that we only care about the case we use in our benchmark. I tried to think of ways to achieve this and it seemed like some compiler hacking was required. Given all my experience hacking compilers (read: none), I thought “why not?”.

To make my life easier, I decided not to read any documentation (not even sure if there’s any ;)) and settled for browsing and grep’ing the sources. Not too long after, I found what I was looking for, UncurryTransformer.tranform. After some analysis, it seemed like all that was needed was a conditional like:

        if (formals.head == IntClass.tpe && restpe == UnitClass.tpe) {
          anonClass setInfo ClassInfoType(
            List(ObjectClass.tpe, IntFunctionClass.tpe, fun.tpe, ScalaObjectClass.tpe), newScope, anonClass);
        }

Ok, I lie. I also defined IntFunctionClass in Definitions.scala. This is probably obvious, but I will mention it anyway, this is not a complete or general solution and it has a few limitations. However, it’s good enough to continue with the next part of the experiment. We can now define the following:

  implicit def toIntProcedure(f: Int => Unit): TIntProcedure = new TIntProcedure() {
    def execute(i: Int) = {
      f.asInstanceOf[IntFunction[_]](i)
      true
    }
  }
  
  private def sumWithFunction(list: TIntArrayList): Int = {
    var sum = 0
    list.forEach{x: Int => sum += x}
    sum
  }

We cast to IntFunction before calling apply, avoiding the boxing of the int. The performance for this was around 120ms to sum 100 million ints, which is about the same as the initial version. It seems like HotSpot does a very good job of inlining, so the additional indirection is basically free.

Ok, so the hack works. It’s relatively ugly, but at least the cast is encapsulated in the implicit converter and during normal usage, one does not have to deal with it. With more changes to the compiler, I believe it would be possible for Int => Unit to translate to IntFunction[Unit] or even IntUnitFunction instead of Function1[Int, Unit] and the cast would then not be needed. There are certainly some issues that would have to be worked out, but there are some nice performance benefits too.

It would solve issues like #1297 (where Range is 25 times slower than the while loop _after_ the fix) and mitigate #1338. It’s worth noting that work on scalar replacement through escape analysis in HotSpot and optimisation phases for the Scala compiler may reduce the amount of boxing that takes place, reducing the problem shown here. It’s unclear to what extent until we can try them out though.

The benchmark code can be found here. For my tests, I used JDK 6 Update 10 rc1 on an 8 core Mac Pro with the following options:

-XX:MaxPermSize=256m -Xms192m -Xmx2560m -server -XX:+UseConcMarkSweepGC

I think that’s enough for a first blog entry. Until next time. :)

Advertisement