We have all heard about how HotSpot is really good at dealing with short-lived objects (both allocation and GC), but the truth is that object allocation is still pretty costly when compared to operations like addition or multiplication. Allocating an object for each step of an iteration over a large collection to make a simple computation might sound like the kind of thing no-one would ever do, but it’s actually quite common in languages like Scala (as described in a previous post). In Java-land, if you use the Function class from Google Collections with primitive wrappers, the same issue may occur. There are many JVM improvements that could help depending on the specific case (generic specialisation, value types, fixnums to name a few), but it’s unclear if/when we’ll get them.
So, what about that title? Escape analysis was introduced during Java 6, but the information gathered was only used for lock elision. However, this information can also be used for other interesting optimisations like scalar replacement and stack allocation. There have been doubts about the benefits of stack allocation (discussed in the comments) so the focus has been on scalar replacement so that the object is never in memory. At least that’s the theory.
Edward Lee started a couple of threads in the Hotspot-dev mailing list about scalar replacement here and here which reminded me to do some experiments. Note that this feature is still in development so the results posted here are preliminary and not indicative of how it will perform once it’s final. Still, it’s interesting to see how well it works at this time. I picked the latest JDK7 build (build 41) and ran a few tests with the following arguments passed to java “-XX:MaxPermSize=256m -Xms128m -Xmx3328m -server -XX:+UseConcMarkSweepGC” and either XX:-DoEscapeAnalysis or XX:+DoEscapeAnalysis.
I started by choosing the simplest test possible. Note that either testSimpleAllocation or testNoAllocation would be commented out.
class C(val a: Int, val b: Int) object Test { def main(args: Array[String]) { for (i <- 1 to 10) testSimpleAllocation() //for (i <- 1 to 10) testNoAllocation() } def testSimpleAllocation() = { System.gc() var time = System.currentTimeMillis; var i = 0 var sum = 0 while (i < 1000000000) { sum += baz(new C(i + 1, i + 2)) i += 1 } println(sum) println(System.currentTimeMillis - time) } def testNoAllocation() = { System.gc() var time = System.currentTimeMillis; var i = 0 var sum = 0 while (i < 1000000000) { sum += baz(i + 1, i + 2) i += 1 } println(sum) println(System.currentTimeMillis - time) } def baz(a: Int, b: Int): Int = a + b def baz(c: C): Int = c.a + c.b }
The result were:
testNoAllocation: 403
testSimpleAllocation with EA: 1006
testSimpleAllocation without EA: 9190
As we can see, escape analysis has a tremendous effect and the method finishes in 11% of the time taken with it disabled. However, the version with no allocation is still substantially faster.
I decided to test a foreach method that takes a Function object next (this time in Java because it does less magic behind the scenes):
package test; public class EscapeAnalysis { interface Function<T, R> { R apply(T value); } interface IntProcedure { void apply(int value); } static class BoxedArray { private final int[] array; public BoxedArray(int length) { array = new int[length]; } public int length() { return array.length; } public void foreach(Function<Integer, Void> function) { for (int i : array) function.apply(new Integer(i)); } public void foreach(IntFunction function) { for (int i : array) function.apply(i); } public void set(int index, int value) { array[index] = value; } public void foreachWithAutoboxing(Function<Integer, Void> function) { for (int i : array) function.apply(i); } } public static void main(String[] args) { BoxedArray array = new BoxedArray(100000000); /* We are careful not to restrict our ints to the ones in the Integer.valueOf cache */ for (int i = 0; i < array.length(); i++) array.set(i, i); for (int i = 0; i < 10; i++) test(array); } private static void test(BoxedArray array) { System.gc(); long time = System.currentTimeMillis(); final int[] sum = new int[] { 0 }; /* Uncomment the one that should be executed */ testIntegerForeach(array, sum); // testIntegerForeachWithAutoboxing(array, sum); // testIntForeach(array, sum); System.out.println(System.currentTimeMillis() - time); System.out.println(sum[0]); } private static void testIntegerForeachWithAutoboxing(BoxedArray array, final int[] sum) { array.foreachWithAutoboxing(new Function<Integer, Void>() { public Void apply(Integer value) { sum[0] += value; return null; } }); } private static void testIntegerForeach(BoxedArray array, final int[] sum) { array.foreach(new Function<Integer, Void>() { public Void apply(Integer value) { sum[0] += value; return null; } }); } private static void testIntForeach(BoxedArray array, final int[] sum) { array.foreach(new IntFunction() { public void apply(int value) { sum[0] += value; } }); } }
The results were:
testIntForeach: 130
testIntegerForeachWithAutoboxing with EA: 1064
testIntegerForeach with EA: 224
testIntegerForeachWithAutoboxing without EA: 1039
testIntegerForeach without EA: 1024
This test shows something interesting, EA gives no improvement if Integer.valueOf (called by auto-boxing) is used instead of new Integer. Apart from that, the results are somewhat similar to the first ones (EA provides a substantial boost, but not enough to match the specialised implementation). After quickly testing that the boxing methods in ScalaRunTime had the same effect as Integer.valueOf, I decided that it was not worth testing more complex scenarios.
It seems like there’s a lot of potential for scalar replacement, but HotSpot needs to do a better job at detecting cases where it can be used safely. If nothing else, at least knowledge of the valueOf methods should be hardcoded into the system. I hope that a more general solution is found though because other languages on the JVM may use different methods (as mentioned earlier Scala uses methods in ScalaRunTime instead). It will also be interesting to see if the performance can get even closer to the no allocation case. Since the feature is still in development, we can hope. :)
Update: The early access release of JDK 6 Update 14 also contains this feature.
Update 2: This feature is enabled by default since JDK 6 Update 23.
this are great news. In many applications a small performance loss does really not matter but full stops do. Doing EA without stack allocation would be really a missed oportunity ;)
Hi,
Thanks for the info. Do you know if there are any plans to introduce stack allocation or scalar replacement in Java 7?
Domingos,
Since scalar replacement is already included in the JDK 7 builds, I would expect it to be available in Java 7. I don’t know if it will be enabled by default, but I hope so. I haven’t heard anything about stack allocation, so I have my doubts about that.
Michael,
GC pauses caused by full collections are indeed a problem, but I am not sure if stack allocation or scalar replacement would help here. Objects that do not escape would likely be collected from the young generation so they would not contribute to the longer pauses. Also note that my post was talking about scalar replacement instead of stack allocation. It’s still unclear how much of an advantage the latter presents over objects that are collected in the young generation.
Something worth testing in terms of reducing the time of GC pauses is the G1 Garbage Collector. That will be included in JDK7, but also in a JDK6 update from what I hear.
hi,
if I interpret your results correctly, in the first one, one billion iterations takes 0.4, 1 and 9.2 seconds respectively. one iteration then takes 0.4, 1 and 9.2 nanoseconds respectively.
sounds short enough, considering the length of one clockcycle, and whatever might be executed inside the loop itself.
Hi Miau,
The time one iteration takes is not particularly important. What matters is how long your user has to wait.
One billion iterations may sound like a lot, but if you have a nested loop with 32000 items you’re already past it.
Also, it’s common in many algorithms (e.g. machine learning) for the innermost loops to have very simple operations.
Currently, people are careful not to allocate in such cases even if the resulting code is uglier (this is made worse since the JVM doesn’t support struct-like structures). If the JVM could make this unnecessary, that can only be a good thing.
Ismael
Ismael,
thanks for the answer.
I ran your test with -XX:+PrintAssembly, and EA enabled.
I got this, with most everything omitted:
6 b Test$::testNoAllocation (66 bytes)
Decoding compiled method 0xb419a508:
Code:
[Disassembling for mach=’i386′]
…
0xb419a6a0: mov %ebx,%edi
0xb419a6a2: add %ebx,%edi
0xb419a6a4: add %edi,%ecx
0xb419a6a6: inc %ebx
0xb419a6a7: add $0x3,%ecx
0xb419a6aa: cmp $0x3b9aca00,%ebx
0xb419a6b0: jl 0xb419a6a0
4 b Test$::testSimpleAllocation (73 bytes)
Decoding compiled method 0xb419ad48:
Code:
[Disassembling for mach=’i386′]
0xb419aea0: mov %ebp,%edi
0xb419aea2: add %ebp,%edi
0xb419aea4: add %edi,%ecx
0xb419aea6: inc %ebp
0xb419aea7: add $0x3,%ecx
0xb419aeaa: cmp $0x3b9aca00,%ebp
0xb419aeb0: jl 0xb419aea0
that is, both noAllocation and simpleAllocation come out to sum+=(2*i+3).
($0x3b9aca00 is 1billion in decimal.)
the results stay the same, i.e. noAllocation is 2,5 times as fast as simpleAllocation. is the extra time spent on the Escape Analysis inside HotSpot?
The results of -verbose:gc are quite fascinating, in the simpleAllocation version, with EA enabled and disabled. that is, zero gc’s for EA enabled and a whole lot for EA disabled.
Interesting investigation Miau. It’s unclear why it would be that much slower if the generated code is the same. I would have thought that the overhead of Escape Analysis would go away after a few of the outer iterations.
Out of curiosity, what build did you use for your tests?
Regarding -verbose:gc, that is indeed one huge benefit of EA.
./java -server -version
java version “1.7.0-ea”
Java(TM) SE Runtime Environment (build 1.7.0-ea-b45)
Java HotSpot(TM) Server VM (build 14.0-b10, mixed mode)
and options:
-server
-verbose:gc
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintAssembly
-XX:+PrintCompilation
-XX:CICompilerCount=1
-Xbatch
-XX:MaxPermSize=256m
-Xms128m
-Xmx512m
-XX:+UseConcMarkSweepGC
-XX:+DoEscapeAnalysis
this is on linux, in virtualbox with 1 gigabyte of ram.
—
for completeness here’s simpleAllocation without EA enabled:
4 b Test$::testSimpleAllocation (73 bytes)
Decoding compiled method 0xb4262f48:
Code:
[Disassembling for mach=’i386′]
0xb4263210: mov 0x38(%ecx),%eax
0xb4263213: lea 0x10(%eax),%edx
0xb4263216: cmp 0x40(%ecx),%edx
0xb4263219: jae 0xb42632dd
0xb426321f: mov %edx,0x38(%ecx)
0xb4263222: prefetchnta 0x100(%edx)
0xb4263229: mov $0xa4322948,%edx ; {oop()}
0xb426322e: mov 0x64(%edx),%edx
0xb4263231: mov %edx,(%eax)
0xb4263233: movl $0xa4322948,0x4(%eax) ; {oop()}
0xb426323a: mov %esi,%edi
0xb426323c: inc %edi
0xb426323d: add $0x2,%esi
0xb4263240: mov %esi,0x8(%eax)
0xb4263243: mov %edi,0xc(%eax)
0xb4263246: mov 0x8(%eax),%esi
0xb4263249: add 0xc(%eax),%esi
0xb426324c: add %esi,%ebx
0xb426324e: cmp $0x3b9aca00,%edi
0xb4263254: jge 0xb426325a
0xb4263256: mov %edi,%esi
0xb4263258: jmp 0xb4263210
0xb426325a: mov $0x148,%ecx
—
0xb42632dd: mov %esi,0x18(%esp)
0xb42632e1: mov %ebx,0x14(%esp)
0xb42632e5: mov %ecx,0x10(%esp)
0xb42632e9: mov $0xa4322948,%ecx ; {oop()}
0xb42632ee: nop
0xb42632ef: call 0xb42608a0 ; OopMap{ebp=Oop off=628}
;*new ; – Test$::testSimpleAllocation@21 (line 15)
; {runtime_call}
0xb42632f4: mov 0x10(%esp),%ecx
0xb42632f8: mov 0x14(%esp),%ebx
0xb42632fc: mov 0x18(%esp),%esi
0xb4263300: jmp 0xb426323a
somewhat more complex. comparing it to the other one shows how scalar replacement does sweet stuff.
I am not quite sure what the beginning of the loop does. if I read correctly it checks if the 16-bit pointer at ecx+38 is the same as ecx+40, if they are not it calls new C(), otherwise it puts the address at ecx+40 in ecx+38, i.e. reuses the previous C.
—
the no-allocation example with one billion iterations comes out to an average of around 1 clock cycle per iteration for the 7 instruction loop, on linux inside virtualbox. that’s awesomely impressive to me.
I figured it out.
the testNoAllocation assembly I posted was incomplete,
the snippet from simpleAllocation is preceded by:
xor %ecx,%ecx
i.e. set the counter to zero, while the snippet from noAllocation comes with:
0xb3d9889c: mov $0x3b9aca00,%ebx
0xb3d988a1: sub %eax,%ebx
0xb3d988a3: and $0xfffffff0,%ebx
0xb3d988a6: add %eax,%ebx
0xb3d988a8: cmp %ebx,%eax
0xb3d988aa: jge 0xb3d988e1
0xb3d988ac: nop
0xb3d988ad: nop
0xb3d988ae: nop
0xb3d988af: nop ;*iload
; – Test$::testNoAllocation@18 (line 38)
0xb3d988b0: mov %eax,%ecx
0xb3d988b2: add %eax,%ecx ;*iadd
; – Test$::testNoAllocation@26 (line 38)
0xb3d988b4: add %ecx,%edi
0xb3d988b6: add %ecx,%edi
0xb3d988b8: add %ecx,%edi
0xb3d988ba: add %ecx,%edi
0xb3d988bc: add %ecx,%edi
0xb3d988be: add %ecx,%edi
0xb3d988c0: add %ecx,%edi
0xb3d988c2: add %ecx,%edi
0xb3d988c4: add %ecx,%edi
0xb3d988c6: add %ecx,%edi
0xb3d988c8: add %ecx,%edi
0xb3d988ca: add %ecx,%edi
0xb3d988cc: add %ecx,%edi
0xb3d988ce: add %ecx,%edi
0xb3d988d0: add %ecx,%edi
0xb3d988d2: add %ecx,%edi
0xb3d988d4: add $0x10,%eax ;*iadd
; – Test$::testNoAllocation@23 (line 38)
0xb3d988d7: add $0x120,%edi ;*iadd
; – Test$::testNoAllocation@30 (line 38)
0xb3d988dd: cmp %ebx,%eax
0xb3d988df: jl 0xb3d988b0
0xb3d988e1: cmp $0x3b9aca00,%eax
0xb3d988e7: jge 0xb3d98902
that is, the inner loop is unrolled 16 times, and iterated until the counter is within 16 of one billion, and then it goes into the one-at-a-time loop.
0x120 is the sum of (2*i+3) for i <- 0..15
the debug build of openjdk has the following additional parameters available:
-XX:+PrintEscapeAnalysis
-XX:+PrintEliminateAllocations
the latter of which is for scalar replacement.
Thanks for posting this, it’s very useful information. :)
So, if I understand correctly you’re saying that the loop unrolling optimisation was missed by the testSimpleAllocation case with scalar replacement?
The real optimization would be for the compiler in both cases to realize that the result will be (10e9 * (2+10e9)) mod (2^32), and it will be the same everytime.
The optimization of adding 288 to the sum for every 16 iterations in the no-allocation case seems like a more limited version of this.
Somewhere in the middle is the fact that doing add %ecx,%edi 16 times in a row is not the best way to multiply by 16 in assembler. :)
I am wildly guessing that the hotspot c2 compiler could pick up on the loop unrolling issue in the simpleAllocation case if it did more passes on the code.
Overall, however, I think this adds emphasis to the relation of the benchmark code and real world code. The performance difference now seems to be from the ability to consolidate the calculation of i+i from 16 times to once, and consolidating 16 iterations of (i%16+3) to a single addition of 288. If similar optimizations can be done automatically in actual algorithms, then we should make sure hotspot unrolls loops as aggressively when doing EliminateAllocations than when compiling the ‘raw’ loop. in the meantime we could consider writing uglier code for better performance. on the other hand, if the same kinds of optimizations as in the benchmark code can’t be done automatically for actual algorithms, then we can start writing beautiful code as soon as Sun releases 1.6u14 happy in the knowledge that one more optimization has been moved from application code to the platform.
and then start waiting for Speculative Lock Elision to show up in openjdk.
“I am wildly guessing that the hotspot c2 compiler could pick up on the loop unrolling issue in the simpleAllocation case if it did more passes on the code.”
Indeed, that was what I was thinking. This seems like the kind of performance bug that should be fixed once the basics of scalar replacement are in place (as opposed to an inherent problem with the scheme).
I also agree that it’s nice to be able to write more natural code and have it perform as well as the hand-optimised version.
A much more extensively vetted benchmark has been published at QCon. (Do locking optimizations word) We looked at EA and found that it offered limited benefits. Those benefits have been improved in _12 version of Sun’s JVM. The deeper danger is the bench maybe measuring biased locking. Biased locking gives a much larger boost than EA does. You need to make sure that BA is turned off.
oops, QCon should read InfoQ.
Hi Kirk,
I think you did not read the blog entry. I did not test locking at all. Escape analysis is used to generate information that can be used for various optimisations. This blog entry only looked at object allocation. Furthermore, JDK 6 Update 12 is too old, you need a build with HS14 at least (JDK 6 Update 14 early access if you don’t want to use a performance release or JDK 7).
Best,
Ismael