<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Objects with no allocation overhead</title>
	<atom:link href="http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/</link>
	<description></description>
	<lastBuildDate>Thu, 11 Feb 2010 20:51:21 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: Ismael Juma</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-214</link>
		<dc:creator>Ismael Juma</dc:creator>
		<pubDate>Tue, 14 Apr 2009 17:05:16 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-214</guid>
		<description>Hi Kirk,

I think you did not read the blog entry. I did not test locking at all. Escape analysis is used to generate information that can be used for various optimisations. This blog entry only looked at object allocation. Furthermore, JDK 6 Update 12 is too old, you need a build with HS14 at least (JDK 6 Update 14 early access if you don&#039;t want to use a performance release or JDK 7).

Best,
Ismael</description>
		<content:encoded><![CDATA[<p>Hi Kirk,</p>
<p>I think you did not read the blog entry. I did not test locking at all. Escape analysis is used to generate information that can be used for various optimisations. This blog entry only looked at object allocation. Furthermore, JDK 6 Update 12 is too old, you need a build with HS14 at least (JDK 6 Update 14 early access if you don&#8217;t want to use a performance release or JDK 7).</p>
<p>Best,<br />
Ismael</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: KIrk</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-213</link>
		<dc:creator>KIrk</dc:creator>
		<pubDate>Tue, 14 Apr 2009 17:00:33 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-213</guid>
		<description>oops, QCon should read InfoQ.</description>
		<content:encoded><![CDATA[<p>oops, QCon should read InfoQ.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: KIrk</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-212</link>
		<dc:creator>KIrk</dc:creator>
		<pubDate>Tue, 14 Apr 2009 17:00:09 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-212</guid>
		<description>A much more extensively vetted benchmark has been published at QCon. (Do locking optimizations word) We looked at EA and found that it offered limited benefits. Those benefits have been improved in _12 version of Sun&#039;s JVM. The deeper danger is the bench maybe measuring biased locking. Biased locking gives a much larger boost than EA does. You need to make sure that BA is turned off.</description>
		<content:encoded><![CDATA[<p>A much more extensively vetted benchmark has been published at QCon. (Do locking optimizations word) We looked at EA and found that it offered limited benefits. Those benefits have been improved in _12 version of Sun&#8217;s JVM. The deeper danger is the bench maybe measuring biased locking. Biased locking gives a much larger boost than EA does. You need to make sure that BA is turned off.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ismael Juma</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-163</link>
		<dc:creator>Ismael Juma</dc:creator>
		<pubDate>Mon, 16 Mar 2009 18:49:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-163</guid>
		<description>&quot;I am wildly guessing that the hotspot c2 compiler could pick up on the loop unrolling issue in the simpleAllocation case if it did more passes on the code.&quot;

Indeed, that was what I was thinking. This seems like the kind of performance bug that should be fixed once the basics of scalar replacement are in place (as opposed to an inherent problem with the scheme).

I also agree that it&#039;s nice to be able to write more natural code and have it perform as well as the hand-optimised version.</description>
		<content:encoded><![CDATA[<p>&#8220;I am wildly guessing that the hotspot c2 compiler could pick up on the loop unrolling issue in the simpleAllocation case if it did more passes on the code.&#8221;</p>
<p>Indeed, that was what I was thinking. This seems like the kind of performance bug that should be fixed once the basics of scalar replacement are in place (as opposed to an inherent problem with the scheme).</p>
<p>I also agree that it&#8217;s nice to be able to write more natural code and have it perform as well as the hand-optimised version.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: miau</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-162</link>
		<dc:creator>miau</dc:creator>
		<pubDate>Mon, 16 Mar 2009 15:05:01 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-162</guid>
		<description>The real optimization would be for the compiler in both cases to realize that the result will be (10e9 * (2+10e9)) mod (2^32), and it will be the same everytime.  

The optimization of adding 288 to the sum for every 16 iterations in the no-allocation case seems like a more limited version of this.

Somewhere in the middle is the fact that doing add %ecx,%edi 16 times in a row is not the best way to multiply by 16 in assembler. :)

I am wildly guessing that the hotspot c2 compiler could pick up on the loop unrolling issue in the simpleAllocation case if it did more passes on the code.

Overall, however, I think this adds emphasis to the relation of the benchmark code and real world code.  The performance difference now seems to be from the ability to consolidate the calculation of i+i from 16 times to once, and consolidating 16 iterations of (i%16+3) to a single addition of 288.  If similar optimizations can be done automatically in actual algorithms, then we should make sure hotspot unrolls loops as aggressively when doing EliminateAllocations than when compiling the &#039;raw&#039; loop.  in the meantime we could consider writing uglier code for better performance.  on the other hand, if the same kinds of optimizations as in the benchmark code can&#039;t be done automatically for actual algorithms, then we can start writing beautiful code as soon as Sun releases 1.6u14 happy in the knowledge that one more optimization has been moved from application code to the platform.

and then start waiting for Speculative Lock Elision to show up in openjdk.</description>
		<content:encoded><![CDATA[<p>The real optimization would be for the compiler in both cases to realize that the result will be (10e9 * (2+10e9)) mod (2^32), and it will be the same everytime.  </p>
<p>The optimization of adding 288 to the sum for every 16 iterations in the no-allocation case seems like a more limited version of this.</p>
<p>Somewhere in the middle is the fact that doing add %ecx,%edi 16 times in a row is not the best way to multiply by 16 in assembler. :)</p>
<p>I am wildly guessing that the hotspot c2 compiler could pick up on the loop unrolling issue in the simpleAllocation case if it did more passes on the code.</p>
<p>Overall, however, I think this adds emphasis to the relation of the benchmark code and real world code.  The performance difference now seems to be from the ability to consolidate the calculation of i+i from 16 times to once, and consolidating 16 iterations of (i%16+3) to a single addition of 288.  If similar optimizations can be done automatically in actual algorithms, then we should make sure hotspot unrolls loops as aggressively when doing EliminateAllocations than when compiling the &#8216;raw&#8217; loop.  in the meantime we could consider writing uglier code for better performance.  on the other hand, if the same kinds of optimizations as in the benchmark code can&#8217;t be done automatically for actual algorithms, then we can start writing beautiful code as soon as Sun releases 1.6u14 happy in the knowledge that one more optimization has been moved from application code to the platform.</p>
<p>and then start waiting for Speculative Lock Elision to show up in openjdk.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ismael Juma</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-158</link>
		<dc:creator>Ismael Juma</dc:creator>
		<pubDate>Mon, 16 Mar 2009 10:20:43 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-158</guid>
		<description>Thanks for posting this, it&#039;s very useful information. :)

So, if I understand correctly you&#039;re saying that the loop unrolling optimisation was missed by the testSimpleAllocation case with scalar replacement?</description>
		<content:encoded><![CDATA[<p>Thanks for posting this, it&#8217;s very useful information. :)</p>
<p>So, if I understand correctly you&#8217;re saying that the loop unrolling optimisation was missed by the testSimpleAllocation case with scalar replacement?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: miau</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-157</link>
		<dc:creator>miau</dc:creator>
		<pubDate>Mon, 16 Mar 2009 10:16:27 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-157</guid>
		<description>I figured it out.

the testNoAllocation assembly I posted was incomplete,
the snippet from simpleAllocation is preceded by:
xor    %ecx,%ecx
i.e. set the counter to zero, while the snippet from noAllocation comes with:
  0xb3d9889c: mov    $0x3b9aca00,%ebx
  0xb3d988a1: sub    %eax,%ebx
  0xb3d988a3: and    $0xfffffff0,%ebx
  0xb3d988a6: add    %eax,%ebx
  0xb3d988a8: cmp    %ebx,%eax
  0xb3d988aa: jge    0xb3d988e1
  0xb3d988ac: nop    
  0xb3d988ad: nop    
  0xb3d988ae: nop    
  0xb3d988af: nop                       ;*iload
                                        ; - Test$::testNoAllocation@18 (line 38)
  0xb3d988b0: mov    %eax,%ecx
  0xb3d988b2: add    %eax,%ecx          ;*iadd
                                        ; - Test$::testNoAllocation@26 (line 38)
  0xb3d988b4: add    %ecx,%edi
  0xb3d988b6: add    %ecx,%edi
  0xb3d988b8: add    %ecx,%edi
  0xb3d988ba: add    %ecx,%edi
  0xb3d988bc: add    %ecx,%edi
  0xb3d988be: add    %ecx,%edi
  0xb3d988c0: add    %ecx,%edi
  0xb3d988c2: add    %ecx,%edi
  0xb3d988c4: add    %ecx,%edi
  0xb3d988c6: add    %ecx,%edi
  0xb3d988c8: add    %ecx,%edi
  0xb3d988ca: add    %ecx,%edi
  0xb3d988cc: add    %ecx,%edi
  0xb3d988ce: add    %ecx,%edi
  0xb3d988d0: add    %ecx,%edi
  0xb3d988d2: add    %ecx,%edi
  0xb3d988d4: add    $0x10,%eax         ;*iadd
                                        ; - Test$::testNoAllocation@23 (line 38)
  0xb3d988d7: add    $0x120,%edi        ;*iadd
                                        ; - Test$::testNoAllocation@30 (line 38)
  0xb3d988dd: cmp    %ebx,%eax
  0xb3d988df: jl     0xb3d988b0
  0xb3d988e1: cmp    $0x3b9aca00,%eax
  0xb3d988e7: jge    0xb3d98902

that is, the inner loop is unrolled 16 times, and iterated until the counter is within 16 of one billion, and then it goes into the one-at-a-time loop.
0x120 is the sum of (2*i+3) for i &lt;- 0..15

the debug build of openjdk has the following additional parameters available:
-XX:+PrintEscapeAnalysis
-XX:+PrintEliminateAllocations
the latter of which is for scalar replacement.</description>
		<content:encoded><![CDATA[<p>I figured it out.</p>
<p>the testNoAllocation assembly I posted was incomplete,<br />
the snippet from simpleAllocation is preceded by:<br />
xor    %ecx,%ecx<br />
i.e. set the counter to zero, while the snippet from noAllocation comes with:<br />
  0xb3d9889c: mov    $0x3b9aca00,%ebx<br />
  0xb3d988a1: sub    %eax,%ebx<br />
  0xb3d988a3: and    $0xfffffff0,%ebx<br />
  0xb3d988a6: add    %eax,%ebx<br />
  0xb3d988a8: cmp    %ebx,%eax<br />
  0xb3d988aa: jge    0xb3d988e1<br />
  0xb3d988ac: nop<br />
  0xb3d988ad: nop<br />
  0xb3d988ae: nop<br />
  0xb3d988af: nop                       ;*iload<br />
                                        ; &#8211; Test$::testNoAllocation@18 (line 38)<br />
  0xb3d988b0: mov    %eax,%ecx<br />
  0xb3d988b2: add    %eax,%ecx          ;*iadd<br />
                                        ; &#8211; Test$::testNoAllocation@26 (line 38)<br />
  0xb3d988b4: add    %ecx,%edi<br />
  0xb3d988b6: add    %ecx,%edi<br />
  0xb3d988b8: add    %ecx,%edi<br />
  0xb3d988ba: add    %ecx,%edi<br />
  0xb3d988bc: add    %ecx,%edi<br />
  0xb3d988be: add    %ecx,%edi<br />
  0xb3d988c0: add    %ecx,%edi<br />
  0xb3d988c2: add    %ecx,%edi<br />
  0xb3d988c4: add    %ecx,%edi<br />
  0xb3d988c6: add    %ecx,%edi<br />
  0xb3d988c8: add    %ecx,%edi<br />
  0xb3d988ca: add    %ecx,%edi<br />
  0xb3d988cc: add    %ecx,%edi<br />
  0xb3d988ce: add    %ecx,%edi<br />
  0xb3d988d0: add    %ecx,%edi<br />
  0xb3d988d2: add    %ecx,%edi<br />
  0xb3d988d4: add    $0&#215;10,%eax         ;*iadd<br />
                                        ; &#8211; Test$::testNoAllocation@23 (line 38)<br />
  0xb3d988d7: add    $0&#215;120,%edi        ;*iadd<br />
                                        ; &#8211; Test$::testNoAllocation@30 (line 38)<br />
  0xb3d988dd: cmp    %ebx,%eax<br />
  0xb3d988df: jl     0xb3d988b0<br />
  0xb3d988e1: cmp    $0x3b9aca00,%eax<br />
  0xb3d988e7: jge    0xb3d98902</p>
<p>that is, the inner loop is unrolled 16 times, and iterated until the counter is within 16 of one billion, and then it goes into the one-at-a-time loop.<br />
0&#215;120 is the sum of (2*i+3) for i &lt;- 0..15</p>
<p>the debug build of openjdk has the following additional parameters available:<br />
-XX:+PrintEscapeAnalysis<br />
-XX:+PrintEliminateAllocations<br />
the latter of which is for scalar replacement.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: miau</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-145</link>
		<dc:creator>miau</dc:creator>
		<pubDate>Thu, 12 Mar 2009 07:15:26 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-145</guid>
		<description>./java -server -version
java version &quot;1.7.0-ea&quot;
Java(TM) SE Runtime Environment (build 1.7.0-ea-b45)
Java HotSpot(TM) Server VM (build 14.0-b10, mixed mode)

and options:

-server 
-verbose:gc
-XX:+UnlockDiagnosticVMOptions 
-XX:+PrintAssembly
-XX:+PrintCompilation
-XX:CICompilerCount=1
-Xbatch
-XX:MaxPermSize=256m
-Xms128m 
-Xmx512m 
-XX:+UseConcMarkSweepGC 
-XX:+DoEscapeAnalysis

this is on linux, in virtualbox with 1 gigabyte of ram.

--

for completeness here&#039;s simpleAllocation without EA enabled:

  4   b   Test$::testSimpleAllocation (73 bytes)
Decoding compiled method 0xb4262f48:
Code:
[Disassembling for mach=&#039;i386&#039;]

  0xb4263210: mov    0x38(%ecx),%eax
  0xb4263213: lea    0x10(%eax),%edx
  0xb4263216: cmp    0x40(%ecx),%edx
  0xb4263219: jae    0xb42632dd
  0xb426321f: mov    %edx,0x38(%ecx)
  0xb4263222: prefetchnta 0x100(%edx)
  0xb4263229: mov    $0xa4322948,%edx   ;   {oop()}
  0xb426322e: mov    0x64(%edx),%edx
  0xb4263231: mov    %edx,(%eax)
  0xb4263233: movl   $0xa4322948,0x4(%eax)  ;   {oop()}
  0xb426323a: mov    %esi,%edi
  0xb426323c: inc    %edi
  0xb426323d: add    $0x2,%esi
  0xb4263240: mov    %esi,0x8(%eax)
  0xb4263243: mov    %edi,0xc(%eax)
  0xb4263246: mov    0x8(%eax),%esi
  0xb4263249: add    0xc(%eax),%esi
  0xb426324c: add    %esi,%ebx
  0xb426324e: cmp    $0x3b9aca00,%edi
  0xb4263254: jge    0xb426325a
  0xb4263256: mov    %edi,%esi
  0xb4263258: jmp    0xb4263210
  0xb426325a: mov    $0x148,%ecx

---

  0xb42632dd: mov    %esi,0x18(%esp)
  0xb42632e1: mov    %ebx,0x14(%esp)
  0xb42632e5: mov    %ecx,0x10(%esp)
  0xb42632e9: mov    $0xa4322948,%ecx   ;   {oop()}
  0xb42632ee: nop    
  0xb42632ef: call   0xb42608a0         ; OopMap{ebp=Oop off=628}
                                        ;*new  ; - Test$::testSimpleAllocation@21 (line 15)
                                        ;   {runtime_call}
  0xb42632f4: mov    0x10(%esp),%ecx
  0xb42632f8: mov    0x14(%esp),%ebx
  0xb42632fc: mov    0x18(%esp),%esi
  0xb4263300: jmp    0xb426323a

somewhat more complex.  comparing it to the other one shows how scalar replacement does sweet stuff.  

I am not quite sure what the beginning of the loop does.  if I read correctly it checks if the 16-bit pointer at ecx+38 is the same as ecx+40, if they are not it calls new C(), otherwise it puts the address at ecx+40 in ecx+38, i.e. reuses the previous C.

--

the no-allocation example with one billion iterations comes out to an average of around 1 clock cycle per iteration for the 7 instruction loop, on linux inside virtualbox.  that&#039;s awesomely impressive to me.</description>
		<content:encoded><![CDATA[<p>./java -server -version<br />
java version &#8220;1.7.0-ea&#8221;<br />
Java(TM) SE Runtime Environment (build 1.7.0-ea-b45)<br />
Java HotSpot(TM) Server VM (build 14.0-b10, mixed mode)</p>
<p>and options:</p>
<p>-server<br />
-verbose:gc<br />
-XX:+UnlockDiagnosticVMOptions<br />
-XX:+PrintAssembly<br />
-XX:+PrintCompilation<br />
-XX:CICompilerCount=1<br />
-Xbatch<br />
-XX:MaxPermSize=256m<br />
-Xms128m<br />
-Xmx512m<br />
-XX:+UseConcMarkSweepGC<br />
-XX:+DoEscapeAnalysis</p>
<p>this is on linux, in virtualbox with 1 gigabyte of ram.</p>
<p>&#8211;</p>
<p>for completeness here&#8217;s simpleAllocation without EA enabled:</p>
<p>  4   b   Test$::testSimpleAllocation (73 bytes)<br />
Decoding compiled method 0xb4262f48:<br />
Code:<br />
[Disassembling for mach='i386']</p>
<p>  0xb4263210: mov    0&#215;38(%ecx),%eax<br />
  0xb4263213: lea    0&#215;10(%eax),%edx<br />
  0xb4263216: cmp    0&#215;40(%ecx),%edx<br />
  0xb4263219: jae    0xb42632dd<br />
  0xb426321f: mov    %edx,0&#215;38(%ecx)<br />
  0xb4263222: prefetchnta 0&#215;100(%edx)<br />
  0xb4263229: mov    $0xa4322948,%edx   ;   {oop()}<br />
  0xb426322e: mov    0&#215;64(%edx),%edx<br />
  0xb4263231: mov    %edx,(%eax)<br />
  0xb4263233: movl   $0xa4322948,0&#215;4(%eax)  ;   {oop()}<br />
  0xb426323a: mov    %esi,%edi<br />
  0xb426323c: inc    %edi<br />
  0xb426323d: add    $0&#215;2,%esi<br />
  0xb4263240: mov    %esi,0&#215;8(%eax)<br />
  0xb4263243: mov    %edi,0xc(%eax)<br />
  0xb4263246: mov    0&#215;8(%eax),%esi<br />
  0xb4263249: add    0xc(%eax),%esi<br />
  0xb426324c: add    %esi,%ebx<br />
  0xb426324e: cmp    $0x3b9aca00,%edi<br />
  0xb4263254: jge    0xb426325a<br />
  0xb4263256: mov    %edi,%esi<br />
  0xb4263258: jmp    0xb4263210<br />
  0xb426325a: mov    $0&#215;148,%ecx</p>
<p>&#8212;</p>
<p>  0xb42632dd: mov    %esi,0&#215;18(%esp)<br />
  0xb42632e1: mov    %ebx,0&#215;14(%esp)<br />
  0xb42632e5: mov    %ecx,0&#215;10(%esp)<br />
  0xb42632e9: mov    $0xa4322948,%ecx   ;   {oop()}<br />
  0xb42632ee: nop<br />
  0xb42632ef: call   0xb42608a0         ; OopMap{ebp=Oop off=628}<br />
                                        ;*new  ; &#8211; Test$::testSimpleAllocation@21 (line 15)<br />
                                        ;   {runtime_call}<br />
  0xb42632f4: mov    0&#215;10(%esp),%ecx<br />
  0xb42632f8: mov    0&#215;14(%esp),%ebx<br />
  0xb42632fc: mov    0&#215;18(%esp),%esi<br />
  0xb4263300: jmp    0xb426323a</p>
<p>somewhat more complex.  comparing it to the other one shows how scalar replacement does sweet stuff.  </p>
<p>I am not quite sure what the beginning of the loop does.  if I read correctly it checks if the 16-bit pointer at ecx+38 is the same as ecx+40, if they are not it calls new C(), otherwise it puts the address at ecx+40 in ecx+38, i.e. reuses the previous C.</p>
<p>&#8211;</p>
<p>the no-allocation example with one billion iterations comes out to an average of around 1 clock cycle per iteration for the 7 instruction loop, on linux inside virtualbox.  that&#8217;s awesomely impressive to me.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ismael Juma</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-143</link>
		<dc:creator>Ismael Juma</dc:creator>
		<pubDate>Wed, 11 Mar 2009 23:06:48 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-143</guid>
		<description>Interesting investigation Miau. It&#039;s unclear why it would be that much slower if the generated code is the same. I would have thought that the overhead of Escape Analysis would go away after a few of the outer iterations.

Out of curiosity, what build did you use for your tests?

Regarding -verbose:gc, that is indeed one huge benefit of EA.</description>
		<content:encoded><![CDATA[<p>Interesting investigation Miau. It&#8217;s unclear why it would be that much slower if the generated code is the same. I would have thought that the overhead of Escape Analysis would go away after a few of the outer iterations.</p>
<p>Out of curiosity, what build did you use for your tests?</p>
<p>Regarding -verbose:gc, that is indeed one huge benefit of EA.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: miau</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-142</link>
		<dc:creator>miau</dc:creator>
		<pubDate>Wed, 11 Mar 2009 22:38:27 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-142</guid>
		<description>Ismael, 
thanks for the answer.

I ran your test with -XX:+PrintAssembly, and EA enabled.

I got this, with most everything omitted:

  6   b   Test$::testNoAllocation (66 bytes)
Decoding compiled method 0xb419a508:
Code:
[Disassembling for mach=&#039;i386&#039;]
...
  0xb419a6a0: mov    %ebx,%edi
  0xb419a6a2: add    %ebx,%edi
  0xb419a6a4: add    %edi,%ecx
  0xb419a6a6: inc    %ebx
  0xb419a6a7: add    $0x3,%ecx
  0xb419a6aa: cmp    $0x3b9aca00,%ebx
  0xb419a6b0: jl     0xb419a6a0

  4   b   Test$::testSimpleAllocation (73 bytes)
Decoding compiled method 0xb419ad48:
Code:
[Disassembling for mach=&#039;i386&#039;]

  0xb419aea0: mov    %ebp,%edi
  0xb419aea2: add    %ebp,%edi
  0xb419aea4: add    %edi,%ecx
  0xb419aea6: inc    %ebp
  0xb419aea7: add    $0x3,%ecx
  0xb419aeaa: cmp    $0x3b9aca00,%ebp
  0xb419aeb0: jl     0xb419aea0

that is, both noAllocation and simpleAllocation come out to sum+=(2*i+3).
($0x3b9aca00 is 1billion in decimal.)

the results stay the same, i.e. noAllocation is 2,5 times as fast as simpleAllocation.  is the extra time spent on the Escape Analysis inside HotSpot?

The results of -verbose:gc are quite fascinating, in the simpleAllocation version, with EA enabled and disabled. that is, zero gc&#039;s for EA enabled and a whole lot for EA disabled.</description>
		<content:encoded><![CDATA[<p>Ismael,<br />
thanks for the answer.</p>
<p>I ran your test with -XX:+PrintAssembly, and EA enabled.</p>
<p>I got this, with most everything omitted:</p>
<p>  6   b   Test$::testNoAllocation (66 bytes)<br />
Decoding compiled method 0xb419a508:<br />
Code:<br />
[Disassembling for mach='i386']<br />
&#8230;<br />
  0xb419a6a0: mov    %ebx,%edi<br />
  0xb419a6a2: add    %ebx,%edi<br />
  0xb419a6a4: add    %edi,%ecx<br />
  0xb419a6a6: inc    %ebx<br />
  0xb419a6a7: add    $0&#215;3,%ecx<br />
  0xb419a6aa: cmp    $0x3b9aca00,%ebx<br />
  0xb419a6b0: jl     0xb419a6a0</p>
<p>  4   b   Test$::testSimpleAllocation (73 bytes)<br />
Decoding compiled method 0xb419ad48:<br />
Code:<br />
[Disassembling for mach='i386']</p>
<p>  0xb419aea0: mov    %ebp,%edi<br />
  0xb419aea2: add    %ebp,%edi<br />
  0xb419aea4: add    %edi,%ecx<br />
  0xb419aea6: inc    %ebp<br />
  0xb419aea7: add    $0&#215;3,%ecx<br />
  0xb419aeaa: cmp    $0x3b9aca00,%ebp<br />
  0xb419aeb0: jl     0xb419aea0</p>
<p>that is, both noAllocation and simpleAllocation come out to sum+=(2*i+3).<br />
($0x3b9aca00 is 1billion in decimal.)</p>
<p>the results stay the same, i.e. noAllocation is 2,5 times as fast as simpleAllocation.  is the extra time spent on the Escape Analysis inside HotSpot?</p>
<p>The results of -verbose:gc are quite fascinating, in the simpleAllocation version, with EA enabled and disabled. that is, zero gc&#8217;s for EA enabled and a whole lot for EA disabled.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ismael Juma</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-141</link>
		<dc:creator>Ismael Juma</dc:creator>
		<pubDate>Wed, 11 Mar 2009 08:54:18 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-141</guid>
		<description>Hi Miau,

The time one iteration takes is not particularly important. What matters is how long your user has to wait.

One billion iterations may sound like a lot, but if you have a nested loop with 32000 items you&#039;re already past it.

Also, it&#039;s common in many algorithms (e.g. machine learning) for the innermost loops to have very simple operations.

Currently, people are careful not to allocate in such cases even if the resulting code is uglier (this is made worse since the JVM doesn&#039;t support struct-like structures). If the JVM could make this unnecessary, that can only be a good thing.

Ismael</description>
		<content:encoded><![CDATA[<p>Hi Miau,</p>
<p>The time one iteration takes is not particularly important. What matters is how long your user has to wait.</p>
<p>One billion iterations may sound like a lot, but if you have a nested loop with 32000 items you&#8217;re already past it.</p>
<p>Also, it&#8217;s common in many algorithms (e.g. machine learning) for the innermost loops to have very simple operations.</p>
<p>Currently, people are careful not to allocate in such cases even if the resulting code is uglier (this is made worse since the JVM doesn&#8217;t support struct-like structures). If the JVM could make this unnecessary, that can only be a good thing.</p>
<p>Ismael</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: miau</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-140</link>
		<dc:creator>miau</dc:creator>
		<pubDate>Wed, 11 Mar 2009 08:10:14 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-140</guid>
		<description>hi,

if I interpret your results correctly, in the first one, one billion iterations takes 0.4, 1 and 9.2 seconds respectively.  one iteration then takes 0.4, 1 and 9.2 nanoseconds respectively.

sounds short enough, considering the length of one clockcycle, and whatever might be executed inside the loop itself.</description>
		<content:encoded><![CDATA[<p>hi,</p>
<p>if I interpret your results correctly, in the first one, one billion iterations takes 0.4, 1 and 9.2 seconds respectively.  one iteration then takes 0.4, 1 and 9.2 nanoseconds respectively.</p>
<p>sounds short enough, considering the length of one clockcycle, and whatever might be executed inside the loop itself.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ismael Juma</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-82</link>
		<dc:creator>Ismael Juma</dc:creator>
		<pubDate>Wed, 17 Dec 2008 20:31:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-82</guid>
		<description>Domingos,

Since scalar replacement is already included in the JDK 7 builds, I would expect it to be available in Java 7. I don&#039;t know if it will be enabled by default, but I hope so. I haven&#039;t heard anything about stack allocation, so I have my doubts about that.

Michael,

GC pauses caused by full collections are indeed a problem, but I am not sure if stack allocation or scalar replacement would help here. Objects that do not escape would likely be collected from the young generation so they would not contribute to the longer pauses. Also note that my post was talking about scalar replacement instead of stack allocation. It&#039;s still unclear how much of an advantage the latter presents over objects that are collected in the young generation.

Something worth testing in terms of reducing the time of GC pauses is the G1 Garbage Collector. That will be included in JDK7, but also in a JDK6 update from what I hear.</description>
		<content:encoded><![CDATA[<p>Domingos,</p>
<p>Since scalar replacement is already included in the JDK 7 builds, I would expect it to be available in Java 7. I don&#8217;t know if it will be enabled by default, but I hope so. I haven&#8217;t heard anything about stack allocation, so I have my doubts about that.</p>
<p>Michael,</p>
<p>GC pauses caused by full collections are indeed a problem, but I am not sure if stack allocation or scalar replacement would help here. Objects that do not escape would likely be collected from the young generation so they would not contribute to the longer pauses. Also note that my post was talking about scalar replacement instead of stack allocation. It&#8217;s still unclear how much of an advantage the latter presents over objects that are collected in the young generation.</p>
<p>Something worth testing in terms of reducing the time of GC pauses is the G1 Garbage Collector. That will be included in JDK7, but also in a JDK6 update from what I hear.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Domingos Neto</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-81</link>
		<dc:creator>Domingos Neto</dc:creator>
		<pubDate>Wed, 17 Dec 2008 20:17:17 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-81</guid>
		<description>Hi,

Thanks for the info.  Do you know if there are any plans to introduce stack allocation or scalar replacement in Java 7?</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>Thanks for the info.  Do you know if there are any plans to introduce stack allocation or scalar replacement in Java 7?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mbien</title>
		<link>http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/#comment-77</link>
		<dc:creator>mbien</dc:creator>
		<pubDate>Wed, 17 Dec 2008 10:41:31 +0000</pubDate>
		<guid isPermaLink="false">http://blog.juma.me.uk/?p=47#comment-77</guid>
		<description>this are great news. In many applications a small performance loss does really not matter but full stops do. Doing EA without stack allocation would be really a missed oportunity ;)</description>
		<content:encoded><![CDATA[<p>this are great news. In many applications a small performance loss does really not matter but full stops do. Doing EA without stack allocation would be really a missed oportunity ;)</p>
]]></content:encoded>
	</item>
</channel>
</rss>
