Commons Math includes (in trunk and MATH_2_X branch) a FastMath class that is described as a “Faster, more accurate, portable alternative to StrictMath” and is implemented fully in Java (unlike java.lang.Math and java.lang.StrictMath). I am generally interested in faster math operations so I decided to investigate.
The first thing that I checked is if FastMath delegated to java.lang.Math or java.lang.StrictMath for any of its methods. It delegates to java.lang.Math for sqrt and random and to StrictMath for IEEEremainder. I found this interesting because I know that HotSpot includes intrinsic methods for many methods in java.lang.Math. Looking at vmSymbols.hpp, we can see that the following methods have intrinsics: abs, sin, cos, tan, atan2, sqrt, log, log10, pow, exp, min and max. Intrinsic methods are usually highly optimised (often involving assembly and sometimes CPU-specific instructions) and do not incur JNI overhead. I was interested in comparing all of FastMath’s methods with java.lang.Math and java.lang.StrictMath, but I was particularly interested in these ones.
Bill Rossi (the original author of FastMath) pointed me to a performance test in SVN, so I decided to run that (r1072209 from SVN trunk) on a Intel Core i7 CPU 860 2.80GHz using JDK 6 Update 25 early access release build 1 (includes HotSpot 20) and JDK 7 build 130 (includes HotSpot 21). I originally ran the test as it is in Commons Math with one addition but it has a few common micro-benchmarking mistakes (thanks to Aleksey for pointing them out in the comments). I fixed the issues (see here for the modified code) and ran the test again.
Since the relative results for the two JDKs did not differ much, I am only posting the charts for JDK 6 Update 25 early access release. The results for both JDKs are available in table form here. Note that the results are in milliseconds, so lower is better.
java.lang.Math does quite a bit better on tan, abs and log while FastMath does better on exp and cosh. The rest are similar.
FastMath does very well on hypot, pow, asin, acos, cbrt, sinh, tanh while java.lang.Math does well on log10 and log1p. java.lang.Math does so badly on hypot that it’s worth mentioning it again. Looks like it needs a bit of work.
As stated above, the results for JDK 7 are very similar with a few small improvements. Methods with an execution time reduction of 15% or higher were pow (202ms instead of 238ms), powII (125ms instead of 149ms) and atan (53ms instead of 66ms)
The usual disclaimers about benchmarks apply. The benchmark provided by commons-math is a good start, but I believe it should be fleshed out more so that common cases are measured separately (e.g. pow(x, 2) is much more common than the ones used in the test). Having said that, FastMath is a nice contribution and it seems like some methods in the JDK could be replaced by the ones in FastMath for better results. It also seems like FastMath could do better by delegating to the JDK for more of its methods.
Update: Some context for the contribution of FastMath to Apache Commons Math can be found here.
Update 2: I’ve updated the charts to use a modified test instead of the one in Commons Math as there were issues there. I had mentioned some suspicious scores, but thanks to Aleksey for pointing the issues out in the comments (they are common mistakes in micro-benchmarks).
I’ve noticed that Math.log and Math.exp got a lot faster somewhere between java 5 and where we are in java 6. I haven’t seen any information about it, and I haven’t run any formal tests, but it seems like it’s something the jvm is optimizing now. I know it at least inlines the calls.
I seem to remember Joe Darcy of the Sun/Oracle Java team saying that he (and others?) had long wanted to replace some of those intrinsics with pure Java, he just hasn’t had the time; see http://blogs.sun.com/darcy/entry/everything_old_is_new_again.
Hi Patrick,
Thanks for the link. I had seen it before, but it’s relevant for the discussion. I think you misinterpreted what he said. He said he wanted to replace the JNI implementations with pure Java implementations. The intrinsic versions are a good thing, e.g. he said “Achieving speed with sufficient semantic control in these cases is not really possible unless you have full control over the instruction sequences, as one has in a vm intrinsic”.
You are right, thanks for the correction.
The performance test is suffering from classic microbenchmarking problems. Even if we trust harness will warmup the JVM by having several @Test invocations before the run (which I don’t believe happening anyway), I don’t trust any code that does not introduce data-dependency on local variable “x”. C2, -server compiler from Oracle HotSpot JVM will optimize this block away:
int x = 0;
time = System.nanoTime();
for (int i = 0; i < RUNS; i++)
x += Math.log(Math.PI + i/* 1.0 + i/1e9 */);
long mathTime = System.nanoTime() - time;
x = 0;
to
time = System.nanoTime();
long mathTime = System.nanoTime() - time;
x = 0;
…thus giving you zero run time (isn’t that what’s happening with “log” in your own run)?
Hi Aleksey,
Yes, it seems like there are some issues with the performance test. I suspected as much from some of the scores (as I pointed out in the post) and what you say regarding warm-up and data dependencies is true. I should probably have looked at the test more closely instead of just trusting it. My suspicion is that the results won’t change much apart from the cases where 0 was returned, but I’ll verify later today.
Best,
Ismael
Alas, there’s a general problem with blog posts like this: almost everyone will have the feeling that you measured math performance, but what you really did is, you measured JVM capabilities in evicting dead code and your own luck in getting some parts of the code compiled, or others working in interpreter.
I’m glad you pointed it out in the post, but there’ll be too many people drawing false conclusions out of faulty experiment.
Cheers!
I understand your concern, but I think you overstate the case (given the number of iterations and the numbers reported). We’ll know soon enough though.
Aleksey, I’ve updated the charts and the test. As expected, the scores don’t change much apart from the ones that looked suspicious (log, sqrt, abs).
Let me know if you find issues.
Best,
Ismael
GNU Classpath comes with a “pure java” StrictMath:
http://cvs.savannah.gnu.org/viewvc/*checkout*/classpath/java/lang/StrictMath.java?root=classpath
Math in GNU Classpath delegates most methods to VMMath calls, which are either VM intrinsics, a special native implementation of the VM, or delegate to the naitve (JNI) fdlibm reference implementation.
It was done that way to make StrictMath completely portable, and give VMs the option to optimize Math however they wanted.
Hey Mark,
Interesting. It would be good to see how the pure Java implementation performs.
Best,
Ismael
I should have added that having StrictMath “pure java” does rely on the VM implementing the strictfp modifier correctly. Something which is sometimes hard to get right. IKVM for example doesn’t.
http://weblog.ikvm.net/CommentView.aspx?guid=6d657cd6-58ac-4a6a-8be9-c3fba167f531
Ismael, the data and test sure look more sane now.
However, I’m still not convinced there’s enough time for JVM warmup. You still measure time in interpreted mode, then measure overhead of OSR, then measure the compiled code performance. To add more, since all three tests are in the same method, you may end up in situation where StrictMath (first in your method) is running in interpret mode, exceeding compilation threshold, forcing entire method to recompile, thus given advantage for Math and FastMath executing in pure compiled mode afterwards.
I see this is mitigated by having 10 cycles of entire benchmark run in test(). Is that enough? Is -XX:+PrintCompilation silent when you actually start measuring?
Note, this also implies you have 10 report() calls per particular test, hence 10 samples, in which first samples are warmup garbage. I wonder if new charts account for last cycle only?
Also, while you have these 10 samples, can you actually compute the confidence intervals for your data? Some of percieved differences might be gone then.
Sorry for being so obnoxious.
I don’t agree that I measure time in interpreted mode for the time used in the charts and the tables. To be clear, I am only using the scores from the 10th iteration and -XX:+PrintCompilation is silent then.
Feel free to run the tests yourself and check. In my machines the scores are very stable, which is a good sign.
0.02 on benchmarking: this is why tools like Japex and Caliper were created. Perhaps using one of those would alleviate some of the concern Alexis has?
Patrick, I am aware of those tools and I’d use one of them if I were writing the tests from scratch. In this case, I think Aleksey’s concerns are unfounded and I’m happy with the results as they are.
Best,
Ismael
Ok, I’m convinced then.
The only thing separating your work from nice performance analysis is having confidence intervals or any other sort of error estimation for the data.
Thanks and over,
Aleksey.
Good stuff: I think it is very important to have independent verification of claims made by lib authors, as well as check how performance evolves over time with newer JDKs. Too often data from old JVMs is used as gospel.
One very minor suggestion — it would be good to mention that graphs represent time in milliseconds (i.e. “shorter is better”). This because there are tests where opposite is through (value is throughput).
Hey Tatu,
That was the motivation indeed.
Good point about “lower is better”. I’ve updated the post to mention this.
Best,
Ismael
Aleksey’s comments on reporting are spot on. Adding a variance calculation should be automatic. And, to say that values are different, you need further statistical testing. Without the variance calculation it’s hard for me to have confidence that the small numbers are the same or significantly different.
Hi Kirk,
If you look at the test, you’ll see that the numbers are not small, they are just converted at the end. Also, the scores were very stable across iterations. Having said that, it would be nice to have variance scores and I welcome you or Aleksey to submit the changes to the code (it’s all available after all). I’ll add the data to the post if that happens.
About the lack of confidence in the numbers, that’s good. You should not trust performance numbers in random blogs (independently of whether they contain variance calculations or not). Instead you should perform tests with your own hardware and application (even more true for math operations). Random blogs can serve at best as the starting point of an investigation. At least that’s the approach I take.
Best,
Ismael
I recently stumbled upon a “replacement” for java.lang.Math called Jafama. It claims to have far superior performance but does not provide full accuracy. Quoted from the project page: “usually about 2-4 (up to 15) times faster, with about 1e-15 accuracy”. Although I must admit I have never used it. But perhaps benchmarking it would provide some interesting additions to your graphs. I at least would be very interested in performance numbers on this.
http://sourceforge.net/projects/jafama/
Hi Rudiger,
Thanks for the link, it looks interesting. I’ll add it to my list of things to try. :)
Best,
Ismael
I know it’s been a while, but I was working on a very trig-intensive program and came across this. I did a bunch of profiling on my progam, using simultaneous averaged calls to compare the performance of a bunch of different trig libraries, and I found that anything table based does well in benchmarks but falls down pretty heavily once it starts having to compete for cache with my data (which is in the hundreds of megs, and only some of it needs trig).
Because I don’t need enormous accuracy I ended up rolling my own chebyshev solutions for (a)sin, (a)cos and (a)tan, which run minimim 4x faster than Math classes and faster for asin/acos. Of course, the biggest speedups come from algorithm redesign and hoisting the trig out of the deepest loops where possible, since it is still fairly slow.
Cheers
Julian