|
How to stop a compiler Suppose I want to write a microbench to test something. Here's a for-instance: // Time sqrt I run this and get: sqrts/sec=Infinity Urr... what? Ah! My fast X86 can do so many sqrts in a second, that "stop-start" is less than 1000 milliseconds, then "(stop-start)/1000" rounds down to zero as integer math. Try again: // Time sqrt, Round 2 This time I get: sqrts/sec=2.5E8 Ahhh... a smug smile, that X86 is really fast. It's a 2.5Ghz Xeon, so that's... lets see... (2.5e9 clks/sec) / (2.5e8 sqrts/sec) = 10 clks/sqrt. Humm... that IS really fast. Let's raise the bar a little: let's try 10 million sqrts instead of 1 million: // Time sqrt, Round 3 This time I get: sqrts/sec=4.5454545454545456E8 Yeek! TWO alarm bells go off in my head! First Alarm: 10x more work but only about 2x more time; now I'm down to 5 clks/sqrt! Second Alarm: that repeating fraction result: it tells me I've likely got a REALLY small number of milliseconds that I'm dividing. In fact... it's 22 milliseconds. Very suspicious! Yup, the compiler is tossing out my inner loop as being dead. To confirm, I'll switch to 1billion sqrts. This time I get: sqrts/sec=3.846153846153846E10 That's roughly 15 sqrts PER CLOCK CYCLE - not 15 clks/sqrt. Yup, that's one speed-demon Xeon. Or a brainiac compiler. To defeat the compiler I need to use all the results of the inner loop in the final output. Here's one way (and notice I'm back to 1million iterations): // Time sqrt, Round 4 Now I'm measuring 1 sqrt and 1 double-precision addition per iteration... but I know the add is fairly fast, or least it's fast relative to square root. This time I get: sqrts/sec=7692307.692307692 Ok, really I want to time various HashTable.get/put implementations but the basic issues are similar. I've got multi-threaded scaling issues to work out as well. Meanwhile, here's my 2002 JavaOne slides on the topic: http://www.azulsystems.com/events/javaone_2002/microbenchmarks.pdf Cliff P.S. As a result of said testing, Bill Pugh's extra fences & ordering guarantees do not appear to cost very much, so I'm including them. The SourceForge project is up but empty; I hope to do an initial check-in later this week. TrackBackTrackBack URL for this entry: Listed below are links to weblogs that reference How to stop a compiler: CommentsYour right hand column content is covering the right 8 or so characters of the text in both IE 6 and firefox 2. Interesting side effect of compiler optimization. :) Posted by: Brian Lloyd-Newberry | Apr 17, 2007 6:32:09 AM Hopefully fixed now. Posted by: Cliff Click | Apr 17, 2007 2:03:51 PM I recently performed a benchmark related to cost (immediate or not) of object allocation and within the test code I made sure to add some code to ensure the dynamic compiler did not optimize away the code I wanted to actually test. Java Performance Myths - Part 2 regards, William Posted by: William Louth | Apr 17, 2007 5:21:31 PM I don't think the compiler optimized away your loop (this is Java, right?). You can look at the bytecode to verify. The more likely explanation is a combination of: 1) currentTimeMillis() tends to be inaccurate (is this Windows or Linux?). and 2) HotSpot kicked in. You could probably do without the add operation if you use System.nanoTime() and run your benchmark for a much longer time. Running for a longer time is a better idea anyway because it averages out the noise from background tasks. Posted by: Bob Lee | Apr 17, 2007 7:44:17 PM You got me on (1) - I should have run for at least 10 seconds, although my final run did end up about 100x larger than CTM's wiggle. As for (2), I *wrote* the HotSpot compiler - so I'm sure it "kicked in". When I said "compiler" I wasn't refering to javac, but to the optimizing compiler built into HotSpot. Cliff Posted by: Cliff Click | Apr 17, 2007 7:53:04 PM Yeah, I realized my mistake right after I clicked the post button. Too late. ;) Is HotSpot really smart enough to skip the loop altogether? Posted by: Bob Lee | Apr 17, 2007 8:19:51 PM Yes; it's a pretty standard dead-code elimination, although simple unrolling gets nearly the same effect: only 1 store per unrolled body so you can lower the times hashCode is computed by any constant factor. Posted by: Cliff Click | Apr 17, 2007 11:18:15 PM Java code execute at very different speeds during an application's lifetime. Initially, it might execute in interpreted mode; if it executes enough times (if it is hot), it becomes a candidate for compilation to native code. Is it possible to know how much times should be a method iterated to become hot? As far as i know, early JVM's do have a method invocation threshold (Client JVM: 1500 and Server JVM:10000). Do the new JVM's have the same threshold? Thanks a lot, Posted by: Fouad Omri | Dec 23, 2007 9:31:45 AM I writing a microbenchmark for Java API methods, and i have also to face the following challanges: JVM optimizations may bias the microbenchmark results as mentionned here. The code we want to benchmark e.g(Math.sqrt()) is considered as Dead Computation and consequently it is optimized. But how would we deal with void methods? They may also be considered as Dead Computations, and we do have no return to fool the compiler with? Thanks a lot, Posted by: Fouad Omri | Dec 23, 2007 9:42:03 AM - On JIT thresholds: they vary alot from JVM to JVM. HotSpot -client & -server thresholds you've already mentioned, although these two can change from release to release. HotSpot on Azul is using both client AND server compilers; we've adjusted the threshold for both as well. Other JVMs do things differently; IBM's J9 (I think) does tick-based profiling at some point - so no counters. - On keeping results alive: the easy answer is to use the result in a "cheap" math. For Math.sqrt, sum the result into counter - then print the counter if it's equal to some unlikely value: - For void functions: Make them return a value! In any case, it's "pure" functions that go dead, not "void" functions, and only if the compiler can figure out it's "pure" (which it will for sqrt). I HIGHLY recommend you do not use Math.sqrt for any microbenchmark, unless your real application is extremely sqrt heavy - and in which case I'd highly recommend you look for some other way to structure your numeric calculation. Sqrt, like sin/cos/tan is one of the X86 FP odd-balls: In "math-sloppy" languages like C you can take all sorts of short-cuts with numerical precision - and so use the X86 FP ops. In Java, you must give correct results for all the evil corner cases which the X86 FP ops do not handle well. This requires a ton of extra ops in Java-on-X86 that are not present in C-on-X86 - so makes Java look very slow compared to C. (and C is faster than Java if you plan on doing a ton of sloppy sqrts on X86!). Sqrts on other platforms are generally same speed in C and Java. Cliff Posted by: Cliff Click | Dec 23, 2007 11:37:12 AM One other thing - HotSpot suffers from the "On-Stack-Replacement problem" - and microbenchmarks especially tend to trigger the problem. The compiler waits some time for profiling, then generates great code for the hot loops. The very next time the method containing the loop gets called, the great code will execute. Alas, in a microbenchmark there IS NO "next time". So eventually the compiler emits somewhat poorer that can be called "in the middle of the loop", replacing "on stack" the existing interpreter frame with the compiled frame. The fix for microbenchmarks is to call the hot loop from several nested loops, each in it's own function. Cliff Posted by: Cliff Click | Dec 23, 2007 11:43:10 AM Thanks a lot Dr. Cliff, i am designing and developping a fine grained benchmark for the Java API. I am trying right now to improve my benchmarking methodology: 1)Warm up the code before benchmarking In the benchmark i implement, are the different API methods considered as black boxes. Metadatas of API methods are collected via Java Reflection. A ParameterGenerator generates appropriates parameters for each method to invoke, and an ErrorHandler tries to handle runtime exceptions if they occur. I do have to face the following problems or challenges: 2) I tried to fool the compiler by making the returns of pure methods used whithin the microbenchmark, just like what you have done with the method Math.sqrt(). But i still have no idea how i could prevent the JIT from optimizing a void pure method, since i am not accessing the source code of the method i benchmark(Reflection). 3) There are not enough docs about JIT Optimizations :( Omri Posted by: Fouad Omri | Dec 23, 2007 1:13:26 PM Reflection goes a long way towards fooling the compiler. The downside is that reflection is relatively expensive so you need to use it relatively less often than the actual workload you are timing. I generally use interface calls instead of reflection (much cheaper but not always applicable.) With interface calls, it suffices to make the hot call-site "mega-morphic" (call more than 2 different targets) during the warm-up phase. Since I pay a mega-morphic interface call (20-50 clk cycles depending on JVM & CPU) per call to the benchmark, I make sure my benchmark target has enough work to swamp that overhead - generally by looping 20,000 iterations over the "real" benchmark call and expecting it to get inlined, unrolled & optimized - but not DCE'd (using my sum-and-print-the-results trick). Good luck with your degree, Posted by: Cliff Click | Dec 24, 2007 9:09:32 AM Cliff, in your Dec 23, 2007 11:43:10 AM response, you noted that OSR can emit suboptimal code (from a performance perspective), relative to the code that could have been emitted had OSR not had to be used. Do you have any ideas what type of optimizations OSR can NOT do? And what type of java code would illustrate this? Do you know of a good example of where the OSR version of the code is measurably inferior to the optimal non-OSR version of the code? I tried writing some simple benchmarks where the original structure was vulnerable to OSR, and compared execution times with rewritten versions that were not vulnerable to OSR (e.g. because I pulled out inner loops into external methods). I found similar execution times in each case, which I interpret as meaning that the OSR code is about as good of quality. Maybe I am not writing complex enough loop body code or something. Posted by: Brent | Feb 25, 2008 7:38:11 AM OSR's compile "in the middle" of the method, usually in the middle of nested loops. If you have a complex loop nest & you start parsing bytecodes from in the middle - sometimes you get irreducible loops. Irreducible loops don't get a lot of optimization, e.g. no loop-hoisting. The irreducible loops are not in the OSR'd loop directly but are the following loops (in the OSR parse: not always the following loops in the bytecodes). Sometimes OSR'd loops cannot do range-check-elimination conveniently, or do not properly peel 1 iteration (to remove loop-invariant null checks). Try writing a single method with a modestly long running outer loop, and several inner loops. Each inner loop should be moderately long and include highly optimizable array math: something like a simple integer vector-sum loop. --- Here's another idea: try to show (lack of) unrolling, by making unrolling insanely profitable: If this loop is unrolled at all, then the Q sum's cancel and the compiler knows it. This then should strip the loop empty and throw it out: runtime will become INDEPENDENT of the trip count N. Runtime will be: time to interpret until compilation, then time to compile, then time to run the (empty) loop. Cliff Posted by: Cliff Click | Feb 25, 2008 8:31:53 AM Bingo! Arrays bound check elimination seems to not occur under OSR code. Consider this version with your suggested multi-inner loops that do simple array math: Running 3 times, I get the consistent output Now do a method extraction of the inner loops so that OSR does not happen for them (they should instead be fully optimally compiled): Running 3 times, I get the consistent output p.s. is there any way to format code here? This website seems to have no decent way, nor does it allow html formatting... Posted by: Brent | Feb 26, 2008 12:08:40 AM I also tried writing a simple test to see if OSR is unable to do loop unrolling. Here is code that should suffer from OSR and whose loop would benefit enormously from unrolling as you suggest: private static int c = 1; private static int c = 1; The first 2 runs of this method yield: Did I code it wrong? Posted by: Brent | Feb 26, 2008 12:43:20 AM Yes, I just complained (again) about being unable to do code well. I just confirmed the OSR issue with your code on my linux 1.5.06 Java. I don't have time to dig into the unrolling OSR - but I can guess that 'c' is well known to the compiler (so everybody 'wins'). You might try having it vary with every other iteration. Also 4% is clearly within the margin of error: it really means "no difference", which might also mean "nobody did unrolling" or "everybody loses". In any case you haven't (yet) distinguished the cases. Cliff Posted by: Cliff Click | Feb 26, 2008 8:22:01 AM Post a comment |


