|
Another round of Micro-benchmark Advice
March 7, 2008
I ran across this article, http://www.javaspecialists.eu/archive/Issue157.html and since Heinz is a friend I thought I'd try to figure out what's going on. Here's what I came up with:
There are 3 or 4 conflicting effects and which one dominates at
any point in time "depends". All of the effects can be removed with
some care.
- OSR: all code is in a loop in
main. The
-server compiler makes good code for hot looping methods; the next
time that method is called the good code runs. Alas, 'main'
is never called again. So after a time (slowly) interpreting the code,
HotSpot makes mediocore code for "the middle of the method" and does an
On-Stack-Replacement of the interpreter frame for the compiled frame.
The -client compiler is invoked for loop-containing methods immediately,
but makes less optimized code. Fix: make all timing
methods from modest-count outer loops which then call methods which
themselves have a long trip count loop:- for( int i=0; i<100; i++ ) test_one();
- void test_one() { for( int i=0; i<1000000; i++ ) do_stuff();
}
- Profiling ends compilation: after compiling the hot loop
the -server compiler notices that it's reaching code that's (1) never
been executed and (2) full of classes that have never been loaded. It
stops compiling, and issues an "uncommon-trap" - HotSpot jargon for
flipping from compiled code back to the interpreter. The -client
compiler usually compiles all the code in a method no matter how hot or
cold. Fix: Run *all* test code during the warmup period which
will force all classes loaded. Call all work methods from some
top-level dispatch function which itself will be profiled, hot and
compiled.
- Inline Caches: HotSpot uses an inline-cache for calls
where the compiler cannot prove only a single target can be called. An
inline-cache turns a virtual (or interface) call into a static call
plus a few cycles of work. It's is a 1-entry cache inlined in the
code; the Key is the expected class of the 'this' pointer, the Value is
the static target method matching the Key, directly encoded as a
call
instruction. As soon as you need 2+ targets for the same call site,
you revert to the much more expensive dynamic lookup
(load/load/load/jump-register). Both compilers use the same runtime
infrastructure, but the server compiler is more aggressive about
proving a single target. Fix: either expect the calls to be
single-target and fast, OR force all calls to be multi-target and
slow. The multi-target solution is easier for this kind of test.
- Bi-morphic (NOT poly-morphic) call site optimization:
Where the -server compiler can prove only TWO classes reach a call site
it will insert a type-check and then statically call both targets
(which may then further inline, etc). The -client compiler doesn't do
this optimization. Fix: either Do or Do Not allow 2 targets
for the result of calls. Usually it's easy to arrange for 1 target
(the norm, and inlined case) OR many more than 2 targets.
- X86 BTB: Some X86 chips include a branch-target-buffer
prediction mechanism, which can sometimes predict the target of
indirect branches. Fix: this one's harder to control, but a light-weight
pseudo-random selection of targets will often defeat the hardware.
i.e., make an array of Foo objects populated with various random
selections of Foo subclasses, and make virtual calls against those.
Good luck with those micro-benchmarks, Cliff
Category: Web/Tech | |
TrackBack (0)
TrackBack
TrackBack URL for this entry: http://www.typepad.com/services/trackback/6a00d83451bd7669e200e550c9ea4d8834
Listed below are links to weblogs that reference Another round of Micro-benchmark Advice:
Comments
Thanks for summing it all up so nicely.
Posted by: Ashwin Jayaprakash | Mar 8, 2008 10:42:52 PM
Excelent post! Thanks for the tips.
Posted by: Paulo Silveira | Jul 3, 2008 11:47:01 AM
Post a comment
|