Cliff Click Jr.’s Blog's Blog

« Clifford vs The YellowJackets | Main | Misc Odds & Ends »

To Clone or Not To Clone...
November 6, 2007

There's this evil bit of the JDK involving NIO, where deep in the inner loop of a per-byte copy there's a virtual call:

ByteBuffer bb = ...;
for( int i=0; i<N; i++ )
  buf[i] = bb.get(i);

No problemo, the JIT will inline the single target, right?
Suppose the abstract class ByteBuffer has a single implementing class called DirectByteBuffer (via the abstract class MappedByteBuffer).
Then the JVM can prove that any ByteBuffer must be an instance of a DirectByteBuffer (please pardon my not-quite-exact translation to java-pseudo-code):

ByteBuffer bb = ...;
for( int i=0; i<N; i++ ) {
  DirectByteBuffer tmp= (DirectByteBuffer)bb; // Cast is free!
  buf[i] = tmp.get(i);
}

And then 'tmp.get' gets inlined (along with a few of it's friends) like so:

ByteBuffer bb = ...;
for( int i=0; i<N; i++ ) {
  DirectByteBuffer tmp= (DirectByteBuffer)bb; // Cast is free!
  if( i<0 || i>tmp.limit ) throw ...;
  buf[i] = tmp.unsafe.getByte(tmp.address + i);
}

And then the compiler does it's usual magic with range-check-elimination, loop-invariant hoisting, unrolling, etc: 

ByteBuffer bb = ...;
DirectByteBuffer tmp= (DirectByteBuffer)bb; // Cast is free!
unsafe ptr = tmp.address + 0;
for( int i=0; i<min(N,tmp.limit,buf.length); i+=4 ) {
  prefetch ptr+256; // some cache lines ahead
  buf[i+0] = *(ptr+i+0); // psuedo-code for unrolled loop
  buf[i+1] = *(ptr+i+1);
  buf[i+2] = *(ptr+i+2);
  buf[i+3] = *(ptr+i+3);
}
if( N>tmp.limit || N >buf.limit ) throw ...;

And suddenly the inner loop looks something like the guts of 'memcpy', and performance is quite good.  Yeah for JITs!

This also works if the single implementing class of ByteBuffer is HeapByteBuffer, yielding code very similar (after suitable levels of inlining, of course):

ByteBuffer bb = ...;
HeapByteBuffer tmp= (HeapByteBuffer)bb; // Cast is free!
if( tmp.position >= tmp.limit ) throw...
int init = tmp.position;
unsafe ptr = tmp.hb + tmp.offset;
int reg = tmp.position; // hoist field into register
for( int i=0; i<min(N,tmp.limit-init,buf.length); i+=4 ) {
  prefetch ptr+256; // some cache lines ahead
  buf[i+0] = *(ptr+reg+0);
  buf[i+1] = *(ptr+reg+1);
  buf[i+2] = *(ptr+reg+2);
  buf[i+3] = *(ptr+reg+3);
  reg+=4;
}
tmp.position = reg; // write register back into field
if( N>tmp.limit || N>buf.limit ) throw ...;

Again the JIT straightens out a nest of twisted inlining to yield something close to memcpy.

What Happens When I Call Them Both?

But suppose now this single loop is called with both HeapByteBuffers and DirectByteBuffers.  Then the compiler can't prove a single target class exists and doesn't inline the original 'get' call.  Performance of the same loop plummets (usually by a factor of 10x or so).  Just for exactly this case, HotSpot includes a special optimization called BimorphicInlining (think: Polymorphic inlining for the special case where "poly=2").  If there are exactly TWO possible targets, and both are hot then HotSpot will insert a runtime type test and inline both targets roughly like so:

ByteBuffer bb = ...;
if( bb instanceOf HeapByteBuffer ) {
  ...range checks, other setup
  unsafe ptr = tmp.hb + tmp.offset;
  int reg = tmp.position; // hoist field into register
} else { // else must be a DirectByteBuffer
  ...range checks, other setup
  unsafe ptr = tmp.address + 0;
}
for( int i=0; i<min(N,tmp.limit-init,buf.length); i+=4 ) {
  prefetch ptr+256; // some cache lines ahead
  if( bb instanceOf HeapByteBuffer ) {
    buf[i+0] = *(ptr+reg+0);
    buf[i+1] = *(ptr+reg+1);
    buf[i+2] = *(ptr+reg+2);
    buf[i+3] = *(ptr+reg+3);
    reg+=4;
  } else {  // must be DirectByteBuffer
    buf[i+0] = *(ptr+i+0);
    buf[i+1] = *(ptr+i+1);
    buf[i+2] = *(ptr+i+2);
    buf[i+3] = *(ptr+i+3);
  }
}
if( bb instanceOf HeapByteBuffer ) tmp.position = reg;
if( N>tmp.limit || N>buf.limit ) throw ...;

Voila'! Performance is (almost) back to what is was before.  Of course, the loop body doubled in size (and the array math is subtly different)  and only 1/2 the code is ever used on any invocation.  And we have to do a type-check in the loop on every iteration (made cheaper by unrolling: the type-check now only happens 1/4th as often).

More Than Two Targets

What happens if we have more than 2 hot targets?  HotSpot stops trying, and falls back to the generic virtual call (and 10x performance lossage).  In theory, you can use an optimization called loop-unswitching - really loop-cloning - to clone a new copy of the loop per possible target of 'get' (no production JVM that I know of does this).  But why do I care?

Turns out this kind of coding style has a number of interesting use cases: any time the iteration logic is complex (hence wants to be shared) but the inner-loop work is tiny.  This is true for graphics BitBlit routines where you have lots of interesting edge cases as you work around the borders of some rectangle - but the work-per-iteration is usually something like: load a word from the 2 arrays, XOR them, then write back - or some minor variation thereof.  i.e., you've got a modestly complex set of nested outer loops, and you want a polymorphic bit of work in the inner loop.  If you only wanted a single kind of inner-loop work, you'd get that single target inlined and the JIT would Do The Right Thing.

But what do you do when you've got lots of inner-loop targets?  The graphics guys didn't wait around for the compiler-writers of the world to catch up; they voted with their feet.  They cloned all those loops, by hand in the source code.  [Yeah, yeah, yeah, I'm being somewhat hyperbolic here; I don't know what really goes on in the 2-D libraries for Windows or Mac under the hood; this situation was true for graphics when I wrote games 20 yrs ago and I keep hearing it pop up now and again].

To Clone Or Not To Clone, That is The Question...

And now I have a good friend busy writing more complex iterator code (for parallel & concurrent iteration over complex spaces) with unknown - but typically small - inner loops.  He wants to write the code Once and let the JIT clone whole loops on an as-needed basis.  Alas, this JIT optimization seems unlikely to happen soon.  He's now debating generating copies of his loops specialized to a single target instance on demand.  I.e., an already complex parallel iterating library is about to pick up some even-more-complex bytecode rewriting stuff in an effort to get decent performance out of simple loops.

So: should he Clone his code?  Cloned not even in the source code (where you can at least look for & fix cut-n-paste bugs, tweak for performance on a per-loop basis, etc) - but cloned automatically (with all the evil performance problems THAT implies - excessive repeated JIT compilation, cloning when there's no point, bytecode bloat, impossible-to-figure-out Exceptions coming from machine-generated code)?  Or suffer the slings and arrows of outrageously bad performance, for what's supposed to be a easy-to-use high-performance Java library for parallel programming?


Thanks,
Cliff

PS - I got the yellow-jackets from last post with a 2nd gopher bomb, thanks for all the suggestions!







Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/240313/23095732

Listed below are links to weblogs that reference To Clone or Not To Clone...:

 

Comments

David Ungar's Self compiler did something that sounds similar to this, only he called it "type splitting". IIRC his runtime didn't invoke the JIT until the code had run a few times, so he had some statistics available to the compiler. So if there were N possible types but only a few of them were frequent then he would clone the code for the frequent cases and leave the original interpreted code for the less common cases. I think his runtime also re-evaluated its optimization strategy periodically (tied to GC I believe) so it could change which types were cloned if the hot types had changed, or even unclone code that had gone cold. I also believe he did this to the entire continuation, not just on loops, so his methods wound up looking more like CLOS multimethods after a few rounds of type-splitting and inlining.

Apropos of nothing, I was rummaging through the C# docs the other day and noticed that C# triggers the JIT the same way Fifth did - they put a "call the compiler" thunk around each method, so the IL for each function is JITted and backpatched when it's first called. Not terribly interesting in and of itself, but I had flashbacks to that XScript system compiling itself out of EPROM back before compiler could generate ROMmable code.

Posted by: Michael Parker | Nov 13, 2007 7:31:11 AM

 

Wow! I've seen you post before and have been thinking "I knew a Michael Parker long long ago... I wonder if it's the same guy", and then you said "Fifth". Now THERE's a flashback! Shoot me an email, we should catch up!

As for the deep-cloning thing, I'm starting to think it's a must-have for HotSpot sooner rather than later. Deep-cloning will be a lot easier than loop-unswitching, and useful in more contexts.

The question becomes one of integrating the selection code into the runtime. Right now we go mega-morphic at virtual call sites for the 'this' pointer only. In this deep-cloning context we'd also want to split out to different codes based on other arguments or loaded values.

Cliff

Posted by: Cliff Click | Nov 13, 2007 8:12:36 AM

 

I always thought this sort of cloning was easier to do if the code was in CPS form, since you just have to clone the continuations on either path of the branch with the appropriate parameters now uniquely typed, or at least more uniquely typed. Obviously a JIT compiler isn't looking at CPS code, and it's probably more trouble than it's worth to convert the IL in and out of CPS. The Self system cached the target method at the call site if it was virtual and checked his assumption in the prologue of the target and redispatched if he landed in the wrong method. All methods in Self were virtual, smalltalk-style, but he optimized the special case where there was only one target, and obviously type-splitting made this special case a whole lot more common. Within the method would potentially do the cloning whenever he needed to select a type.

It's a shame that Self was more widely renowned at the time for its garbage collector (it seems to be the granddaddy of the modern generational collectors) than for its JIT, which was mostly viewed as a curiosity in the pre-Java days. In the hottest spots of the program the full optimizer would generate code that was comparable in speed to optimized C, and since it integrated code profiling into the JIT and GC it was able to do some interesting things that none of the current generation are even attempting. By going back and increasing the optimization level for hot spots in the code Self was able to really boost performance without too much JIT overhead, and by balancing between interpreted IL and various levels of optimized native code it kept the runtime footprint surprisingly low.

Posted by: mparker762 | Nov 13, 2007 9:23:38 AM

 

You might also want to look at how CLOS implementations do it, since what you wind up after splitting with are essentially CLOS multimethods - the Lisp guys have been bumming that sort of code for a long time.

Posted by: Michael Parker | Nov 13, 2007 9:26:30 AM

 

HotSpot has long done virtual calls with an "inline cache" - essentially what you talk about above - and HotSpot came from a strong Self background (Self turned into a startup called Animorphic, got purchased by Sun, and converted into HotSpot).

In retrospect, HotSpot ought to do more "type-splitting".

Cliff

Posted by: Cliff Click | Nov 13, 2007 3:55:56 PM

 

On the subject of 2d graphics inner loops.

My Java PlayStation emulator takes a hybrid approach.

I have a single Java implementation of the code for every possible renderer full of inefficient IFs in the inner loops based on the state of static member variables.

At runtime, I just clone the class (for a particular set of rendering options), and make the member variables final with the corresponding values.

Then the compiler will produce an optimal version of the method for the renderer in question ;-)

Posted by: graham sanderson | Dec 19, 2007 9:38:20 PM

 

In Doug's situation, his inner loops have leaf calls to unknown user fcns - he's trying to get rid of the dynamic dispatch in the inner loop.

Thanks,
Cliff

Posted by: Cliff Click | Dec 20, 2007 12:55:36 PM

 

I've been playing around with Maps and I tried to change the collision handling in java.util.HashMap to use small fragments of 2D-arrays (2 rows, 8 columns each) to hold k,v pairs. Each such fragment is connected to another fragment (if there is a lot of collision). To my horror, I found the array traversals to be much slower than pointer traversals (Entry->Entry-> ...). Could it be because of the array bounds check?

Posted by: Ashwin Jayaprakash | Feb 15, 2008 9:29:24 AM

 

Possibly: it's hard to tell from your description. The Right The To Do is to whip out a profiling tool.

But I suspect your guess it right: usually the linked chains are very short. So in the typical case you get what you want on the first probe. With arrays, you'll need to range-check each array exactly once. Range checks are cheap on long running loops, because the range check can be done once for the whole loop - but this isn't a long running loop. It's probably a run-once loop.

The array for an individual bucket probably makes sense only when there are lots of items in each bucket - as can happen if the hash function is screwed up and everything is hashing to the same place.

I use One Big Array for the whole table - HashMap uses One Big Array that links to short linked-list chains. My style avoids one level of indirection typically.
You've got HashMap's One Big Array but now it links to a structure which is both a linked list itself AND points to arrays.

All those layers of indirection hurt.

Cliff

Posted by: Cliff Click | Feb 15, 2008 10:16:16 AM

 

Post a comment