Talking with Google...
March 29, 2007

[Part 2 : 'put' of my NonBlockingHashMap presentation will appear later]

I presented my NonBlockingHashMap at Google yesterday.  Overall, the talk went over very well with lots of intelligent questions.  This bodes well for presenting this stuff at JavaOne.  To be blunt, I was very concerned that I'd be talking over most folks' heads at JavaOne, but maybe not.

Slides: http://www.azulsystems.com/events/javaone_2007/index.htm 
Video: http://video.google.com/videoplay?docid=2139967204534450862

As part of the Q&A somebody asked me about a narrow race to that happens when installing a new table.  The HashTable auto-resizes, and periodically a new table has to be installed.  During the install process many threads can all independently decide that a table-resize is required and then all go and create a new empty table and attempt to install it.  Only one succeeds in the install (via a CAS) and the rest just drop their extra tables on the floor and let GC reclaim them.  I had supposed that the extra tables created here never amounted to very much - the race seems very narrow and is clearly very rare for modest sized tables (as based on a bunch of profiling).  And I said as much during Q&A.

But I had observed that for high thread&cpu counts (500 to 750+) and very large tables (4million entries, times 2 words (Key,Value) times 8 bytes (64-bit VM), so a 64Mbyte array) some odd timing problems which I suspected were GC related.  So I profiled a little more... and Lo!  The Google engineer was right (sorry I didn't get your name!)... a 64Mbyte array takes some time to create - because it has to be zero'd.  During that time more threads figure out that the table needs resizing, so they also make 64M arrays.  The sum of them start triggering various slow-path GC issues (emergency heap-growth, large object allocation space, GC decisions, etc) which gives more time for more threads to discover that a table resize is needed.  Pretty soon I'm making a few hundred 64M arrays, all of which are dead (except the one winner) and the GC is swallowing a major hiccup.

My fix goes against the spirit of Non-Blocking algorithms: I stall all but the first few threads making huge arrays.  The stalled threads are basically waiting to see if a new table shows up.  The algorithm is still non-blocking in the end: if the proposed new table doesn't show up the stalled threads eventually get around to making one themselves.  But by stalling a little I nearly always avoid the problem; threads that have "promised" to make a new array are in fact busy making one - I just need to give them a little  more time.

This issue points out one of the interesting discrepancies between Java and C.  In C, I could malloc a few hundred 64M arrays relatively quickly: they would all end up mmap'ing virtual memory but NOT physical memory.  Then when I freed the extra's, I get all that virtual memory back - but I'd never end up touching any physical memory.  Once I had a winner determined, I could initialize that 64M array in parallel.  In Java, I only get the pre-initialized array and it's normally zero'd by a single thread.  It takes some substantial time to zero a 64M array.

I could also go with ArrayLets in Java- basically an array of arrays.  For a 4M element array I'd actually make a single 2048-element 2-d array and zero only that.  The inner arrays would get lazily created on demand.  This would let me spread out the cost of zero'ing both over time and across threads.  Biggest downside: every HashTable access would require an extra indirection (and range check) to get the real elements.

At the moment, I'm going to live with my sleazy stall-on-resize-of-huge-arrays.  It appears to work really well, and it's really simple and fast.

Moral: Peer review is Good, and I'll be a little less smug about saying 'that race is really rare' in the future!

Cliff

Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e200d834f1fb7453ef

Listed below are links to weblogs that reference Talking with Google...:

 

Comments

As we say in the performance world, measure don't guess. ;-}

On the whole this is a pretty cool advance on the concurrency libraries.

Kirk

Posted by: Kirk | Mar 30, 2007 4:49:33 AM

 

Re. your talk: State machines permeate my algorithms, too. I suspect it's because they are easy to do atomically, make the "progress" through an algorithm obvious (simplifying concurrent assistance), and make enumerating all cases (the bane of lock-free algorithms) both explicit and fundamental.

Posted by: Chris Purcell | Apr 2, 2007 11:09:29 AM

 

Alternatively, you could modify your JVM to let you allocate unzeroed memory, as you could in C, to avoid the need to solve an artificial problem.

Posted by: Anonymous | May 10, 2007 8:17:56 AM

 

The goal is a pure-Java implementation suitable for the JDK - so no JVM hacks allowed!

But this does point an interesting issue with the Java spec. It seems totally within the spirit of Java to allow un-initialized primitive (non-ref) arrays. There's no type-safety issue, and there's plenty of Fortran-ish codes that know the first step of the algorithm is to write into that new array, old contents be damned.

Cliff

Posted by: Cliff Click | May 10, 2007 8:42:44 AM

 

It seems to me that you could optimize for reference arrays too.

If you zero on first access, as long as you can prove the array reference doesn't escape before its first access, you could zero it before it is first exposed or referenced allowing for something like

volatile Object[] foo;

(I apologize I can't for the life of me figure out how to markup code here)

somecode {
if (foo==null) {
Object[] local = new Object[lots];
if (foo==null) {
// zeroing would only happen here
cas(foo, local, null);
}
}
// use foo
}

not that its clear that this would help a lot in your situation if the allocation is much faster than the zeroing

Posted by: graham sanderson | May 21, 2007 7:42:22 PM

 

of course in general this is a pretty useless optimization!

Posted by: graham sanderson | May 21, 2007 7:48:15 PM

 

Not initializing just-about-to-be-killed-arrays is a VERY standard Fortran optimization, and is optimized with something akin to dead-code-elimination (redundant store elimination but on whole arrays).

I want to allocate uninitialized, then optionally immediately free - so any initialization is wasted (except for the one winner). My current solution appears to work fine (stall a bit) but I think array-lets would be OK as well.

Posted by: Cliff Click | May 21, 2007 10:11:28 PM

 

sorry; i meant that my optimization for reference arrays is not very useful since it only helps in the case where you wanted to minimize overwork by concurrent threads (since you HAVE to zero the array eventually) vs. what I think you mean with primitive arrays where the initialization is not really important

Posted by: graham sanderson | May 21, 2007 10:51:58 PM

 

note; i feel as dirty as i'm sure you'd accuse for suggesting a synchronized block around the allocation, but I really only see this as a horrible thing if you are worried about threads dying during allocation while owning a monitor; whilst it'd be nice to survive this situation, are there any VMs which would actually be in a good state after suffering this situation.

Posted by: graham sanderson | May 21, 2007 11:12:21 PM

 

note that you'd probably want a heuristic on which resizes are too cheap to warrant synchronization; say RAM/processors/2, or just <1 meg entries.

Posted by: graham sanderson | May 22, 2007 7:23:07 AM

 

I HAD a lock around allocation originally, but got rid of it for lock-free "bragging rights".

Actually, VMs are supposed to survive threads getting hit with a "Thread.death()" - arbitrary Java exception. It's not a very likely situation, so it's really about bragging rights. (but in fact the TCK tests throwing exceptions this way and HotSpot is routinely tested that it survives this). I'm sure if we had a *hardware* failure of a CPU then the VM would be in bad shape.

Heuristic note: I think I do something like "stall before doing the 3rd allocation >=1Meg".

Cliff

Posted by: Cliff Click | May 22, 2007 7:47:08 AM

 

Post a comment