Cliff Click Jr.’s Blog's Blog

« Why WeakHashMap Sucks | Main | Clifford vs The YellowJackets »

More thinking about Non-Blocking structures
September 14, 2007

I like the notion of using arrays to hold collections (efficiency (really cool paper), ease of access), but how do you incrementally migrate a collection from one array to another without blocking?  By it's nature, you can't atomically update the entire array at one go.  By the nature of concurrency, there can always be a 'late arriving writer' who wants to update the old array and isn't aware that a migration is in progress.  If you let the writer's write proceed, you risk losing his update (classic hard problem in non-blocking algorithms: dropped update during a 'resize').  How do you get the late-writer's attention that something is going on, when (pretty much by definition) checking an unrelated volatile flag can't do the job?

This deserves a little more discussion; checking the volatile flag narrows the race but always: the writer can check the flag, then the resizer sets the flag, then copies the old data to the new structure, then the writer (having seen the flag is clear) updates the old data, and the resizer then announces the new structure as "the Real Data" and the writer's update has silently been dropped.  In a non-GC'd world it gets even worse: the resizer frees the old structure and the late writer updates into the dangling memory.

The idea here is to update something that the writer must also atomically update.  i.e., the resizer CAS's down a flag into the memory that the late writer must also CAS.  The late writer must either witness the flag (and fails his update, and the flag tells him why), or his update "beat" the resizer's work - and his update is now part of the state being copied/resized into the new array.

BUT (here's the Non-Blocking catch-22), you can't replace the existing data with just a flag without copying it first - because doing so means you've "hidden" the old data.  i.e., if you don't copy the old data, it exists only in your internal thread state between the time you CAS down the flag and the time you publish the old data.  If the OS decides to swap out the copy-thread between those writes the old data can remain hidden indefinitely.  Other threads can see the flag but not the data - so they get "blocked" waiting on you to publish the old data.  e.g., a late-reader will have to wait/block until the OS allows the copy thread to publish the data in the new structure.

My first cut of a NonBlockingHashMap solved this by copying the data to the new structure first, then going back and setting the flag in the old structure.  The problem with this approach is that a late writer can update the old structure - and the copying thread has to restart the copy again.  This lead me to the solution of a 2-phase-commit copy, with as many repeated copy attempts as late-arriving writers (upper bound is the number of threads in the program).  A major downside to this approach is if multiple racing writers are writing and multiple racing copy-threads are copying, then a reader racing against them all can see the writer's writes flip-flopping more times than you might expect.  Technically, this is legal by the JMM (since all threads are racing against each other) but it's not desirable.

So here's the New Plan: CAS a flag into the old structure that does not hide the old data.  This means I need to add a bit to every word - easy in C (or most non-GC'd languages where pointers tend to be "plain") but harder in a GC'd language.  I add the bit in Java by boxing the data - presence of the box is the 'flag' but the data is still available inside the box. (there's this funny AtomicMarkableReference class that looks real close to what I want, but I see in the HotSpot source that its not intrinsified).

The flag forces late-writers to check for a new structure and put their updates in the new structure.  Since the old data remains available, no thread is 'blocked' reading the old data.  This also means that the old data can be copied into the new structure by a different thread than set the flag - allowing for concurrent parallel copying, a requirement for timely copying of very large arrays!

--------------------------------------------------------------------------

Here's the big picture.  Key slots are unchanged from before; either null (slot empty) or some Key (which never changes again) or the big X (meaning: go to the new table).  Value slots are a little different from before: either null (no value ever written), or some Value or Tombstone (may flip wildly) and finally boxed{Value/Tombstone} (and never changes again).

New Key insertion is as before (reprobe to find an empty key-slot then CAS down the key; finding an X means retry operation in the new table).

If a thread find the proper Key, it then checks the Value slot.  Normal Values are returned as a hit.  If it sees null or Tombstone, a reader reports a miss.  If it see a box then it retries in the new structure.  Same basic rules for a writing thread: you're not allowed to update a box.

Copying threads: 

  1. Transit unused Key slots to X (CAS null->X)
  2. For used Key slots:
  3.   Sets the flag (boxes the existing value) if not already done (CAS V->box{V})
  4.   Reads the boxed value
  5.   If boxed value is NOT a tombstone then
  6.     Claims/finds a Key slot in the new structure (CAS null->K)
  7.     Writes the boxed value in the new structure, if nothing else is already there (CAS null->V)
  8.     If something is already in the new table, do nothing: a recent update overwrote anything in the old table

Then the resize operation as a whole proceeds like this:

  • Decide to resize (array full, too many reprobes, etc)
  • Create & CAS-install a new target array.  At this point future users get slower by some small amount, so we want the copy to complete in a reasonable period of time.
  • Copy all live slots from the old to the new
  •   - Copy slots with any combination of readers, writers and background threads
  •   - Copy "some" slots with each touch (incrementally slowing down users to get the copy done)
  • Promote the new array as "the array"

No more 2-phase copy!  No more excessive value flip-flops (when racing with multiple writers & copy-threads)!

--------------------------------------------------------------------------

Now here's the Big Picture, the Crucial Observations:

  • Only units of atomic-update count - unrelated words (the original volatile flag mentioned) can only narrow the race, not close the race
  • I must touch each datum that can be atomically updated, because that's the only way to inform other updaters that something special is going on
  • I can do it with nothing more than 1 bit-per-word (really: per unit of atomic update).  I can do it with much less actually, since stealing 1 address for the Box per Value out of my 2^32 (or 2^64) is really tiny.... but I must steal something out of the address space.
  • I no longer need to check a "resize in progress" volatile flag before proceeding with any operation, as that doesn't close the race in any case - and adds a constant amount of work per operation.

Chris Purcell noted something similar to this thesis: you must be able to atomically update a word, but you'd really like to atomically read many locations along with that single word update.  He proposes a primitive that's stronger than CAS, but is both weaker than DCAS (updates only a single word) and stronger than DCAS (can read atomically more than 2 words).  I really like his primitive, and I hope it sees the light of day (and Yes, I'm ideally placed to make that happen!).

Really, I'm approaching a design-pattern for non-blocking coding of abstract data types: use arrays, use state machines per-array-word, and box/flag every word when it's time to switch arrays.  All the rest is straightforward engineering.

--------------------------------------------------------------------------

My NBHM work has been moving along, albeit slowly - duty at Azul calls first.  In the SourceForge CVS (but not released) are ideas for lightweight iteration & ConcurrentHashSet... and at some point I'll redo the core algorithm to remove the 2-phase commit.

Thanks for all the comments (and excitement) we've seen so far!  This whole Non-Blocking thing has really got people way more excited than I ever dreamed... and it seems to me that some really hard problems are starting to give way.

Cliff Click

Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e200e54eddac178833

Listed below are links to weblogs that reference More thinking about Non-Blocking structures:

 

Comments

Just to be sure I'm on the same page here: instead of copying the data to your new array, then flagging it in the old one (with a lock-free guarantee because the two steps can be interrupted), you copy the data to a new autoboxed variable, then update the old location to the autobox (with a lock-free guarantee because the two steps can be interrupted). Surely this just increases your memory usage and decreases locality of reference during the update?

Posted by: Chris Purcell | Sep 17, 2007 1:54:04 AM

 

Oops, not quite...
Old model:
(1) Read data from old and CAS into new, adding a prime as a go (autoboxing)
(2) CAS a "no more changes" into old table
(3) if (2) fails due to concurrent update, restart at step (1)
(4) CAS-remove prime (unbox) from new table

So the old model always autoboxed AND looped on contention (but contention is guaranteed limited to number of late-arriving threads). Note that you never had to loop, but if you don't succeed at (3) then the copy did not make progress. Note that the value being CAS'd varied by copying thread based on the last read/last write timing - so that multiple copy threads can conflict with each other (until the old-table value settles out).

New model - tighter looping, fewer CASes and still autoboxing:
(1) CAS a prime in the old table (autobox)
(2) if (1) fails due to contention, repeat (but see below)
(3) Read old and CAS to new table, removing prime as you go (unbox)

In the new model I do not need to read the new table until I have already marked the old table. I do not need a 2-phase commit, nor need both old and new tables to agree.

If I choose not to loop at the new-model step (2), then the copy does not make progress BUT step (1) is the same for all threads (unlike in the old model). So if ANY copy thread succeeds (i.e., CAS does not fail spuriously) then the table copy as a whole makes progress.

Cliff

Posted by: Cliff Click | Sep 17, 2007 8:02:56 AM

 

How do you assist the copy phase without the risk of overwriting a more recent value in the new table by mistake? Hmmm...perhaps if I just wait for you to publish the code, this would be easier ;)

Posted by: Chris Purcell | Sep 18, 2007 8:46:22 AM

 

The new table is init'd to null - in step 7 I only CAS null->V. null's never an allowed value. If I want to delete, I'll insert a Tombstone. If I want to map to a null, I'll use a special sentinel.

(I'm furiously working on Azul's J6 port, so no updates to the NBHM code for a few weeks ) :-(

Cliff

Posted by: Cliff Click | Sep 18, 2007 9:06:00 AM

 

Any JDEE experts Out There? I debugged the original NBHM only with printf's (well, System.out.println's). I'm a decent emacs hack, so I figured I'd try an emacs-friendly IDE for future development... but configuring JDEE is giving me fits.

Thanks,
Cliff

Posted by: Cliff Click | Sep 18, 2007 9:08:36 AM

 

Isn't this basically the way the old lisp machines did it during their GCs? Only they used forwarding pointers instead of boxes.

Posted by: Michael Parker | Sep 21, 2007 10:44:42 AM

 

Everything Old is New again?

Those who can't remember the past are condemned to re-invent it?

I was struck by the similarities between Azul's GC hardware and the old Symbolics stuff when I read the papers a few years ago (there's clear differences as well, 25yrs makes a big difference). Both steal a pointer bit to signal "this value is needs special treatment before you use it" (same as this algorithm).

However Azul stores the notion for "this value is stale go elsewhere" NOT in the pointer ... and I can't remember where Symbolic's stored that same notion, so maybe I am repeating the past here.

Cliff

Posted by: Cliff Click | Sep 21, 2007 12:12:55 PM

 

The Symbolics Ivory processor stored the word type in 8 bits of tag that were associated with each 32-bit value, making for an odd 40-bit word size. The forwarding flag was just another type except that in normal circumstances the memory subsystem recognized it on incoming values and handled things itself.

ah memories, misty watercolored memories...

Posted by: Michael Parker | Sep 22, 2007 1:40:35 AM

 

If you're just resizing an array, have you considered using ropes instead of block copying?
http://en.wikipedia.org/wiki/Rope_(computer_science)

Posted by: Michael Brundage | Oct 14, 2007 12:38:46 PM

 

Too much overhead in the normal non-copying case! In any case, I don't do block-copying per-se; threads touching the hash table do a fixed amount of the copy work. There's no requirement for atomic bulk copy; it's entirely fine-grained stripable.

Cliff

Posted by: Cliff Click | Oct 15, 2007 8:27:28 AM

 

Hi Cliff,

Regarding lock and wait free structures. Your concept and design using state machine is great.

For real time java based system, I have started using non-blocking hashmap, hashset structures. I get around 30,000 objects per second. Some threads do building of tree structure, some threads do iteration, some threads remove expired objects based on time from non blocking structure. All these operations should happen concurrently without blocking. So I feel your structures are best.

a> There should be no synchronization in code. I tested this with non blocking hashlong, producer thread adding object and consumer thread removing object. Can thread which is iterating structure also remove object? Without "concurrent modification exception".

b> Kindly let us know how does non blocking structures work and concept behind "wait and lock free".

c>It seems nonblocking Set does not support "Comparator" like sun java Set/TreeSet. Is there any work around for this? To make this non blocking I have taken "sun" code for TreeSet modified to remove concurrent modification exception. This seems to work but bit fuzzy. There is a memory leak problem. I want to use azul structures with comparator clause.

d>With 30,000 objects/secound I need optimal jvm tunning - no major collections but only minor collections - need to attempt real time jvm. Any advice on this.

Thanks

Posted by: satish | Nov 9, 2008 12:46:47 AM

 

Post a comment