Odds & Ends
April 15, 2009

Various pending goodies...

Load-Linked/Store-Conditional vs Compare-And-Swap


Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!).  There was a long period of debate in the CompSci community one what constituted the "minimal" needed instruction to do useful multi-CPU work.  Eventually the community has decided that the Compare-And-Swap (CAS) instruction and the Load-Linked/Store-Conditional (LL/SC) instruction combo both are (1) sufficient to do useful work ("infinite concensus number") and (2) relatively easy to implement in real hardware.  X86's, Sparc's and Azul's CPUs use CAS.  IBM's Power, Alpha, MIPS & ARM6 all use LL/SC. ARM5 only has an atomic-swap.

LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms.  With LL/SC, you first Load-Linked a word.  The hardware marks the cache line as "linked".  You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter).  Finally you issue a Store-Conditional to the same address.  If the cache line is still "linked", the store happens.  If not, then not.  The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line.  Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented).  "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail.  I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure.  The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC.

With CAS, you typical first load a word with a normal load, then manipulate it.  Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory.  CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic ABA bug.  In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless all copies of A die first - including the one loaded before the CAS.  Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS.  Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively.  With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link".  E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS). 

These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations.  Nothing in the academic literature requires any sort of global ordering on these atomic instructions.  Instead the usual academic assumption is that all memory operations are strongly ordered - which is obviously not true on all modern hardware.  Practitioners are required to insert memory fences as needed to achieve the desired ordering.  Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do.  Azul's CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations.  (As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking.  Report back the %-tage of dropped counts and the throughput rate.  Then try it with an Atomic counter.  My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless).  I am less familiar with the fence properties of common LL/SC implementations.  Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc?

build.java


Some time ago I reported on a build tool I've been using.  Currently 'build' is being used to build HotSpot internally to Azul.  We've got it building 400+ files, plus build rules for building portions of the JDK, compiling w/gcc & javac and also 'javah', tar, jar, strip, binary signing, etc.  In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc.  It's blazingly fast, it's easy to hack (being written in plain-olde Java).  I hunted around awhile and couldn't find a good place to dump a medium-large blob of code, so here's a sample build.java in 4 parts:

Part 1
Part 2
Part 3
Part 4


A "HotSpot -server" aka C2 compiler IR visualizer

No I didn't do it!  This deserving grad student did it.

Here is the reference to the visualizer for the ideal graph of the HotSpot server compiler:

http://ssw.jku.at/igv/master.jnlp

It is a JNLP application hosted on a university server (http://ssw.jku.at/General/Staff/TW/ has also a test file to download), but the easiest way to get a first impression.  The source code of the tool is hosted on http://kenai.com/projects/igv  The server compiler instrumentation is part of OpenJDK and also included in Sun's weekly builds of Java 7.

More C2 Goodies

John Cavazos wrote:
... One of the things I am currently looking at is determining the right phase-ordering of optimizations applied to a particular program being optimized.  I have some nice (un-published) results for JikesRVM,
but it would be nice to replicate the research for HotSpot...

I have strong arguments for nearly all orderings of our passes, so I'm curious to know about your phase-ordering results.
The only obvious transform we might be missing is PRE, but our peephole opts pretty nearly subsumes PRE (they are not mathematically equivalent - I can write programs where either PRE or the peephole stuff makes progress against the other).  In practice, PRE will find essentially nothing once the peephole opts are done.  You'll notice that we do the peephole pass alot; it's totally incremental and provably linear bounded.  In other words, if there's nothing to be gained then there's no cost in trying.  The peephole pass includes amongst other things all pessimistic forms of copy propagation, value equivalence, constant propagation (especially the null/not-null property), constant test folding, repeated test folding, dead code elimination, load-after-store opts (aliasing analysis is included for free during building of SSA form), algebraic properties, and lots more.


For HotSpot, the optimization ordering is:

  • Build an inline tree, including class hierarchy analysis.  This is the one pass I'd be willing to move, as it happens too early.
  • (a single unified pass:) parse bytecodes, inline, build SSA form (the IR remains in SSA form always), do peephole opts over SSA form.  This pass typically costs 40% of compile-time budget.
  • Fixed-point the peephole opts over SSA form, once all backedges are known after parsing.
  • Loop opts round 1: "beautify loops" (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, split-if (zipper-peel CFG diamonds with common tests, plus some local cloning where I can prove progress), peel loops (required to remove loop-invariant null checks)
  • Fixed-point the peephole opts over SSA form
  • Loop opts round 2: "beautify loops" (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, lock coarsening, split-if & peel - but if these don't trigger because there's nothing to gain, the do iteration-splitting for range-check-elimination & a 1st loop unrolling.
  • Fixed-point the peephole opts over SSA form
  • Conditional Constant Propagation (the optimistic kind, instead of the pessimistic kind done by the peephole pass)
  • Iterate loop (split-if, peel, lock coarsen - but these typically never trigger again and take very little time to check), unrolling & peephole passes, until loops are unrolled "enough".  On last pass, insert prefetches.  Typically this iterates once or twice, unless this is a microbenchmark and then unrolling might happen 8 or 16 times.
  • Remove tail-duplication, and a bunch of other minor code-shaping optimizations e.g. absorb constant inputs into deoptimization-info in calls, or commuting Add ops so that 2-address machines can do update-in-place.
  • Convert "ideal" IR into machine code IR.
  • Build a real CFG for the 1st time, including a dominator tree, loop tree.  Populate with frequencies from earlier profiling.
  • Global latency-aware (loop-structure-aware) scheduling.
  • Replace null-checks with memory references where appropriate.
  • Register allocate.  Internally this has many passes.  This pass typically costs 40% of compile-time budget.
  • Sort basic blocks to get good control flow ordering (forward branches predicted not-taken, backwards predicted taken, etc)
  • Some last-minute peephole opts.
  • Emit code into a buffer, including OOP-maps & deoptimization info.


Travel Plans

I got too much travel coming up!

Apr 30, May 1st - DaCapo in Boston.  Favorite small group; GC & Java focused. Website: http://www.cs.tufts.edu/research/dacapo/.  I had to turn down the invite to an SSA seminar in France because the dates conflicted.  Very sad.  I hope one of the attendees will post a trip report.

May 11-May 15.  IFIP WG2.4  (International Federation of Information Processors, Working Group 2.4 - a *really* old European group with a random mix of industry & academia.)  It's a nice group to preview JavaOne talks, and always the meetings are a week-long rambling discussion in some quaint resort.

June 2-June 5.  JavaOne.  I owe slides for 3 talks by end of this week.  'enuf said.

July 6-10th - ECOOP, as a paid-for invited speaker.  A free vacation to Italy!   :-)

Cliff

Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e201157020ab72970b

Listed below are links to weblogs that reference Odds & Ends:

 

Comments

ARM doesn't have LL/SC (nor do they have CAS). They have an atomic swap. See A3.13 of http://www.arm.com/miscPDFs/14128.pdf

Posted by: Arm Guy | Apr 15, 2009 11:02:08 AM

 

Fixed! Thanks - but it appears than ARM6 *does* have LL/SC - LDREX & STREX.

Posted by: Cliff Click | Apr 15, 2009 11:40:31 AM

 

http://gist.github.com/ might be a good place for posting build.java

Posted by: Christian Vest Hansen | Apr 16, 2009 8:47:37 AM

 

Thanks - I gave it a shot, but the default setup has a similar file length issue to dzone snippets. Or at least it truncated my attempt at pasting into the widget.
Cliff

Posted by: Cliff Click | Apr 16, 2009 12:13:38 PM

 

Some readers will be confused by your statement that HotSpot is doing "global scheduling". So at the risk of being pedantic I have to point out that "ordering" and "scheduling" are not the same thing. A schedule is a timeline that accommodates fixed resources.

I have a pretty large ToDo list of things I need to accomplish. I manage to keep it up-to-date and start things early that will take a long time, but don't ask me which task I'll be doing at 4pm on the third Tuesday in May. Scheduling is a much harder problem, particularly with lots of dependencies.

GCM is nothing near a scheduler. If you don't think it's worthwhile for compilers to attempt scheduling, then let's be clear about that. Nothing wrong with keeping a JIT lightweight.

Posted by: Andrew Trick | Apr 24, 2009 6:38:51 PM

 

What's in HotSpot is beyond GCM; GCM finds many possible orderings. What little is published merely said "of all choices, pick the latest".

The GCM in HotSpot does a latency balancing act, so that long latency operations end up in earlier basic blocks. Within each basic block there's actual real list scheduler, with knowledge of finite functional units and actual functional-unit delays. A good example is to look at an unrolled dense linear algebra loop, with FP-unit delays and load-unit delays. It *will* be scheduled nicely.

This all happens prior to register allocation, so spills can mess the schedule up.

There was an effort to do an Itanium-style packing; I don't know how that effort panned out.

Cliff

Posted by: Cliff Click | Apr 24, 2009 7:40:03 PM

 

The docs for the IBM PowerPC G5 say that the chip implements LL under the hood as a read-for-write, i.e. grab the cacheline in exclusive mode. Further, that cacheline will then be held for a long time waiting for a subsequent SC before being released in despair, so an LL without a corresponding SC could cause some bad starvation effects.

As far as I can tell, if coupled with a fair scheme for passing around cachelines and a sufficiently "nice" guarantee about scheduling quanta from the OS, that gives a wait-free progress guarantee on the combined LL/SC operation. Wait-free CAS, for instance, doesn't give you wait-free increment, as all CASes could fail. Wait-free LL/SC allows a wait-free implementation of all single-word atomic functions, including increment. Obviously the hardware design doesn't scale well to large numbers of CPUs (but then wait-freedom so rarely does), so not terribly useful for Azul.

Posted by: Chris Purcell | May 1, 2009 4:22:56 PM

 

Azul's CAS has the guarantee that if the expected value is correct, then exactly CAS *will* succeed - i.e., you are guaranteed forward progress if the expected value is right.

(And another guarantee that after a sufficiently long amount of time being held in 1 L2 while being requested in another L2, that it *will* migrate and that migration is roughly round-robin. So no starvation either.)

Cliff

Posted by: Cliff Click | May 1, 2009 6:13:17 PM

 

There was a paper a few years ago proving that wait-free CAS only gives you wait-free single-word operations in general if you're willing to sacrifice O(# threads) memory. IIRC atomic increment was on that list. (Intuitively, every thread needs its own dedicated word to CAS, or it risks never being able to make a visible change.)

Posted by: Chris Purcell | May 4, 2009 3:54:43 AM

 

Sure, although spending O(#threads) isn't any big deal.

The High Scale Java libs include a high-scaling counter, which solves the problem exactly by throwing memory at it. The counter is striped in an array, and if CAS's start failing too often the array is grown to spread the CAS's to different cache lines.

Running the NonBlockingHashMap scaling test on a 864-way machine, the counter array typically grows to 1024 words - or 256 unique cache lines. So every cache line is shared with roughly 3 CPUs. Since the CPUs have to do other work than just spin on the counter (i.e. make a unique String key, do a lookup/mod on the table, sanity check results) that sounds about right.

Cliff

Posted by: Cliff Click | May 4, 2009 8:08:23 AM

 

Post a comment