|
Odds & Ends Various pending goodies... Load-Linked/Store-Conditional vs Compare-And-Swap
LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms. With LL/SC, you first Load-Linked a word. The hardware marks the cache line as "linked". You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter). Finally you issue a Store-Conditional to the same address. If the cache line is still "linked", the store happens. If not, then not. The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line. Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented). "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail. I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure. The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC. With CAS, you typical first load a word with a normal load, then manipulate it. Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory. CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic ABA bug. In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless all copies of A die first - including the one loaded before the CAS. Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS. Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively. With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link". E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS). These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations. Nothing in the academic literature requires any sort of global ordering on these atomic instructions. Instead the usual academic assumption is that all memory operations are strongly ordered - which is obviously not true on all modern hardware. Practitioners are required to insert memory fences as needed to achieve the desired ordering. Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do. Azul's CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations. (As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking. Report back the %-tage of dropped counts and the throughput rate. Then try it with an Atomic counter. My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless). I am less familiar with the fence properties of common LL/SC implementations. Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc? build.java
A "HotSpot -server" aka C2 compiler IR visualizerNo I didn't do it! This deserving grad student did it. Here is the reference to the visualizer for the ideal graph of the
HotSpot server compiler:
More C2 Goodies
John Cavazos wrote:
... One of the things I am currently looking at is determining the right phase-ordering of optimizations applied to a particular program being optimized. I have some nice (un-published) results for JikesRVM, but it would be nice to replicate the research for HotSpot... I have strong arguments for nearly all orderings of our passes, so I'm
curious to know about your phase-ordering results.
Travel PlansI got too much travel coming up! Apr 30, May 1st - DaCapo in Boston. Favorite small group; GC & Java focused. Website: http://www.cs.tufts.edu/research/dacapo/. I had to turn down the invite to an SSA seminar in France because the dates conflicted. Very sad. I hope one of the attendees will post a trip report. May 11-May 15. IFIP WG2.4 (International Federation of Information Processors, Working Group 2.4 - a *really* old European group with a random mix of industry & academia.) It's a nice group to preview JavaOne talks, and always the meetings are a week-long rambling discussion in some quaint resort. June 2-June 5. JavaOne. I owe slides for 3 talks by end of this week. 'enuf said. July 6-10th - ECOOP, as a paid-for invited speaker. A free vacation to Italy! :-) Cliff Category: Web/Tech | | TrackBack (0) TrackBackTrackBack URL for this entry: Listed below are links to weblogs that reference Odds & Ends: CommentsARM doesn't have LL/SC (nor do they have CAS). They have an atomic swap. See A3.13 of http://www.arm.com/miscPDFs/14128.pdf Posted by: Arm Guy | Apr 15, 2009 11:02:08 AM Fixed! Thanks - but it appears than ARM6 *does* have LL/SC - LDREX & STREX. Posted by: Cliff Click | Apr 15, 2009 11:40:31 AM http://gist.github.com/ might be a good place for posting build.java Posted by: Christian Vest Hansen | Apr 16, 2009 8:47:37 AM Thanks - I gave it a shot, but the default setup has a similar file length issue to dzone snippets. Or at least it truncated my attempt at pasting into the widget. Posted by: Cliff Click | Apr 16, 2009 12:13:38 PM Some readers will be confused by your statement that HotSpot is doing "global scheduling". So at the risk of being pedantic I have to point out that "ordering" and "scheduling" are not the same thing. A schedule is a timeline that accommodates fixed resources. I have a pretty large ToDo list of things I need to accomplish. I manage to keep it up-to-date and start things early that will take a long time, but don't ask me which task I'll be doing at 4pm on the third Tuesday in May. Scheduling is a much harder problem, particularly with lots of dependencies. GCM is nothing near a scheduler. If you don't think it's worthwhile for compilers to attempt scheduling, then let's be clear about that. Nothing wrong with keeping a JIT lightweight. Posted by: Andrew Trick | Apr 24, 2009 6:38:51 PM What's in HotSpot is beyond GCM; GCM finds many possible orderings. What little is published merely said "of all choices, pick the latest". The GCM in HotSpot does a latency balancing act, so that long latency operations end up in earlier basic blocks. Within each basic block there's actual real list scheduler, with knowledge of finite functional units and actual functional-unit delays. A good example is to look at an unrolled dense linear algebra loop, with FP-unit delays and load-unit delays. It *will* be scheduled nicely. This all happens prior to register allocation, so spills can mess the schedule up. There was an effort to do an Itanium-style packing; I don't know how that effort panned out. Cliff Posted by: Cliff Click | Apr 24, 2009 7:40:03 PM The docs for the IBM PowerPC G5 say that the chip implements LL under the hood as a read-for-write, i.e. grab the cacheline in exclusive mode. Further, that cacheline will then be held for a long time waiting for a subsequent SC before being released in despair, so an LL without a corresponding SC could cause some bad starvation effects. As far as I can tell, if coupled with a fair scheme for passing around cachelines and a sufficiently "nice" guarantee about scheduling quanta from the OS, that gives a wait-free progress guarantee on the combined LL/SC operation. Wait-free CAS, for instance, doesn't give you wait-free increment, as all CASes could fail. Wait-free LL/SC allows a wait-free implementation of all single-word atomic functions, including increment. Obviously the hardware design doesn't scale well to large numbers of CPUs (but then wait-freedom so rarely does), so not terribly useful for Azul. Posted by: Chris Purcell | May 1, 2009 4:22:56 PM Azul's CAS has the guarantee that if the expected value is correct, then exactly CAS *will* succeed - i.e., you are guaranteed forward progress if the expected value is right. (And another guarantee that after a sufficiently long amount of time being held in 1 L2 while being requested in another L2, that it *will* migrate and that migration is roughly round-robin. So no starvation either.) Cliff Posted by: Cliff Click | May 1, 2009 6:13:17 PM There was a paper a few years ago proving that wait-free CAS only gives you wait-free single-word operations in general if you're willing to sacrifice O(# threads) memory. IIRC atomic increment was on that list. (Intuitively, every thread needs its own dedicated word to CAS, or it risks never being able to make a visible change.) Posted by: Chris Purcell | May 4, 2009 3:54:43 AM Sure, although spending O(#threads) isn't any big deal. The High Scale Java libs include a high-scaling counter, which solves the problem exactly by throwing memory at it. The counter is striped in an array, and if CAS's start failing too often the array is grown to spread the CAS's to different cache lines. Running the NonBlockingHashMap scaling test on a 864-way machine, the counter array typically grows to 1024 words - or 256 unique cache lines. So every cache line is shared with roughly 3 CPUs. Since the CPUs have to do other work than just spin on the counter (i.e. make a unique String key, do a lookup/mod on the table, sanity check results) that sounds about right. Cliff Posted by: Cliff Click | May 4, 2009 8:08:23 AM Post a comment |



| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 |