|
Just What-the-Heck is a "wait-free" algorithm? What is a wait-free algorithm? Wikipedia has a nice article on it. There's the obvious definition: all threads complete the job in a finite number of steps - where a "step" is some unit of work granted a thread by having the OS give the thread some time-slice on a real CPU. Wait-free isn't interesting for loop-free (and not recursive) programs - there's only a finite number of steps in a finite piece of straight-line code. However, wait-free is often used to describe multi-threaded programs - which often communicate between each other with some sort of atomic-update machine instruction (frequently a CAS instruction but sometimes it's LL/SC and sometimes something else). To re-cap: wait-free algorithms in the real world rely on fragile failable machine instructions - and the failure is often fixed by looping while retrying. To remain a wait-free algorithm, the algorithm must *somehow* know the all loops are of finite length - i.e., failure is finite. This is often a reasonable assumption - under low load most machines' atomic-update instructions mostly work. But now let us posit an adversarial OS. The definition of wait-free doesn't mention an Operating System - and yet real programs running on real hardware have to deal with a real OS. Programs often more threads than CPUs and one of the OS's jobs is to schedule the threads - grant them "steps" on the CPU to make forward progress. The OS chooses when a thread gets CPU cycles (steps) and for how long. Typically OS's attempt to be fair and efficient, granting all threads an equal number of steps in a round-robin fashion. However, the OS isn't required to be completely fair and typically they are not (in the name of efficiency). A wait-free program can make forward progress in the face of an OS that is not being nice, i.e. adversarial OS. Let's assume the adversarial OS has to be at least slightly fair: it can't deny cycles to a thread forever. Indeed, if the program runs forever the the OS must hand out an infinite number of cycles to all threads (the unfair adversary isn't very interesting: the OS can simply choose to not hand out cycles some threads, or even to any thread - ala a hard-crash). What kind of damage can this fair adversarial OS do to our algorithms? For starters, the OS can guarantee that no contended LL/SC ever succeeds! Its easy: the OS schedules threads right up until they execute the LL; then that thread isn't allowed cycles again until another thread also executes an LL to the same address. For all known implementations of LL, this will cause the 1st thread to break the "link" part of load-linked. The first thread is then allowed cycles until it tries another LL. The 2nd thread is preempted at once of course, since it also just executed an LL. Hence the OS ends up allowing all threads an infinite number of cycles while guaranteeing no LL/SC ever succeeds. What does this all mean? It means in theory (since the adversarial OS is a theoretic notion) that IBM, Alpha & MIPS are screwed: they cannot use their primary atomic-update operation to write any algorithm without dragging in the OS. i.e., no one can write even plain locking algorithms on their hardware unless you also can show that the OS is at least slightly not-adversarial. Truly random time-slices would work - but fixed-cycle time slicing will be adversarial for algorithms which are "beating" in tune with the time slice. It's a little worse than that: the "load" part of LL is a ... well, a load. A contended load. It will miss in cache and thus be very slow, with a larger-than-normal cycle count attributed to it. The random-slice OS is slightly more likely to context switch immediately after the missing load than at other points in the program. Of course, the context switch gives other CPUs a huge chance to request the same address - which means that the poor loser thread is both going to lose this LL/SC but also take another cache miss when it retries (and thus have another greater chance of a context switch). This is all very theoretical: OS writers try hard to make their OS's not adversarial and exponential backoff/retry loops work great (or worked great on the 4-cpu PowerPC machines I used years ago). But you had to do it when you used LL/SC or you would get CPUs live-locked spinning on LL/SC. How does CAS fair in the face of the fair adversarial OS? All hardware docs I could find claimed that a CAS does succeed if the initial conditions are correct. According to the docs, assuming that multiple threads attempt to update the same address via CAS using the same initial load then exactly one should succeed. Here then the adversary cannot stop at least one thread from making forward progress. This means that CAS can at least be used to make lock-free algorithms. The OS can arrange for some loser threads to always lose CAS races, by preempting them out after they load the initial CAS conditions and not granting them cycles until after the winner thread does a successful CAS update to the same location. In other words, you can't use CAS directly to write wait-free algorithms (you always write wait-free algorithms without either CAS or LL/SC, although they tend to be brutally slow). Now here's another interesting bit: I distinctly recall using machines where CAS would sometimes fail for all threads even when everybody used the same correct initial conditions. Indeed, I want to claim it was true on early multi-socket Intel machines - but I cannot find any written documentation to support my memory! My fuzzy recollection was that since Intel didn't build the motherboards, Intel would not guarantee forward progress on contended CAS's - which had to do an external R/M/W cycle and the correctness of which all depended on the motherboard manufacturer. Indeed it seems to me that unless a CPU designer also designs the details of how CAS interacts across chip boundaries, then that CPU cannot make forward progress guarantees about CAS. I.e., like IBM has to claim the OS running on their gear is not a fair adversarial OS in order to make LL/SC useful, so also Intel has to claim that motherboards running their chips implement a strong version of the R/M/W cycle to make CAS useful. Ok, this is all stretching the point a great deal. In practice these things work well- at least on smaller CPU count machines. But CPU counts are on the rise, and the issues mentioned here become less and less academic with each passing year. On an Azul 864-way machine, if all threads are attempting to CAS the same location in a tight loop Azul promises you exactly one succeeds. But also I can promise you that at least one CPU will perpetually fail unless you invoke some sort of fair-lock (which we do in the OS; Java threads can't quit write CAS loops that are so tight as to prevent all forward progress to one CPU whereas hand-written asm in the OS did). I wonder how well a 1000-cpu LL/SC machine will fair on such loops and how much exponential backoff will be required (and how much inefficiency will that cause)? Or maybe the answer is: on some future 1million-cpu machine, if you have to CAS all CPUs on the same cache line then you are screwed: performance is so bad it's indistinguishable from a crash; you'll hit ^C and try again. Hence we all end up learning how to write programs which don't do that... I'll let you know if I get it figured out... :-) Cliff Category: Web/Tech | | TrackBack (0) TrackBackTrackBack URL for this entry: Listed below are links to weblogs that reference Just What-the-Heck is a "wait-free" algorithm?: CommentsThis whole site has tiny font disease. What's wrong with 12 or 13 point? Posted by: eric | Aug 8, 2008 10:56:20 PM "The OS can arrange for some loser threads to always lose CAS races, by preempting them out after they load the initial CAS conditions and not granting them cycles until after the winner thread does a successful CAS update to the same location. In other words, you can't use CAS directly to write wait-free algorithms." This doesn't follow. Read Herlihy's "Wait-Free Synchronization" paper for a CAS-based wait-free universal transformation: it always completes in a fixed number of steps, regardless of how nasty the OS is. Wait-free algorithms do still tend to suck on massively parallel machines, because the majority of them have execution times (and memory costs) that are at least linear in the number of threads in the system. Posted by: Chris Purcell | Aug 9, 2008 6:35:27 AM "In other words, you can't use CAS directly to write wait-free algorithms". That is not correct. Maurice Herlihy in his 1993 paper entitled "A methodology for implementing highly concurrent data objects" had proved that you can implement any data structure as a wait-free algorithm even in the face of adversarial OS given wait-free CAS (as implemented on modern architectures). His construction is not very efficient in practice, but it still yields an algorithm that is guaranteed to run in a fixed number of steps regardless of contention, that is "wait-free algorithm". Posted by: Roman Elizarov | Aug 9, 2008 8:56:43 AM Ok, I must be dense of reading. I've read & re-read that paper and I don't see the conflict in what we're both saying. Some comments on that paper: - Herlihy assumes *strong* LL/SC, but in practice only *weak* versions are in any kind of main-stream use. In other words, he assumes limited spurious failure, but the LL/SC's I've only ever seen or used allowed infinite spurious failure - and indeed I was able to achieve infinite spurious failure commonly and quite accidentally. The comment "you can implement any data structure as a wait-free algorithm even in the face of adversarial OS" and what I'm saying above are not in conflict. If you can write a wait-free alg where CAS always fails except for 1 thread, then the CAS can be replaced (with failing branches for loser threads and a simple store for the winner thread) and you still have a wait-free alg. His algorithm, set inside an adversarial OS, still manages to be wait-free - while having only 1 winner thread ever win a CAS. Hence you can replace his use of SC with a plain store for the winner thread and the install-new-result-test SC can always be false for the losers (and so that clause can just be eliminated from Fig 10 for the losers). Essentially, the winner does all the work - and the losers do exponential back-off until the winner finishes the job. No CAS nor LL/SC needed hence this wait-free algorithm can be written without a CAS/LL/SC. Instead LL/SC is used to make it more efficient but LL/SC isn't *required* to make it wait-free. Cliff Posted by: Cliff Click | Aug 9, 2008 10:32:54 AM This seems like an ideal place to plug our (dissenting) book: http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0123705916 Chapter 5 explains why you *cannot* implement most useful WF or LF data structures without using CAS,LL/SC, or equivalent, and Chapter 6 explains how you *can* build a WF implementation of *any* sequential object from CAS (or ideal LL/SC) even if the OS hates you. Plus lots of other crunchy, multicore goodness. Posted by: Maurice Herlihy | Aug 9, 2008 3:16:43 PM I'd like to clarify (ok, weaken) my position ever so slightly. :-) The algorithm that results from taking a WF algorithm and choosing a winner thread, and rewriting 2 copies of the code to not use CAS's has this property: "If ALL threads are given an infinite number of 'steps', then ALL threads make infinite forward progress." Note that this is slightly different from this wording of WF: "If Any threads are given an infinite number of 'steps', then those threads make infinite forward progress." Under a WF algorithm, if the OS permanently halts a thread the other threads continue to make progress. Not so with my hacked CAS-less version. Yet my algorithm is stronger than Obstruction-free: there is no requirement for an thread to have 'k' un-obstructed steps in a row for forward progress; round-robin fine-grained interleaving is fine - the requirement is only that all threads receive steps. Also, my algorithm is stronger than Lock-free is this way (but not in another way; the algorithms are not directly comparable): with a Lock-free algorithm if some threads are given an infinite number of 'steps' then at least one of those threads make infinite forward progress - but perhaps only one thread. Under my algorithm, ALL threads make progress. So I think the disagreement arises because I've got a class of algorithms that doesn't neatly fit into WF, LF or OF. Last note: this programming style (one Winner/Actor/Owner thread doing all the work for a datastructure, and Losers/Workers waiting for work to complete) has been around quite awhile. It looks alot like what you might do with Erlang or Scala (being careful here to not drag in msg-passing, I'm just talking about the notion of Actor who is responsible for doing all work on a datastructure so that structure can remain coherent).
Posted by: Cliff Click | Aug 10, 2008 1:06:39 PM Guaranteeing forward progress is even more problematic with STM which is One of the things I do with lock-free algorithms is make read access wait-free Posted by: Joe Seigh | Aug 11, 2008 2:34:28 PM (sorry for the long delay; got busy at work) "Using algorithms and tricks to reduce contention will probably be the key to achieving scalability when core counts start climbing." I think I'm already there. At Azul Systems we've found - - Obstruction-free algorithms are not suitable *at all* for production work. When hundreds of cores start obstructing each other NOBODY makes any forward progress, ever. Complete hang. The behavior is bi-modal instead of a graceful loss of throughput. - Lock-free algorithms are *mostly* not suitable for production work. Lock-free guarantees forward progress overall, but perhaps only for 1 thread. In practice, we'll have e.g. 100 threads/cpus making forward progress while 700 make no progress at all. It depends on the ratio of "other" work to the "atomic" work, so sometimes these do pan out. - Wait-free algorithms are not suitable for production work - either too hard to write, or too grossly inefficient (and yes we're willing to take some interestingly large slow-downs if we could make forward progress). Also wait-free is "too strong" for us - wait-free guarantees forward progress even if 1 cpu stops executing instructions. But CPUs don't stop executing instructions, or at least catastrophic failure is really really rare (and when it happens, it's *catastrophic* since the whole program isn't written as a giant WF algorithm). So I'm looking at algorithms that fit inbetween lock-free and wait-free - not-wait-free in that they require a "fair" scheduler (but CPUs are nearly always "fair" in that sense), but stronger than lock-free in that they guarantee forward progress for all, not just for some. (and I'm trying hard to avoid "fair-locks" because Amdahl's law just kills me). Cliff Posted by: Cliff Click | Aug 19, 2008 9:20:30 AM Post a comment |


