Clojure: STMs vs Locks
May 28, 2008

I've been participating in this fascinating discussion with Rich Hickey, the author of Clojure about Software Transactional Memory.  I decided to cleanup and echo the conversation here, but the original can be found here.


The Problem Statement: it's not "atomic-vs-lock" but instead it's "where do we put atomic/lock?"

Randall R Schulz

  I found the following comment on an Azul Systems blog (http://blogs.azulsystems.com/cliff/2008/05/javaone.html) :

 

Cliff Click wrote:   

    Performance in an STM is not transparent: it requires a runtime to do abort/retry, and as the runtimes get fancy & complex the behavior under load of STM's gets.... weird and unpredictable.  Not something you can put into a production environment.   

  Any comments or counter-examples to this?

Raoul Duke

Rich Hickey, richhic...@gmail.com

wrote:

  Are explicit locking designs free from such anomalies? I wouldn't think so.

paraphrase: the behaviour under load gets weird and tricky.

well, in some ways maybe that doesn't happen with locks: the question of (depending on which STM system we're looking at) figuring out what caused all the rollbacks vs. with locks, you can at least generally quickly see that a given lock has 90% of all threads waiting on it kind of thing. whereas you don't know what variable in the venn diagram of overlapping transactions caused the retry?

tho i believe Rich previously said that Clojure would actually have a way of finding that out!
http://groups.google.com/group/clojure/browse_thread/thread/2f73b80fd...

Cliff Click

You got it - under load, performance is unpredictable.

With locks, you can see which locks are loaded, generally figure out why, and plan a strategy (stripe the lock, shorter hold time, switch to the j.u.concurrent datastructures, etc).

Not so (at least not yet) with STM.  I've talked to a few people who've tried STM out in a larger scale than the usual academic papers - and the results are downright disappointing.  Unless you become an expert in the STM runtime you're using (where "expert" means "not just grok it, but can build & tweak it") anytime performance is not good enough - you get stuck.  This is an open research problem at best, with no clear indication that the problem can be solved at all.

Rich Hickey

One could as generically argue that systems with manual locking are usually broken, and therefore their behavior itself, never mind their performance, is unpredictable.  That's been my experience with manual locking in the hands of mere mortal developers.  It can be as difficult to understand the behavior of any single manually-concurrent program as to understand the performance characteristics of an STM.  In the latter case, at least the STM is handling correctness for you, and all users of the same STM can share knowledge (and bug fixes and performance improvements), vs each application being its own Gordian knot.  And in any case in which the STM is failing you performance-wise, you can opt out and use manual locks outside of transactions.  To the extent you can use it, STM can drastically reduce the complexity of a system.

I'm wary of unifying the discussion of concurrency with one about performance, as the problems of concurrency start with correctness, where the manual locking story, due to its complexity, is quite bad.  Scalability is not a universal problem - some systems need to handle thousands of simultaneous connections - most don't, some need to leverage hundreds of CPUs in a single application - most don't.  But every application that wants to leverage multiple cores needs to be correct.

It would be no problem to instrument the Clojure STM references with fault counters that would allow one to know the exact per-reference contention level. Once known, the answers in both cases are similar - share less mutable state, don't have both long and short-lived transactions contend for the same data, check your success conditions ASAP etc.

STM designs differ from one another quite a bit, so any general statements are difficult to assess.  I think the level of granularity of the STM comes into play.  Most of the STM papers I've read concentrate on creating lots of transactional cells with which they envision constructing concurrent data structures like the ones in java.util.concurrent.  I don't think that's a good idea at all.

Clojure encourages the use of immutable persistent data structures, which need no locking or coordination whatsoever in order to perform correctly in a concurrent situation, and STM references to support the appearance of identity-preserving change.  It also encourages the use of java.util.concurrent and the like for caches and work queues - the best tools for the job.

As far as I know, Clojure is the first STM that uses snapshot MVCC, avoiding both read logging and transaction restarts due to read invalidations.  Clojure's STM also gives you the ability to 'ensure' a reference you later intend to read or write, thus providing some manual control over resource acquisition order (rather than just relying on the side effect of access).  It also supports explicit commute, further reducing retries for commutative writes.  I'll not claim its performance characteristics are at all proven, but its contention due to, and for, reading is lower than in programs that use manual locking, and lower than in STMs in which write transactions can cause read transactions to retry.

Once you get to record-level STM granularity, like Clojure's, it's also a bit easier to draw parallels to the performance characteristics of the database systems it mimics, which have been doing similar things for decades.

I don't consider STM performance any more or less of a research problem than determining the correctness of programs that do manual locking - there's still work to do.

And of course, neither STM, nor any other mechanism, is a panacea.

Brett Morgan

Given that I am a mere developer, actually with MT i'd consider myself a rank newbie, the most important thing to me is visibility.  I want to be able to see what problems I am creating with my design, so that I can change my design, and see how the problems change.  In effect, I am looking for an environment that teaches me what I am doing wrong, if only by showing me which references are hotly contended.

In keeping with that, it would probably be helpful for both refs and transactions to be namable, such that debugging output from the STM runtime can actually be easily inspected.  The reason I say this is that the data I need to work over is non uniform, so I need to be able to label the hot reference to know which particular chunks of my data set are causing issues.

thoughts?

Cliff Click

The problem with manual locking - and it applies equally well to transactions - is that the there's no name-space-control over what needs to be locked/transacted.  The failure people have with locks (and limited experience with transactions) is that they can't properly name their shared variables.  They think they can, until we start looking hard, and then we keep finding more... and more that need to be locked/transacted atomically.  In short, people don't know where to put the lock/unlock or where to bound the transactional region.

Having said that, locks have some advantages over transactions: transparent performance being one of them.  Programs are busted if the locks are wrong (or transaction regions don't cover all the variables they need to; Clojure isn't immune here) AND they are busted if performance sucks.  For 2-4 way machines the issue is probably moot; for 8-32 way machines (standard servers these days) - it matters.  In a year every el-cheapo linux blade server out there will be looking at 16-64 cpus.  And of course for Azul, 100+ active cpus is the norm.  I claim that in a year, most people will either care (get bit by lack of scaling), or see that their need to care within another year.

Requiring shared variables to have a special syntax, ala Ref, is a big step forward here for both locks & transactions.  You're reducing the complexity of the problem via name-space control (only Ref's are shared) and promoting immutable shared state.  The STM issue is orthogonal.

As evidence of that, suppose you replace your STM implementation with a trivial single global lock.  Are the programs more or less complex than their locking counterparts?  The program is as-correct and as-complex as before (before you swap the STM implementation), and indeed is exactly equal to the locking solution using a single global lock.  Suppose I can get a correct program using STM - then I also have a correct program using a single global lock.  STM only varies from locking as I try to make the implementation more performant: for the STM I trust the language implementor to "stripe" the locks under the  hood.  For locks, I start using more than one lock.  In both cases I can also try to shrink the lock-hold-time (or reduce the size of the STM region), or switch algorithms.

Summary: STM doesn't solve the Big Problem (lack of name-space control).  Ref's might solve the Big Problem - and solving the Big Problem will be a big help to both locks & STM.  Locks have transparent performance and are thus easy to optimize.  Optimizing without good name-space control is a buggy process for both locks & STM.  I don't know how to optimize an STM implementation without becoming an expert in the particulars of the STM runtime (and there are so many to choose from!)

Rich Hickey

I agree that identifying the things that can change is one key issue, and Clojure's Refs, Agents, and Vars all do that.  But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure's) STM works.

Refs are not merely markers, they do enforcement.  There is simply no mutating a ref outside of a transaction - it's not an allowed mistake.  Any attempt to change a ref outside a transaction throws an exception.  So, there is no such thing as "transaction regions don't cover all the variables they need to" in Clojure.

Your thought exercise with a single lock is a good one.  Let's say Clojure's implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.

Which is more correct?  The Clojure one, because it is not subject to undetected 'forgetting to obtain the lock'.  Your assertion that "Suppose I can get a correct program using STM - then I also have a correct program using a single global lock" implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness.  All Clojure STM programs (that don't throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don't throw exceptions) are correct STM (or otherwise) programs.

Which one is has better performance? Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.

As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock acquisition order and the resultant deadlock risk.  At this point STM becomes dramatically less complex than manual locking.

I'll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy.  But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.

Analogies to databases are telling.  DBMS's took over locking long ago and most people were happy to have them do so.  Ditto memory management and GC.  In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.

So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement.  It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I'm sure people are working on that.  STMs can provide correctness enforcement today, and therefore the two are not equivalent.  Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance.

Cliff Click

On May 24, 10:51 am, Rich Hickey

wrote:

I agree that identifying the things that can change is one key issue, and Clojure's Refs, Agents, and Vars all do that.  But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure's) STM works. Refs are not merely markers, they do enforcement.  There is simply no mutating a ref outside of a transaction - it's not an allowed mistake.  Any attempt to change a ref outside a transaction throws an exception.  So, there is no such thing as "transaction regions don't cover all the variables they need to" in Clojure.

Alas - there still is.   :-(
Example down below, and this is my primary complaint against both locks & STM.  However, your other arguments are more quickly answered - mostly because I agree with you.

Your thought exercise with a single lock is a good one.  Let's say Clojure's implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.

Which is more correct? The Clojure one, because it is not subject to undetected 'forgetting to obtain the lock'. Your assertion that "Suppose I can get a correct program using STM - then I also have a correct program using a single global lock" implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness.  All Clojure STM programs (that don't throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don't throw exceptions) are correct STM (or otherwise) programs.

Good one.  I agree.  Clojure wins this round.

Which one is has better performance?  Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.

Not really relevant; nobody writes the 'single global lock' version, including Clojure.  It was just a thought exercise.

As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock acquisition order and the resultant deadlock risk.  At this point STM becomes dramatically less complex than manual locking.

Not interesting in practice.  Because in practice, deadlocks are easy to find & fix.  You get a crash dump, crawl the stacks, discover the lock cycle & reorganize.  Sure the situation could be better, but deadlocks are a 'crash once' kind of bug.  Dev's don't like 'em, but they don't lose sleep over them either.

I'll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy.  But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.

Analogies to databases are telling.  DBMS's took over locking long ago and most people were happy to have them do so.  Ditto memory management and GC.  In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.

Actually, I was thinking of the compiler-vs-ASM analogy that took place in the 60's & 70's.  But point well taken: I agree that we need some kind of support here.  Maybe in the very long run STM runtimes get better & STM does win.  Meanwhile life sucks.  If you can only solve the problem with STM's I'll point out below then I'll agree with you, and maybe even try my hand at an STM implementation.

So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement. It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I'm sure people are working on that. STMs can provide correctness enforcement today, and therefore the two are not equivalent. Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance.

Ok, the long sought after counter example: STM's do not compose w/correctness. Bad Java syntax follows, because I don't 'speak' Clojure.  I'm using the classic example.

    class Account {
      long _money;
      add( long m ) { atomic { _money = _money + m; } }
    }
   
    transfer( Account a1, Account a2, long m ) {
      a1.add( m);
      a2.add(-m);
    }

While add alone is STM'd & safe, the two calls to add that need to be wrapped in an atomic are not - so there is a race where the accounts do not balance.  Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

Less trivial examples quickly spiral out of control.  Sure I know Ref's A, B & C are all Ref's and thus only updated in dosync blocks - but can I split any collection of Ref updates into more than one atomic region?  The ad-absurdum answer is to wrap a dosync around all Ref's - which means the whole program, and that means a single-threaded program.  Imagine the program unrolled in front of you, as an endless sequence of actions (all control flow flattened out)... interspersed with other actions are updates to Ref's.  Now: dissect this list into groups with atomic or lock as you see fit, preserving the program meaning as another unrolled thread of execution is interspersed.  How do you decide where it's OK to split the stream and where it's not OK? Why, in the Grand Scheme of things is:

  "... { read _money ... write money } ... " 
a correct splitting of the action stream, while
  "... { read _money ... write _money }  ... { read _money ... write _money } ..." 
is broken, but
  "... {{ read _money ... write _money } ... { read _money ... write _money }} ..." 
is correct again?

A Brief Digression into the Usefulness of Deadlock Avoidance

Phil Jordan

I'm new to STM, I've stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking.  I especially don't know what goes on under the hood in Clojure's STM implementation, or how powerful the JVM is in terms of memory guarantees.

I'm chipping in out of interest and to improve my understanding, apologies if I sound like an idiot:

Cliff Click wrote:

  On May 24, 10:51 am, Rich Hickey

wrote:   

    As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock  acquisition order and the resultant deadlock risk. At this point STM becomes dramatically less complex than manual locking.   

  Not interesting *in practice*.  Because in practice, deadlocks are easy to find & fix.

From personal experience, this is far from the truth in complex systems.  Deadlocks happening only on "in the wild" systems, appearing in the form of heisenbugs, etc.  Not fun at all.  There's too much in the way of implicit contracts going on, which blows up in your face if you're trying to extend undocumented software that was written by someone who left the company before you arrived.  (and we all know the "well then document it" approach just doesn't happen in practice)

Maybe it's just the implementations I've used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful.

You get a crash dump, crawl the stacks, discover the lock cycle & reorganize. Sure the situation could be better, but deadlocks are a 'crash once' kind of bug.

You need a reasonable amount of infrastructure in place for that, plus you're relying on absence of evidence rather than proof that it can't deadlock.

Dev's don't like 'em, but they don't lose sleep over them either.

The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. :)

  Bad Java syntax follows, because I don't 'speak' Clojure.  I'm using the classic example.

    class Account {
      long _money;
      add( long m ) { atomic { _money = _money + m; } }
    }
   
    transfer( Account a1, Account a2, long m ) {
      a1.add( m);
      a2.add(-m);
    }

While add alone is STM'd & safe, the two calls to add that need to be wrapped in an atomic are not - so there is a race where the accounts do not balance.  Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic. The way Clojure handles this (I've been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a ref, and any attempt at modifying it will fail unless you're in a transaction.  Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction.

Less trivial examples quickly spiral out of control.  ...

Okay, you kind of lost me with what you're trying to say here.

I get the impression you're mixing up atomic operations on the low level with a high-level STM implementation like Clojure's.  The latter presumably won't be efficient unless the implementation uses the former, but my understanding is that the programmer using an STM implementation is largely supposed to be isolated from such details, as long as he/she is able to determine what operations should be wrapped in a single transaction.

Cliff Click

I'm new to STM, I've stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking.  I especially don't know what goes on under the hood in Clojure's STM implementation, or how powerful the JVM is in terms of memory guarantees.

I'm chipping in out of interest and to improve my understanding,

Let's see if I can do you some justice...

  From personal experience, this is far from the truth in complex systems.  Deadlocks happening only on "in the wild" systems, appearing in the form of heisenbugs, etc.  Not fun at all.  There's too much in the way of implicit contracts going on, which blows up in your face if you're trying to extend undocumented software that was written by someone who left the company before you arrived.

Yup.  So the deadlock happened.  Ouch.

  (and we all know the "well then document it" approach just doesn't happen in practice)

Yup, but You could make a difference.  HotSpot probably has the highest level of 'asserts' per line of code (and a pretty darned high level of comments per line of code) of anything Out There.  Docs are cheaper than debugging.  But it's an aside.... back to deadlocks-in-practice:

  Maybe it's just the implementations I've used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful.   

    You get a crash dump, crawl the stacks, discover the lock cycle & reorganize.   

  You need a reasonable amount of infrastructure in place for that,

Crash dump?  Core file?  'ctrl-backslash' to a JVM to get a thread-dump?
This stuff is commonly available.

If you don't have a proper tool chain, then Job One for you should be to go get one - and I'm very serious here.  Drop everything else until you get a functional 'path' (protocol, implementation, business process, etc) that allows you do debug a crash from a customer in the field.  You might need a willing beta-partner for this, hand him a broken program & let it crash, figure out he needs disk-space to capture the core & logs, needs training to hit 'ctrl-backslash' when suspecting a deadlock, etc - plus FTP the files back to Home Base, plus you need to be able to capture the files per-customer-per-bug (ala Bugzilla), then decrypt the file (gdb for core, eyeballs or a little perl for stack-trace logs), etc, etc, etc, ....  No Rocket Science, but annoying bits of Engineering along with some Customer Management.

  plus you're relying on absence of evidence rather than proof that it can't deadlock.

Err.... No.

I'm not shooting for perfection here; I'm shooting for "good enough to ship".  Which means that if - In Practice - each possible deadlock happens once, then gets fixed - that's almost always Good Enough.  Maybe not to run a nuclear reactor, but definitely good enough to run large businesses.  In practice, most possible deadlocks never happen - but a few that do require several go-'rounds before getting fixed properly.  And then the deadlock is fixed, and remains fixed.

 

    Dev's don't like 'em, but they don't lose sleep over them either.   

  The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. :)

Yup... and those folks are generally stuck anyways.  But if they are thinking about the problem ahead of time they are going to be way way ahead in the long run.

  My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic.  The way Clojure handles this (I've been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a Ref, and any attempt at modifying it will fail unless you're in a transaction.  Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction.   

...   

Okay, you kind of lost me with what you're trying to say here.


I Restate My Argument

Cliff Click

Trivial examples are.... trivial.  They can be fixed in a thousand obvious ways.  We need to extrapolate to the non-trivial, because that's the only place where the STM-vs-Locks argument becomes interesting.  So lets pretend that instead of a single Ref _money and two classes, I've got 500 Ref's and a million lines of code - ala HotSpot (Sun's JVM).  Easily >500 shared concurrent variables, about 750KLOC.  About ~100 unique locks (some are classes of striped locks) guarding them in very complex locking patterns.  Now replace all those locks with an STM & atomic.  Is my program any more correct?  Not really.... ...I might have avoided some potential deadlocks (HotSpot uses lock ranking asserts to avoid deadlock; deadlock rarely happens at the engineers desk and maybe once/year in the field across all HS users).  The set of deadlocks-in-the-field avoided was miniscule.  I'll concede that HotSpot is a rarely-well-engineered piece of concurrent code, and that deadlocks-in-the-field appear to happen more often to other large programs.  But still, fixing after the fact is reasonable when the deadlock rate is so low and each fix 'sticks'.

Instead of deadlock, HS crashes far far more often because the locks don't cover the right set of changes.  Switching out the locks for an STM didn't change what I was guarding; it only removed any fine-grained-lock errors (admittedly useful... but only incrementally more so than avoiding deadlocks).  I'm still stuck with a program that's too Big to see where the proper atomic/STM/locking boundaries need to be.  In a trivial example I can say "go up one call level and atomic there", but in the Real Program - I can't do that.  Go up how many layers and add atomic?  1 layer?  10 layers?  100 layers?  Yes, I see unique call-stacks with >100 layers.  I can't put atomic around main because that makes my program single-threaded.

Here's where I want some kind of compiler support.  Ref helps - because it at least demands I have an atomic around the Ref.  But Ref is insufficient, because a syntactally correct program simply wraps an atomic around each Ref - exact what my trivial example did.  I'd like to be able to specify classes & groupings of Refs that have to be wrapped in atomic - that the compiler will complain about.  Yes I'm still responsible for the semantics - I have to tell the compiler which groupings of Refs are interesting - but I'd like some kind of automatic support, so that as my program goes from 10 lines to 10MLOC I can be told "you touched both Ref Person._checking._money and Ref Person._savings._money without wrapping both Ref accesses in a single atomic, thats a datarace error".

Rich Hickey

On May 25, 11:04 am, cliffc

wrote:

Not interesting *in practice*.  Because in practice, deadlocks are easy to find & fix.

I think you'll find plenty of dissent on that point.

Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

That's a problem with the Account class and OOP - classes are simply the wrong granularity for so many program semantics.  Yet, programmers demonstrate they can handle this, and set appropriate transaction boundaries, because they do so in their database work.  There is a pretty close analogy there, since many DBMSs will treat every statement in a batch as an independent transaction unless collected in an explicit transaction.  In spite of the same risks, work precisely like this is getting done correctly, and relatively easily, I think in no small part due to the simplicity of wrapping a bunch of arbitrary statements in BEGIN TRANSACTION.

Less trivial examples quickly spiral out of control....

What you are asking here is a philosophical question, which computers may never answer, since the answer depends on the semantics of the program, and such semantics will probably never be conveyed to the system more easily than:

  (defn transfer [a1 a2 m]
    (dosync
      (add a1 m)
      (sub a2 m)))
or the Java-esque:
  transfer( Account a1, Account a2, long m ) {
     atomic{
       a1.add( m);
       a2.add(-m);
     }
   }

Absent some similar declaration of relatedness, the program will have to understand itself (or one program another), since while it may look like debit/credit to us, it just looks like foo/bar to it.

I hope you don't wait for an answer to this (hard AI, potentially impossible) problem before dipping your toes in STM - it's likely you could make a significant contribution.

Rich Hickey

Towards that end, and more generally, it would be nice if there were metrics for success other than anecdotes from the field.  There should be some meaningful problem statement STM and other solutions could take aim at, something more real-world than the typical STM paper's correctness test or Erlang's 'pass messages around in a circle' and 'accept connections until you die' benchmarks, and more attainable for a new technology than the million-line systems Dr. Click and Brian Goetz mentioned in their final Java One talk, or Erlang's nine-9s phone switches.  Then everyone could run proposed solutions on their own 2/16/multi-hundred-processor systems and say - nope, not quite, this works, this doesn't.

Unfortunately, describing such a problem in an architecture-neutral way is difficult.  I think there can be a certain presumption that any concurrency solution should allow people to architect systems much as they do now, as a graph of directly-connected mutable stateful objects.  Dropping STM (and many of the other concurrency mechanisms) into such an architecture is likely to forever disappoint.  IMO, if you're not serious about immutability, you're likely to struggle with concurrency.

On my side, points were taken re: transparency and control for STMs.  The Clojure STM's current architecture is certainly capable of both better reporting and control.  So I'll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become 'knobs' for performance tuning - timeout windows, retry counts, history cache sizes etc.  So, I have some work to do before I'd like that kind of scrutiny :)

Cliff Click

On my side, points were taken re: transparency and control for STMs.  The Clojure STM's current architecture is certainly capable of both better reporting and control.  So I'll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become 'knobs' for performance tuning - timeout windows, retry counts, history cache sizes etc.  So, I have some work to do before I'd like that kind of scrutiny :)

I'm going to be a sourpuss again here.   :-(
This is exactly the trap MPI fell into; and you have to do it anyways.   Double-unsmiley.   :-(   :-( 

Here's the deal:

     

  • I write a Large Complex Program, one that Really Needs STM to get it right.   
  • But performance sucks.   
  • So I do a deep dive into the STM runtime, and discover it has warts.   
  • So I hack my code to work around the warts in the STM.   
  • Crap like: at an average of 5 cpus in this atomic block the STM 'works', but at an average of 7 cpus in the same atomic block I get a continous fail/retry rate that's so bad I might as well not bother.  So I guard the STM area with a "5-at-a-time" lock and a queue of threads waiting to enter.  Bleah (been there; done that - for a DB not an STM but same-in-priniciple situation).  A thousand thousand variations of the same crap happens, each requiring a different hack to my code to make it performant.   
  • Meanwhile the STM author (You: Rich) hacks out some warts & hands me a beta version.   
  • I hack my code to match the STM's new behavior, and discover some new warts.

Back & Forth we go - and suddenly: my app's "performance correctness" is intimately tied to the exact STM implementation.  Any change to the STM's behavior kills my performance - and you, Rich, have learned a lot about the building of a robust good STM.  You (now) know the mistakes you made and know it's time to restart the STM implementation from scratch.

My options limited:

     

  1. Scream at you to Not Change A Thing.  Thus C/C++/Clojure Standards Committees are born.  1/2 smiley :-P   
  2. Cling tenaciously to the abandoned version, and recognize my code is now abandon'd ware.  No new features or support for me.  But maybe the App is running fine and I can forget about it.   
  3. Rewrite my app from scratch Yet Again, to match the new STM's warts/features.

MPI is in this position.  Every large parallel scientific App Out There is intimately tied to the marriage of parallelizing_compiler + MPI_runtime + computer_archicture.  Changing any one of those pieces requires the app to be rewritten from scratch in order to achieve a fraction of the required performance.

The parallelizing compilers have stablized in a performance way.  There's a coding style which will auto-vectorize/auto-parallelize/auto-stripe/etc and there's some codes "on the edge" (some compiler+hardware combos yes, some no), and there's a well known "don't go here, the compiler can't hack it".  The compiler support for STM is very much lacking and/or in-flux.  Your heading into this terrain now, as you add Clojure compiler support to your STM.  Me, as an STM user, can't know what's in store for me as you go through the learning process.  I know that I don't know what STM coding styles will be performant and what will not.

The MPI_runtime has now stablized (I believe, not sure) after 20+ years.  Again there's the "this is good, this is marginal, this is bad" folk wisdom that drives coding.  Again, for STM's, this area is very much in flux.  Go read some of the bazillion academic papers Out There.  Everybody's got their own take, every STM is good in this domain & bad in some other domain - and all the domains are different; god forbid I write a large program dependent on STM 'A' and later try to switch over to STM 'B'.

The computer_arch has been stable for message-passing for some time.  For STM, I believe it's trying to ramp-up.  i.e., the computer_arch for STM is "all the world's an X86" right now, and all hardware vendors are furiously studying what it would take to add some kind of STM/HTM/hybrid support.

Thus, for STM to make it to the 'Promised Land' - the STM industry needs to figure out:

     

  • what belongs in the compiler   
  • what belongs in the runtime   
  • where there are Dragons and where there are Green Pastures   
  • teach the STM users which paths lead to Dragons or Green Pastures.

If we BOTH don't go through the excercise, then STM will never hit the promised land.

MPI never really "made it"; the 'Green Pasture' area was too small, and it was always surrounded by steep performance cliffs leading to Dragons when you slipped off the edge.  Upgrade your hardware: rewrite your app.  Upgrade your compiler: 2 months of performance debugging. Upgrade your MPI: 2 months of performance debugging to discover you need to rewrite your app.

GC "made it"; the Green Pasture gradually got bigger & bigger; Dragons remain lurking - but only after you filled up an ever-growing Green Pasture and needed to poke at the edges of stability.


We Agree to Press On, Bringing Light into the Darkness

Raoul Duke

it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)

Rich Hickey

it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)

That could be fun.

Cliff Click

 

    it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)   

  That could be fun.

Send me an email; Azul hands out 'academic' accounts all the time.  You get a remote login; you upload your code & run it.

And, if you don't mind, I'm going to edit this entire thread for clarity and echo it on my blog. This is good stuff (this conversation), and I definitely wish you the best of luck.

Cliff

Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e200e5528a4d288833

Listed below are links to weblogs that reference Clojure: STMs vs Locks:

 

Comments

>>My real bugs arise in 2 competing places:
>>- failure to wrap 'atomic' around a large enough region thus allowing a data race
>>- wrapping 'atomic' around too large a region thus crushing performance to the point of uselessness
>
>But are these really "bugs"? The compiler obviously can't magically figure out what variables need to be updated as one unit, and which variables require that you don't. ... This is information that you *have* to specify somewhere, and it maps *directly* to atomically clauses. Put "atomically" around stuff you need to happen automically - but *don't* put it around everything that's transactional, imagine that the function was called "commit" if you like.

Right! Like Michelangelo carving David, I'll just remove from the bare slab every bit of stone that doesn't look like a David and I'll be left with a Work of Art..... not!

My simple examples aren't getting through to you...should I try again?

I have 2 Ref's, called A & B. Sometimes I touch only A, sometimes only B, sometimes both. I have to tell the compiler when an 'atomic' region includes just an A, or just a B or both. This situation si Fine by me.

Now my program has grown to a million lines of code, and *for each new call path leading to an A and a B* I need to decide whether or not they should be atomic/transactional - or not. Of course, all A's & B's are already in a transactional wrapper - so the program is syntactically correct - even as I add new call paths and new interleavings of A & B. And now, the compiler does not complain at all - each time I add a new possible interleaving I get no message from the compiler that the situation needs inspection yet again. Eventually I don't notice when a new interleaving is possible and a datarace arises when A & B aren't updated atomically.

and also my shared variables now run from A to Z, and AA to ZZZZ....

Cliff

Posted by: Cliff Click | May 29, 2008 4:50:48 PM

 

> The work you've done by the time you update the mod counter won't be lost unless someone else commits a change to the mod counter between the time you read it and the time you write to it. The point is that with enough threads the variable that has the main contention (mod counter) will be read/updated very rapidly over and over again by different threads,

Quick: what's the maximum number of threads that can take a Reader/Writer lock for Read-Only - when there's essentially never a Writer? Hint: it's a much much smaller number than the number of CPUs in an Azul box.

Cache-contention for the control of a single cache-line will crush your performance.

We exactly have this problem in real customer situations *today*.

Cliff

Posted by: Cliff Click | May 29, 2008 4:54:07 PM

 

And of course shared memory may lead to contention. That's obviously going to be the case regardless of what strategy you use.
If you put "atomic" around each thread in WebLogic's thread pools, then that *must* be because each thread does a single atomic step, right (because why would you write a transaction for something which isn't a transaction?)? And if so, it will run no slower than if you use locks to achieve the same thing (barring any constant factors).

STM isn't a silver bullet, it won't magically remove unavoidable things. You can't just put "atomic" around the whole thread and expect that to magically figure out what part should *really* be atomic and ignore your instructions for parts that can be committed independently. If you say that something is one indivislbe step, then that's what you get.

Posted by: Sebastian | May 29, 2008 4:55:49 PM

 

Yeah!!! We made progress!!!

You finally admit: adding 'atomic' naively won't fix things....

Will you also agree that adding 'atomic' at too fine-grained a level also won't fix things?

Adding 'atomic' at the "Right Granularity" is a Really Hard Problem. What I want is support for solving that Really Hard Problem. No I don't expect the compiler to infer my semantics.
I just want tool support to help me, like the compiler does with the type system preventing me from passing an Int when a pointer is needed, or GC helps me not to free a variable with live references.

Where's the equivalent for Ref's?

Cliff

Posted by: Cliff Click | May 29, 2008 5:02:54 PM

 

I don't see how that revised example poses any problem, but then you seem to use references and transactions interchangably in it.

It's dead simple really. IF you need A and B to change atomically, and you want isolation etc., THEN you wrap it in atomically. This is the kind of information that you *need* to supply in any scenario, there is no way to infer this or anything.

If you follow this rule, you will never see any data races. If you see a data race, that must mean that you did not write what you meant. The complexity of the rest of the program does not matter at all, in each case you make a local decision about what your program should actually do - does it read A, or B, or both? As long as you know what you want your program to do, you won't have any race conditions.

Also, since transactions compose, you probably wouldn't make this decision abou A, B.. ZZZ etc. in detail each time, rather you would build larger and larger transactions that handle complicated operations on the shared memory, and you only need to decide which of these transactions need to happen indivisibly and which do not.

And again, this information *needs* to get specified somehow, and I don't see how you could do it in any easier way then just saying it directly ("A and B at once", "only A" etc.), this to me translates the "english" description of how the shared memory should be handled, directly into code, with a no complexity at all.

Posted by: Sebastian | May 29, 2008 5:06:57 PM

 

I have never claimed that the compiler is a magical tool that will correctly guess what you mean. You get what you tell it to give you.

"atomic" means, "do this atomically". That's what you write, and that's what you get.
If you have two things that need to happen at once, and you don't put them inside the same "atomic" then you haven't expressed that those two things need to happen at once. The compiler will never be able to guess this for you, you'll always need to specify these things *somewhow*. I don't see how the compiler could *ever* help you with that, as it's simply a matter of writing something that you don't mean.

The type system does sanity checking, it doesn't write your code for you. STM does sanity checking too (e.g. no I/O in transactions), but it can't write your program for you either.

You will always need to tell the compiler somehow, about which variables need to be updated at once, and which don't. This is information that is specific to the semantics of your program, and the compiler can't help you. And no, I don't think this is "really difficult", since it follows *directly* from what you want the program to do. There is no "wrong" level, just like there is no "wrong" number of passes through a for loop. If you need it to be 10 then write that. The compiler won't tell you that you probably want 11. Likewise, if you need two transactions to happen indivisibly, then that's what you write. The code maps directly to what you want it do do. You can still make logic errors, but the compiler rarely helps you with those in other areas (GC, type checking etc.).

Posted by: Sebastian | May 29, 2008 5:15:26 PM

 

There are two independent aspects of the Haskell STM system involved here. The first is the linguistic mechanism (the STM monad) used to prevent bugs in the program. The second aspect is the data-structures and algorithms used to implement atomicity and transactional isolation, e.g. locks, transaction logs, lock free data-structures, etc.

The linguistic mechanism won't prevent you for making your data structures serialize on a single piece of memory. But it will prevent some programmer from commenting out a line that takes a lock somewhere in your >500KLOC system because "it ran faster and the tests still passed on their machine".

Posted by: Josh Dybnis | May 29, 2008 5:22:58 PM

 

@Cliff

I realize the Hashtable example is not a real one, as such heavily-contended global resources are probably better served by the data structures you've been working on, but STM can bring several things to the general data-overlap problems, none a panacea. First, a multiversion concurrency control STM with snapshot isolation, like Clojure's, can take readers almost completely out of the contention loop. Read-only transactions always see consistent data and don't get rerun due to the activity of writers. And the read portion of a writing transaction does not participate in the transaction's data sharing 'footprint' (and yes, Clojure has an ensure construct for avoiding write-skew). You can achieve similar things by hand with a lot of work, meticulous attention to detail and discipline, but MVCC does make reading consistent data much easier to get right than with manual locks.

Second, having a general-purpose alternative to read-modify-write for commutative operations like mod counters and unconditional associative adds can change transactions from being fully serialized (i.e. one transaction wins and all the others fully rerun), to being commit-serialized, i.e. if all the overlapping data can be commuted, commute allows all transactions to succeed the first time, serializing just the commit phase and not all the work. Clojure's STM supports commute.

Third, if you are using Clojure's persistent data structures in your Refs, as advised, and you only need object-level consistency in a particular bit of logic, you can safely read completely out-of-transaction with Clojure's STM. Immutability is a beautiful thing.

As far as the declarative way to state your sharing/consistency requirements, I think dosync (or any nestable moral equivalent) is it. At some point a program is going to have to declare -> this constitutes an indivisible unit of work mixed up. You slap your forehead and fix it. But it's not like these are your meaningless A and B Refs whose relationships are difficult to discern, in practice they are more like transfer/withdraw and not very mysterious. It's not a fiddly, arbitrary thing. Wrapping too large a region is unlikely in practice, *when* you are using an STM like Clojure's that allows for sub-transactions. Haskell's policy of stay-in-the-STM monad as described above encourages too-large transactions.

Using an STM is substantially easier than manual locking, and while *your* program may have the same number of concurrency bugs, most users will have far fewer with STM. The important thing is, no performance or locking granularity considerations can change the atomicity specification - because it's a logical, not an operational/performance thing. So if you are having performance troubles, it has to do with your data sharing and mutation and nothing else. Share less, perform better. Change less, perform better. Get a better STM :)

So, because 'right granularity' is a logical thing, it is in the same category as all other "have I successfully captured in this program the specification of what it is supposed to do" - automating it is as hard as automating the writing of programs. All you are going to get are tools for annotating atomicity and enforcing your annotations (STMs do this today), but not making them in the first place, just like the compiler does not tell you what types your fields should be, or when to call new.

As far as annotations go, if not dosync, what do you imagine being able to say? Can you say similar things with databases and, if not, why not?

Posted by: Rich Hickey | May 29, 2008 7:29:02 PM

 

Middle 2 paragraphs got gnarled:

As far as the declarative way to state your sharing/consistency requirements, I think dosync (or any nestable moral equivalent) is it. At some point a program is going to have to declare -> this constitutes an indivisible unit of work. This is a business logic thing, and has nothing to do with locking granularity, although it has implications for (dictates) same. Nor is it an inherent property of the data, e.g. the business logic says doing transfer means doing both withdraw and deposit, atomically. What withdraw and deposit do, and what data they require, is encapsulated, as it should be, and subject to change as the business rules governing their definitions change. Note how the specification of atomicity can and often does occur in a function that touches no Refs at all!

Using dosync only when the rules/logic require 'together or not at all' keeps the granularity as fine as possible, and no locking or performance considerations can make it finer. Now, if you run into performance problems, changing what transfer means is not an option. The only option is to reduce the data footprint overlap of the constituent operations. If the business logic defines shuffle-transfer to be two transfers done atomically, again, dosync lets you express the intent (compose transfers) and no locking considerations can subdivide that - it's what the program has been defined to do. So, dosync is sufficient for declaring what constitutes a logical indivisible operation. Can you make a mistake and not include a needed sub-operation - yeah, just like you can get mixed up. You slap your forehead and fix it. But it's not like these are your meaningless A and B Refs whose relationships are difficult to discern, in practice they are more like transfer/withdraw and not very mysterious. It's not a fiddly, arbitrary thing. Wrapping too large a region is unlikely in practice, *when* you are using an STM like Clojure's that allows for sub-transactions. Haskell's policy of stay-in-the-STM monad as described above encourages too-large transactions.

Posted by: Rich Hickey | May 29, 2008 7:34:11 PM

 

The account example, while trivial, has a solution that extends quite well to less trivial problems.

Just make sure you model the right thing. What you want is atomic transmission of money, basically only allow atomic handovers. So model the money, not the receivers. Now by virtue of modeling the correct semantics, you no longer have to worry about the scope of the STM compositions and such.

Like such:

class Money {
private Ref balance = 0;

// only the bank can create money :)
private Money(int amount) { balance = amount; }

public Money() { }
public atomic transferTo(Money other, int amount) {
balance -= amount;
other.amount += amount;
}
protected void finalize() {
// real life should have this ;)
atomic { assert(balance == 0); }
}
}

class Account {
private Money balance = new Money();
public void add(Money source, int amount) {
source.transferTo(balance, amount);
}
}

// cannot be implemented like proposed any longer
// unless we use Money, and not just numbers, this has guarantees like
// no money ever gets lost.
void transfer(Account a1, Account a2, int amount) {
Money m = new Money();
a1.add(m, -amount);
a2.add(m, amount);
}


So what did we gain? No money is ever transferred without being atomic. Of course, while transferring money from account to account as proposed and implemented now, there is still no atomic transmission. However by virtue of the guaranteed atomic transfers, and by the semantics that no money gets lost, the transfer always succeeds. And if not, we get noted of that fact. However, the better transfer method is below.

(As a side note, it got pointed out that writing the OO version of Account is not suited, I would like to say that is simply not true. Transfer should have been written in the Account class, which would also have solved most of the original problems.)


class Account {
...
public void transferTo(Account other, int amount) {
balance.transferTo(other, amount);
}
}


Also I feel greatly for the sentiment that garbage collection was once where STM is now. I for one am quite curious if STM can overcome the non transparent performance problems.

Posted by: Onne Gorter | May 30, 2008 6:18:45 AM

 

Thanks Onne -

Immutability - great; love it

Ref's or otherwise naming the shared variables - great; love it. At least you get an error if you fail to wrap some kind of 'atomic' around your shared variables.

Switching over to 'Money' from 'Account' as the basic modeling element - another great idea, if very domain specific. As you say, Money is what is supposed to be atomic; you've modeled The Right Thing here and the placement of 'dosync' becomes obvious.

STM's will save me from my Large Java Programs? No (and you never said it would) - I've not got the ref/immutability thing going for me .

STM will save me from my Large Clojure/Haskell Programs? I remain extremely skeptical. Quoting: "You will always need to tell the compiler somehow, about which variables need to be updated at once, and which don't."

Large numbers of people who are working on large programs CANT DO THIS RIGHT NOW.

Caveat Emptor's:
- I obviously don't mean small programs, ala Money-Transfer.
- I do mean large numbers of people; a million(?) programmers must be working with the main App Servers, or various things that mimic App Servers to some degree. Or so my ad-hoc sampling tells me.
- Nearly all large concurrent programs are Java programs. Stems from the fact that there are a large number of Java programmers furiously writing concurrent programs.
- Java programs don't have Ref; immutability isn't enforced.

So maybe, just maybe, large numbers of programmers CAN tell what variables to make 'atomic' in large Clojure/Haskell programs - and if so, then STM will win the day (and research $$$ will pour in to make it efficient).

If somebody comes up with an alternative non-STM way of solving this Big Program problem, it will win instead. I'm not voting against STM here, just pointing out that it'll be the first technique that solves the Big Program problem that will win the day.


Cliff

Posted by: Cliff Click | May 30, 2008 8:11:30 AM

 

I've just been referred to this page and, whereas I agree with most of
what Rich wrote, I want to contribute with some comments.

Given that the discussion is voluminous already, I'll try to summarize
in one single post my comments, without interspersing them with the
various bits of the original posters. I hope that it makes it easier
to read.

I've developed an STM myself: the JVSTM (which, unfortunately, has a
very poor page at http://www.esw.inesc-id.pt/~jcachopo/jvstm/). The
JVSTM is a pure Java library implementation of an MVCC-based STM. So,
as far as I could tell from what I've read here, it has many
similarities with Clojure's STM (I wasn't aware of Clojure before
this, so my understanding may be wrong). It seems that Clojure's Refs
correspond to JVSTM's vboxes, and that Clojure's dosync corresponds to
the JVSTM's @Atomic method annotation. Other than this, both seem to
share a set of interesting properties. Most prominently, the fact
that read-only transactions never fail. That, I believe, goes a long
way to solve some of the problems that Cliff complains about. But
I'll get to that a bit later.

First, I would like to point out that, as others have said, STMs are
not a drop-in replacement for locks. It's not just a matter of
replacing all the lock/unlock pairs on your program by begin/commit
pairs. For once, because given the optimistic nature of most STMs you
should be prepared for repeated execution of an atomic block
(something that may not happen with a pessimistic approach).

This thread started with a quote of Cliff saying that the performance
of STMs may become unpredictable under some conditions. Even though
that is true in general, I don't see it as a problem exclusive of
STMs. Actually, I don't see it as being any different of an approach
that uses locks.

As a matter of fact, I strongly believe that the use of STMs have the
potential for more aggressive optimizations than the use of locks. I
like to compare it to JIT compilers. At least in theory, a JIT
compiler may perform better in the long run than an ahead-of-time
compiler because it may explore knowledge about the execution of the
system that the ahead-of-time compiler simply doesn't have.

Likewise, an STM runtime may explore the runtime information after the
execution of a conflicting transaction to ensure that it works
uncontended in the next try. There are many ways in how this can be
done, but the general idea is that, with a high probability, a second
run of the same transaction will perform similarly to the first
conflicting one.

On a side note, I have some anecdotal evidence for this. In the JVSTM
I have what I call "speculative read-only transactions". The idea is
that I speculatively assume that a transaction is read-only and
execute it in some optimized way, bailing out when a write is
attempted, in which case I restart the transaction as a full-blown
transaction. This is specially useful when it is hard to know
before-hand whether a transaction is read-only or not. But still, in
most of the benchmarks that I've tried where I know which operations
are read-only and read-write, it turns out that it is best to always
use speculative read-only transactions in the read-write case also.
In fact, in many cases, a read-write transaction may not write at all.
And in the cases where it does, the second run of the transaction
finds most of the data that it needs in the cache, so the second run
is almost free.

Lock-based approaches, on the other hand, tend to be conservative (if
they want to be correct), and therefore may lock in excess and become
unable to explore the actual parallelism of the future computation.
That's an interesting research topic, I guess ;)


Moving on to Cliff's argument about the difficulty of knowing when to
acquire the locks/putting an atomic block. Again, as others have
argued, this is unsolvable by tools. It is a semantic problem.
It is easy to see this in the example of the "transfer" operation. I
argue that whether to make that operation atomic or not depends on the
application domain and that, therefore, both options are possible.
So, there is no way that a compiler can help us here, as both programs
(with and without an atomic block around the transfer) are valid.

Yet, I strongly believe that STMs are much better in this regard than
locks. Because, whereas locks force us to reason globally (most of
the times), STMs allow us to reason locally, considering the
particular semantics of each operation. This is deeply related to the
composability of transactions, that lock-based approaches simply don't
have. In the transfer example, note that, even if the transfer
operation is not atomic (because it does not need to in some
particular application), we may still have a higher-layer operation
calling transfer that should perform the transfer (and eventually
other operations) atomically. And we can do this by making the
higher-layer operation atomic, without the knowledge or consent of the
transfer operation.

So, answering to Cliff's question on at which layer should we put the
atomics, we should put them in each and every layer that they make
sense. With STMs it is ok to have a higher-layer atomic operation
calling a lower-layer atomic operation.

Related to this, Cliff asks for some way to "declare groups of refs".
Isn't what methods/visibility of class fields (in Java) are meant for?
Are you saying that you wanted an orthogonal way of doing it? I don't
see a good example for that. Can you give some concrete example?

Regarding the hashtable/mod-counter example, high contention variables
are a problem, indeed. But, again, they are so regardless of
atomics/locks. I've proposed a solution to some of these problems
that actually helps solving the hashtable problem mentioned by Cliff
and that is akin (as far as I could tell) to the commute solution
proposed by Rich. I call them per-transaction-boxes and they allow me
to defer the execution of some code that changes a high-contention
variable to the end of a transaction where it will be executed in
mutual exclusion, thereby avoiding a conflict with other transactions.
I've performed some tests that showed some results but I did not have
a machine with a significant number of CPUs to validate it further
(hint ;).

Also, Cliff mentions some large scale applications that tried STMs
with bad results. I would like to know more about the exact problems
and I volunteer to help test the use of the JVSTM on such systems, if
there is interest. The JVSTM is dead simple to use and it is used
daily on a production system with close to a million lines of code
since September 2005.

Actually, this system is a good counter example for Cliff's argument
that making the main method atomic would make the system
single-threaded. This system is a web application where each request
is processed atomically. Therefore, we have very large transactions
(some access more than 100 million boxes). But, given that only 1% of
the transactions write, most of the transactions run without any kind
of interference because of the multi-version concurrency control.
Even though we may have an unusual read/write ratio, I believe that in
many systems the numbers would not be significantly different. Also,
having more writes doesn't mean that there is contention. In fact, in
our system, we have also approximately 1% conflicts (that is, for each
100 write transactions only one needs to be restarted, and, almost
always, it succeeds on the second try).


Finally, regarding cache contention, one interesting characteristic of
the JVSTM is that it does not need any kind of memory synchronization
during the execution of a transaction. There is only a memory barrier
on transaction beginning and then the next synchronization point
occurs only at commit time and only for write transactions. Read
transactions can proceed at full speed with local copies of data from
their CPU caches (at least theoretically, it all depends on the
machine architecture, of course).

I apologize for the length of this post...

Posted by: João Cachopo | May 30, 2008 8:23:51 AM

 

Wow, that's a big post - and on this lousy commenting format also. :-)

Lots of good stuff here, some news to me:

- a large functional STM system (but relying heavily on a very high read-to-write ratio: all transactions read millions of words but only 1% do any writes at all? This is a read-to-write ratio of 10^8!!!!). I wonder how many programs fit this bill?

- JVSTM (& Haskell?) allows "commuting variables", or reversable operations. This technique might be generally useful inside of STMs. The Galois system leans heavily on this idea; Maruice Herlhy has a technique for adding XTN ability to any data structure using reversable ops.

- Running the XTN 1st time as an optimistic read-only sounds like a good idea, especially if you do any kind of tracking on the success-vs-failure rate. You still need some kind of reader/writer lock I assume, which in turn implies some minimum size for a high-thru-put XTN.

I've seen lots & lots of STM implementations, some more like MVCC and some less like MVCC. STMs that are optimized for reads suffer on transactions with a larger write-count; they tend to force endless retries and live-lock. Plenty of academic papers mention this.

Which is another way of saying: a fancier STM runtime with different techniques for different XTN's might give more reliable performance.

Which is another way of saying: more work on STM implementations can still make useful performance progress.

And with that, I *must* get back to work... this blog has been fun but I can't spend any more time here today!

Cliff

Posted by: Cliff Click | May 30, 2008 8:51:39 AM

 

Haskell doesn't come with Clojure's "commute" operation "out of the box" but its very simple to build it on top of the STM monad, as are other interesting behaviors. Modeling various memory effects monadically over STM gives a great "playground" for new ideas, including seamless integration of STM with database transactions, etc.

Absolutely, by the way, I think that a genuinely performance-optimized (rather than half-researchy) STM implementation would have some sort of adaptive algorithm, perhaps by-transaction, or perhaps simply generally, for determining strategies for handling individual and compositions of transactions.

Another issue here, I think, is that for certain problems a manually-tuned concurrent or parallel structure is always going to be the winner, just as GC high-level-languages don't mean we don't drop into C where we need it. STM implementations make much more sense precisely for things like appservers, where the extremely parallel bit is already in place, but simple reasoning-about-correctness-for-the-masses is what's missing (and also, where the execution path is much more variable because there are huge surprises about what the pattern and nature of requests thrown at us will be).

Posted by: sclv | May 30, 2008 12:09:14 PM

 

@Cliff
> Wow, that's a big post

I'm sorry. In my defense, I have to say that it was only one post...

This time it is a quickie though, just to clarify some things.

> - a large functional STM system (but relying heavily on a very high
> read-to-write ratio: all transactions read millions of words but
> only 1% do any writes at all? This is a read-to-write ratio of
> 10^8!!!!). I wonder how many programs fit this bill?

Not all transactions read millions of words. Some do, but most don't.

In fact, the average number of boxes read per read-only transaction is
around 5.000 (whereas the average number of boxes read per write
transaction is 50.000, and the average number of boxes written per
write transaction is 35).

So, given that we have 100 times more read-only transactions than
write transactions, the read/write ratio is much lower than 10^8!

I'm talking about a web application here where the shared data
corresponds to the application's domain entities. Actually, I believe
that the read/write ratio that we have is not at all uncommon for web
applications.

Shameless plug: More detailed info about this transactional workload
can be found on Chapter 7 of my PhD thesis, which describes the JVSTM
in much detail in Chapter 4
(https://dspace.ist.utl.pt/bitstream/2295/132008/2/cachopo-phd.pdf).


> - JVSTM (& Haskell?) allows "commuting variables",

I was talking about Clojure (that I know nothing except what I've read
here), rather than the Haskell STM. As far as I know, the Haskell STM
has no such thing.


> Maruice Herlhy has a technique for adding XTN ability to any data
> structure using reversable ops.

Yes, transactional boosting, as he calls it.


> - Running the XTN 1st time as an optimistic read-only sounds like a
> good idea, especially if you do any kind of tracking on the
> success-vs-failure rate. You still need some kind of reader/writer
> lock I assume, which in turn implies some minimum size for a
> high-thru-put XTN.

Huh? Why would I need a reader/writer lock?
I don't have any locks for read-only transactions.


> STMs that are optimized for reads suffer on transactions with a
> larger write-count; they tend to force endless retries and
> live-lock. Plenty of academic papers mention this.

Actually, even though the JVSTM was designed for more read-oriented
workloads, it performs surprisingly well even on heavily
write-oriented workloads on the benchmarks that I've tried thus far.
Not only on the micro-benchmarks of DSTM2, but also on STMBench7.

Posted by: João Cachopo | May 30, 2008 2:33:37 PM

 

@sclv

> Another issue here, I think, is that for certain problems a
> manually-tuned concurrent or parallel structure is always going to
> be the winner, just as GC high-level-languages don't mean we don't
> drop into C where we need it.

I seriously doubt it, actually.

The way I see it, dynamic concurrency control as it may be performed
by an STM (or some other mechanism) runtime that takes into account
the actual execution environment, such as the number of available
processors and the actual ratio of read/write transactions, will be
very hard to beat by hand-crafted locking strategies in the long run.
Unless, of course, we are talking about hand-crafting a particular
lock-strategy for a single execution environment.

Not that it is not possible, theoretically, to implement a lock
strategy that takes into account the number of processors available at
runtime, but I find it very unlikely to be done reliably. But I may
be wrong here. I'm making an educated guess (at most), here.

Posted by: João Cachopo | May 30, 2008 2:46:41 PM

 

> > - Running the XTN 1st time as an optimistic read-only sounds like a
> > good idea, especially if you do any kind of tracking on the
> > success-vs-failure rate. You still need some kind of reader/writer
> > lock I assume, which in turn implies some minimum size for a
> > high-thru-put XTN.
>
> Huh? Why would I need a reader/writer lock?
> I don't have any locks for read-only transactions.

Ok, I'll bite: how do you handle multi-word updates from other transactions? How do your unlocked readers NOT end up seeing partially completed XTN's?

BTW - shameless plugs on your PhD are perfectly fine. You work for years to get one; it comes with some bragging rights.

(looks like I need to read it, as well as bone up on my Haskell)

Cliff

Posted by: Cliff Click | May 31, 2008 8:39:24 AM

 

> Ok, I'll bite: how do you handle multi-word updates from other
> transactions? How do your unlocked readers NOT end up seeing
> partially completed XTN's?

Because I use multi-version concurrency control. Readers always see
the state as it was when the transaction started, regardless of how
many writes occurred since then.

Readers may happen to see the newer writes, but if they do, they skip
over those values to get to the values that correspond to the version
that they want. To accomplish this I use a global "clock" that
advances when a write transaction commits. This clock is used to tag
values as they are written.

As a matter of fact, the access to this global clock is the only
synchronization point for readers. This clock is stored on a volatile
variable and writers write to this variable at the very end of the
commit operation, after all the values changed were written back. The
readers read this variable before anything else, to ensure a proper
"happens-before" relation (as defined in the Java Memory Model)
between the reader and all the previously committed writers. This
corresponds to the memory-barrier that I mentioned before.

Because of this, none of the remaining shared memory needs to be
volatile, and that's why I mentioned, also, that the JVSTM has good
properties regarding cache contention.

I apologize in advance if this description is too confusing...

Posted by: João Cachopo | May 31, 2008 1:37:53 PM

 

Is it possible that we employ Uniqueness Typing in Clojure instead of STM or locks?
I have seen a successful implementation of Uniqueness Typing in language Clean. (http://clean.cs.ru.nl/)
It is simple to grasp and use; a simple definition would be:

In computing, a unique type guarantees that an object is used in a single-threaded way, with at most a single reference to it. If a value has a unique type it can be modified in place improving the efficiency of functional languages while maintaining referential transparency. Unique types can also be used to integrate functional and imperative programming languages.

more on:
http://en.wikipedia.org/wiki/Uniqueness_type
http://clean.cs.ru.nl/

Posted by: Kaveh Shahbazian | Jun 1, 2008 11:56:14 AM

 

Talking about atomicity and transactions:

Best Paper PPoP 2008

Split Hardware Transactions
http://research.sun.com/scalable/pubs/PPoPP2008-SpHT.pdf

Slides
http://research.ihost.com/ppopp08/presentations/SpHT.PPoPP.ppt

Abstract
Transactional Memory (TM) is on its way to becoming the programming API of choice for writing correct, concurrent, and scalableprograms. Hardware TM (HTM) implementations are expected tob e significantly faster than pure software TM (STM); however, full hardware support for true closed and open nested transactions is unlikely to be practical.
This paper presents a novel mechanism, the split hardware
transaction (SpHT), that uses minimal software support to combine multiple segments of an atomic block, each executed using a separate hardware transaction, into one atomic operation. The idea of segmenting transactions can be used for many purposes, including nesting, local retry, orElse, and user-level thread scheduling; in this paper we focus on how it allows linear closed and open nesting of transactions. SpHT overcomes the limited expressive power of best-effort HTM while imposing overheads dramatically lower than STM and preserving useful guarantees such as strong atomicity
provided by the underlying HTM.

Posted by: kimo | Jun 2, 2008 3:01:41 AM

 

Some comments gathered over the last few days:

- Doug Lea writes to give me moral support, and point out this gem of a paper:

http://domino.research.ibm.com/comm/research_people.nsf/pages/mvaziri.pubs.html

I had forgotten about it, but it looks really good on the surface. People had been asking "what would I like" in large-program concurrency support. This paper is a good start.

- Unique Types - I think I've mentioned them before; they look like the "obvious" solution to making a msg-passing-only communication run fast, by converting a 'copy-on-pass-message' solution into a 'only-one-user-so-just-move-pointer' solution.

- split hardware xtn POPL paper - doesn't address my main concerns with large scale concurrent programming. It's "Yet Another XTN Implementation". Looks good? Sorta.... but to my jaundiced eye it looks no better (and no worse) than loads of papers that have come before it.

Cliff

Posted by: Cliff Click | Jun 2, 2008 8:19:32 AM

 

> Readers may happen to see the newer writes, but if they do, they skip over those values to get to the values that correspond to the version that they want.

So each shared read is really TWO reads, one of the value and one of the local clock. These reads *must* be strongly ordered, lest you see a partial update of a in-progress-commit. Hence each read is a "load / fence / load / cmp / br" sequence.


> To accomplish this I use a global "clock" that advances when a write transaction commits.

Also, each commit has one write to a globally shared clock (puts a limit on XTN commit rate but that's likely not an issue for large XTNs).


> This clock is used to tag values as they are written.

Also each write-back is 2 writes; one of the value and one of the clock. Also, these writes must be strongly ordered as well - "st / fence / st". X86+Sparc will do the store-ordering for you, but not IA64, Alpha, nor Azul.

Do I have it right? (haven't read your thesis yet, but I did download it). What's the performance hit? The usual slowdown for a pure STM doing this is on the order of 5x or so. Using Clojure REF's you can greatly limit the box'd readers (JIT doesn't have to assume ALL reads are sharable, just REFs) so this might work great for Clojure.


> As a matter of fact, the access to this global clock is the only synchronization point for readers.

Doesn't jive with how I interpreted what you wrote above... each shared write & each shared read is a sync point (or at least, requires cache lines to flip back & forth putting an upper bound on the reader+commit rate).


> This clock is stored on a volatile variable and writers write to this variable at the very end of the commit operation, after all the values changed were written back. The readers read this variable before anything else, to ensure a proper "happens-before" relation (as defined in the Java Memory Model) between the reader and all the previously committed writers. This corresponds to the memory-barrier that I mentioned before.

This approach seems to *me* to cause *any write from any XTN anywhere* to abort *all readers everywhere* - which doesn't seem to jive what you wrote before (readers not impacted by unrelated writers).


Cliff

Posted by: Cliff Click | Jun 2, 2008 8:33:37 AM

 

I apologize again for a long post, but I couldn't explain this with
less words...

> So each shared read is really TWO reads, one of the value and one of
> the local clock.

Actually, with my current implementation as a pure Java library, it is
more than that, because shared data is stored within standard java
objects, which I call vboxes.

So, I have this VBox generic class with two methods - "T get()" and
"void put(T val)" - that implement the read/write operations.

So, the read of a shared memory location (that is, a VBox),
corresponds to a method call that goes through a ThreadLocal variable
to access the current transaction and ask the transaction (through
another method call) for the value to the box. The transaction will
then look into the box for finding the correct value. This is the
simple case, when the xtn is read-only, because on the general case,
there is more work needed to record the read-set (the JVSTM ensures
strict serializability, rather than snapshot isolation).

So, as you see, it is much more than two reads. But this may be done
more efficiently, of course, if supported at the JVM level (I have a
PhD student starting his work on this).

Nevertheless:

> These reads *must* be strongly ordered, lest you see a partial
> update of a in-progress-commit. Hence each read is a "load / fence
> / load / cmp / br" sequence.

I'm not a specialist on computer architecture, and I'm not sure that
I'm following you here, but if I'm not mistaken, no, there is no need
for a fence at each shared location read. The only fence needed (for
a read-only xtn, mind you) is at xtn begin.

A key insight here is that transactions always read values as they
were at the time the transaction began (actually, at the time the
fence is used).

VBoxes are nothing more than a pointer to a triple (version, value,
next), creating a linked list of tagged values.

So, consider the case of two shared locations, L1 and L2, with current
values V1 and V2, when a transaction read-only T1 starts. Also,
assume that other transactions T2, ..., Tn change L1 and L2, writing
new values to these locations and committing them before T1 reads any
or both of the values. When T1 needs to read any of the locations,
say L1, it fetches the triple (version, value, next) from L1 and
compares "version" with T1's clock. If the version fetched matches
the T1 desired version, then T1 uses the "value", otherwise, it has to
go for the next triple (version, value, next) of L1 (remember that I
have multiple versions for each location).

So, for this to work, we must have the following: a write transaction
must create first the new triple, and only then it may change the vbox
pointer to the new triple. Also, either the processor ensures that
the same storing order is seen from other processors, or we must use a
memory barrier between the creation of the triple and the vbox pointer
change. In this case, I argue that the read-only transaction does not
need any kind of fence. If I'm making a blatant mistake here, I'll
surely appreciate that you correct me, please.

> Also, each commit has one write to a globally shared clock (puts a
> limit on XTN commit rate but that's likely not an issue for large
> XTNs).

Yep.


> Also each write-back is 2 writes; one of the value and one of the
> clock. Also, these writes must be strongly ordered as well - "st /
> fence / st". X86+Sparc will do the store-ordering for you, but not
> IA64, Alpha, nor Azul.

I don't know the details of each processor, but as I explained above,
I need that the storing order of the write-back be preserved.
But this does not affect the performance of reads, right?


> Do I have it right?

Mostly, yes. Except (I hope) for the reads ;)

> What's the performance hit? The usual slowdown for a pure STM doing
> this is on the order of 5x or so.

One of the difficulties on answering this question is in finding good
(realistic) benchmarks. As you know, most of the work on STMs
typically uses micro-benchmarks.

On top of this, I don't have access to a machine with a significant
number of cores, so my tests are somewhat limited.

Nevertheless, on the STMBench7 benchmark, without long-traversals, the
JVSTM is about 1.5x slower than medium-grained locking for a single
thread, but catches up at 4 threads (this is on a machine with 4
cores), and I would expect it to improve better with more cores.
This is for a read-dominated workload (90% reads, 10% writes).

For a write-dominated workload, the JVSTM is between 1.5 and 1.6 times
slower, from 1 to 4 threads.

With read-only long-traversals enabled in the workload, the JVSTM goes
from 1.3x slower (for 1 thread) to 2x faster (for 4 threads) than the
medium-grained locking approach.


> Using Clojure REF's you can greatly limit the box'd readers (JIT
> doesn't have to assume ALL reads are sharable, just REFs) so this
> might work great for Clojure.

That is similar to the use of vboxes, I guess.


I wrote:
>> As a matter of fact, the access to this global clock is the only
>> synchronization point for readers.

And Cliff:
> Doesn't jive with how I interpreted what you wrote above... each
> shared write & each shared read is a sync point (or at least,
> requires cache lines to flip back & forth putting an upper bound on
> the reader+commit rate).

Does it make more sense now?


Also, after I wrote:
>> This clock is stored on a volatile variable and writers write to
>> this variable at the very end of the commit operation, after all
>> the values changed were written back. The readers read this
>> variable before anything else, to ensure a proper "happens-before"
>> relation (as defined in the Java Memory Model) between the reader
>> and all the previously committed writers. This corresponds to the
>> memory-barrier that I mentioned before.


Cliff adds:
> This approach seems to *me* to cause *any write from any XTN
> anywhere* to abort *all readers everywhere* - which doesn't seem to
> jive what you wrote before (readers not impacted by unrelated
> writers).

No, readers never abort because they will always be able to see a
consistent view of the data (even if in the past).

Again, the key insight here is that read-only transactions are
serialized at the time that they begin, whereas write transactions are
serialized at the time that they commit.

Best regards, and thank you for taking the time to think about this.

Posted by: João Cachopo | Jun 2, 2008 1:56:39 PM

 

Ok, I think I know what's going on now.

You in fact do 2 loads & a compare per read (plus other reads related to not having the JVM directly support your model). There's a load of the Value and a load of the Version.

Those 2 loads MUST BE ORDERED, or your implementation is busted. Same for the writer-side of things; there are 2 writes (more actually, but only 2 that count) - a write of the Value and a write of the Version. Again, those 2 stores MUST BE ORDERED or you are busted. Watch my JavaOne DebuggingDataRaces talk on how missing fences allow memory reorderings which will break you here.

If you declared the 'Version' as a volatile variable, then indeed they ARE ordered - and you are not busted. So the fix is easy (and probably already done!).

If you are running on an X86, then the Stores are always ordered (strong memory model on the X86).

The JVM's implementation of volatile variables, on an X86, will insert a memory-fence between the 2 loads (and - since Intel only very recently announced a real supported Hardware Memory Model - the HotSpot JVM has been working on a very conservative assumptions about X86 and so will insert redundant fences between the stores).

The JVM's implementation of volatile variables on other platforms inserts fences as required by that hardware. Azul & IA64 both have fairly weak HMMs (and both scale really really well - there's a direct correlation there!) and have to insert fences between writers as well as readers.


Cliff

Posted by: Cliff Click | Jun 4, 2008 8:02:45 AM

 

Last note: you want to check your scaling behavior?

Azul hands academic accounts. Send me an email with a 1-liner description of what you to check, and I can give you a remote log-in account to a machine with ~380 cores. You have to share time on the box but it's generally not an issue.

Cliff

Posted by: Cliff Click | Jun 4, 2008 8:05:15 AM

 

Post a comment