And now some Hardware Transactional Memory comments...
February 25, 2009

(sorry for the long gap between postings; work's gotten interesting and I got busy)

I recently attended the Bay Area Workshop on Transactional Memory at Stanford generously hosted by HP.  Slides are here; my slides are helpfully titled "2009_TMW.pdf".  In this workshop I gave a brief overview of Azul Systems' Hardware Transactional Memory support plus our experiences in using it.  This was followed shortly by a blog posting from David Dice discussing Sun's HTM support in the Rock processor.

For Azul Systems' certainly (and to a lesser extent Rock), the name of the game is throughput: we appear to be generously over-provisioned with bandwidth, we can sustain 30G/sec allocation on 600G heaps with max pause times on the order of 10's of milliseconds (and unlike IBM's Metronome hard-real-time collector, our MMU past 20msec is 0.99; see MMU definition at bottom).  Each of our 864 cpus can sustain 2 cache-missing memory ops (plus a bunch of prefetches); a busy box will see 2300+ outstanding memory references at any time.  We have a lite microkernel style OS; we can easily handle 100K runnable threads (not just blocked ones).  Our JVM & GC scales easily to the whole box.  In short: the bottleneck is NOT the platform.  We need our users to be able to write scalable concurrent code.

The goal of our HTM has always been to accelerate "dusty deck" Java programs via Lock Elision.  (We've never been tempted to enter into the language wars and implement some form of full-blown transactional programming via an "atomic" keyword). We see a lot of defensive uses of the "synchronization" keyword; legacy libraries using "sync" on all methods or code maintainers sprinkling "sync" about liberally in order to squash some bug.  In practice, 99% of all locks are never contended - and these we run very fast (our CAS & FENCE instructions can hit in cache; an uncontended lock requires only a few clock cycles).  But the locks that are contended prevent parallel execution (Amdahl's Law and all that)- and we observed that much of the time the contention is on the lock itself and not on the data it protects.  So we added in HTM support in our first processors and this support has been shipping and turned on at all costumer sites for over 4 years now.  By now we have a lot of field experience with using HTM.

But first, a brief overview of our HTM support: we use a few extra bits on each L1 cache line to track "speculatively read" and "speculatively written" cache lines.  Each CPU can be put in a "speculate" mode; loads and stores then set these bits accordingly.  If a speculative line is evicted from the L1 for any reason then the transaction "aborts" (there is also an "abort" instruction).  Speculatively-written lines are marked invalid (meaning: they no longer cache any data) and speculatively-read lines have their spec-bit cleared.  Also the CPU branches to a fixed trap vector for software fixup. 

If the CPU makes it to a "commit" instruction then the spec bits are cleared - including the speculative-write bits - which atomically makes all those writes visible to other CPUs. In other words, the XTN commits.

Nothing aborts a transaction except the "abort" op or losing a speculative line out of L1.  We do not abort on function calls/returns, nor TLB misses, nor exceptions nor rain nor snow nor sleet nor dark of night...  Our XTNs are limited therefore by the size and associativity of the 16K 4-way L1 cache (and the shared inclusive L2).  We routinely see successful HTMs of many thousands of instructions with many hundreds of cache-hitting memory ops. (contrast this to what I can figure out about Rock: Rock appears to abort on function call/return, on running out of store-buffer depth (which limits an XTN to 16 stores?), on TLB misses, on L1 associativity, on common nested locking patterns (because of abort on function calls)). 

Software heuristics determine when to try the HTM (uncontended locks are much cheaper using a cache-hitting CAS).  The software measures the success/fail ratio of the HTM on contented locks and determines when to flip to a standard blocking implementation and when to use the HTM. 

While some apps get a nice 2x speedup (e.g. Trade6), most apps are speedup only slightly (we had lots of teething problems with the software heuristic that used to cause 10-20% slowdowns due to endless fail/retry loops, although these all appear to be fixed now).  We nearly always see a handful of locks using the HTM support, but these appear to only rarely be the "right" locks to get more CPUs really busy.  HTM failure appears to nearly always be due to conflict and not capacity.

Issues

In short, users' don't write "TM-friendly" code.  Neither do library writers.  For example the "modcount" field in Hashtable is bumped for every update in the Hashtable.  For a large Hashtable we can expect most mods to be in unrelated portions of the Hashtable.  i.e., without the modcount field update we would expect the HTM to allow parallel 'put' operations.  With the modcount field update, 'put' operations always have a true data conflict - and this aborts the HTM.  After some number of failed 'put' operations we have to start really locking the Hashtable (to allow the puts to make forward progress) and this now single-threads the 'get' operations as well.  This general pattern of updates to peripherial shared values (usually performance counters of one sort or another) is very common and it kills the HTM - because it presents a true data conflict.

Many times a small rewrite to remove the conflict makes the HTM useful.  But this blows the "dusty deck" code - people just want their old code to run faster.  The hard part here is getting customers to accept that a code rewrite is needed.  Once they are over that mental hump, once a code rewrite is "on the table" - then the customers go whole-hog.  Why make the code XTN-friendly when they can make it lock-friendly as well, and have it run fine on all gear (and not just HTM-enabled gear)?  Also locks have well understood performance characteristics, unlike TM's which generally rely on a complex and not-well-understood runtime portion (and indeed all the STMs out there have wildly varying "sweet spots" such that code which performs well on one STM might be really unusably slow on another STM). 

Really what the customers want to know is: "which locks do I need to 'crack' to get performance?".  Once they have that answer they are ready and willing to write fine-grained locking code.  And nearly always the fine-grained locking is a very simple step up in complexity over what they had before.  It's not the case that they need to write some uber-hard-to-maintain code to get performance.  Instead it's the case that they have no clue which locks need to be "cracked" to get a speedup, and once that's pointed out the fixes are generally straightforward. (e.g., replacing sync/HashMap with ConcurrentHashMap, striping a lock, reducing hold times (generally via caching), switching to AtomicXXX::increment, etc)

Summary

  • Very modest gains for HTM; usually <10% (although 2x for some large apps)
  • Rewrites of "dumb" direct contention (e.g. perf counters) would help a lot
  • Azul did this for some common JDK pieces Out There (e.g. Hashtable), but there's just too much of it
  • Customers not willing in general to touch code for HTM
    • If code is being hacked for performance, it will be with fine-grained locking instead of simply making it "HTM friendly". 

Thanks,
Cliff

[MMU is a common measure of GC performance and stands for "minimum mutator utilization".  It shows the worst-case amount of time a worker thread gets for a timeslice of a given length vs time spent in GC.  Thus an MMU graph plots utilization (a number from 0 to 1 where 1 is better) versus timeslice size.  A program using Metronome might see a max GC pause of 1microsecond; it's MMU at 1 microsecond is then zero because the mutator can do no work for that microsecond.  However the MMU at 1 second might be 70% showing that in the long run Metronome's GC exacts a 30% performance "tax" on mutator operations.  Contrast this to Azul's Pauseless GC: at 10msec our MMU is zero but at 1sec our MMU is 0.99; in the long run there is no GC "tax" but in the short run our max pauses are much larger than Metronome's.  Metronome is suitable for e.g. hard-real-time applications like audio-playback, whereas Azul Systems' is suitable for e.g. soft-real-time transactional apps with high throughput.]

Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e201116897927f970c

Listed below are links to weblogs that reference And now some Hardware Transactional Memory comments...:

 

Comments

Can you give a pointer to rewriting dumb performance counters? Thanks.

Posted by: Bob Chops | Feb 25, 2009 2:52:21 PM

 

Use e.g. AtomicInteger::incrementAndGet() OUTSIDE the locked region. This means the perf counter will rarely briefly be out of sync with the true count of calls to what's being tracked, but this should be ok for a perf counter.

E.g. replace

"sync { ctr++; foo(); }"

with

"AtomicInteger::inc(ctr); sync { foo(); }"

Cliff

Posted by: Cliff Click | Feb 25, 2009 3:03:03 PM

 

Do you think that it would be possible to write, say, PMD rules (or any other static analysis tool) to highlight those changes which would help make code more efficient when it comes to locking?

Posted by: Ian Phillips | Feb 25, 2009 11:58:39 PM

 

The hard part for PMD is telling a perf counter (sloppiness OK) from e.g. a unique-id-generator (sloppiness NOT ok).

However PMD could totally spot some common data-race patterns like:

if( ptr != null ) then ptr.field...

Which is buggy for any shared 'ptr' where ptr can go to null. Instead the proper idiom is:

tmp = ptr; if( tmp != null ) then tmp.field...

Similarly is the not-atomic update to a caching structure:

tmp = hash.get(key);
if( tmp == null )
then { hash.put(key,tmp=expensive_gen()); }

Where racing threads can each do unrelated 'puts' and the last thread 'wins' and the first thread is not aware that he 'lost'. This is true even if 'hash' is a synchronized hash table or e.g. ConcurrentHashMap (or NonBlockingHashMap). The issue is that the whole idiom "get/test/put" needs to be atomic - which is what CHM.putIfAbsent does.

If you can't tell, I just debugged Somebody Else's Buggy App today feature *3* broken un-sync'd get/test/put idioms all in OpenSource code. :-(

Cliff

Posted by: Cliff Click | Feb 26, 2009 7:31:12 AM

 

(sorry if i'm being thick, but i just wanted to check that i'm getting what you are getting at) sounds like you are still more on the side of going with classic locking rather than either stm or htm, yes? or is it more/also the case that clients you work with are not interested in doing anything other than classic locking? i do like the shangrila idea that TM helps give us better flexible composability, which i think is the root of all goodness in a software system. wonder how it will really pan out.

Posted by: Raoul Duke | Feb 26, 2009 11:06:55 AM

 

In general when you scale up TM (i.e. larger atomic sections) you run into more data that is read vs. being written in the transaction. This kind of kills STM implementations that do mutating reads and makes it hard for them to scale up since they while not all the reads are critical to the transaction, the STM implementation can't tell which reads are and which reads aren't.

In conventional locking, this kind of stuff was done with read/write locks. E.g. a read lock to lookup a node in a linked list and a write lock to protect the update of the node while it was still in the list protected by the read lock. You don't notice this usage pattern as much in Java because GC covers up a lot of programming sins. A conventional system without GC would die messily if read locks weren't used to protect read access to global shared data.

I think there's a way around this problem but I haven't gotten around to implementing it in my STM yet. I'm waiting for higher core count cpus than the Atom 330 I'm using now.

Azul's GC sounds interesting since my STM is about 6X slower than conventional solutions for worst case, all mutating writes, (read only access is wait-free) and as far as I can tell it's all GC overhead on my system.

Posted by: Joe Seigh | Feb 26, 2009 5:49:03 PM

 

> i do like the shangrila idea that TM helps give us better flexible composability, which i think is the root of all goodness in a software system

I do to, but I don't think 'atomic' will storm the Java world until a lot more is understood on how to make it perform well and there are a lot of implementations out there to choose from. At the moment, my customers are comfy with well-choosen fine-grained locking solutions.

Cliff

Posted by: Cliff Click | Feb 26, 2009 9:57:55 PM

 

> In general when you scale up TM (i.e. larger atomic sections) you run into more data that is read vs. being written in the transaction. This kind of kills STM implementations that do mutating reads and makes it hard for them to scale up since they while not all the reads are critical to the transaction, the STM implementation can't tell which reads are and which reads aren't.


Thus my standard Clojure plug; Clojure limits the set of transacted variables by probably 100-fold (you have to name shared variables specially; otherwise the variable is either not-shared or not-mutable). This alone might make STM a reasonable solution.

Cliff

Posted by: Cliff Click | Feb 26, 2009 10:01:36 PM

 

> Azul's GC sounds interesting since my STM is about 6X slower than conventional solutions for worst case, all mutating writes, (read only access is wait-free) and as far as I can tell it's all GC overhead on my system.


If you can make a "canned" benchmark, I can run it on Azul & tell you how it goes (in great detail...).


Cliff

Posted by: Cliff Click | Feb 26, 2009 10:03:46 PM

 

It would be at http://sourceforge.net/projects/atomic-ptr-plus/ in the xmem package. It includes a testcase, xmemtest1 which can take a number of threads to run with. The testcase compares an STM queue against other queue implementations. It uses a couple of mbeans to analyse the thread and gc execution times.

It looks like about 10X slower on my desktop dual core.

Posted by: Joe Seigh | Mar 1, 2009 4:11:29 PM

 

Ok, I finally got around to looking at the xmem stuff.
You've got the usual micro-benchmark issues: not nearly long enough runtimes to get good results, no warmup period, 1st-test-fast-2nd-test-slow, no statistical analysis, etc.

For instance, if I just repeat all the tests a 2nd time then most of the non-STM queues get substantially faster. I looked at running all the tests a 3rd time, and times are still improving... I also bumped the loop count by 100-fold to make the tests last long enough to do some profiling.

I got one failure:
Exception in thread "Thread-141" java.lang.NullPointerException
at xmemtest1$1.run(xmemtest1.java:166)
at java.lang.Thread.run(Thread.java:618)
Looks like 'q.poll' returned a null in the harness; this seems legit to me and the harness should handle it.

You've got some hotlocks at this contention level.
Sample stack trace with the lock held:

Detail: low | medium | high

* transact.XmemCollector.enterTransaction (XmemCollector.java:89, bci=3, server compiler)
o locked transact.XmemCollector (0x000000c98c00b960)

COMPILER2 frame
JL0 transact.XmemCollector
JL1 transact.XmemCollector
JL2 0x0000000000000000
JL3 0x0000000000000000

* transact.XmemCollector.atomic (XmemCollector.java:137, bci=25, server compiler)

COMPILER2 frame
JL0 transact.XmemCollector
JL1 XmemQueue$2
JL2 java.lang.Integer
JL3 0x0000000000000000
JL4 0x0000000000000001
JL5 0x0000000000000004
JL6 0x0000000000000000
JL7 0x0000000000000000

* XmemQueue.poll (XmemQueue.java:162, bci=14, server compiler)

COMPILER2 frame
JL0 0x0000000000000000
JL1 0x0000000000000000

* xmemtest1$1.run (xmemtest1.java:165, bci=142, server compiler)

COMPILER2 frame
JL0 xmemtest1$1
JL1 0x0000000000000000
JL3 0x0000000040202de9
JL4 0x000000000000acad
JL5 0x0000000000000000

* java.lang.Thread.run (Thread.java:618, bci=11, interpreter)

INTERPRETER frame
JL0 java.lang.Thread

GC times for Azul are quite reasonable (less than native in all cases) and not the bottleneck. You're doing a paltry 130MB/sec of allocation. For fun I bumped the heap to 20G; total GC time paid by all 20 mutators for all runs is maybe 500ms for 340sec*20threads of runtime.

Cliff

Posted by: Cliff Click | Mar 5, 2009 8:25:00 AM

 

Post a comment