Azul Systems
UNBOUND COMPUTE

Cliff Click Jr.’s Blog's Blog

Category: Web/Tech

And now some Hardware Transactional Memory comments...
February 25, 2009

(sorry for the long gap between postings; work's gotten interesting and I got busy)

I recently attended the Bay Area Workshop on Transactional Memory at Stanford generously hosted by HP.  Slides are here; my slides are helpfully titled "2009_TMW.pdf".  In this workshop I gave a brief overview of Azul Systems' Hardware Transactional Memory support plus our experiences in using it.  This was followed shortly by a blog posting from David Dice discussing Sun's HTM support in the Rock processor.

For Azul Systems' certainly (and to a lesser extent Rock), the name of the game is throughput: we appear to be generously over-provisioned with bandwidth, we can sustain 30G/sec allocation on 600G heaps with max pause times on the order of 10's of milliseconds (and unlike IBM's Metronome hard-real-time collector, our MMU past 20msec is 0.99; see MMU definition at bottom).  Each of our 864 cpus can sustain 2 cache-missing memory ops (plus a bunch of prefetches); a busy box will see 2300+ outstanding memory references at any time.  We have a lite microkernel style OS; we can easily handle 100K runnable threads (not just blocked ones).  Our JVM & GC scales easily to the whole box.  In short: the bottleneck is NOT the platform.  We need our users to be able to write scalable concurrent code.

The goal of our HTM has always been to accelerate "dusty deck" Java programs via Lock Elision.  (We've never been tempted to enter into the language wars and implement some form of full-blown transactional programming via an "atomic" keyword). We see a lot of defensive uses of the "synchronization" keyword; legacy libraries using "sync" on all methods or code maintainers sprinkling "sync" about liberally in order to squash some bug.  In practice, 99% of all locks are never contended - and these we run very fast (our CAS & FENCE instructions can hit in cache; an uncontended lock requires only a few clock cycles).  But the locks that are contended prevent parallel execution (Amdahl's Law and all that)- and we observed that much of the time the contention is on the lock itself and not on the data it protects.  So we added in HTM support in our first processors and this support has been shipping and turned on at all costumer sites for over 4 years now.  By now we have a lot of field experience with using HTM.

But first, a brief overview of our HTM support: we use a few extra bits on each L1 cache line to track "speculatively read" and "speculatively written" cache lines.  Each CPU can be put in a "speculate" mode; loads and stores then set these bits accordingly.  If a speculative line is evicted from the L1 for any reason then the transaction "aborts" (there is also an "abort" instruction).  Speculatively-written lines are marked invalid (meaning: they no longer cache any data) and speculatively-read lines have their spec-bit cleared.  Also the CPU branches to a fixed trap vector for software fixup. 

If the CPU makes it to a "commit" instruction then the spec bits are cleared - including the speculative-write bits - which atomically makes all those writes visible to other CPUs. In other words, the XTN commits.

Nothing aborts a transaction except the "abort" op or losing a speculative line out of L1.  We do not abort on function calls/returns, nor TLB misses, nor exceptions nor rain nor snow nor sleet nor dark of night...  Our XTNs are limited therefore by the size and associativity of the 16K 4-way L1 cache (and the shared inclusive L2).  We routinely see successful HTMs of many thousands of instructions with many hundreds of cache-hitting memory ops. (contrast this to what I can figure out about Rock: Rock appears to abort on function call/return, on running out of store-buffer depth (which limits an XTN to 16 stores?), on TLB misses, on L1 associativity, on common nested locking patterns (because of abort on function calls)). 

Software heuristics determine when to try the HTM (uncontended locks are much cheaper using a cache-hitting CAS).  The software measures the success/fail ratio of the HTM on contented locks and determines when to flip to a standard blocking implementation and when to use the HTM. 

While some apps get a nice 2x speedup (e.g. Trade6), most apps are speedup only slightly (we had lots of teething problems with the software heuristic that used to cause 10-20% slowdowns due to endless fail/retry loops, although these all appear to be fixed now).  We nearly always see a handful of locks using the HTM support, but these appear to only rarely be the "right" locks to get more CPUs really busy.  HTM failure appears to nearly always be due to conflict and not capacity.

Issues

In short, users' don't write "TM-friendly" code.  Neither do library writers.  For example the "modcount" field in Hashtable is bumped for every update in the Hashtable.  For a large Hashtable we can expect most mods to be in unrelated portions of the Hashtable.  i.e., without the modcount field update we would expect the HTM to allow parallel 'put' operations.  With the modcount field update, 'put' operations always have a true data conflict - and this aborts the HTM.  After some number of failed 'put' operations we have to start really locking the Hashtable (to allow the puts to make forward progress) and this now single-threads the 'get' operations as well.  This general pattern of updates to peripherial shared values (usually performance counters of one sort or another) is very common and it kills the HTM - because it presents a true data conflict.

Many times a small rewrite to remove the conflict makes the HTM useful.  But this blows the "dusty deck" code - people just want their old code to run faster.  The hard part here is getting customers to accept that a code rewrite is needed.  Once they are over that mental hump, once a code rewrite is "on the table" - then the customers go whole-hog.  Why make the code XTN-friendly when they can make it lock-friendly as well, and have it run fine on all gear (and not just HTM-enabled gear)?  Also locks have well understood performance characteristics, unlike TM's which generally rely on a complex and not-well-understood runtime portion (and indeed all the STMs out there have wildly varying "sweet spots" such that code which performs well on one STM might be really unusably slow on another STM). 

Really what the customers want to know is: "which locks do I need to 'crack' to get performance?".  Once they have that answer they are ready and willing to write fine-grained locking code.  And nearly always the fine-grained locking is a very simple step up in complexity over what they had before.  It's not the case that they need to write some uber-hard-to-maintain code to get performance.  Instead it's the case that they have no clue which locks need to be "cracked" to get a speedup, and once that's pointed out the fixes are generally straightforward. (e.g., replacing sync/HashMap with ConcurrentHashMap, striping a lock, reducing hold times (generally via caching), switching to AtomicXXX::increment, etc)

Summary

  • Very modest gains for HTM; usually <10% (although 2x for some large apps)
  • Rewrites of "dumb" direct contention (e.g. perf counters) would help a lot
  • Azul did this for some common JDK pieces Out There (e.g. Hashtable), but there's just too much of it
  • Customers not willing in general to touch code for HTM
    • If code is being hacked for performance, it will be with fine-grained locking instead of simply making it "HTM friendly". 

Thanks,
Cliff

[MMU is a common measure of GC performance and stands for "minimum mutator utilization".  It shows the worst-case amount of time a worker thread gets for a timeslice of a given length vs time spent in GC.  Thus an MMU graph plots utilization (a number from 0 to 1 where 1 is better) versus timeslice size.  A program using Metronome might see a max GC pause of 1microsecond; it's MMU at 1 microsecond is then zero because the mutator can do no work for that microsecond.  However the MMU at 1 second might be 70% showing that in the long run Metronome's GC exacts a 30% performance "tax" on mutator operations.  Contrast this to Azul's Pauseless GC: at 10msec our MMU is zero but at 1sec our MMU is 0.99; in the long run there is no GC "tax" but in the short run our max pauses are much larger than Metronome's.  Metronome is suitable for e.g. hard-real-time applications like audio-playback, whereas Azul Systems' is suitable for e.g. soft-real-time transactional apps with high throughput.]

Category: Web/Tech | Permalink | Comments (11) | TrackBack (0)


A Brief Conversation with David Moon
November 18, 2008

I had an email conversation between myself, David Moon & Daniel Weinreb. For the younger readers: David Moon is one of the original architects of the Symbolics machines, which 20 years ago more-or-less attempted the same sorts of things Azul is doing now; i.e. language-specific hardware support. I lightly edited the emails for clarity and ordering (otherwise the nested interleavings get horrendous to follow).
[Ed: 11/20/2008 - Some small follow-on conversation has been appended to the end]


Cliff Click wrote: Mostly I felt inspired by your Symbolics work. Azul Systems makes gear for running Java (but actually it runs a C/C++ program - we just happen to run a large C++ that implements Java). One of the biggest impact changes we made was a hardware read-barrier for GC - a simple instruction that tests invariants of freshly loaded pointers and takes a fast-trap if the test fails. Fixup is entirely in software. It's a RISC-y approach to the GC hardware that Symbolics had 20 years ago.


David A Moon wrote: I remember this now; this was one of the things that impressed me when I read about your machine a couple years ago. As you may or may not know, the Symbolics 3600 had fairly complex GC hardware and the Symbolics Ivory went 90% of the way towards riscifying it, eliminating all the large hardware tables. I think there was still a 16x1 bit (NOT 16Kx1 !) table to indicate which portions of address space are oldspace; with 64 address bits instead of 32 you were able to reduce that to a single bit, which is nice.

Cliff Click wrote: We use very simple 3-address 32-register 64-bit RISC cores, with 54 cores per die. All chips are cache-coherent and uniform (medicore) memory access time for all; the upper bound is 16 chips or 864 cpus. The "big" box is about the size of a dorm room fridge, 14U form-factor. The big box also has top-20 super-computer bandwidth (barely). Modest 16K I- & 16K D-caches for all cpus, plus groups of 9 share a 2-meg L2. Chips are fully interconnected.


David A Moon wrote: 3-address 32-register 64-bit address, 32-bit instruction RISC does seem to be the "sweet spot" for instruction set architectures. [Cliff: Yes!] I hope you avoid needless complexity like condition codes, link registers, maybe even separate integer and floating point registers. [Cliff: Yes, yes & yes] I don't think Java needs floating point rounding and trapping modes either. [Cliff: Yes, it does not] For your application you probably don't even need separate user and supervisor modes since all executable code is generated by your just-in-time compiler from a safe language.

Cliff Click wrote: That's the theory but the practice is a little different: JVM's crash and 1 crashing JVM should not bring down the whole box. So we in fact have a user/kernel split and it's saved our butts any number of times.


David A Moon wrote: Maybe you don't need any kind of mode bits register at all? Being completely modeless would be cool.

Cliff Click wrote: There's also a speculative-memory mode: writes are buffered in the L1 but don't become visible until a 'COMMIT' instruction.


David A Moon wrote: That's a pissload o' cores. Are those 54 real cores per die? Or is it perhaps 6 real cores, each executing 9 interleaved instruction streams, in the style of Denelcor HEP?

Cliff Click wrote: Nope, real cores all the way.


David A Moon wrote: Does the precisely defined Java memory model give you any help in making a more simple implementation of cache coherency than is usually done e.g. for Intel x86 architecture? Or did that memory model come out after your hardware was designed?

Cliff Click wrote: The Java Memory Model came first, and we designed the hardware around it. Actually the hardware implements something somewhat weaker than the JMM by default, and the JIT inserts memory FENCE ops as needed. More like Sparc RMO or Power's memory model.


David A Moon wrote: I assume uniform memory access doesn't mean cached accesses aren't any faster than cache misses, but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network. [Cliff: Yup]

It must be a real challenge to get enough memory bandwidth to keep 864 instruction streams going at full speed, even more so with such tiny caches.


Daniel Weinreb wrote: Moon says: "... but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network." Actually, I heard a talk about a year ago, at the first New England Database Day, suggesting that the way information moves between the L1 caches of the cores is by a sequence of moves, each being up, down, right, or left, in the direction of the destination core.


Cliff Click wrote: No, we're a full direct interconnect on the L2's. 9 L1's share an L2. One step to the L2, then one step to the correct remote L2 - across the box in any direction (including another L2 on the same chip; it was easier to run outside the chip and back in than arrange for a special faster-path for within-chip communication. 15/16ths of your L2 misses are remote and the gain for speeding up the remaining 1/16th of the misses isn't worth it.


Daniel Weinreb wrote: If so, the times would not actually be uniform. Taking advantage of such knowledge in the software would be challenging except for special-purpose domains, I would think. I don't know how the 54 L1 caches (or maybe fewer if you're doing the piplined business that Moon described) are interconnected, but there is a limit to how many can participate in a full crossbar NxN connection. [Cliff: Key word: challenging]


Cliff Click wrote: We have a lot of memory bandwidth - one of our bigger boxes is on the top-20 supercomputer bandwidth list. The chips are fully interconnected.


David A Moon wrote: Meaning every chip has 15 ports directly connected to another chip? [Cliff: Yup] That's a lot of pins. You might want to look at http://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnect for an interesting slimmer approach.

Cliff Click wrote: I'm sure we looked at sicortex at one point.


Daniel Weinreb wrote: I agree about the challenge of the memory bandwidth for all those cores. Do you have very wide paths between the chips and the main memory?


Cliff Click wrote: There are 4 memory controllers per chip (and up to 16 chips so 64 controllers). All memory is striped across all controllers at a fine grained level so each core's memory requests are spread across the box - mostly eliminating "hotspots".


David A Moon wrote: Also they are a message passing architecture and you have to be a shared memory architecture, but I think that's a separate issue from the interconnect topology. I am no longer sure which I believe in. On the one hand, I am a shared memory guy from way, way back. On the other hand, it is becoming increasingly expensive to pretend that memory is uniform and random-access, and the illusion is becoming increasingly unconvincing. The von Neumann architecture may be reaching the end of its life. I like the way the Cell treats main memory as essentially an I/O device, but I have never attempted to program it.

Cliff Click wrote: By all accounts, programming the Cell is a nightmare. Since we can do shared-memory with so many cores, other people can too. The more difficult issue is parallel programming that uses all those cores. Shared memory is one way to make programming easier.


Daniel Weinreb wrote: What about cache coherency over multiple chips?

These days, as Moon has pointed out to me, the importance of caches means that you gain a lot by not looking at your address space as a sequence of co-equal words, but rather as a sort of block-oriented device (a little like the way one looks at a disk), in which the block size is the size of the cache line in some relevant cache (probably L2 but it depends a lot on the hardware if the hardware is novel). (See the word "illusion" in his mail.)

Does your Java implementation specifically take this into account, e.g. allocating objects aligned on cache line boundaries, implementing collections as B-trees with nodes being an integral number of cache lines, and that sort of thing? If not, you might consider benchmarking it; I'd love to know the results of such a test.


Cliff Click wrote: Our hardware has short cache lines (32b) so there's less to be gained by aligning objects to cache lines. For certain benchmarks (not to mention SpecJBB or anything) such alignment is probably worth alot - but not for most use cases. Perhaps it makes sense to expose cache-line alignment to the Java/JDK programmer?



Cliff Click wrote: Things we do specifically for Java:
  • GC read-barrier enables a fully parallel & concurrent GC; we can sustain 40G/sec allocation on a 400G heap indefinitely, with max-pause times on the order of 10-20msec. This uber-GC is partially made possible because of the read barrier (and partially possible because we 'own' the OS and can play major page-mapping tricks).

David A Moon wrote: That is indeed impressive! Even if I divide 40 GB/sec by 864 it's still fast. Of course you need it, much Java code really likes to generate garbage all over the place. My pet peeve is no way to return multiple values from a method without generating garbage.

Daniel Weinreb wrote: Dave, you say: "My pet peeve is no way to return multiple values from a method without generating garbage." In Rich Hickey's talk about Clojure(Cliff: a new Lisp dialect implemented on the JVM, mainly having been run on the Sun JVM but he ran it on an Azul at least once) addressed this issue. [Cliff: I'm familar with Clojure] He left out multiple value returns exactly on the grounds that the consing cost for a small returned value is so cheap these days, ephemeral GC's being so damned good. I do not know to what extent he has done actual metering, though.


Cliff Click wrote: Yes, annoying. And it's embedded into the JVM bytecodes that way, so you can't even rescue the problem by changing languages while keeping the JVM. The only hope here is for a combo of JIT'ing & smart memory allocation: either inline & optimize away the junk object or do some sort of stack-based allocation for the very-short-lived object.


David A Moon wrote: Can you say anything about what page-mapping tricks are good for?

Actually I am a little surprised you bother with virtual memory at all. If memory gets full surely it is faster to do a garbage collection than to swap pages out to disk (I think you don't even have a disk if I remember correctly). Maybe page mapping is just to avoid core shuffling if there are multiple heaps? Is the page size large to cut down on mapping overhead?


Cliff Click wrote: Ok, a bunch of things here:
  • Page sizes are 1Meg.
  • No swap; swapping is death for GC. If you are out of memory, then you are OutOfMemory. To put it another way: you are going to blow your performance guarantees so badly, the JVM might as well have crashed. Better to make the user allocate enough resources up front to keep the performance guarantees.
  • The page mapping is integral to the GC; we free physical memory earlier in the GC cycle than virtual memory. This lets us recycle physical memory much sooner, and makes it much harder for an application to "surprise" the concurrent GC - i.e. an app whose allocation rate suddenly accelerates might run a different concurrent GC out of physical memory - and force an application stall.
  • The mappings are also used to deny user-mode threads access to a page where the objects are being concurrently relocated, but still allow GC-mode access to that page. GC-mode is a subset of user-mode (so we sorta are like a 3-privilege mode machine), except that user-mode can promote itself to GC-mode anytime it wants.
  • The read-barrier will take a fast-trap on failure, and by default get promoted to GC-mode in the trap handler. This lets the faulting CPU fixup the object reference and continue, without needing to wait for GC to catch up.
  • We also have a write barrier, but it's merely a slight efficiency hack. The read-barrier enables the good GC.

David A Moon wrote: I would expect the write barrier to be more than a slight improvement, unless your memory scanning is incredibly fast. The main thing the write barrier accomplishes is to greatly decrease the number of pages (or cards or other chunks) that have to be scanned for pointers to the youngest generation.

Cliff Click wrote: Ahh.... here's the difference: a 'card-mark write barrier' can be fairly cheaply done with plain-o-instructions, so the write-barrier instruction is a delta above that. The biggest change is our write-barrier op 'filters' card-marks that don't need to happen which in turn cuts down on useless card-mark writes. Writes are generally really cheap - unless there's a ton of contention caused by rapidly updating of an old-gen object holding a pointer into young-gen. In that case the filtering is worth a lot.


David A Moon wrote: Since the GC is concurrent the time required to scan for pointers of interest to the GC shouldn't slow down the application unless the application uses all the cores, except surely you run out of memory bandwidth eventually and then the GC's memory scanning interferes with the application's accesses to memory. So even with all that concurrency and massive processing power and memory bandwidth I would think you would still get a lot of benefit from doing less scanning because of the write barrier. But I haven't measured it and you probably have.

Cliff Click wrote: We're definitely doing a generational collector, so we'd be using a software write-barrier if we didn't have the hardware one handy. Or maybe: "We have a write-barrier. The fast-path is done in hardware. There is a slow-path done in software."


David A Moon wrote: I don't know if this was published, but the write barrier on the Symbolics Ivory was really simple. Each page table word contains an encoding (in 4 bits?) of the age of the youngest generation pointed to by this page; the write barrier just causes a trap when those bits need to be updated. I don't remember if we kept another index or just had the GC scan the page table. Since it's an IBM-style inverted page table the page table is dense even when a lot of pages are swapped out so scanning it is not too slow. It would have worked better if the Ivory page size hadn't been way too small.

Did you ever look at Eliot Moss's "train" garbage collection ideas to do ephemeral-GC-type optimization for long-lived data? If I remember correctly there is a bug: memory consumption can become arbitrarily large in the worst case, but the idea might still be good. I think Microsoft is using a form of it in .NET.


Cliff Click wrote: HotSpot tried out the train collector years ago, and never got it to perform well in practice. There are a bunch of odd-ball issues, (remembered sets with popular objects?) that made it not very performant.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Simple 3-adr RISC is easy to JIT for; it's HotSpot's JIT and makes quite good code - easily comparable to "gcc -O2". [David: Yeah.]
  • Just-In-Time zero'ing of L1 cache for allocation - and we don't issue a memory read for the cache-line getting zap'd to zeros. This is worth about 1/3 less bandwidth than the usual "store a bunch of zeros in a row" implementations.

David A Moon wrote: dcbzinstruction on POWER. Clearly a good idea.

Cliff Click wrote: dcbz doesn't make the grade:
  • it's not a prefetch so you can't issue it far enough in advance without causing the stall you are trying to cover, and
  • any java lock that happens will invoke a fence, which will also stall
  • it does avoid the memory read bandwidth.
Our version gets all of the above right.


David A Moon wrote: Dedicated register for the allocation pointer?

Cliff Click wrote: No. The main issue with allocation is avoiding the mandatory cache-miss as you stream through memory. You can hide tons of dumb int-ops and cache-hitting-loads underneath that cache-miss. So we pay a little bit of integer-cycles to setup our J-I-T-zeroing, and getting the allocation pointer is one of those cycles. In short: it's too small picken's to be worth holding a register for.


Cliff Click wrote: Things we do specifically for Java (continued):
  • GC bits in each pointer (old-gen, young-gen, stack-allocated, normal-C-ptr) [David: I remember that now. Good.]
  • Java type-heirarchy bits in each pointer. This saves a trip to memory to fetch the Java type when making virtual calls.

David A Moon wrote: So you're saying that since 64 bits is way too many, you can use a bunch of those bits to put a rather wide tag in each pointer? Maybe 20 bits wide? Seems like a great idea. No problem with having too many classes defined and running out of bits?

Cliff Click wrote: Yes, wide tag per pointer. No problem (yet) with running out of classes. Big Java Apps these days seem to have about 2^15 classes.


Daniel Weinreb wrote: I am curious about how you do threads. On the Lisp machine, the stack had to be contiguous in virtual memory. So whenever we had to allocate one for a new thread, we had to trade off between wasting VM versus worrying about running out of stack. I don't know if we ever considered making the stack such that it could be a linked list of (big) blocks, but I suppose anything that slows down the function call path is very dicey. Dave, did you ever think about this? It would have been nice to have cheap full coroutines ("stack groups").


Cliff Click wrote: Nope - we're using the fairly classic stack layout for HotSpot. And yes this means you have to decide up front the stack-vs-heap memory split.


Daniel Weinreb wrote: Cliff, have you heard anything about the possibility that the JVM will be extended to be able to do tail-calling? Rich Hickey says that there was recently a "JVM summit" (something like that)of people writing non-Java languages for the JVM, and there was a lot of interest in tail-calling. It would be great for the Lisp world, since Scheme depends on it so much; it would be nice to re-unite the Scheme and Lisp communities as much as possible.


Cliff Click wrote: I'm well aware of the interest in tail-calling optimizations - but it has to compete for scarce engineering resources like everything else.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Really big interconnect & memory bandwidth. Java has lousy locality, so things miss in cache - a lot.

David A Moon wrote: Would that be true even with 4 times larger caches?

Cliff Click wrote: I think so. We did a bunch of studies on cache-sizes, looking at fewer-cores-more-cache vs more-cores-less-cache.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Support for 2 outstanding memory misses per cpu, up to 24 per L2 cache counting prefetches - so 2304 outstanding memory ops across the whole box. [David: "Non-blocking caches are good."]
  • Minor tweaks to help do Java-specific math: array addressing is sign-extend, THEN shift-and-add - which we do in 1 alu op. Also we have a Java range-check op, faster virtual-call support, the IEEE754 subset needed for Java FP, etc. [David: Good.]

David A Moon wrote: Do you have any kind of specialized cache to avoid memory references to find the address of a virtual method to call? I haven't measured it but I get the sense that all those not very predictable indirections through vtbls really hurt C++ speed.

Cliff Click wrote: Yes... specialized hardware, not specialized cache.

We do the standard HotSpot
"inline cache"trick that's done on all CPUs.

The cache is a 1-entry cache, inlined in the code. The Key is the expected class of the object, the Value is the target of the function call encoded as a plain "CALL" instruction. So you do a Key-compare inline then a plain static call. If the Key compare hits the hardware will Do The Right Thing with the static call. In practice, 95% of all call-sites (that can't be optimized by the JIT) never fail the 1-entry cache. The other 5% tend to fail for having lots and lots of targets.

So there IS a "cache", but it's not specialized. It's the normal HotSpot cache. Instead, we have an instruction to do the Key-compare in 1 cycle (and the call is another cycle). So we do cache-hitting v-calls in 2 clks & 2 ops. On a wide-issue O-O-O X86 that same sequence is maybe 4 dependent ops but issues also in 2 or 3 clks - except loading the object header is typically a cache-miss; but the branch predicts right and the O-O-O hardware lets the X86 proceed despite the miss.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Simple hardware transactional memory, used to speculate through Java locks.

David A Moon wrote: I'd be interested in reading more about that.

Cliff Click wrote: We have a white-paper here. http://www.azulsystems.com/products/whitepaper/wp_otc.pdf


Cliff Click wrote: Ok, I'm done boasting. :-)


David A Moon wrote: It's plenty worth boasting about.

--Dave Moon


Cliff Click wrote: Thanks, Cliff


===========================================
11/20/2008 - A little more:

David A Moon wrote: Then I guess your thread scheduler is not aware of the partitioning of the hardware and doesn't try to put related threads onto cores in the same chip.

Cliff Click wrote: The thread scheduler is well aware of the hardware layout. It's no mean feat to schedule 800+ cpus with 1000's of runnable threads.

Instead, nobody has a clue what counts as "related threads". Certainly there's no Java-level hint. You might imagine doing L2-to-L2 miss-rate profiling and try to decide which set of threads are communicating through memory instead of through an L2. It's a non-trivial task for which there are some non-trivial academic papers out there showing modest gains for lots of hard work.


David A Moon wrote: But you said your memory access time is mediocre. That's the first symptom of the uniform memory illusion breaking down. Then you find that no matter how big you make your caches they aren't big enough. Then you find that each new generation has slower memory, as a ratio of memory access time to processor speed. Then you find that programs are too slow unless they are written with detailed knowledge of the hardware memory structure. You probably know way better than me how far we are down that slope today.

Cliff Click wrote: Split out the notions of 'uniform' from 'fast'. Everybody pays attention to caches (or should): the mindset is "memory is slow". But only some people pay attention to memory layout: the mindset is "memory is uniform".

For some parallel scientific programs, with well understood access patterns people can build both the machine and the program such that most accesses are "near" the CPU. In such a case it makes sense to build a NUMA machine: "near" accesses are faster than "far" accesses.

Azul builds are gear as 'UMA' because our programs do not have well understood access patterns. Instead, the patterns are mostly random (after cache filtering) and so it makes sense to have uniform mediocre speeds instead of 1/16th of memory fast and 15/16ths as slow.

In both the NUMA and UMA case everybody has given up on the illusion that memory is fast. Instead we all acknowledge the presence of caches; we prefetch (or expect the hardware to do so); we know that "1st access is slow, 2nd one is free"; we do cache-smart object layout and so on. No illusions about memory there.

Category: Web/Tech | Permalink | Comments (12) | TrackBack (0)


Some Source Forge Threads on NBHM
November 13, 2008

A couple of interesting threads running on high-scale-lib's forums I thought I'd post to a wider audience.

Josh Dybnis (jdybnis) - 2008-11-10 11:46
I wrote an implementation of the non-blocking hash map in C. I put it up at http://code.google.com/p/nbds/


Cliff Click (cliffclickProject Admin) - 2008-11-10 17:59
I never know what to make of such statements -

- if the table has "almost no synchronization" then it's not non-blocking (but it might be highly concurrent). Non-blocking has a specific meaning to the academic community; if you say "non-blocking" you should also be able to say "no synch". If you are based on NBHM then you need to provide e.g. a non-blocking malloc and a way to reclaim storage. Both are things NBHM uses GC for, and are not-trivial to get non-blocking. NBHM does indeed do very little allocation - so the blocking embedded in "malloc" will probably never hurt you.

- No "C" program can ever implement any reasonable concurrent program, because "C" has no memory model (and the proposed model is a major punt and not much better than nothing). Instead you can implement NBHM using "C using gcc_ver_XXX on X86_model_YYY". If you change the "XXX" or the "YYY", then all bets are off, and the program can fail. In short, the "volatile" keyword in C does not have nearly the semantics as it does in Java, and C compilers routinely do things with "volatile" variables that Java does not allow. Changing the C compiler version or the underlying hardware can break things.  

Despite my complaints a C version of NBHM on X86 is surely a very useful and non-trivial thing.
Congrats!
Cliff


Josh Dybnis (jdybnis) - 2008-11-11 01:12
Thanks for the feedback. I really appreciate you sharing the algorithm. It's one of the most elegant pieces of work I've come across in a few years. 

- In theory, I agree 100% with you about writing concurrent programs in C. And people need to be clear about the limitations before attempting such a thing. In practice I think it is reasonable to do "porting" work for each platform/compiler. My implementation is targeted to gcc 4.x on 64-bit x86 (Intel does specify a memory model in their Software Developer's Manual). I think it could be ported to most other architectures without major modifications.

- My implementation includes a non-blocking malloc() and a method to do safe memory reclamation. 

- My statement about "almost no synchronization" was badly stated. I was referring to my malloc implementation and not the NBHM itself. Also I was also using "synchronization" to refer to x86 operations having the "lock" prefix, not any non-blocking property of the program.

- A couple of more caveats about my malloc and the safe memory reclamation. They aren't the most robust implementations. They need a bit of work. But they are non-blocking, and should be very low overhead on systems without too many cores (say less than 32). The malloc does call mmap() from multiple threads, it could block in there. So I guess technically it is not "non-blocking". But that is an artifact of the implementation. In theory one could use an asynchronous method of invoking the kernel to make it truly non-blocking. 

Josh

By: Cliff Click (cliffclickProject Admin) - 2008-11-11 07:41
How do you know when e.g. the NBHM Big Array is free?
Cliff


By: Josh Dybnis (jdybnis) - 2008-11-12 17:06
>How do you know when e.g. the NBHM Big Array is free? 
Missed this earlier. 
After I promote a new array the old one gets queued up for a deferred free. (The same with keys in the old array that are tombstoned and were not copied to the new array.) The frees actually happens at a later point when all the threads are guaranteed not to be holding references to the memory. I accomplish this using a technique from RCU. Each thread that might be holding a reference periodically announces that it is in a quiescent state. I have a way of detecting when every thread has made such an announcement. At that point I can safely free the array. My implementation of all that is not particularly robust. If a thread blocks, and never announces it is in a quiescent state nothing more will ever get freed. This is straightforward to fix with a more complex implementation. I'm having an on-going discussion about it on comp.programming.threads. http://groups.google.com/group/comp.programming.threads/browse_thread/thread/702aedc141b34fa1#

Also, one could also use GC in a C program too. Or one of the safe memory reclamation techniques documented in the academic literature (e.g. hazard pointers, smr, etc.). 


By: Cliff Click (cliffclickProject Admin) - 2008-11-12 17:21
Sounds good. Just was wondering.
Lack-of-GC generally brings about various broken attempts to free memory in concurrent programs.

Cliff

---------------------------

By: Cliff Click (cliffclickProject Admin) - 2008-11-11 13:52
> In theory, I agree 100% with you about writing concurrent programs in C..... I think it could be ported to most other architectures without major modifications. 

Itanium. :-)

Cliff


By: Josh Dybnis (jdybnis) - 2008-11-11 19:47
Yeah Itanium, the Alpha would be even worse. It is your NBHM algorithm though; I don't think there is anything dependent on having a strong memory model. You would just have to put memory barriers in the right places. I'm not saying that it would be easy to figure out the optimal solution, but it wouldn't require any major rework to the code. The lazy way to do it would be to change all the CAS calls to include a full memory fence. Doing it that way would constrict performance of course, but it would be correct, and probably still more scalable than a traditional lock-based hash table.


By: Cliff Click (cliffclickProject Admin) - 2008-11-11 21:25
The NBHM State Machine is "correct" independent of any memory model. I still need fences & a Memory Model around the promotion-logic & table-copy areas. It's here with an IA64 would cause you grief. If you didn't need table-copy (i.e., pre-made fixed-sized table), then it would be correct on any shared-memory machine. 

The NBHM as presented as stronger semantics than a minimalistic non-blocking hash table - Keys are treated with Java volatile semantics, so e.g. you can make & install a fresh Key on 1 thread, and have a 2nd thread see the Key from the table and also see the Key's contents so it can correctly run a Key compare. A naive C IA64 implementation can screw up there, although the X86 strong ordering saves it. Same issue happens with the returned Value (made in Thr1, returned in Thr2, but does Thr2 see the initialized contents of the Value?). 

Here's a sample IA64 table-copy screwup:

  • - T1 copies last K/V from old table to new table.
  • - T1 incrs the last count of copied values. Bug: No ordering between the stores.
  • - T2 sees the last count (read of T1's 2nd store), 
  • - T2 promotes (stores new table as default)
  • - T3 sees the new table (read of T2's store)
  • - T3 probes for K, but does not see T1's store.
  • - T3 reports K as not-in-table, when in fact it is.

Cliff

---------------------------

By: Josh Dybnis (jdybnis) - 2008-11-12 16:36
What happens if a thread copies a slot and then dies before updating _copyDone? Does the copy never get promoted?

By: Cliff Click (cliffclickProject Admin) - 2008-11-12 16:39
The fall-back case is that some thread (all threads?) scan the entire array and discover that all things are copied, despite any failed counts. At this point the "discovering" thread can promote. 


Cliff


By: Josh Dybnis (jdybnis) - 2008-11-12 17:15
I'm missing where that is in the implementation. I see where a thread enters panic-mode and does the scan, but it looks like it always compares its workdone with _copyDone and the size of the array. So if every slot is copied, but a thread didn't get around to updating _copyDone, none of the panicking threads will ever terminate their scans.



By: Cliff Click (cliffclickProject Admin) - 2008-11-12 17:23
Welcome to the difference between algorithm and implementation. :-)

Feel free to post a fix. It should be a 1-liner.
Cliff


By: Josh Dybnis (jdybnis) - 2008-11-12 17:44
Heh. I have a similar bug in my implementation.

=============================

compareAndSwap memory fence (New)

Andrew Trick (andrewtrick) - 2008-11-12 11:16
My interpretation of your slide-set/blog comments is that the NBHM state machine--ignore table copying for now--does not require any memory fences or store ordering. CAS_key, CAS_val gets the job done as long as your machine has atomic compare-exchange. Correct?
 
In fact, in NBHM source you bypass util.concurrent.atomic, calling Unsafe.compareAndSwapXXX presumably to avoid the memory fence. However, the Sun JDK and HotSpot assumes that Unsafe.compareAndSwapXXX enforces a full memory fence. So would it be fair to say that you've made Azul-specific changes to the JDK/JVM to optimize NBHM? Should Sun get rid of the full-fence semantics of Unsafe.compareAndSwap before more Java/JDK code begins to rely on it?
 
-Andy


By: Cliff Click (cliffclickProject Admin) - 2008-11-12 11:20
Yes, I need no fencing (ignoring table copy) for some kind of hashtable semantics. You DO need very carefully placed fences to get CHM semantics - which amount to 'volatile' semantics on values used in put & get calls.
 
I bypassed the Atomic classes to get the no-fencing CAS. The Unsafe calls specifically have weak no-fencing versions. On an X86 & Sparc you are stuck with the fencing anyways, but on Azul the weak no-fence version actually does not fence.
 
So I did not do any Azul-specific changes to the JDK here (Azul's JDK IS different from Sun's - but not here).
 
Cliff


Andrew Trick (andrewtrick) - 2008-11-12 19:05
Thanks for the reply and clarification.
 
The JDK does indeed have AtomicReference.weakCompareAndSet. Sorry to be pedantic about API details but... What I meant by "JDK assumes a fence" is that AtomicReference.compareAndSet for Sun is identical to weakCompareAndSet, so the underlying Unsafe.compareAndSwapObject must have a fence. What I meant by "HotSpot assumes a fence" is that the Unsafe.compareAndSwap intrinsic generates abstract memory barriers, which you would have to materialize as a real barrier on any machine that doesn't have implicit CAS fences.
 
But really I was just trying to understand the intention behind your NBHM source, which is pretty clear now. It's not important whether some careful JVM porting is needed to make it fast on certain h/w.

By: Cliff Click (cliffclickProject Admin) - 2008-11-12 21:35
Letsee...
 
- sun.misc.Unsafe does not specify any memory ordering properties for compareAndSet.
- AtomicReference.compareAndSet is doc'd to have volatile semantics, and operates on a volatile variable.
- AtomicReference.weakCompareAndSet is doc'd to NOT have volatile semantics.
 
A legit JDK implementation could then
- not fence for sun.misc.Unsafe
- but yes fence around volatile ops, including Unsafe ops on volatiles.
 
This is exactly the intent of the docs & implementation, and it specifically allows Azul to not fence on sun.misc.Unsafe & still meet the spec - which we do (on both fronts: not-fence and meet-the-spec). We also typically do not suffer from other peoples' bad assumptions here - where they expect Unsafe to fence and are surprised when it doesn't. I suspect Unsafe.CAS users are a rare breed.
 
Removing the fence for Unsafe ops is a tidy win for Azul on some highly parallel contention-heavy codes.
 
Cliff 

That's all,

Cliff

Category: Web/Tech | Permalink | Comments (0) | TrackBack (0)


Cliff's New England Fall Travel Classic
October 28, 2008

Or how I went to New England to view the Fall Colors in Perfect Weather,
and All You Got was this Lousy Blog

Azul sent me to Manhattan (my least favorite place on the planet) to speak at the NYC Java Users Group.  I discover that the earliest flight out of San Jose (plus 3 time zones) gets in NYC at 4pm - and that you can't get from a NYC airport to downtown Manhattan in 2hrs during rush hour - so I can't make a 6pm talk if I leave San Jose on the same day.  Sigh - it's the Red-Eye for me.

Wednesday: I end up on the red-eye from Tuesday the night before and arrive at JFK at a delightful 6:30am local time.  No sleep for me.  I've done the NYC subway system a number of times so it's not much problem for me to buy a MetroCard and make my way to downtown.  I arrive at the hotel at maybe 8am in the morning.  Lo!  The hotel is full and they have no room for me - and indeed have no room until 3pm. Certainly by then I am one shabby dude - unshaven for 36 hours, same clothes, no shower, no sleep - not quite up (down?) to the standards of the street people hanging around the hotel but definitely looking grubby.  And I'm exhausted.

3pm arrives and I finally get my room and my shower, change clothes, lay down.... bad mistake.  I barely scrape up enough brain cells to force myself out of bed, and stumble, shaking from sleep deprivation, to the clock: only 90 minutes gone in a blink.  I splash some water on my face and go to meet the JUG.

The talks go very well; lots of good questions from an animated audience.  I presented Challenges and Directions in Java Virtual Machines and Towards a Scalable Non-Blocking Coding Style, and I got quite animated myself - all in all very good.  Dinner was some close-by expensive steak joint, but excellent food.  NYC definitely does steaks well.  Finally the hotel bed looms... and dawn breaks.

Thursday: Today is 2 customer visits, one planned and one directly due to the JUG talk the evening before.  Alas the new customer has micro-second response time requirements; we can only provide millisecond response times (albeit with massive throughput).  The planned customer visit goes much better; it's a tools&libraries group looking for advice on scalable Collections.  After that I take a train to visit my Uncle's farm in Connecticut.  Penn Central is full of these little train ticket kiosks; I happily approach one.  It takes my credit card, works me down through the menus to New London, CT - then hangs.  I try another; this one takes 30sec before posting loudly "Out of Servce".  I'm 0 for 2 so I sigh and get in the infinite line to see a human.  The line takes a full hour to buy a ticket.  Then I go wait for my (now a full hour later) train.  Just when it gets near the head of the list of arriving trains, the train board reports my train as "delayed: 10 minutes".  This goes on for 20 minutes before it changes to "stalled", then finally "now on track #9".  But all goes well after that, my Uncle picks me up in New London and we drive off to the farm.

Friday, Sat, Sun: The next few days are a real delight. The weather is perfect - highs maybe 65, lows in the 50's, breezy, no clouds and the trees in full color.  My cousins are all married and with kids all the same ages as my kids so it's a great time to visit.  I go to a soccer game, the end of a 5K run, a local chocolate & wine tasting, buy from a craft fair and shop at the local antiques shops; help (interfere?) with some hay harvesting, take a stab at fixing 2 Windoze machines (bizarre XP SP3 things about not being able to install Beethoven.wm3 after about of hour of disk grinding), and just plain visiting.

Monday: I pack up my suitcase, both physical and mental, and prepare to jump back into the fray.  The flight to Nashville is easy - except for my last child is finally going off to public school (we've home-schooled all our kids for years with great success), and my wife is in a frenzy getting him ready - and I'm a thousand miles away.  So I get to hear about it between plane flights, in airports, in the cab, etc. The Nashville hotel is nice, and only $150/night for a room twice the size of the NYC $500/night room, and a much nicer neighborhood. 

Tuesday, Wednesday, Thursday: OOPSLA.  Paid $710 at the door - I forgot to pre-reg.  I had great a lunch conversation Doug Lea, Bill Pugh, & Ben Zorn.  I take lots of notes.  Lots of great hallway and dinner conversation covering reader/write locks, the state of the JCP & Exec Committee, whether Sun should drop the JCP process altogether, whether the market will implode & we should all run up our credit cards to the max, end of civilization, etc. Ben tries to convince me that Microsoft can be trusted as the sole holder of all my critical personal information (sorry Ben) and that totally transparent sync'ing of all my portable & desktop devices is coming Real Soon Now.  Perhaps - but I'm never going to trust my personal info to some single central server - even if it does get old-school Ma-Bell-land-line-style 5-9's reliability for both the server and the sync'ing.  Too many companies have said "it's safe with us" and been wrong.  Too much motivation for crooks if all the eggs are so close together.  Other conversations that stuck in my mind:

  • Azul could Open Source or JCP our hacks for +UseLockedCollections (see about 1/2-way down the blog) and +UseCheckedCollections.  This could be a Good Thing for Azul as it will reduce the number of apps which fail "out of the box" on Azul as well as raise awareness of the issue.  Note that all major App-Servers (and many a large 3-letter-acronym company's major product) fail on Azul without +UseLockedCollections (and runs great with it) and the issue has nothing to do with Azul except we have more cores.  I.e., I predict these crashes will soon be coming to an X86 multi-core Near You!  I can hand out stack traces for any interested parties; email me for details (and yes, bug reports have been filed but between me and you I suspect most of them get filed into the round bin).
  • Bill Pugh offers that at Google they have a 3rd option: log&continue.  The logging is only invoked where now I would throw a ConcurrentModificationException - but there's no standard JDK place for logging such things.  We certainly could add an Azul logging option where the output is available on a RTPM data-race screen.
  • Doug Lea desperately needs 3 things to make his Fork-Join parallelism have a sane API - note that Fork-Join has a really nice approach to handling large complex parallel problems - there's no real limit on scaling plus it has a fantastic story on automatic load balancing which is typically really crucial for us.
    1. Tail-call optimization.  Part and parcel of the solution when doing recursive decomposition.
    2. Some kind of auto-boxing-removal story.  See also John Rose's DaVinci Machine.
    3. Some kind of forced inlining story, so that a complex F-J iterator gets cloned at the site where the loop-body is declared.  This will allow the (highly megamorphic) inner loop function to be specialized per call site, inlined into the iterator loop and decently optimized.
  • Several people asked for a Weak-Key'd AND Weak-Value'd NonBlocking HashMap.  

Friday: I make last minute plans back to NYC to visit the tools & libraries customer group again.  It's a 6:45am flight out of Nashville, which means be at airport by 5:am so leave hotel at 4:30am and get up at 3:30am.  Sign, another day with no sleep.  The flight to NYC is rocky as all heck, the clouds have finally arrived to ruin my days.  We bump for 2hrs out of a 2:45hr flight.  Then it's AirTrain to Jamaca Station, Long Island Railroad to Penn Station, Subway#3 to Wall Street - I'm coming quit good at navigating NYC.  I'm almost an hour early to my customer visit - which is really an open-ended tutorial on Azul's profiling tool: RTPM.

These guys are doing some fun stuff.  They have their own versions of the standard JDK collections where the expected collection size is 1million to 10+million elements.  They have very efficient striped iterators (100x speedups on a 100-way Azul box) and are looking at Doug Lea's Fork-Join for describing their inner loop closures.  RTPM goes a long way to helping them figure out what's going on.

There's also a nifty in-memory DB thingy where we discover that turning on "-tiered" mode (contrast to HotSpot's -client and -server modes) is worth a stunning 5x speedup (Azul usually sees something like a 15% speedup for -tiered).  We try to make the in-memory DB run that fast under a ReadWriteLock.  The customer has a very light-weight roll-your-own version (new in Java6: lots of new features in the JDK ReentrantReadWriteLock class which slows it down by 50%).  The customers' lock doesn't stripe the reader-counts and instantly suffers massively on the highly contended single reader count word (which the JDK version does as well).  I step their engineer through details of reader-lock-striping; it's definitely ticklish stuff to get right.

Finally 5pm arrives and everybody is going home - although for me "home" is really a subway ride to Penn Station, then a train back to New London, CT for another weekend on the farm.  I am sooooo ready to be back in San Jose.

Saturday, Sunday: Another great weekend on the farm; I finally get to see some rain (it's not rainy season in San Jose yet so I haven't see rain since March).  The wind blows down a tree over the power lines and we lose power from 9pm Sat to 5am Sunday (when my room lights pop on and wake me up, grumble).  With the power out and the internet gone it really feels like a farm.  On Sunday I take the train in to Boston, then have dinner with David Bacon at Legal Seafoods & take the T to my hotel.  Gosh I almost feel like a savvy New England traveler. 

Monday is the 2009 VEE PC meeting.  Despite missing 1/3 of the PC members (two have new babies, one got food poisoning, several had other issues that made them dial-in only for a few minutes at a time), the meeting went very well.  We got done early and I got to take the new Boston T Silverline to the airport - it's labeled like a subway but it's really an electric bus (the long bendy kind) in it's own bus-sized tunnel.  Logan was easy, the flight was smooth and I made it home (finally) at 1am local time.  It's good to be home.

Continue reading "Cliff's New England Fall Travel Classic "

Category: Web/Tech | Permalink | Comments (11) | TrackBack (0)


JVM Language Summit
September 29, 2008

I spent last week at the JVM Language Summit, a small conference focused on non-Java JVM-based languages.  It was populated almost solely by either JVM implementors or bytecodes-are-my-new-assembler types.  It was a very fun & geeky conference for me; some of the best technical discussions I've had in a long time.  My favorite philosophical talk was definitely Erik Meijer talking about Fundamentalist Functional Programming.  In the end, he's all for non-pure side-effecting programming BUT he wants the type system to "stop lying" about side-effects.  We both agreed that we'd love to see a functional programming language which had types for side-effects (e.g. Haskell's IO "sin bin"), but with more elaborate side-effects.  I.e., if I'm writing my program with few-to-none side effects I might want to track them exactly (e.g. this function has type 'side-effects field _code') but if I call into the Swing or AWT libs I might want a type like 'side-effects all Java variables'.

My favorite techy/JVM talk was Bernd's Maxine VM.   I think he's got a really good nearly pure-Java layer figured out for generating ASM code.  You want something that looks like running plain Java code (and does field reads & writes) that guaranteed must inline down to where the JIT turns it into atomic machine code loads & stores.  He's missing lots & lots of major features (e.g. good JIT, good GC) but the framework looks nice.  If we ever are to get a true JVM-in-JVM (and skip bootstrapping your JVM via a C++ compiler which is what HotSpot does), I think Maxine has the best chance.

Most desired JVM feature: invokedynamic; this would cut huge chunks of evil code out of most JVM-based languages (but not Clojure & JPC, both of which target 1.5 JVMs more closely).
Most needed JVM feature without changing the spec: a trivial Escape Analysis suitable for dealing with endless Fixnum/BigNum math.  Most of these non-Java JVM languages do fixnum/bignum math (automatically overflowing fixed-int math into bignums) - which means every trivial bit of integer math creates a new Fixnum object.

I showed off Azul's RTPM and got lots of oooo's and aahhhh's.  I convinced lots of people to sign up for academic accounts mostly for access to RTPM.  I was able to quickly diagnose a big problem w/JRuby (failure to inline a key trampoline, masked by +PrintInlining lying), and also got 20% speedup on the ASM library - the core bytecode parsing library in widespread use.

My own talk was well received:

Brian Goetz wrote:

your talk was both the most popular and highest rated, according to the comment cards. 

The slides are here although the postscript conversion mostly screwed up my transparent ovals.  Mostly I ran this trivial program on Azul's JVM using these languages: Scala, Clojure, Jython, JRuby, JavaScript/Rhino, & JPC. I profiled them using RTPM and report on the profiling.  See the slides for the in-depth discussion.  3 things popped out:

  1. Clojure, JRuby, Jython, Javascript/Rhino all using fixnum/bignums for every trivial piece of math - which ends up allocating new FixNums for each simple integer operation. A major fix for this would be to implement the simplest possible Escape-Analysis. You'll still end up with the overflow checks but the major cost here is the Fixnum allocation (and NOT the GC costs; the fixnums trivally die in the young-gen; direct GC costs are quite modest).
  2. JRuby was thinking that a key trampoline function was inlining - and it wasn't.  See the slides for details, but hopefully we'll see a JRuby with a substantial performance boost soon.
  3. I ran Clojure's STM Traveling Salesman Problem using 600 worker 'ants' to optimize the path.  It ran straight-up without a hitch, no tweaks - and with great scaling (it did allocate about 20Gig/sec on a 200Gig heap but this is easy for an Azul box to cope with).  There may be hope for STM's after all - IF the set of STM variables is limited to a handful of named shared variables, and not for dusty-desks that must assume all-is-shared.

Cliff

Category: Web/Tech | Permalink | Comments (18) | TrackBack (0)


A Plea For Programs
September 19, 2008

[Update 9/21/2008: I've got a simple sample program for people to port, plus I've got at least some code for - Clojure, JavaScript/Rhino, JPC & JRuby; missing Scala at least - thanks].

I would like some non-Java Java-bytecode programs to do performance testing, for a talk I'm giving this coming Friday (my bad for starting this late) and I'm hoping my gentle readers can supply some.  I'd like programs in different languages, but ones that are easy to setup and run.  I'm going to do internal JVM profiling, so I'm not all that concerned with the output or "Foo-per-second" results.  Ideally, my programs would be:

  • Non-Java.  Clojure, Scala, JPython, JRuby all come to mind.  The more variation, the merrier!
  • Easy setup.  I'm not an expert in any of these, so the resulting program has to be easy to setup and run.  Perferably a simple "java -cp Weirdo.jar FunnyProgram" command line.
  • Plain JVM.  Note that the 'java' command has to be there; I intend to use Azul Systems' JVM for profiling and we have our own.  Any kind of odd-ball jar or class files should be fine.
  • Long enough.  The program has to run for several minutes at least, without "babysitting".  Long enough for the JIT to settle down (if it's going to), and long enough for decent profiling.
  • Little I/O.  Besides DBs being a pain to setup, I'm really looking for CPU-bound programs.  Plain file I/O is fine, if the files are provided and can be scripted easily (e.g. "java -cp Weirdo.jar FunnyProgram < BigInput.dat > /dev/null").
  • Be multi-threaded.  Not a requirement, but a definite nice-to-have.  Several of these languages support alternative threading & coherency models and I'd like to test these features.
  • Be Open Source, so I can post the collection for others to compare against.  This is NOT a hard requirement; I'm all fine with keeping private anything you request be kept private.  Performance profiling data *will* be released, as that is what the talk is about!  (I'm also fine with signing NDA's but that's probably not going to be an issue with this crowd).
  • An example: A multi-threaded Mandelbrot program would be fine, computing a 1000x1000 grid of points centered around (1.0,1.0) with a spread of (1.0,1.0) - so fill in the grid (0.5,0.5) to (1.5,1.5), using your choice of thread controls.
  • Please include any names, so I can give credit where credit is due.

I hope to discover things like:

  • How close does "plain code" match the JVM/JIT's expectations?  How well does the JIT turn "plain code" into machine instructions?  I hope to present the JIT'd code for sample language constructs and detailed profiling data.
  • How well does the function-call logic match the JVM/JIT's expectations?  Can trivial functions be inlined?  What's the cost of a not-inlined function-call?
  • Other interesting costs?  (e.g., endless new-Class churning, endless new-bytecode churning causing endless JIT'ing; endless new weak-ref or finalizer creation causing GC grief, etc)
  • How well does the alternative threading & coherency scale?  Can Mandelbrot run on a thousand CPUs?  (I expect: trivially yes). How about programs with more interesting coherency requirements? 

I put a sample Java program here, if you'd like to port something really simple.  The inner loop of this program looks like: "for( i=0; i<1000000; i++ ) { sum += ((int)(sum^i)/i); }".  The JIT'd assembly code from HotSpot's server compiler looks like this, unrolled a few times:

2.83% 243 0x12d93878 add4      r5, r4, 1 // tmp=i+1; unrolled 8 times, this is #1
0.06% 5 0x12d9387c xor       r3, r5, r1 // sum in r1, tmp in r5                  
0.06% 5 0x12d93880 beq       r5, 0, 0x012d93b40 // zero check before divide             
0.35% 30 0x12d93884 div4      r0, r3, r5 // divide, notice cycles on next op      
2.64% 227 0x12d93888 add4      r1, r0, r1
// sum += (sum ^ tmp)/tmp               

As expected, there's a pretty direct mapping from the source code to the machine code.  I'd like to see how other JVM-based languages stack up here. Email me directly with small programs, or post links here.

Thanks!
Cliff

Category: Web/Tech | Permalink | Comments (27) | TrackBack (0)


Chrome is at least Slightly Evil
September 10, 2008

Chrome did a number of Evil Things to me recently, including killing my FirePass VPN and AgeOfConan - even after a full reboot and never launching Chrome again.... but I get ahead of myself.  Here's what happened:

I saw the Chrome announcement on SlashDot, amongst other places, read the comic and pretty much agreed with what the developers were saying.  I installed Chrome to give it a test drive.  Things went pretty well at first: the new interface was clearly going to take getting used to but also had a lot of strong points going for it.  Then I hit one of my not-so-techy websites.  Gack!  Screaming Flashing Flash ADS IN YOUR FACE, Enlarge Your X Life, Your Small Ego Needs A Bigger SUV, etc... Quick, turn on AdBlock!  But no, not available yet.

Without AdBlock, clearly Chrome isn't suitable for general surfing yet.  So I went hunting for Chrome-friendly AdBlock replacements; I only found a gizmo called Privoxy that provides a filtering proxy service.  It was waaay less easy to use than AdBlock and didn't provide that "you never even knew the ad was missing" experience that AdBlock achieves (i.e., I was left with lots of big holes saying "Prixovy chopped this ad out").

So I decided to give Chrome a rest and wait for post-Beta (and presumably give the AdBlock guys some time to work their magic).  Meanwhile, Microsoft installed SP3 on my AMD/HP XP machine (Red Herring #1).  FireFox upgrades to 3.0.  Red Herring #3: I fiddled with my router's port-forwarding rules in an effort to get 5 people behind the same router to play Diablo 2 on Battle.net (me, my 4 kids, plus also my brother Uncle Eric in Atlanta and our lifelong friend in Texas so no LAN party suggestions please)   - BTW D2 works fine for 4 players behind our router and the 5th can play in an unrelated D2 game, but when the 5th player attempts to join the same game Blizzard kicks out the last prior player to join - suggestions welcome on how to fix this!).

A day later and many reboots of both Man and Machine (and Router) I've given up on 5-player D2, and decide to play a little Age of Conan.  But nope, some weirdo message pops up about not connecting to the patch server.  I figure the AoC servers are down (no mention on the forums though) and will try again later.  So I try to log into work via FirePass & VPN - it's also a no-go, with some weirdo message about having IE4.0 or later installed (gack).  I file an IT trouble ticket and go to bed.

Next morning I spend 1/2 the morning failing to get FirePass to work.  I spend hours surfing the web for people with similar problems; there's a fair number of not-quite similar situations.  IT has a bunch of bland unhelpful suggestions, but FirePass is working for some folks with FF 3.0 and not others.  In retrospect it HAD been working for me, for about a day (the FF upgrade came in before the Chrome try).  The Web turns me on to the Microsoft XP SP3 + AMD debacle; I chase that one down all the next day - but that bug causes your machine to boot-to-blue-screen and thankfully I don't have that problem. 

Web surfing with FF is fine (and ad-free!); telnet is fine; email is fine; nslookup & tracert is fine.  I get a putty/SSH channel going so once again I can (text-mode only) log in to work.  Good thing I'm an old-school Emacs-ian; text-mode Emacs is not quite a windowing O/S in-and-of itself; I manage to get some work done that day.  Still AoC is down and in my evening play hours I decide to try and figure out what's going on.  I'm not alone; plenty of people are complaining about this no-connect-to-patch-server problem, but the AoC tech folks keep claiming the patch servers are up and the fault is at the client end.  And somebody mentions a "patcher.log" file dumped out when AoC startup does it's patch attempt. 

This "patch.log" file provides the little bit of evidence I need to break the case.  Buried in there are successfully connect attempts from days of yore, plus my more recent failures.  The recent failures all mention - *TaDa!* - a proxy attempt using port 8118.  ?Huh?  Where did this funny port number come from?  I remember it from somewhere... aha - Privoxy's default port.  Now I un-installed privoxy days ago; who can be remembering it's port number (and feeding it to the AoC Patcher?)... only Chrome.

I have 4 browsers on my machine (IE, FF, Safari, Chrome) but only Chrome had a port 8118 typed into it and nary another mention of that port number has slipped my fingers anywhere.  Somehow Chrome is feeding AoC port 8118 - and nothing's there on that port so AoC times out!  I go check out my default browser settings.  They say "Use my current Web browser".  IE is disabled.  Chrome, FireFox & Safari are enabled.  Just to be sure, I click on FF as the default browser.... and Microsoft pops up and says that Chrome has to be de-installed to be made no longer available!  Safari can politely be made not-the-default, but not Chrome?

So I de-install Chrome.  During the de-install, Chrome asks for feedback using a web-browser... using the disabled IE instead of my (I thought!) default of FireFox.  I exit out of IE as fast as possible.  Now with Chrome removed, AoC starts fine.  So I check out FirePass - and Yup, it's fine as well.

Evil #1 - Chrome became the default browser without permission.  If they asked to become the default, they hid it well.  I never allow new browsers to become the default without a few days prior surfing.
Evil #2 - Chrome silently prevented me from running FirePass VPN, hence telecommuting.  I assume they did it by handing FirePass the same crapolla proxy port somehow.  This happens even after a full power-cycle/reboot and not launching Chrome ever again.
Evil #3 - Chrome silently prevented AoC from launching, again by passing bogus proxy port numbers out.
Evil #4 - Chrome cannot be disabled by unchecking "Enable access to the program", although Safari & IE can.
Evil #5 - The Uninstall launches IE, even though it's flagged as "disabled access" and it wasn't the prior default browser.  Typically I launch IE less than once/year and always because some other program ignores the settings and launches it anyways.
Evil #6 - No AdBlock. Not really a big Evil, just a big Annoyance.

Postlude - I attempt to file a bug with Chrome using FireFox.  But the Google site only allows non-crasher bugs to be filed via Chrome itself - and Chrome never crashed for me, it just broke other programs left and right.  No way I'm going to install it again so instead I attempt to file a "Chrome crashed" bug via the website.  Options are either "you got blue-screened" or "chrome died but you survived" and neither really fits the bill.  I fill in a somewhat abbreviated version of the above blog, then hit submit.  No go!  Google insists I download Yet Another Program and run it and report the results before I can file.  No Way, Google!  I just got big-time burned from the last program I downloaded from Google. 

Evil #7 - I can't even file a bug report with Google (without downloading & running at least some code from Google!).  I know Googly's read my blog.  Consider this your Bug Report.

Cliff

Category: Web/Tech | Permalink | Comments (8) | TrackBack (0)


Death by Decompilation
September 4, 2008

I recently got involved in trying to improve performance of a large customer application.  Azul's JVMs come with an always-on profiling tool - RTPM - so it was easy to spot where the time was going.  The name has been changed to protect the guilty but the name might as well have been "addToSet".  A quick check of the JIT'd assembly showed that all time was spent in spinning down a long array (an ArrayList actually), checking for duplicates.  In the end, no dup gets found and the new element gets appended.  Adding N elements gives us a classic O(N^2) algorithm.  The array was already >32000 elements in length at the time we profiled it, so time is proportional to (32K^2) ~ 1billion.  Ugly stuff.

Naturally (grumble) this code is in a large 3rd party not-open-source application, where the owning corporation has been recently purchased by an even larger Faceless Corporation.  The "proper approach" would be to file a performance bug and wait and wait and ...

We voted to try and work around the problem by doing a little decompilation, hacking the ArrayList into a HashSet, and inserting a new jar file in the bootclasspath.  So we pulled this class file out of the pile-o-jar's and hit it with the standard decompiler jad (www.kpdus.com/jad.html).  Presto!  About 2000 lines of fairly legible Java code.  Plus a pile-o-warnings about un-decompilable stuff (Ut-oh!) mostly around nested try/catch/finally clauses and inner classes.  I surf the Java code looking for other uses of the offending ArrayList - there are definitely some - and start to see places where jad bailed out.  When I turn javac loose on the new source file, I get lots of error messages.

No problem thinks I, I'll just get another decompiler.  I immediately google-up cavaj (cavaj-java-decompiler.en.softonic.com).  A few moments later I get another Java file - and whoops!  It's header mentions jad by name down to the exact same revision.  Sure enough this decompiler is a GUI front-end for jad complete with identically broken Java code.  Turns out there are a number of such GUI wrappers for jad.  So I search some more, discarding mocha (www.brouhaha.com/~eric/software/mocha) for being abandoned, ClassCracker (mayon.actewagl.net.au/index.html) because the demo version didn't produce any Java code at all, and IceBreaker (www.breakertech.com) because their web site was down.  I also tried Java Decompiler (java.decompiler.free.fr) but it also produced loads of syntax errors although in different places than jad.  DJ (members.fortunecity.com/neshkov/dj.html) also choked on my class file but just barely.  Very long forward jumps spanning nested try/finally blocks would confuse it - but the inner classes came out looking nice.  A colleague also tried JDec (sourceforge.net/projects/jdec) with limited success.  I also stumbled across this nice list of decompilers from google.

In the end, I hand-integrated the code from JDec & DJ to get something that would compile.  I then hand-refactored the code to insert (probably re-insert) generics and Java6 style iterators - this step was entirely mechanical on my part but no decompiler I tried did it.  The resulting code was much smaller and more readable... and made a bug in the code more obvious (there are "before" & "after" Sets that periodically get unioned and during the union the dup-elimination code is broken and allows dups).  The actual replacement of ArrayList with a HashSet took only a few minutes. 

In summary: decompiling a single class file to reasonable-looking java code took about 6 hrs (and 7 decompilers looked at and 5 actually turned loose on the class file), while the refactoring to correct the egregious performance bug took maybe 10-minutes (yanked all the search loops and just used 'contains' and 'addAll') and included fixing the dups-allowed bug as a pleasant side-effect. 

Is it time for another Java decompiler bake-off?  Has somebody got a recent comparison of decompilers floating about?

Thanks,
Cliff


 

Category: Web/Tech | Permalink | Comments (17) | TrackBack (0)


Just What-the-Heck is a "wait-free" algorithm?
August 8, 2008

What is a wait-free algorithm?  Wikipedia has a nice article on it.  There's the obvious definition: all threads complete the job in a finite number of steps - where a "step" is some unit of work granted a thread by having the OS give the thread some time-slice on a real CPU.

Wait-free isn't interesting for loop-free (and not recursive) programs - there's only a finite number of steps in a finite piece of straight-line code.  However, wait-free is often used to describe multi-threaded programs - which often communicate between each other with some sort of atomic-update machine instruction (frequently a CAS instruction but sometimes it's LL/SC and sometimes something else).
I.e. - most (all?) interesting wait-free algorithms rely on special atomic instructions (you always can write wait-free programs without any kind of atomic-update instruction but these tend to be really slow).  Both CAS & LL/SC will fail if their initial conditions change and the "fix" for failing is usually to loop and try again. 

To re-cap: wait-free algorithms in the real world rely on fragile failable machine instructions - and the failure is often fixed by looping while retrying.  To remain a wait-free algorithm, the algorithm must *somehow* know the all loops are of finite length - i.e., failure is finite.  This is often a reasonable assumption - under low load most machines' atomic-update instructions mostly work.  But now let us posit an adversarial OS.

The definition of wait-free doesn't mention an Operating System - and yet real programs running on real hardware have to deal with a real OS.  Programs often more threads than CPUs and one of the OS's jobs is to schedule the threads - grant them "steps" on the CPU to make forward progress.  The OS chooses when a thread gets CPU cycles (steps) and for how long.  Typically OS's attempt to be fair and efficient, granting all threads an equal number of steps in a round-robin fashion.  However, the OS isn't required to be completely fair and typically they are not (in the name of efficiency).  A wait-free program can make forward progress in the face of an OS that is not being nice, i.e. adversarial OS.

Let's assume the adversarial OS has to be at least slightly fair: it can't deny cycles to a thread forever.  Indeed, if the program runs forever the the OS must hand out an infinite number of cycles to all threads (the unfair adversary isn't very interesting: the OS can simply choose to not hand out cycles some threads, or even to any thread - ala a hard-crash).  What kind of damage can this fair adversarial OS do to our algorithms?

For starters, the OS can guarantee that no contended LL/SC ever succeeds!  Its easy: the OS schedules threads right up until they execute the LL; then that thread isn't allowed cycles again until another thread also executes an LL to the same address.  For all known implementations of LL, this will cause the 1st thread to break the "link" part of load-linked.  The first thread is then allowed cycles until it tries another LL.  The 2nd thread is preempted at once of course, since it also just executed an LL.  Hence the OS ends up allowing all threads an infinite number of cycles while guaranteeing no LL/SC ever succeeds.

What does this all mean?  It means in theory (since the adversarial OS is a theoretic notion) that IBM, Alpha & MIPS are screwed: they cannot use their primary atomic-update operation to write any algorithm without dragging in the OS.  i.e., no one can write even plain locking algorithms on their hardware unless you also can show that the OS is at least slightly not-adversarial.  Truly random time-slices would work - but fixed-cycle time slicing will be adversarial for algorithms which are "beating" in tune with the time slice. 

It's a little worse than that: the "load" part of LL is a ... well, a load.  A contended load.  It will miss in cache and thus be very slow, with a larger-than-normal cycle count attributed to it.  The random-slice OS is slightly more likely to context switch immediately after the missing load than at other points in the program.  Of course, the context switch gives other CPUs a huge chance to request the same address - which means that the poor loser thread is both going to lose this LL/SC but also take another cache miss when it retries (and thus have another greater chance of a context switch).

This is all very theoretical: OS writers try hard to make their OS's not adversarial and exponential backoff/retry loops work great (or worked great on the 4-cpu PowerPC machines I used years ago).  But you had to do it when you used LL/SC or you would get CPUs live-locked spinning on LL/SC.

How does CAS fair in the face of the fair adversarial OS?  All hardware docs I could find claimed that a CAS does succeed if the initial conditions are correct.  According to the docs, assuming that multiple threads attempt to update the same address via CAS using the same initial load then exactly one should succeed.  Here then the adversary cannot stop at least one thread from making forward progress.  This means that CAS can at least be used to make lock-free algorithms.  The OS can arrange for some loser threads to always lose CAS races, by preempting them out after they load the initial CAS conditions and not granting them cycles until after the winner thread does a successful CAS update to the same location.  In other words, you can't use CAS directly to write wait-free algorithms (you always write wait-free algorithms without either CAS or LL/SC, although they tend to be brutally slow).

Now here's another interesting bit: I distinctly recall using machines where CAS would sometimes fail for all threads even when everybody used the same correct initial conditions.  Indeed, I want to claim it was true on early multi-socket Intel machines - but I cannot find any written documentation to support my memory!  My fuzzy recollection was that since Intel didn't build the motherboards, Intel would not guarantee forward progress on contended CAS's - which had to do an external R/M/W cycle and the correctness of which all depended on the motherboard manufacturer.  Indeed it seems to me that unless a CPU designer also designs the details of how CAS interacts across chip boundaries, then that CPU cannot make forward progress guarantees about CAS.  I.e., like IBM has to claim the OS running on their gear is not a fair adversarial OS in order to make LL/SC useful, so also Intel has to claim that motherboards running their chips implement a strong version of the R/M/W cycle to make CAS useful.


Ok, this is all stretching the point a great deal.  In practice these things work well- at least on smaller CPU count machines.  But CPU counts are on the rise, and the issues mentioned here become less and less academic with each passing year.  On an Azul 864-way machine, if all threads are attempting to CAS the same location in a tight loop Azul promises you exactly one succeeds.  But also I can promise you that at least one CPU will perpetually fail unless you invoke some sort of fair-lock (which we do in the OS; Java threads can't quit write CAS loops that are so tight as to prevent all forward progress to one CPU whereas hand-written asm in the OS did).  I wonder how well a 1000-cpu LL/SC machine will fair on such loops and how much exponential backoff will be required (and how much inefficiency will that cause)? 

Or maybe the answer is: on some future 1million-cpu machine, if you have to CAS all CPUs on the same cache line then you are screwed: performance is so bad it's indistinguishable from a crash; you'll hit ^C and try again.  Hence we all end up learning how to write programs which don't do that...  I'll let you know if I get it figured out...      :-)

Cliff

Category: Web/Tech | Permalink | Comments (8) | TrackBack (0)


On How I Went to 2008 PLDI and All You Got Was These Lousy Notes
June 15, 2008

On How I Went to 2008 PLDI and All You Got Was These Lousy Notes


ISMM and PLDI are in Tucson this year, in June (and CGO was in Boston in Feburary - next year I hope they swap!!!).  Outside temps are expected to hit more than 105 in the shade.  My Dad lives in Tucson, so I went a day early to visit.  He's recovering from a diastrous surgical accident, slowly but surely.

The conference center is in a new resort, located in the desert. Indeed the days where very hot and dry - but mostly spent indoors.  I didn't have very high expectations, but actually the place was very nice.  The landscaping was just superb; they had winding shaded paths running around the resort with labelling on the desert plants.  The staff was nice and competent and the bathtub was ridiculously oversized.

Monday I hiked the Ventanna Canyon trail.  I started at 3:30 in the afternoon - hot part of the day, I know, but I wanted to be back before dinner.  I packed a bottle of water, a broad sun hat and layered on the sunscreen.  The trail itself winds through private land for a mile or so before entering national parklands.  It's a rocky and winding trail, crossing a dry creek bed repeatedly as is slowly climbs into the canyon.  The trail is lined with saguaro cactus, other cactus of every type, the occasional accaia tree and a surprising amount of wildlife.  I saw a dozen jackrabbits, a 3-foot snake (not a rattler), and loads of lizards and birds.  A bright red cardinal landed 3 ft from my head and stared at me; the air was loud with owl hoots and the buzzing of a kind of large heavy hornet.  Some kind of tree was popping seed pods; every few moments there'd be a small pop and the rattling of seeds bouncing around the stones.  My feet crunched on the gravel in the hot sun except where I rock-hopped across the dry creek bed.  At times the trail came right up to the rock canyon wall, other times I was wandering back and forth across a narrow valley.  The trail climbed slowly in the initial part, but as the canyon narrows it starts to climb steeply up a ridgeline. This part of the hike was slow going for me; it was still hot but with nearly a 1000ft elevation gain.  I metered my water carefully - there'd be no refills till I got back - and I took lots of rest breaks.  Finally the trail levels out by my destination - the promised Maiden Pools.  At this time of year, they are reduced to small scummy puddles covered with bugs - definitely a disappointment!  After a short break, I turned around.  The ridgeline had lifted above the canyon and I was treated with gorgous views of Tucson.  The trip down was a lot quicker than the trip up, and mostly in shade from the narrow canyon walls.  I got down about 7, nearly 3.5 hours after starting on a 6.4 mile hike (which is definitely slow going for me!).

I saw my Dad again on the last day; he's finally come home from the hospital.  He's been in the hospital for nearly 2 months, so this is a really welcome change!  He's in good spirits and is on the mend. 

The trip home started out fine: a quick flight to Phoenix from Tucson, then a short layover and a flight home... except we got a slow start, and then a slower start, and then they canceled the flight while we were all sitting on the plane.  Apparently we were too slow, and San Jose's curfew kicked in.  US Airways put us up in a Clarion hotel about 20 minutes away.  I skipped dinner thinking I could eat at home and as of this writing it's 5am; I'm back in the airport before any food joints open, waiting for the 1st flight to San Jose of the day.


As usual with these trip reports, I summarize brutally (and only those talks I actually attend) and sort by my personal agenda.  No offense intended, and if you think I didn't do your talk justice please correct me here!

Some Cool Stuff

Building The Billion Dollar GC
Bounding Leaky Programs
Limits of Heap Compaction
Register Allocation via Puzzle Solving
Register Allocation/Coalescing paper
Foundations of the C++ Concurrency Memory Model
Race Directed Random Testing of Concurrent Programs
SharC - Checking C code for Locking Invariants
Automatic Volume Management for Programmable Microfluidics
Al Aho - Quantum Computer Compilers

Some Memory Management Papers

Immix - A Mark-Region Collector
Concurrent Real-Time Copying Garbage Collection
Limits of Parallel Marking
The Closer: Automating Resource Management in Java
Path Specialization: Reducing Phased Execution Overheads
Region-based Memory Management
Efficient Dynamic Heap Allocation of Scratch-Pad Memory
Virtual-Memory-Compaction....
Supporting Superpage Allocation
MPADs
Parallel GC with a Block-Structured Heap
Memory Management for Self-Adjusting Computation
Combine Ref Counting with Functional Program
A Study of Java Object Demograpics

Some Concurrent Programming Papers

Data-Race Detection via Linear Programming
"sketch" - A Tool for Writing Concurrent Data Structures
Velodrome
Inferring Locks for Atomic Sections
Model Checking Concurrent Programs

Other Stuff

Expressive and Safe Static Reflection with MorphJ
Liquid Types
Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling
Execution Indexing Problem
XMEM - Type-Safe Shared Memory for Cross-JVM Communication

TuningFork is now open-sourced on sourceforge. (IBM's trace visualization tool and open trace format)


Building the Billion Dollar GC

David Bacon, Keynote

Metronome - worth several hundred million $$$ in revenue to IBM.

(David openned with an amusing story about the failure of technology - his GPS and backup GPS both failed as his sailboat approached the reefs surrounding Bermuda; weather was too poor to use a sextent reliably).

1999-2001, Recycler.  Ref-counting based GC.  max pause: 2.6ms, MMU 72% @ 10ms.  But has pathologies - no bound on time with large numbers of cycles, or on large cycle.  No compaction.  Required a GC processor on the side; no way to define how beefy the GC processor had to be.

Next project: worked really really hard to get rid of pathologies. Wanted robust performance.  Decided on Snapshot-at-Beginning because at time you declare snapshot, you bound amount of work to complete the cycle.  Was worried about incremental-barrier approach (which is exactly the problem Azul solves with our read-barrier).

Real-Time also means Real-Space.  So some space obsessions.  Choose a segregated free-list.  Lose a little memory to fragmentation.  Compute 'rho' - fragmentation lossage ratio.  Picking a 'rho' defines how many memory (max) can be lost to fragmentation.

Array-Lets - choosen to handle compaction of large arrays well (most allocators have a large-object-space which doesn't compact). 

Would be very interesting to run Fragger on Metronome.

2001-2003: 2nd go-round.  Max pause 12ms (up from 2.6ms).  MMU 45% @20ms (vs 72%@10ms).  Much worse than last time.  BUT!!!  Very robust against poor heap shape and bad apps; immune to pathologies.

Now comes 3rd attempt - built into J9 (IBM production VM). (sidebar - RTSJ doesn't "work" in a large scale, the scoped memory doesn't "compose".  Modules from different places, when run together, have scope unpredictable failures.  Raytheon was choking trying to use RTSJ on the next-gen Destroyer project; saw Metronome and said "we'll buy it now")

2004: went crazy trying to get it ready for Raytheon.  1st demo from Raytheon: max 872us, MMU 65@%10ms,  79%@20ms Finally IBM Software Group agrees to pick up the project (for $150M sales!)

2005-2006: Lots more features added: AoT compilers, timer modules, priority inheritance, etc.  Plus must make "real time friendly" all the other cruft: jvmpi/ti, debug and RAS tools, weird ref's (soft, weak, final, JNI), heap format/dump, class libs, all the RTSJ libs, ArrayLet support, read-barrier support in class libs, documentation, etc.....

Plus testing 24x7, including performance testing, pause-time testing, etc.

Today - new multi-cpu version being worked on.  Still only available on blades - max 8cpus.  Not tried 100G heap; no idea what the limit of sustainable allocation rate. 


Bounding Leaky Programs

Non-mutating Mutator - Take a snapshot from a real mutator thread; run with massive profiling (e.g. all memory refs snapshot'd).  The NonMutator cannot update the heap; it's just profiling.

Bounding Leaky Programs.  After a leaky program eats all of memory, most of memory is reachable - but never used again.  When it throws an OutOfMemory - just capture the OOM.  Then 'cut' a random pointer edge, and get to reclaim most leaky memory.  'pick a random' - since most memory is leaked and dead, likely to be cutting off dead stuff.  But also can use most common heap-type and staleness data (pages never touched in a long long time).  "posion" the cut edge, so that if that pointer is actually used, you then throw the OOM you would have thrown originally.  This can post-pone OOM's from leaky programs anywhere from 100x ("forever") to 10x or more.

A measure of heap fragmentation - sum-of-(freeblocksize^2) / (sum-of-freeblocksize)^2

Gives a number from 0 to 1 -

  •   0==maximally_compacted,
  •   1==maximally_fragmented.

Can convert this number into a characteristic-free-block size.  i.e., if we coalesce all free blocks, then split back into free blocks of the same size - which size gives this metric.

Can I use this to decide "how likely a future alloction request will fail due to fragmentation"?  Probably...


Limits of Heap Compaction

Compare 20-zillion different heap-compression techniques across a wide variety of heaps, and against e.g. gzip/LZW style compression.

For some applications (e.g. Xalan) see heap compressions upwards of 50% or more.  Some of the techniques are cheapandeasy to use as well. Trimming trailing zeros from arrays is a big one.

Biggest winner is zero-based object-compresion.  Back each field with a bit; store only the non-zero fields.  Zero fields have a bit set in the backing bitfield.  Nearly 45% heap-size reduction across all heaps.

Used the Dacapo benchmarks+psuedojbb, 20-25 heap snapshots per benchmark.  Savings of each technique are very stable over time (all snapshots show same savings per benchmark).


Register Allocation via Puzzle Solving

Solving for register-pairs.  Split all live ranges maximally; split before and after EACH instruction, using a complete parallel-rename. 

Very graphical; 'puzzles' represent the short live range.  Puzzles have a top-area and a bottom-area for the incoming and outgoing registers.  Some puzzle pieces only go on the top, or bottom, or (if it's a 2-adr live range) it will cover both.  Never have a piece that can go either on the top OR on the bottom. 

Tried 4 different reg allocs on X86, using LLVM 1.9.  Puzzle solving, linear scan (default in LLVM), iterated register coalescing, PBQP (partitioned boolean quadratic programs).  Had to spill and retry 5% of time.  Puzzle-solving is fast; equiv to existing well-tuned linear scan.  Need to insert about 7% more spill ops than no allocation (compare to C2's graph coloring: about 2% of all executed ops are C2 spill ops).

Cliff Notes:A nice way to handle mixing register pairs ala X86 in a linear-scan allocator.  Alas, as with all linear-scan allocators the question is what happens at merge points.  L-S allocators always do well with large basic blocks, but the typical Java basic block is 5 instructions long (HotSpot "-server", so plenty of inlining and optimization).


Register Allocation/Coalescing paper

conservative coalescing vs aggressive coalescing.  but coalescing breaks SSA-form chordal graph form (and chordal graphs can be colored in linear time!!!). 

So do not modify graph, instead modifying the coloring to get a good coalescing.  Optimistically take the existing coloring (easy, from chordal graph SSA coloring), then flip a nodes' color to coalesce an edge - but this introduces a conflict - so again resolve the color-clash by flipping a neighbor color until no more color clashes. Sometimes have to give up a re-coloring.  And what not should be colored 1st? 

Conduct re-coloring as a transaction.  Mark nodes, change colors, etc. If unresolvable, then rollback.  Also, do not recolor already recolored nodes (prevent coloring cycles).  Also when picking colors, try to pick colors not in any neighbor, etc. 

Will attempt all nodes in the affinity graph (all nodes joined by copies) in a single affinity group (split by self-interferences), will attempt to recolor all those nodes to the same color at once.  The split agrees up front which copies will not be attempted to coalesced.

Not terribly fast, but not too bad: about N^1.2, and 1000msec for 1000 node IFG.  After coalescing, 8.4% of copy-cost is left (vs 6.7% left for PERFECT coalescing).  Overall execution speed is about 85-90% of no-coalescing.


Foundations of the C++ Concurrency Memory Model

Similar to JMM; if you avoid data races, then you get SC.  Explicitly: most compilers are now busted when updating bitfields. 

Provide atomic<T> types that allow concurrent access, without introducing dataraces.  atomics allows concurrent access and compiler optimizations on everything else.  'atomic' ends up being similar to Java 'volatile' - but they couldn't make the existing C++ volatile keyward do anything meaningful.

Looks like they are going to end up doing a good job; most of my earlier complaints appear to be answered.


Race Directed Random Testing of Concurrent Programs

Take a previously known tool for finding potential dataraces.  Then force scheduling to expose the races "for real", and test to see if the race is harmful.  Works on large java programs, finds real races with low false-positive rate.  Reproducible: given the same random starting seed, execution is deterministic. 

Stress Testing: repeated execution.  Fails: end up testing the same schedule over and over again.  Fails: on a crash, can not repro. Fails: crashs in different environments can not be repro'd in-house.

So instead, force the schedule.  Pick a random thread, step 1 instruction.  Turns out to get much better randomized schedules (more schedules tried than normal stress testing).  But still does not do much possible schedule-coverage (exponential schedules) - and so does not cover buggy schedules.   

So prioritize search to expose bugs quickly.  Start with some existing technique to get list of potential race conditions (all the "false positives" are good now). 

Good news: very modest slow-downs; gives a reproducable race/crash with very high probability (reproducable, but perhaps the indicated original potential race isn't a real race).  Found a bunch of new races in the standard JDK container libs, and in 300KLOC programs.


SharC - Checking C code for Locking Invariants

Programmer defines high-level sharing strategy.  Example: this data is protected by that lock, etc.  SharC soundly checks thru static and dynamic checking.  Violations include real data-race traces. 

  • private
  • readonly (read by all, no writes except init)
  • locked(l) - only accessed under lock(l)
  • dynamic - either private or readonly
  • racy - benign race, no checking

Use these as type annotators: 

    "int readonly *private X;"
  • private: only accssed by owning thread
  • readonly; target of X can only be read

No non-private pts to private objects.

Sharc infers most annotations

Some common strategies:

  • locked - counters, queues, caches
  • private->readonly (init privately, goes to read-only)
  • private-lock-private - an object moving from thread to thread

Key Insight: if you have the sole ptr to some object, you can change the sharing stragety for an object.  SharC checks this, and allows a 'sharing cast' to change the sharing strategy. 

During runtime, SharC makes all the right guarantees with combos of ref-counting and runtime checks.  IF there is a race between 2 threads, SharC provides an error regardless of thread schedule. Needed expensive checks for 'dynamic' typed objecs.  Uses WPO to infer dynamic and private sharing modes.

Tried it on 6 small/med concurrent programs.  Added annotations, investigate errors; add more annotations, repeat until no more errors. Programs up to 350KLOC; needed 40 annotations/mods on 350KLOC. Runtime overhead varies from 2% to 14%.  Space overhead sometimes gets large (30%+).

Cliff Notes: Actually seems plausible that HotSpot could be annotated this way!


Automatic Volume Management for Programmable Microfluidics

Think "Lab on a Chip" .  Chip with micro-pumps, micro-fluids, mixers, values, sensors, heaters, reaction loops, etc.  Want to mix and match fluids for various kinds of sensors (e.g., blood glucose meter, or a hormone tester or drug discovery).

'Assays' are an life-science experiment.  Using these fluids, but NOT "computing" using fluids - doing Real Science/Chemistry.  Fluids are not computer variables; fluids can be destroyed; have a fixed volume and when it's used it's used up. 

"Program" specifies relative volumes; but "execution" needs real volumes.  Sometimes a mix is needed for several things, so need to decide how much of each fluid is being used.  Have max capcity issues; have minimum size issues (micro-pumps have a lower bound on how much they can pump).

Very cool problem; using compiler-like technology to solve.  Problem looks something like network-flow, but it's actually harder (NP-hard). Able to throw various linear-programming solvers at the problem.


2008 PLDI  Keynote

Al Aho - Quantum Computer Compilers

Lots of strange algorithmic implications; some very hard problems become asymptotically simpler (e.g. factorization goes from exponential to cubic). Right now QC's are giant physics experiments, but progress is being made on many fronts.


Immix - A Mark-Region Collector

Blackburn, McKinely

Really nice presentation.

GC Fundamentals - time/space tradeoff; Immix - improve this tradeoff curve

  • Allocation (e.g. free list, bump-ptr)
  • garbage type (e.g. tracing, ref-counting)
  • reclaimation (e.g. sweep-to-free, compact, evacuation)
  • 1960 - Mark-Sweep: free-list alloc, trace, sweep-to-fre
  • 1967 - Mark-Compact:
  • 1970 - Semi-Space: bump-alloc + trace + evac

Look at these 3 algorithms for: mutator performance, GC time, min-heap.

They all have spots in the curve where they are weaker than one another.

Add a 4th reclaimation: sweep-to-region Mark-Region: bump + trace + sweep-to-region

Split heap into regions.  Objects cannot span regions. After making, any unmarked regions can be freed in bulk.

Large regions: suffer from fragmentation, but cheap bump-ptr

Small regions: suffer meta-data overhead, and small-fragmentation

Carefully pick block sizes to match TLB boundaries and cache boundaries.  Fragmentation doesn't hurt locality so much if you don't span these boundaries.  Recycle partially freed blocks (from last cycle) first before working out of new free blocks. Recycle in address order (explored other orders extensively).  This mimics the mutators' normal allocation order - i.e., continguous in order generally.

Will also occasionally evacuate to get rid of really bad fragmentation (but this is a whole-heap remap?).  And it's only done rarely.

Extensive testing vs 9 other GCs (and Dacapo, JBB2000, JVM98). Meets or exceeds all other collectors (in a single-gen and 2-gen settings).


Concurrent Real-Time Copying Garbage Collection

  • hard RT GC is gaining support
  • multi-core support is hard

This work: compacts; concurrent; lock-freedom; efficiency (actually, Azul does all these - lock-freedom is very specific: where the mutator interacts with the GC it does not need to lock; important for hard RT.)

For Azul - want happens on a reloc trap?  Somebody locks, so that the copy only happens once.

So this is simpler than Stopless - "Chicken" and "Clover". "Chicken" - fast, but may not copy all objects so allows fragmentation "Clover" - probabilitistic, but guarantees objects get copied

Chicken - abort copy if a mutator writes during copy.  Uses a Brookes-style forwarding ptr.  Tag the low-bit during the copy. Mutator will clear this bit atomically if it wants to write.  (ugh that's a write-barrier!!!). 

So write-barrier is wait-free, and has a fast-path/slow-path. In practice, only 1% of copies get aborted.

Clover - probabilistic indication of per-field copy.  Steal a bit-pattern, a random one.  CAS this bit-pattern down to indicate that the field is copied.  Chance of collision with a REAL bit pattern is 1/2^128 (128-bit CAS on Intel) - infinitesimal.  Then can make the algorithm correct-but-probabilistic-lock-free by detecting if the user is user your random R.

Both schemes - 20% throughput overhead (barrier costs). Clover has more slowdown during the copy phase (3x slower) Chicken - no other slowdown.

Alas, only implemented in Bartok; no comparision to Metronome.

Cliff Notes - I like the abort-copy-on-write as a way to avoid stalling the writing threads.  Be nice if we could be even cheaper on the mutator - some like: TLB-write-protect the whole page. If a TLB write trap happens, record that a write happened somewhere on the page, unprotect the page and let mutator proceed.  At end of copy, the GC thread can see if any writes happened.  Since we only copy objects for pages that are mostly empty, we'd only rarely have false aborts.


Limits of Parallel Marking

Did a study on longest ref chain see in SpecJVM98 heaps.  Very short for 1/2 the specs (max: 40).  Some chains around length 1200 for raytrace/mtrt and javac.  Measured max heap depth every 10000 stores into the heap - so pretty darned fine-grained measurements.  No SpecJAppServer, no Dacapo.  Bummer.

Also compute "relative depth" - ratio between number of objects in whole heap to longest single chain.  Then can compute worse-case speedup possible if whole heap filled with max-length chains.  This reports a guaranteed level of parallelism; real runs should do better because most chains are short.

Cliff Notes: Doesn't know about Azul's spec-marking patent and idea of being able to mark any shape heap in time proportional to size-heap-over-num-cpus.  Several researchers jumped up and pointed out that the speculative approach could theoreticaly bring in the whole heap (including all dead stuff), and I countered with the random starting point approach (that you'd get an interesting amount of dead stuff a vanishingly small %-tage of the time) and we all admitted the analysis was over our collective heads.  There's a PhD thesis in there somewhere.


The Closer: Automating Resource Management in Java

Resource Management in Java (outside of GC)

Examples: socket open/close.  Or AddListener and RemoveListner. Really track anytime you need matching create/delete calls.

Usual approach is manual; has usual problems (double-free, dangling).

Also finalizers - but might run out of e.g. file handles, long before running out of GC (happens to us all the time!).  Error happens on a fresh open, but instead we'd like to run a full GC cycle to reclaim file handles BEFORE the 'open' call fails.

Would like to displose after it's left 'relavent' use.  Listeners are called when an action happens - but if the listener itself won't ever call something else (there's no Observer on the Listener, but the Listener is attached to the Observed), then the listener is dead.  So define 'interest reachablility'.  Non-interest-links keep you GC alive.  'Interest-links' keep your resource alive; call the dispose method when only not-interest-reachable.

Also want to rapidly call 'dispose' as soon as only not-interest- reachable.  This greatly reduces drag-time on large resources. Sometimes can do it statically, but not always.

User provides: primitive resources and non-interest links. CLOSER provides: higher-level resources built from auto-made dispose calls.

Pretty standard flow analysis approach.  See if they can track a single 'owner' for a resource.  If so, then either statically dispose - or single-threaded-check dispose if the single owner ever created the resource.  If multiple owners then switch to ref-counting.

Worked great on a large Swing app (removed all existing manual resource management; had to annotate 5 resources - other 62 resources were 'auto-discovered' and auto-dispose code was almost completely equal to the manual, except for fixing 1 bug.

Cliff Notes: I like this as a idea for simplifying higher-level resource mgt.  I'd rather hook into 'GC' somehow - but GC on 'interest-links' somehow.  Similar to how I'd like to see using compiler/data-flow opts on normal programs. Maybe this work could be done with weak-refs and finalizers and standard GC?  Interest-links are strong links, other links are weak-links and finalizers for cleanup.


Path Specialization: Reducing Phased Execution Overheads

It's "efficient" code cloning, to reduce GC barrier costs.

Have 1 set of code when GC is idea, and no barrier is needed.  Still need polling points, so GC can change phases.

Have another set for each different GC phase, with the different barriers optimized inline.  With a little bit of cleverness, can fold these versions mostly together and not clone quite so much code.

Also claim read barrier costs are typically much much higher than has been reported in the past (common paper is "Barriers: Friend or Foe" which reports 10-20% cost for simple read barriers).  However, e.g. Stopless has barriers costing closer to 60% during some short phases, but cost is back down during most of the GC cycle. 


Region-based Memory Management

Check for unsafe usage of Regions (ala HotSpot) on C programs with 100KLOC.  Found a dozen errors, and another dozen false positives. Actually, quite nice - wonder if it works on HS.


Efficient Dynamic Heap Allocation of Scratch-Pad Memory

Allocator targets memorys less than 1Mb in size, for embedded systems.  Scratch pads are common in small machines, but also on e.g. Cell. Scratch is big enuf to be useful, small enuf to be fast.

But can too much stuff to go in it; get fragmentation and control problems.  Need to manage carefully.

  • manual: nightmare for programmers to manage
  • compiler/static: doesn't make good use, no time sense
  • dynamic: still hard to use (e.g. have a fast heap and a slow heap)

Example: DL-malloc will use 512bytes of state to manage a 4K scratch-pad.  1/8th of fast memory is gone already!

SMA - 40 bytes on a 4K scratch.  Also a lot less code and faster. Uses a fast simple fat-block bitmap approach.  Can break big blocks up to avoid fragmentation.  Split blocks are power-of-2 buddy system and stored on a fixed-sized free-lists. 

No boundary tags.  Instead coalesce tags: whole block region has a bitmap showing free chunks of size 2, and another bitmap for chunks of size 4, 8, etc.  Makes the coalesce operation very efficient (constant time per de-alloc or maybe log-n per de-alloc to coalesce).

Can deferr coalesce and stick with free-list for hi-volumne sizes.

Can update the bitmap with CAS for allocations that do not span a word of bitmap.  If spanning a word, will need several CAS's.  If CAS fails, then roll-back earlier allocations and retry.

Very much lower overhead than DL malloc on small allocations. Won't scale well at all (part of the small-space-fast design trade-off).


Virtual-Memory-Compaction....

double-map physical page as a way to move objects physically without moving them virtually.  It's memory compaction for C++!  If the page-offsets of 2 objects do not overlap, then can copy one object from one phys-page to the other, and double-map both virtual addresses into the same phys-page (and thus reclaim the copied-from phys-page).


Supporting Superpage Allocation

(avoiding fragmentation of physical memory)

There's lots of prior work to avoid frag; none seems to work well. Define characteristics of pages:

  • movable memory,
  • reclaimable,
  • unmovable, or
  • temporary

Group pages together based on mobility.

Movable - e.g. an app page.  Copy the data to a new page, and change the apps Virt-2-Phys mapping.  File-system cache pages are temp.

Free-lists for each kind of memory.  Pre-split memory for each kind of memory.  Lots of details; he talks very very fast.  I can't follow it or type that fast.

Turns out it's working well enough, and will probably be in a future version of Red Hat.  No slow-down in page allocation (and often a speedup), and also able to allocation super-pages about 50% of the time.


MPADs

Auto convert array-of-structs into struct-of-array for better cache locality which searching across the whole dataset.

Pointers-to-structs turn into ptrs-to-1st-field.  All 1st-fields are now in an array, and the array is power-of-2 aligned.  1st field must also be a power-of-2 (and math is cheaper on 1st field, so pick a hot field).  Then can convert the ptr-to-1st-field address into the array index by AND-masking to the array start, and doing a sub/shr.  Other fields can be found by adding different array bases and using the just computed index.

Worked great on microbenchmarks; didn't help SpecCpu ... accidently turned off hardware stream prefetching on Power chips.


Parallel GC with a Block-Structured Heap

Simon Marlow, Simon-Peyton-Jones

It's a parallel GC for Haskell.  GC talks to the OS via 'malloc', not 'mmap'.  Doing this for flexibility and portability.  NO address-space layout requirements.  Grabbing OS pages (4K chunks) at a shot. Similar to HotSpot TLAB - on 4k pages. 

Actually grabbing 'megablocks' = 2^M in side and 2^M aligned.  First 2^D bytes contain "block descriptors", then rest of 2^M holds 2^12 sizes block. 


Memory Manage for Self-Adjusting Computation

Large data-set which changes slowly over time. e.g. robots interact with phys environment

e.g. run the same program over and over again, where the input and output from run to run vary slowly over time.  Use change-propagation to make a meta-program which acts "as if" it ran the whole program from scratch - but it runs much faster, and assumes some large starting state and produces only the delta to output state.

Benefits: much smaller in both time and space over running the same program over and over again.  Downside: more complex (but SMJ can auto-produce such a program from the original whole-run-at-a-shot program).  Downside: lousy GC behavior.

So: result is indepedent from input trace, except for performance. Given a good input trace, and a changed input, can compute a new whole output efficiently.  GIven a lousy input trace, will slowly compute the new output as-if from scratch.

Has the classic bad allocation pattern: it's basically a giant FIFO queue for the live dataset.  New allocations are long-lived; things that die are old things.


Combine Ref Counting with Functional Program

(ref counting not quite dead yet) - ref counting admits lots of interesting optimizations.  Cycles - 2 general approaches; Brownbridge - maintain invariant that any cycle has at least one 'weak' edge and don't walk weak edges.  Very complex, very hard to get right.  Or LMW - trial deletion and see if a cycle is busted.  Downside: very slow if lots of cyclic live data.  Also often have large chunks of data you know isn't cyclic (shape of code) or isn't dead (e.g. roots).

So pick these invariants:

  • no cycle is entirely strong
  • weak-in + strong-out ==> there-exists strong-in

Once the strong-count drops to zero, the whole cycle can be nuked.

Maintain invar#1

  • mutator makes refs in only 3 ways:
  • root->strong; constructor-arg->strong; ditto->weak

invar#2:

  • delete strong ref - check for weak-in and no strong-in and strong out  
       
    • might delete last strong-in  
    • correct by making output strong edge into a weak one (and recursively)  
    • No limit on amount of work for weaking an edge  

This approach limited to pure functional programming - can build cycles in the constructors but NOT by set!'ing.

Independence Theory - edge coloring is independent of other optimizations and heuristics.  "push-out" should be possible. Typically for every node reclaimed, less than 2 are scanned.


A Study of Java Object Demograpics

Current collectors tend to have a very simple classification scheme - short lived, long lived, immortal.  And then act accordingly: young-gen, promotion, old-gen decisions are all driven by the prior classification.

Real programs have a richer structure.  Objects from the libs, the app, and the JVM itself have different lifetimes.  Clusters of allocation sites are stable across program inputs; a small number of clusters dominate all allocations. 

Now look at individual allocation sites, and look at the lifetime of objects from those sites.  Use some cheap statistical test to see if 2 sites have similar average lifetimes.  Group all allocation sites with a greedy-algorithm based on closeness of lifetime patterns.  The top 10 of these groups, or clusters, will cover 90% of all allocation.

A small amount of inlining made each "hot" inlining site very deterministic. 


Data-Race Detection via Linear Programming

Assume a 'capability' for each memory location.  You need the capability to write, or a tiny fraction of it to read.  Thus many readers are allowed, but only 1 writer (and no readers if any writer). From this can define typing rules for the lambda calculus, and from that define some linear inequalities.  Can solve system of linear inequalities with, e.g. GLPK or LPSOLVE. 

Now try to add locks to the lambda calculus; locks "store" capbilities.  Creating a lock also has to include some capabilities. Can stll get a real lambda calc, then linear inequalities.  Able to run this on some small pthreads/C programs and find some races.  Has the usual false-positive issues (lots reported; needs some expert extensions to his lambda calculus formulation to remove the false positives).


"sketch" - A Tool for Writing Concurrent Data Structures

insight - implementation - validation- fails-validate - fix either insight or implementation.

This tool: fix the validate/implementation cycle.  User provides insight, but tool tries implementation and validation cycle. 

You write a partial program and a spec; 'sketcher' fills in the rest of the program (exhaustive search) until it gets a program which meets the spec.  Your partial program includes 'holes' to be filled in, and asserts to be honored.  Plus you fill in a list of sample statements to be filled into the hole (but unordered statements, plus various reg-exp expansions - so pretty flexible about what can go in the 'hole').

Their gizmo needs a counter-example input on failure.  It eats the failing example to help drive a smarter next-go-around.  Failing traces from SPIN model-checker for one program can be used to exclude whole classes of other programs which would exhibit the same bug.

The using constraint-solver to produce new program attempts that avoid all prior buggy traces and fits the 'holes' given by sketcher.


Velodrome

Free from race conditions is not sufficient; must have good atomic properties as well.  Can show some interleaved schedules are equivalent to some serialized execution.  Atomic code blocks are serializable.  Atomicity violations often reveal bugs.

Race-Freedom - as-if sequentially consistent

Atomicity - as-if each atomic block is executed serially

Want both properties.

Race-Freedom/Race-Detectors - e.g. Eraser (incomplete, false alarms) vs complete tools.  But Atomicity checkers only come in the "False alarm" flavor.  Bunch of these kinds of atomicity checkers.

Velodrome - complete checker for atomicity - no false alarms.

Demos a non-standard sync idiom (Dekker's algorithm).  Shows happens-before layered over program order, plus H-B on volatiles and shared variables.  Also, anything not in a XTN is treated as operating in its own 1-instruction XTN.  If this H-B ordering has any cycles IFF then not serializable.  Velodrome detects such cycles.

Now: make this scale to billions of XTN's.  Cannot keep the entire graph; too big.  Can use GC to remove XTN's - XTN's with no incoming H-B edges will never be part of a cycle.  Can't get rid of XTN's that are mid-progress.  Ref-Counting works well (since no cycles!)

Must optimize unary-XTN's (single-instruction XTN's).  If a unary XTN has no in-edges - then it will never have in-edges, so can be GC'd immediately - so actually never even allocate a node for it in the graph.  If unary has exactly 1 in-edge, can merge it into the prior XTN as well - and "1 in-edge" refers to all the nodes smushed into the same larger XTN.  Must do this to keep costs reasonable. 

Velodrome: part of RoadRunner framework for Java; uses BCEL.  Can also add other analysis in a pipeline on top of velodrome (e.g. Atomizer and Eraser).  Velodrome causes slowdown by 13x.  H-B XTN graphs tend to be of size ~20 nodes! 

Compare Velodrome to Atomizer on 14 multi-threaded java programs. Atomizer has 1/3 false warning rate (238 total warnings).  Velodrome gives 133 errors, no false alarms.  84 more errors are Atomizer false alarms, and 21 errors are only detected by Atomizer. 

So - take Atomizer errors, and run with Velodrome but pause the executing thread that Atomizer complains about.  Now Velodrome usually catches a concrete witness of the broken thread scheduling. 


Inferring Locks for Atomic Sections

This talk: compiler support for atomic sections via pessimistic concurrency; i.e. implement the 'atomic' keyword with a plain lock. Useful for XTN's with non-reversable ops (i.e., I/O or wait/notify).

Actually grabbing fine-grained locks at the start of each atomic section corresponding to all the memory locations being touched inside the atomic region.  (how is this different from every other STM manager out there?).  Compiler framework is parameterized: can use any heap-shape analysis to guard chunks of heaps with coarse-grained locks.  Also the coarse-grained locks are kept in a hiearchy (tree order); locks closer to the root cover everything under them.

Performance is better relative to TL2/STM in a 80% puts R/B tree, and comparible when using 80% reads.  Still need to figure out read/write locks, but then think performance will always tie or beat a good STM. 

Testing methodology is flawed, so hard to tell how well it really works.


Model Checking Concurrent Programs

stress testing rarely finds all the bugs.  Poor selection of thread interleavings. 

"CHESS" - Systematically enumerate thread interleavings.  Provide some qualified coverage guarantees.  Replace OS scheduler with a "demonic" scheduler.  Want to enumerate all paths in a state-space diagram, but can't capture program states (too big).  Very effective for acyclic state spaces, but not for cyclic spaces.

Problem is to get interesting schedule coverage.  e.g. spin loops are very boring to spin at; and if you spin 100 vs 101 times - do you then explore the states following each seperately?  Likely the states following the spin loops are same for looping 100 vs 101 times.  Also can't really detect live-locks very well - they just look like bigger badder spin locks.

(Good) programs indicate their lack-of-progress - a thread scheduled infinitely often will yield infinitly often (via sleep, yield, thread-ends, asm("rep nop"), etc).  A violation of the Good Samaritan assumption is a performance error.

Found lots of bugs (not presented in paper!)

Found in singularity - etc, programs from 6K to 175K. Can boot and shutdown singularity under the model-checker - found 2 bugs.

Now in production use at Microsoft!


Expressive and Safe Static Reflection with MorphJ

So far - looks like a technique to avoid replicating synchronized wrapper classes (i.e., instead of a SynchronizedList class and a SynchronizedSet and a SynchronizedMap class, you can add sync to any prior class).

Adds syntax to allow iteration over reflective calls:

  for( R m(A) : T.methods ) ...

where T is an interface Type, T.methods is a list of methods in T. SO for example can declare:

  for( R m (A) : T.methods ) 
    public R m (A args) { sync(mutex) { me.m(A); } }

It's an interesting syntatic extension for defining new classes of code subsumption and commoning... but probably not one that's truely useful.  Much easier just to clone the code by hand.  Certainly have enuf obvious extensions that it seems more broadly applicable.  Looks like a good language hackers' tool.  e.g., can replace reflection hacking + BCEL fairly easily; used MorphJ to replace Heirly's auto-STM tool.

// here is "for all fields with a getter..."
  for( T f : X.fields ; some get#f : X.methods )

Liquid Types

Better static typing?  Can put restrictions on actual int values as 'types'.  Such as 'i is never 0 so division is ok', or a type such as 'i such that i is < array.length'.  Problem with dependent-types: need huge type annotations.  Can't we use inference?

Limited to logical qualifiers: "0<=v" or "v<len a" or "v<100", plus conjuctions of them.  Type qualifiers limited to the usual, plus these logical quals.

Provable array safety; 1% annotation in some large code, 10% for a container library.  Seems to do a decent job of inferring proper types. ML type-inference, plus 'dsolve' - linear constraint solver. Caught an off-by-one in the bitvector library; safety check for bitblit:

  if( c < 0 || 
      offset1 < 0 || offset2 < 0 ||
      offset1+c > length1 || offset2+c > length2 ) ....

Error is should be "offset1+c >= length1"


Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling

Bartok

Start with TAL - type-safe asm code

Compiler is 200KLOC; does about 40 optimizations. Generates code that is 2% slower than no TAL; compiler is 40% slower. But gen'd code is typed.

Carrying type-proofs through all of the optimization steps? Have a type-system for passing by-ref values. Support most optimizations; does not support inline allocation, or array-bounds checking ABCD (ABCD is not safe in the prescence of overflow).  Proof is generated by Bartok, carried along with the TAL. Proof-checker can check proof in linear time, long after optimizer is gone.  Verification takes about 2.5% of compilation time.

Also check that GC info is correct, that return address is immutable.

Not handling correctness proofs for concurrent programs.

Object sizes have grown alot, to hold type info - 2x more data overhead than plain optimized code (ala plain C).  Execution time is only a little slower (no inlining of allocation code); I assume allocation is considered trusted code (part of TCB). 


Execution Indexing Problem

same program run twice w/slightly different inputs OR 2 versions of same program OR same multi-threaded program, w/different schedules

How do you represent an execution point across different executions? How to set breakpoints in the same execution-point in 2 different executions?

Describe a grammer for the execution trace (remember Manish's trace compression?)  Same game: it's a grammer to describe execution.  Can compute these online, and use them to decribe execution points.  Can ask for their 'indices' same as asking for a call-stack.  Instrument at branches and post-dominators; counters for loops.


XMEM - Type-Safe Shared Memory for Cross-JVM Communication

Uses HotSpot!  Type-safe shared memory: a pool of shared memory. No shared => private-JVM ptrs allowed.  GC safe.  Direct access. Use it, e.g., for HTTP or RMI or CORBA communication between different JVMs physically located on the same box. 

Ala Azul's DirectPath (but we don't break OS protection).

Shared objects not distinguishable from private objects.  Mapped to same virtual address; shared objects have same pointers in each JVM. XMem uses a write-barrier to enforce no-private-ptrs in shared-heap. Otherwise shared objects are exactly the same as private ones.

Need Local Class Table and Global Class Table - so shared objects can have globally shared classes as well.  Shared objects then have 2 flavors of classes - the global version, and via indirect, a private "real" local version. 

Some special native/JNI calls to allocate objects into the shared memory; plus some conveience modes for making objects shared by default.  Can 'clone' objects into shared space.  Can also open sockets and read/write to get objects into shared space.

GC: extends card-marks to tell which private heaps point into the shared heap.  When shared heap is done, trigger a minor collection in each JVM to find roots into the shared heap.  Any tracing algorithm will now work; they did a parallel STW copy collector.  I assume one JVM did all the GC for the collection using the shared memory.

Support locks in the shared heap; must always inflate such locks - as they cross JVM boundaries. 

Overhead vs XMEM oblivious programs- 2% slowdown inside HotSpot for extra work, lock inflation, etc.

But communication microbenchmarks show 20x to 200x speedups.

Also looked at TomCat and Hsqldb - using shared memory instead of serialization (XML) over sockets - often 1000x faster.

Cliff Notes: They lose OS safety protections.  A broken VM can corrupt the shared memory, thus also crashing an unrelated VM using the same shared memory.  Note the difference between 2 VMs talking via serialization over sockets: a corrupt VM will cause a Java level exception in the other VM, which might be able to correctly handle the crashing VM.  Not so here.


Yet Another sampling talk, where they show that a 1% sampling rate is good enough to track all sorts of memory-trace analyses.


RADAR - dataflow analysis for datarace detection

(solves a problem Java doesn't have)


Postlude

I'm off onto a couple of weeks vacation.  No email!  I'll be going through Internet Addiction Withdrawal.  (No cell phone either, but I'm a visual person and hate the phone anyways).

Cliff

Category: Web/Tech | Permalink | Comments (4) | TrackBack (0)