Azul Systems
UNBOUND COMPUTE

Cliff Click Jr.’s Blog's Blog

Category: Web/Tech

A Plea For Programs
September 19, 2008

[Update 9/21/2008: I've got a simple sample program for people to port, plus I've got at least some code for - Clojure, JavaScript/Rhino, JPC & JRuby; missing Scala at least - thanks].

I would like some non-Java Java-bytecode programs to do performance testing, for a talk I'm giving this coming Friday (my bad for starting this late) and I'm hoping my gentle readers can supply some.  I'd like programs in different languages, but ones that are easy to setup and run.  I'm going to do internal JVM profiling, so I'm not all that concerned with the output or "Foo-per-second" results.  Ideally, my programs would be:

  • Non-Java.  Clojure, Scala, JPython, JRuby all come to mind.  The more variation, the merrier!
  • Easy setup.  I'm not an expert in any of these, so the resulting program has to be easy to setup and run.  Perferably a simple "java -cp Weirdo.jar FunnyProgram" command line.
  • Plain JVM.  Note that the 'java' command has to be there; I intend to use Azul Systems' JVM for profiling and we have our own.  Any kind of odd-ball jar or class files should be fine.
  • Long enough.  The program has to run for several minutes at least, without "babysitting".  Long enough for the JIT to settle down (if it's going to), and long enough for decent profiling.
  • Little I/O.  Besides DBs being a pain to setup, I'm really looking for CPU-bound programs.  Plain file I/O is fine, if the files are provided and can be scripted easily (e.g. "java -cp Weirdo.jar FunnyProgram < BigInput.dat > /dev/null").
  • Be multi-threaded.  Not a requirement, but a definite nice-to-have.  Several of these languages support alternative threading & coherency models and I'd like to test these features.
  • Be Open Source, so I can post the collection for others to compare against.  This is NOT a hard requirement; I'm all fine with keeping private anything you request be kept private.  Performance profiling data *will* be released, as that is what the talk is about!  (I'm also fine with signing NDA's but that's probably not going to be an issue with this crowd).
  • An example: A multi-threaded Mandelbrot program would be fine, computing a 1000x1000 grid of points centered around (1.0,1.0) with a spread of (1.0,1.0) - so fill in the grid (0.5,0.5) to (1.5,1.5), using your choice of thread controls.
  • Please include any names, so I can give credit where credit is due.

I hope to discover things like:

  • How close does "plain code" match the JVM/JIT's expectations?  How well does the JIT turn "plain code" into machine instructions?  I hope to present the JIT'd code for sample language constructs and detailed profiling data.
  • How well does the function-call logic match the JVM/JIT's expectations?  Can trivial functions be inlined?  What's the cost of a not-inlined function-call?
  • Other interesting costs?  (e.g., endless new-Class churning, endless new-bytecode churning causing endless JIT'ing; endless new weak-ref or finalizer creation causing GC grief, etc)
  • How well does the alternative threading & coherency scale?  Can Mandelbrot run on a thousand CPUs?  (I expect: trivially yes). How about programs with more interesting coherency requirements? 

I put a sample Java program here, if you'd like to port something really simple.  The inner loop of this program looks like: "for( i=0; i<1000000; i++ ) { sum += ((int)(sum^i)/i); }".  The JIT'd assembly code from HotSpot's server compiler looks like this, unrolled a few times:

2.83% 243 0x12d93878 add4      r5, r4, 1 // tmp=i+1; unrolled 8 times, this is #1
0.06% 5 0x12d9387c xor       r3, r5, r1 // sum in r1, tmp in r5                  
0.06% 5 0x12d93880 beq       r5, 0, 0x012d93b40 // zero check before divide             
0.35% 30 0x12d93884 div4      r0, r3, r5 // divide, notice cycles on next op      
2.64% 227 0x12d93888 add4      r1, r0, r1
// sum += (sum ^ tmp)/tmp               

As expected, there's a pretty direct mapping from the source code to the machine code.  I'd like to see how other JVM-based languages stack up here. Email me directly with small programs, or post links here.

Thanks!
Cliff

Category: Web/Tech | Permalink | Comments (27) | TrackBack (0)


Chrome is at least Slightly Evil
September 10, 2008

Chrome did a number of Evil Things to me recently, including killing my FirePass VPN and AgeOfConan - even after a full reboot and never launching Chrome again.... but I get ahead of myself.  Here's what happened:

I saw the Chrome announcement on SlashDot, amongst other places, read the comic and pretty much agreed with what the developers were saying.  I installed Chrome to give it a test drive.  Things went pretty well at first: the new interface was clearly going to take getting used to but also had a lot of strong points going for it.  Then I hit one of my not-so-techy websites.  Gack!  Screaming Flashing Flash ADS IN YOUR FACE, Enlarge Your X Life, Your Small Ego Needs A Bigger SUV, etc... Quick, turn on AdBlock!  But no, not available yet.

Without AdBlock, clearly Chrome isn't suitable for general surfing yet.  So I went hunting for Chrome-friendly AdBlock replacements; I only found a gizmo called Privoxy that provides a filtering proxy service.  It was waaay less easy to use than AdBlock and didn't provide that "you never even knew the ad was missing" experience that AdBlock achieves (i.e., I was left with lots of big holes saying "Prixovy chopped this ad out").

So I decided to give Chrome a rest and wait for post-Beta (and presumably give the AdBlock guys some time to work their magic).  Meanwhile, Microsoft installed SP3 on my AMD/HP XP machine (Red Herring #1).  FireFox upgrades to 3.0.  Red Herring #3: I fiddled with my router's port-forwarding rules in an effort to get 5 people behind the same router to play Diablo 2 on Battle.net (me, my 4 kids, plus also my brother Uncle Eric in Atlanta and our lifelong friend in Texas so no LAN party suggestions please)   - BTW D2 works fine for 4 players behind our router and the 5th can play in an unrelated D2 game, but when the 5th player attempts to join the same game Blizzard kicks out the last prior player to join - suggestions welcome on how to fix this!).

A day later and many reboots of both Man and Machine (and Router) I've given up on 5-player D2, and decide to play a little Age of Conan.  But nope, some weirdo message pops up about not connecting to the patch server.  I figure the AoC servers are down (no mention on the forums though) and will try again later.  So I try to log into work via FirePass & VPN - it's also a no-go, with some weirdo message about having IE4.0 or later installed (gack).  I file an IT trouble ticket and go to bed.

Next morning I spend 1/2 the morning failing to get FirePass to work.  I spend hours surfing the web for people with similar problems; there's a fair number of not-quite similar situations.  IT has a bunch of bland unhelpful suggestions, but FirePass is working for some folks with FF 3.0 and not others.  In retrospect it HAD been working for me, for about a day (the FF upgrade came in before the Chrome try).  The Web turns me on to the Microsoft XP SP3 + AMD debacle; I chase that one down all the next day - but that bug causes your machine to boot-to-blue-screen and thankfully I don't have that problem. 

Web surfing with FF is fine (and ad-free!); telnet is fine; email is fine; nslookup & tracert is fine.  I get a putty/SSH channel going so once again I can (text-mode only) log in to work.  Good thing I'm an old-school Emacs-ian; text-mode Emacs is not quite a windowing O/S in-and-of itself; I manage to get some work done that day.  Still AoC is down and in my evening play hours I decide to try and figure out what's going on.  I'm not alone; plenty of people are complaining about this no-connect-to-patch-server problem, but the AoC tech folks keep claiming the patch servers are up and the fault is at the client end.  And somebody mentions a "patcher.log" file dumped out when AoC startup does it's patch attempt. 

This "patch.log" file provides the little bit of evidence I need to break the case.  Buried in there are successfully connect attempts from days of yore, plus my more recent failures.  The recent failures all mention - *TaDa!* - a proxy attempt using port 8118.  ?Huh?  Where did this funny port number come from?  I remember it from somewhere... aha - Privoxy's default port.  Now I un-installed privoxy days ago; who can be remembering it's port number (and feeding it to the AoC Patcher?)... only Chrome.

I have 4 browsers on my machine (IE, FF, Safari, Chrome) but only Chrome had a port 8118 typed into it and nary another mention of that port number has slipped my fingers anywhere.  Somehow Chrome is feeding AoC port 8118 - and nothing's there on that port so AoC times out!  I go check out my default browser settings.  They say "Use my current Web browser".  IE is disabled.  Chrome, FireFox & Safari are enabled.  Just to be sure, I click on FF as the default browser.... and Microsoft pops up and says that Chrome has to be de-installed to be made no longer available!  Safari can politely be made not-the-default, but not Chrome?

So I de-install Chrome.  During the de-install, Chrome asks for feedback using a web-browser... using the disabled IE instead of my (I thought!) default of FireFox.  I exit out of IE as fast as possible.  Now with Chrome removed, AoC starts fine.  So I check out FirePass - and Yup, it's fine as well.

Evil #1 - Chrome became the default browser without permission.  If they asked to become the default, they hid it well.  I never allow new browsers to become the default without a few days prior surfing.
Evil #2 - Chrome silently prevented me from running FirePass VPN, hence telecommuting.  I assume they did it by handing FirePass the same crapolla proxy port somehow.  This happens even after a full power-cycle/reboot and not launching Chrome ever again.
Evil #3 - Chrome silently prevented AoC from launching, again by passing bogus proxy port numbers out.
Evil #4 - Chrome cannot be disabled by unchecking "Enable access to the program", although Safari & IE can.
Evil #5 - The Uninstall launches IE, even though it's flagged as "disabled access" and it wasn't the prior default browser.  Typically I launch IE less than once/year and always because some other program ignores the settings and launches it anyways.
Evil #6 - No AdBlock. Not really a big Evil, just a big Annoyance.

Postlude - I attempt to file a bug with Chrome using FireFox.  But the Google site only allows non-crasher bugs to be filed via Chrome itself - and Chrome never crashed for me, it just broke other programs left and right.  No way I'm going to install it again so instead I attempt to file a "Chrome crashed" bug via the website.  Options are either "you got blue-screened" or "chrome died but you survived" and neither really fits the bill.  I fill in a somewhat abbreviated version of the above blog, then hit submit.  No go!  Google insists I download Yet Another Program and run it and report the results before I can file.  No Way, Google!  I just got big-time burned from the last program I downloaded from Google. 

Evil #7 - I can't even file a bug report with Google (without downloading & running at least some code from Google!).  I know Googly's read my blog.  Consider this your Bug Report.

Cliff

Category: Web/Tech | Permalink | Comments (8) | TrackBack (0)


Death by Decompilation
September 4, 2008

I recently got involved in trying to improve performance of a large customer application.  Azul's JVMs come with an always-on profiling tool - RTPM - so it was easy to spot where the time was going.  The name has been changed to protect the guilty but the name might as well have been "addToSet".  A quick check of the JIT'd assembly showed that all time was spent in spinning down a long array (an ArrayList actually), checking for duplicates.  In the end, no dup gets found and the new element gets appended.  Adding N elements gives us a classic O(N^2) algorithm.  The array was already >32000 elements in length at the time we profiled it, so time is proportional to (32K^2) ~ 1billion.  Ugly stuff.

Naturally (grumble) this code is in a large 3rd party not-open-source application, where the owning corporation has been recently purchased by an even larger Faceless Corporation.  The "proper approach" would be to file a performance bug and wait and wait and ...

We voted to try and work around the problem by doing a little decompilation, hacking the ArrayList into a HashSet, and inserting a new jar file in the bootclasspath.  So we pulled this class file out of the pile-o-jar's and hit it with the standard decompiler jad (www.kpdus.com/jad.html).  Presto!  About 2000 lines of fairly legible Java code.  Plus a pile-o-warnings about un-decompilable stuff (Ut-oh!) mostly around nested try/catch/finally clauses and inner classes.  I surf the Java code looking for other uses of the offending ArrayList - there are definitely some - and start to see places where jad bailed out.  When I turn javac loose on the new source file, I get lots of error messages.

No problem thinks I, I'll just get another decompiler.  I immediately google-up cavaj (cavaj-java-decompiler.en.softonic.com).  A few moments later I get another Java file - and whoops!  It's header mentions jad by name down to the exact same revision.  Sure enough this decompiler is a GUI front-end for jad complete with identically broken Java code.  Turns out there are a number of such GUI wrappers for jad.  So I search some more, discarding mocha (www.brouhaha.com/~eric/software/mocha) for being abandoned, ClassCracker (mayon.actewagl.net.au/index.html) because the demo version didn't produce any Java code at all, and IceBreaker (www.breakertech.com) because their web site was down.  I also tried Java Decompiler (java.decompiler.free.fr) but it also produced loads of syntax errors although in different places than jad.  DJ (members.fortunecity.com/neshkov/dj.html) also choked on my class file but just barely.  Very long forward jumps spanning nested try/finally blocks would confuse it - but the inner classes came out looking nice.  A colleague also tried JDec (sourceforge.net/projects/jdec) with limited success.  I also stumbled across this nice list of decompilers from google.

In the end, I hand-integrated the code from JDec & DJ to get something that would compile.  I then hand-refactored the code to insert (probably re-insert) generics and Java6 style iterators - this step was entirely mechanical on my part but no decompiler I tried did it.  The resulting code was much smaller and more readable... and made a bug in the code more obvious (there are "before" & "after" Sets that periodically get unioned and during the union the dup-elimination code is broken and allows dups).  The actual replacement of ArrayList with a HashSet took only a few minutes. 

In summary: decompiling a single class file to reasonable-looking java code took about 6 hrs (and 7 decompilers looked at and 5 actually turned loose on the class file), while the refactoring to correct the egregious performance bug took maybe 10-minutes (yanked all the search loops and just used 'contains' and 'addAll') and included fixing the dups-allowed bug as a pleasant side-effect. 

Is it time for another Java decompiler bake-off?  Has somebody got a recent comparison of decompilers floating about?

Thanks,
Cliff


 

Category: Web/Tech | Permalink | Comments (15) | TrackBack (0)


Just What-the-Heck is a "wait-free" algorithm?
August 8, 2008

What is a wait-free algorithm?  Wikipedia has a nice article on it.  There's the obvious definition: all threads complete the job in a finite number of steps - where a "step" is some unit of work granted a thread by having the OS give the thread some time-slice on a real CPU.

Wait-free isn't interesting for loop-free (and not recursive) programs - there's only a finite number of steps in a finite piece of straight-line code.  However, wait-free is often used to describe multi-threaded programs - which often communicate between each other with some sort of atomic-update machine instruction (frequently a CAS instruction but sometimes it's LL/SC and sometimes something else).
I.e. - most (all?) interesting wait-free algorithms rely on special atomic instructions (you always can write wait-free programs without any kind of atomic-update instruction but these tend to be really slow).  Both CAS & LL/SC will fail if their initial conditions change and the "fix" for failing is usually to loop and try again. 

To re-cap: wait-free algorithms in the real world rely on fragile failable machine instructions - and the failure is often fixed by looping while retrying.  To remain a wait-free algorithm, the algorithm must *somehow* know the all loops are of finite length - i.e., failure is finite.  This is often a reasonable assumption - under low load most machines' atomic-update instructions mostly work.  But now let us posit an adversarial OS.

The definition of wait-free doesn't mention an Operating System - and yet real programs running on real hardware have to deal with a real OS.  Programs often more threads than CPUs and one of the OS's jobs is to schedule the threads - grant them "steps" on the CPU to make forward progress.  The OS chooses when a thread gets CPU cycles (steps) and for how long.  Typically OS's attempt to be fair and efficient, granting all threads an equal number of steps in a round-robin fashion.  However, the OS isn't required to be completely fair and typically they are not (in the name of efficiency).  A wait-free program can make forward progress in the face of an OS that is not being nice, i.e. adversarial OS.

Let's assume the adversarial OS has to be at least slightly fair: it can't deny cycles to a thread forever.  Indeed, if the program runs forever the the OS must hand out an infinite number of cycles to all threads (the unfair adversary isn't very interesting: the OS can simply choose to not hand out cycles some threads, or even to any thread - ala a hard-crash).  What kind of damage can this fair adversarial OS do to our algorithms?

For starters, the OS can guarantee that no contended LL/SC ever succeeds!  Its easy: the OS schedules threads right up until they execute the LL; then that thread isn't allowed cycles again until another thread also executes an LL to the same address.  For all known implementations of LL, this will cause the 1st thread to break the "link" part of load-linked.  The first thread is then allowed cycles until it tries another LL.  The 2nd thread is preempted at once of course, since it also just executed an LL.  Hence the OS ends up allowing all threads an infinite number of cycles while guaranteeing no LL/SC ever succeeds.

What does this all mean?  It means in theory (since the adversarial OS is a theoretic notion) that IBM, Alpha & MIPS are screwed: they cannot use their primary atomic-update operation to write any algorithm without dragging in the OS.  i.e., no one can write even plain locking algorithms on their hardware unless you also can show that the OS is at least slightly not-adversarial.  Truly random time-slices would work - but fixed-cycle time slicing will be adversarial for algorithms which are "beating" in tune with the time slice. 

It's a little worse than that: the "load" part of LL is a ... well, a load.  A contended load.  It will miss in cache and thus be very slow, with a larger-than-normal cycle count attributed to it.  The random-slice OS is slightly more likely to context switch immediately after the missing load than at other points in the program.  Of course, the context switch gives other CPUs a huge chance to request the same address - which means that the poor loser thread is both going to lose this LL/SC but also take another cache miss when it retries (and thus have another greater chance of a context switch).

This is all very theoretical: OS writers try hard to make their OS's not adversarial and exponential backoff/retry loops work great (or worked great on the 4-cpu PowerPC machines I used years ago).  But you had to do it when you used LL/SC or you would get CPUs live-locked spinning on LL/SC.

How does CAS fair in the face of the fair adversarial OS?  All hardware docs I could find claimed that a CAS does succeed if the initial conditions are correct.  According to the docs, assuming that multiple threads attempt to update the same address via CAS using the same initial load then exactly one should succeed.  Here then the adversary cannot stop at least one thread from making forward progress.  This means that CAS can at least be used to make lock-free algorithms.  The OS can arrange for some loser threads to always lose CAS races, by preempting them out after they load the initial CAS conditions and not granting them cycles until after the winner thread does a successful CAS update to the same location.  In other words, you can't use CAS directly to write wait-free algorithms (you always write wait-free algorithms without either CAS or LL/SC, although they tend to be brutally slow).

Now here's another interesting bit: I distinctly recall using machines where CAS would sometimes fail for all threads even when everybody used the same correct initial conditions.  Indeed, I want to claim it was true on early multi-socket Intel machines - but I cannot find any written documentation to support my memory!  My fuzzy recollection was that since Intel didn't build the motherboards, Intel would not guarantee forward progress on contended CAS's - which had to do an external R/M/W cycle and the correctness of which all depended on the motherboard manufacturer.  Indeed it seems to me that unless a CPU designer also designs the details of how CAS interacts across chip boundaries, then that CPU cannot make forward progress guarantees about CAS.  I.e., like IBM has to claim the OS running on their gear is not a fair adversarial OS in order to make LL/SC useful, so also Intel has to claim that motherboards running their chips implement a strong version of the R/M/W cycle to make CAS useful.


Ok, this is all stretching the point a great deal.  In practice these things work well- at least on smaller CPU count machines.  But CPU counts are on the rise, and the issues mentioned here become less and less academic with each passing year.  On an Azul 864-way machine, if all threads are attempting to CAS the same location in a tight loop Azul promises you exactly one succeeds.  But also I can promise you that at least one CPU will perpetually fail unless you invoke some sort of fair-lock (which we do in the OS; Java threads can't quit write CAS loops that are so tight as to prevent all forward progress to one CPU whereas hand-written asm in the OS did).  I wonder how well a 1000-cpu LL/SC machine will fair on such loops and how much exponential backoff will be required (and how much inefficiency will that cause)? 

Or maybe the answer is: on some future 1million-cpu machine, if you have to CAS all CPUs on the same cache line then you are screwed: performance is so bad it's indistinguishable from a crash; you'll hit ^C and try again.  Hence we all end up learning how to write programs which don't do that...  I'll let you know if I get it figured out...      :-)

Cliff

Category: Web/Tech | Permalink | Comments (8) | TrackBack (0)


On How I Went to 2008 PLDI and All You Got Was These Lousy Notes
June 15, 2008

On How I Went to 2008 PLDI and All You Got Was These Lousy Notes


ISMM and PLDI are in Tucson this year, in June (and CGO was in Boston in Feburary - next year I hope they swap!!!).  Outside temps are expected to hit more than 105 in the shade.  My Dad lives in Tucson, so I went a day early to visit.  He's recovering from a diastrous surgical accident, slowly but surely.

The conference center is in a new resort, located in the desert. Indeed the days where very hot and dry - but mostly spent indoors.  I didn't have very high expectations, but actually the place was very nice.  The landscaping was just superb; they had winding shaded paths running around the resort with labelling on the desert plants.  The staff was nice and competent and the bathtub was ridiculously oversized.

Monday I hiked the Ventanna Canyon trail.  I started at 3:30 in the afternoon - hot part of the day, I know, but I wanted to be back before dinner.  I packed a bottle of water, a broad sun hat and layered on the sunscreen.  The trail itself winds through private land for a mile or so before entering national parklands.  It's a rocky and winding trail, crossing a dry creek bed repeatedly as is slowly climbs into the canyon.  The trail is lined with saguaro cactus, other cactus of every type, the occasional accaia tree and a surprising amount of wildlife.  I saw a dozen jackrabbits, a 3-foot snake (not a rattler), and loads of lizards and birds.  A bright red cardinal landed 3 ft from my head and stared at me; the air was loud with owl hoots and the buzzing of a kind of large heavy hornet.  Some kind of tree was popping seed pods; every few moments there'd be a small pop and the rattling of seeds bouncing around the stones.  My feet crunched on the gravel in the hot sun except where I rock-hopped across the dry creek bed.  At times the trail came right up to the rock canyon wall, other times I was wandering back and forth across a narrow valley.  The trail climbed slowly in the initial part, but as the canyon narrows it starts to climb steeply up a ridgeline. This part of the hike was slow going for me; it was still hot but with nearly a 1000ft elevation gain.  I metered my water carefully - there'd be no refills till I got back - and I took lots of rest breaks.  Finally the trail levels out by my destination - the promised Maiden Pools.  At this time of year, they are reduced to small scummy puddles covered with bugs - definitely a disappointment!  After a short break, I turned around.  The ridgeline had lifted above the canyon and I was treated with gorgous views of Tucson.  The trip down was a lot quicker than the trip up, and mostly in shade from the narrow canyon walls.  I got down about 7, nearly 3.5 hours after starting on a 6.4 mile hike (which is definitely slow going for me!).

I saw my Dad again on the last day; he's finally come home from the hospital.  He's been in the hospital for nearly 2 months, so this is a really welcome change!  He's in good spirits and is on the mend. 

The trip home started out fine: a quick flight to Phoenix from Tucson, then a short layover and a flight home... except we got a slow start, and then a slower start, and then they canceled the flight while we were all sitting on the plane.  Apparently we were too slow, and San Jose's curfew kicked in.  US Airways put us up in a Clarion hotel about 20 minutes away.  I skipped dinner thinking I could eat at home and as of this writing it's 5am; I'm back in the airport before any food joints open, waiting for the 1st flight to San Jose of the day.


As usual with these trip reports, I summarize brutally (and only those talks I actually attend) and sort by my personal agenda.  No offense intended, and if you think I didn't do your talk justice please correct me here!

Some Cool Stuff

Building The Billion Dollar GC
Bounding Leaky Programs
Limits of Heap Compaction
Register Allocation via Puzzle Solving
Register Allocation/Coalescing paper
Foundations of the C++ Concurrency Memory Model
Race Directed Random Testing of Concurrent Programs
SharC - Checking C code for Locking Invariants
Automatic Volume Management for Programmable Microfluidics
Al Aho - Quantum Computer Compilers

Some Memory Management Papers

Immix - A Mark-Region Collector
Concurrent Real-Time Copying Garbage Collection
Limits of Parallel Marking
The Closer: Automating Resource Management in Java
Path Specialization: Reducing Phased Execution Overheads
Region-based Memory Management
Efficient Dynamic Heap Allocation of Scratch-Pad Memory
Virtual-Memory-Compaction....
Supporting Superpage Allocation
MPADs
Parallel GC with a Block-Structured Heap
Memory Management for Self-Adjusting Computation
Combine Ref Counting with Functional Program
A Study of Java Object Demograpics

Some Concurrent Programming Papers

Data-Race Detection via Linear Programming
"sketch" - A Tool for Writing Concurrent Data Structures
Velodrome
Inferring Locks for Atomic Sections
Model Checking Concurrent Programs

Other Stuff

Expressive and Safe Static Reflection with MorphJ
Liquid Types
Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling
Execution Indexing Problem
XMEM - Type-Safe Shared Memory for Cross-JVM Communication

TuningFork is now open-sourced on sourceforge. (IBM's trace visualization tool and open trace format)


Building the Billion Dollar GC

David Bacon, Keynote

Metronome - worth several hundred million $$$ in revenue to IBM.

(David openned with an amusing story about the failure of technology - his GPS and backup GPS both failed as his sailboat approached the reefs surrounding Bermuda; weather was too poor to use a sextent reliably).

1999-2001, Recycler.  Ref-counting based GC.  max pause: 2.6ms, MMU 72% @ 10ms.  But has pathologies - no bound on time with large numbers of cycles, or on large cycle.  No compaction.  Required a GC processor on the side; no way to define how beefy the GC processor had to be.

Next project: worked really really hard to get rid of pathologies. Wanted robust performance.  Decided on Snapshot-at-Beginning because at time you declare snapshot, you bound amount of work to complete the cycle.  Was worried about incremental-barrier approach (which is exactly the problem Azul solves with our read-barrier).

Real-Time also means Real-Space.  So some space obsessions.  Choose a segregated free-list.  Lose a little memory to fragmentation.  Compute 'rho' - fragmentation lossage ratio.  Picking a 'rho' defines how many memory (max) can be lost to fragmentation.

Array-Lets - choosen to handle compaction of large arrays well (most allocators have a large-object-space which doesn't compact). 

Would be very interesting to run Fragger on Metronome.

2001-2003: 2nd go-round.  Max pause 12ms (up from 2.6ms).  MMU 45% @20ms (vs 72%@10ms).  Much worse than last time.  BUT!!!  Very robust against poor heap shape and bad apps; immune to pathologies.

Now comes 3rd attempt - built into J9 (IBM production VM). (sidebar - RTSJ doesn't "work" in a large scale, the scoped memory doesn't "compose".  Modules from different places, when run together, have scope unpredictable failures.  Raytheon was choking trying to use RTSJ on the next-gen Destroyer project; saw Metronome and said "we'll buy it now")

2004: went crazy trying to get it ready for Raytheon.  1st demo from Raytheon: max 872us, MMU 65@%10ms,  79%@20ms Finally IBM Software Group agrees to pick up the project (for $150M sales!)

2005-2006: Lots more features added: AoT compilers, timer modules, priority inheritance, etc.  Plus must make "real time friendly" all the other cruft: jvmpi/ti, debug and RAS tools, weird ref's (soft, weak, final, JNI), heap format/dump, class libs, all the RTSJ libs, ArrayLet support, read-barrier support in class libs, documentation, etc.....

Plus testing 24x7, including performance testing, pause-time testing, etc.

Today - new multi-cpu version being worked on.  Still only available on blades - max 8cpus.  Not tried 100G heap; no idea what the limit of sustainable allocation rate. 


Bounding Leaky Programs

Non-mutating Mutator - Take a snapshot from a real mutator thread; run with massive profiling (e.g. all memory refs snapshot'd).  The NonMutator cannot update the heap; it's just profiling.

Bounding Leaky Programs.  After a leaky program eats all of memory, most of memory is reachable - but never used again.  When it throws an OutOfMemory - just capture the OOM.  Then 'cut' a random pointer edge, and get to reclaim most leaky memory.  'pick a random' - since most memory is leaked and dead, likely to be cutting off dead stuff.  But also can use most common heap-type and staleness data (pages never touched in a long long time).  "posion" the cut edge, so that if that pointer is actually used, you then throw the OOM you would have thrown originally.  This can post-pone OOM's from leaky programs anywhere from 100x ("forever") to 10x or more.

A measure of heap fragmentation - sum-of-(freeblocksize^2) / (sum-of-freeblocksize)^2

Gives a number from 0 to 1 -

  •   0==maximally_compacted,
  •   1==maximally_fragmented.

Can convert this number into a characteristic-free-block size.  i.e., if we coalesce all free blocks, then split back into free blocks of the same size - which size gives this metric.

Can I use this to decide "how likely a future alloction request will fail due to fragmentation"?  Probably...


Limits of Heap Compaction

Compare 20-zillion different heap-compression techniques across a wide variety of heaps, and against e.g. gzip/LZW style compression.

For some applications (e.g. Xalan) see heap compressions upwards of 50% or more.  Some of the techniques are cheapandeasy to use as well. Trimming trailing zeros from arrays is a big one.

Biggest winner is zero-based object-compresion.  Back each field with a bit; store only the non-zero fields.  Zero fields have a bit set in the backing bitfield.  Nearly 45% heap-size reduction across all heaps.

Used the Dacapo benchmarks+psuedojbb, 20-25 heap snapshots per benchmark.  Savings of each technique are very stable over time (all snapshots show same savings per benchmark).


Register Allocation via Puzzle Solving

Solving for register-pairs.  Split all live ranges maximally; split before and after EACH instruction, using a complete parallel-rename. 

Very graphical; 'puzzles' represent the short live range.  Puzzles have a top-area and a bottom-area for the incoming and outgoing registers.  Some puzzle pieces only go on the top, or bottom, or (if it's a 2-adr live range) it will cover both.  Never have a piece that can go either on the top OR on the bottom. 

Tried 4 different reg allocs on X86, using LLVM 1.9.  Puzzle solving, linear scan (default in LLVM), iterated register coalescing, PBQP (partitioned boolean quadratic programs).  Had to spill and retry 5% of time.  Puzzle-solving is fast; equiv to existing well-tuned linear scan.  Need to insert about 7% more spill ops than no allocation (compare to C2's graph coloring: about 2% of all executed ops are C2 spill ops).

Cliff Notes:A nice way to handle mixing register pairs ala X86 in a linear-scan allocator.  Alas, as with all linear-scan allocators the question is what happens at merge points.  L-S allocators always do well with large basic blocks, but the typical Java basic block is 5 instructions long (HotSpot "-server", so plenty of inlining and optimization).


Register Allocation/Coalescing paper

conservative coalescing vs aggressive coalescing.  but coalescing breaks SSA-form chordal graph form (and chordal graphs can be colored in linear time!!!). 

So do not modify graph, instead modifying the coloring to get a good coalescing.  Optimistically take the existing coloring (easy, from chordal graph SSA coloring), then flip a nodes' color to coalesce an edge - but this introduces a conflict - so again resolve the color-clash by flipping a neighbor color until no more color clashes. Sometimes have to give up a re-coloring.  And what not should be colored 1st? 

Conduct re-coloring as a transaction.  Mark nodes, change colors, etc. If unresolvable, then rollback.  Also, do not recolor already recolored nodes (prevent coloring cycles).  Also when picking colors, try to pick colors not in any neighbor, etc. 

Will attempt all nodes in the affinity graph (all nodes joined by copies) in a single affinity group (split by self-interferences), will attempt to recolor all those nodes to the same color at once.  The split agrees up front which copies will not be attempted to coalesced.

Not terribly fast, but not too bad: about N^1.2, and 1000msec for 1000 node IFG.  After coalescing, 8.4% of copy-cost is left (vs 6.7% left for PERFECT coalescing).  Overall execution speed is about 85-90% of no-coalescing.


Foundations of the C++ Concurrency Memory Model

Similar to JMM; if you avoid data races, then you get SC.  Explicitly: most compilers are now busted when updating bitfields. 

Provide atomic<T> types that allow concurrent access, without introducing dataraces.  atomics allows concurrent access and compiler optimizations on everything else.  'atomic' ends up being similar to Java 'volatile' - but they couldn't make the existing C++ volatile keyward do anything meaningful.

Looks like they are going to end up doing a good job; most of my earlier complaints appear to be answered.


Race Directed Random Testing of Concurrent Programs

Take a previously known tool for finding potential dataraces.  Then force scheduling to expose the races "for real", and test to see if the race is harmful.  Works on large java programs, finds real races with low false-positive rate.  Reproducible: given the same random starting seed, execution is deterministic. 

Stress Testing: repeated execution.  Fails: end up testing the same schedule over and over again.  Fails: on a crash, can not repro. Fails: crashs in different environments can not be repro'd in-house.

So instead, force the schedule.  Pick a random thread, step 1 instruction.  Turns out to get much better randomized schedules (more schedules tried than normal stress testing).  But still does not do much possible schedule-coverage (exponential schedules) - and so does not cover buggy schedules.   

So prioritize search to expose bugs quickly.  Start with some existing technique to get list of potential race conditions (all the "false positives" are good now). 

Good news: very modest slow-downs; gives a reproducable race/crash with very high probability (reproducable, but perhaps the indicated original potential race isn't a real race).  Found a bunch of new races in the standard JDK container libs, and in 300KLOC programs.


SharC - Checking C code for Locking Invariants

Programmer defines high-level sharing strategy.  Example: this data is protected by that lock, etc.  SharC soundly checks thru static and dynamic checking.  Violations include real data-race traces. 

  • private
  • readonly (read by all, no writes except init)
  • locked(l) - only accessed under lock(l)
  • dynamic - either private or readonly
  • racy - benign race, no checking

Use these as type annotators: 

    "int readonly *private X;"
  • private: only accssed by owning thread
  • readonly; target of X can only be read

No non-private pts to private objects.

Sharc infers most annotations

Some common strategies:

  • locked - counters, queues, caches
  • private->readonly (init privately, goes to read-only)
  • private-lock-private - an object moving from thread to thread

Key Insight: if you have the sole ptr to some object, you can change the sharing stragety for an object.  SharC checks this, and allows a 'sharing cast' to change the sharing strategy. 

During runtime, SharC makes all the right guarantees with combos of ref-counting and runtime checks.  IF there is a race between 2 threads, SharC provides an error regardless of thread schedule. Needed expensive checks for 'dynamic' typed objecs.  Uses WPO to infer dynamic and private sharing modes.

Tried it on 6 small/med concurrent programs.  Added annotations, investigate errors; add more annotations, repeat until no more errors. Programs up to 350KLOC; needed 40 annotations/mods on 350KLOC. Runtime overhead varies from 2% to 14%.  Space overhead sometimes gets large (30%+).

Cliff Notes: Actually seems plausible that HotSpot could be annotated this way!


Automatic Volume Management for Programmable Microfluidics

Think "Lab on a Chip" .  Chip with micro-pumps, micro-fluids, mixers, values, sensors, heaters, reaction loops, etc.  Want to mix and match fluids for various kinds of sensors (e.g., blood glucose meter, or a hormone tester or drug discovery).

'Assays' are an life-science experiment.  Using these fluids, but NOT "computing" using fluids - doing Real Science/Chemistry.  Fluids are not computer variables; fluids can be destroyed; have a fixed volume and when it's used it's used up. 

"Program" specifies relative volumes; but "execution" needs real volumes.  Sometimes a mix is needed for several things, so need to decide how much of each fluid is being used.  Have max capcity issues; have minimum size issues (micro-pumps have a lower bound on how much they can pump).

Very cool problem; using compiler-like technology to solve.  Problem looks something like network-flow, but it's actually harder (NP-hard). Able to throw various linear-programming solvers at the problem.


2008 PLDI  Keynote

Al Aho - Quantum Computer Compilers

Lots of strange algorithmic implications; some very hard problems become asymptotically simpler (e.g. factorization goes from exponential to cubic). Right now QC's are giant physics experiments, but progress is being made on many fronts.


Immix - A Mark-Region Collector

Blackburn, McKinely

Really nice presentation.

GC Fundamentals - time/space tradeoff; Immix - improve this tradeoff curve

  • Allocation (e.g. free list, bump-ptr)
  • garbage type (e.g. tracing, ref-counting)
  • reclaimation (e.g. sweep-to-free, compact, evacuation)
  • 1960 - Mark-Sweep: free-list alloc, trace, sweep-to-fre
  • 1967 - Mark-Compact:
  • 1970 - Semi-Space: bump-alloc + trace + evac

Look at these 3 algorithms for: mutator performance, GC time, min-heap.

They all have spots in the curve where they are weaker than one another.

Add a 4th reclaimation: sweep-to-region Mark-Region: bump + trace + sweep-to-region

Split heap into regions.  Objects cannot span regions. After making, any unmarked regions can be freed in bulk.

Large regions: suffer from fragmentation, but cheap bump-ptr

Small regions: suffer meta-data overhead, and small-fragmentation

Carefully pick block sizes to match TLB boundaries and cache boundaries.  Fragmentation doesn't hurt locality so much if you don't span these boundaries.  Recycle partially freed blocks (from last cycle) first before working out of new free blocks. Recycle in address order (explored other orders extensively).  This mimics the mutators' normal allocation order - i.e., continguous in order generally.

Will also occasionally evacuate to get rid of really bad fragmentation (but this is a whole-heap remap?).  And it's only done rarely.

Extensive testing vs 9 other GCs (and Dacapo, JBB2000, JVM98). Meets or exceeds all other collectors (in a single-gen and 2-gen settings).


Concurrent Real-Time Copying Garbage Collection

  • hard RT GC is gaining support
  • multi-core support is hard

This work: compacts; concurrent; lock-freedom; efficiency (actually, Azul does all these - lock-freedom is very specific: where the mutator interacts with the GC it does not need to lock; important for hard RT.)

For Azul - want happens on a reloc trap?  Somebody locks, so that the copy only happens once.

So this is simpler than Stopless - "Chicken" and "Clover". "Chicken" - fast, but may not copy all objects so allows fragmentation "Clover" - probabilitistic, but guarantees objects get copied

Chicken - abort copy if a mutator writes during copy.  Uses a Brookes-style forwarding ptr.  Tag the low-bit during the copy. Mutator will clear this bit atomically if it wants to write.  (ugh that's a write-barrier!!!). 

So write-barrier is wait-free, and has a fast-path/slow-path. In practice, only 1% of copies get aborted.

Clover - probabilistic indication of per-field copy.  Steal a bit-pattern, a random one.  CAS this bit-pattern down to indicate that the field is copied.  Chance of collision with a REAL bit pattern is 1/2^128 (128-bit CAS on Intel) - infinitesimal.  Then can make the algorithm correct-but-probabilistic-lock-free by detecting if the user is user your random R.

Both schemes - 20% throughput overhead (barrier costs). Clover has more slowdown during the copy phase (3x slower) Chicken - no other slowdown.

Alas, only implemented in Bartok; no comparision to Metronome.

Cliff Notes - I like the abort-copy-on-write as a way to avoid stalling the writing threads.  Be nice if we could be even cheaper on the mutator - some like: TLB-write-protect the whole page. If a TLB write trap happens, record that a write happened somewhere on the page, unprotect the page and let mutator proceed.  At end of copy, the GC thread can see if any writes happened.  Since we only copy objects for pages that are mostly empty, we'd only rarely have false aborts.


Limits of Parallel Marking

Did a study on longest ref chain see in SpecJVM98 heaps.  Very short for 1/2 the specs (max: 40).  Some chains around length 1200 for raytrace/mtrt and javac.  Measured max heap depth every 10000 stores into the heap - so pretty darned fine-grained measurements.  No SpecJAppServer, no Dacapo.  Bummer.

Also compute "relative depth" - ratio between number of objects in whole heap to longest single chain.  Then can compute worse-case speedup possible if whole heap filled with max-length chains.  This reports a guaranteed level of parallelism; real runs should do better because most chains are short.

Cliff Notes: Doesn't know about Azul's spec-marking patent and idea of being able to mark any shape heap in time proportional to size-heap-over-num-cpus.  Several researchers jumped up and pointed out that the speculative approach could theoreticaly bring in the whole heap (including all dead stuff), and I countered with the random starting point approach (that you'd get an interesting amount of dead stuff a vanishingly small %-tage of the time) and we all admitted the analysis was over our collective heads.  There's a PhD thesis in there somewhere.


The Closer: Automating Resource Management in Java

Resource Management in Java (outside of GC)

Examples: socket open/close.  Or AddListener and RemoveListner. Really track anytime you need matching create/delete calls.

Usual approach is manual; has usual problems (double-free, dangling).

Also finalizers - but might run out of e.g. file handles, long before running out of GC (happens to us all the time!).  Error happens on a fresh open, but instead we'd like to run a full GC cycle to reclaim file handles BEFORE the 'open' call fails.

Would like to displose after it's left 'relavent' use.  Listeners are called when an action happens - but if the listener itself won't ever call something else (there's no Observer on the Listener, but the Listener is attached to the Observed), then the listener is dead.  So define 'interest reachablility'.  Non-interest-links keep you GC alive.  'Interest-links' keep your resource alive; call the dispose method when only not-interest-reachable.

Also want to rapidly call 'dispose' as soon as only not-interest- reachable.  This greatly reduces drag-time on large resources. Sometimes can do it statically, but not always.

User provides: primitive resources and non-interest links. CLOSER provides: higher-level resources built from auto-made dispose calls.

Pretty standard flow analysis approach.  See if they can track a single 'owner' for a resource.  If so, then either statically dispose - or single-threaded-check dispose if the single owner ever created the resource.  If multiple owners then switch to ref-counting.

Worked great on a large Swing app (removed all existing manual resource management; had to annotate 5 resources - other 62 resources were 'auto-discovered' and auto-dispose code was almost completely equal to the manual, except for fixing 1 bug.

Cliff Notes: I like this as a idea for simplifying higher-level resource mgt.  I'd rather hook into 'GC' somehow - but GC on 'interest-links' somehow.  Similar to how I'd like to see using compiler/data-flow opts on normal programs. Maybe this work could be done with weak-refs and finalizers and standard GC?  Interest-links are strong links, other links are weak-links and finalizers for cleanup.


Path Specialization: Reducing Phased Execution Overheads

It's "efficient" code cloning, to reduce GC barrier costs.

Have 1 set of code when GC is idea, and no barrier is needed.  Still need polling points, so GC can change phases.

Have another set for each different GC phase, with the different barriers optimized inline.  With a little bit of cleverness, can fold these versions mostly together and not clone quite so much code.

Also claim read barrier costs are typically much much higher than has been reported in the past (common paper is "Barriers: Friend or Foe" which reports 10-20% cost for simple read barriers).  However, e.g. Stopless has barriers costing closer to 60% during some short phases, but cost is back down during most of the GC cycle. 


Region-based Memory Management

Check for unsafe usage of Regions (ala HotSpot) on C programs with 100KLOC.  Found a dozen errors, and another dozen false positives. Actually, quite nice - wonder if it works on HS.


Efficient Dynamic Heap Allocation of Scratch-Pad Memory

Allocator targets memorys less than 1Mb in size, for embedded systems.  Scratch pads are common in small machines, but also on e.g. Cell. Scratch is big enuf to be useful, small enuf to be fast.

But can too much stuff to go in it; get fragmentation and control problems.  Need to manage carefully.

  • manual: nightmare for programmers to manage
  • compiler/static: doesn't make good use, no time sense
  • dynamic: still hard to use (e.g. have a fast heap and a slow heap)

Example: DL-malloc will use 512bytes of state to manage a 4K scratch-pad.  1/8th of fast memory is gone already!

SMA - 40 bytes on a 4K scratch.  Also a lot less code and faster. Uses a fast simple fat-block bitmap approach.  Can break big blocks up to avoid fragmentation.  Split blocks are power-of-2 buddy system and stored on a fixed-sized free-lists. 

No boundary tags.  Instead coalesce tags: whole block region has a bitmap showing free chunks of size 2, and another bitmap for chunks of size 4, 8, etc.  Makes the coalesce operation very efficient (constant time per de-alloc or maybe log-n per de-alloc to coalesce).

Can deferr coalesce and stick with free-list for hi-volumne sizes.

Can update the bitmap with CAS for allocations that do not span a word of bitmap.  If spanning a word, will need several CAS's.  If CAS fails, then roll-back earlier allocations and retry.

Very much lower overhead than DL malloc on small allocations. Won't scale well at all (part of the small-space-fast design trade-off).


Virtual-Memory-Compaction....

double-map physical page as a way to move objects physically without moving them virtually.  It's memory compaction for C++!  If the page-offsets of 2 objects do not overlap, then can copy one object from one phys-page to the other, and double-map both virtual addresses into the same phys-page (and thus reclaim the copied-from phys-page).


Supporting Superpage Allocation

(avoiding fragmentation of physical memory)

There's lots of prior work to avoid frag; none seems to work well. Define characteristics of pages:

  • movable memory,
  • reclaimable,
  • unmovable, or
  • temporary

Group pages together based on mobility.

Movable - e.g. an app page.  Copy the data to a new page, and change the apps Virt-2-Phys mapping.  File-system cache pages are temp.

Free-lists for each kind of memory.  Pre-split memory for each kind of memory.  Lots of details; he talks very very fast.  I can't follow it or type that fast.

Turns out it's working well enough, and will probably be in a future version of Red Hat.  No slow-down in page allocation (and often a speedup), and also able to allocation super-pages about 50% of the time.


MPADs

Auto convert array-of-structs into struct-of-array for better cache locality which searching across the whole dataset.

Pointers-to-structs turn into ptrs-to-1st-field.  All 1st-fields are now in an array, and the array is power-of-2 aligned.  1st field must also be a power-of-2 (and math is cheaper on 1st field, so pick a hot field).  Then can convert the ptr-to-1st-field address into the array index by AND-masking to the array start, and doing a sub/shr.  Other fields can be found by adding different array bases and using the just computed index.

Worked great on microbenchmarks; didn't help SpecCpu ... accidently turned off hardware stream prefetching on Power chips.


Parallel GC with a Block-Structured Heap

Simon Marlow, Simon-Peyton-Jones

It's a parallel GC for Haskell.  GC talks to the OS via 'malloc', not 'mmap'.  Doing this for flexibility and portability.  NO address-space layout requirements.  Grabbing OS pages (4K chunks) at a shot. Similar to HotSpot TLAB - on 4k pages. 

Actually grabbing 'megablocks' = 2^M in side and 2^M aligned.  First 2^D bytes contain "block descriptors", then rest of 2^M holds 2^12 sizes block. 


Memory Manage for Self-Adjusting Computation

Large data-set which changes slowly over time. e.g. robots interact with phys environment

e.g. run the same program over and over again, where the input and output from run to run vary slowly over time.  Use change-propagation to make a meta-program which acts "as if" it ran the whole program from scratch - but it runs much faster, and assumes some large starting state and produces only the delta to output state.

Benefits: much smaller in both time and space over running the same program over and over again.  Downside: more complex (but SMJ can auto-produce such a program from the original whole-run-at-a-shot program).  Downside: lousy GC behavior.

So: result is indepedent from input trace, except for performance. Given a good input trace, and a changed input, can compute a new whole output efficiently.  GIven a lousy input trace, will slowly compute the new output as-if from scratch.

Has the classic bad allocation pattern: it's basically a giant FIFO queue for the live dataset.  New allocations are long-lived; things that die are old things.


Combine Ref Counting with Functional Program

(ref counting not quite dead yet) - ref counting admits lots of interesting optimizations.  Cycles - 2 general approaches; Brownbridge - maintain invariant that any cycle has at least one 'weak' edge and don't walk weak edges.  Very complex, very hard to get right.  Or LMW - trial deletion and see if a cycle is busted.  Downside: very slow if lots of cyclic live data.  Also often have large chunks of data you know isn't cyclic (shape of code) or isn't dead (e.g. roots).

So pick these invariants:

  • no cycle is entirely strong
  • weak-in + strong-out ==> there-exists strong-in

Once the strong-count drops to zero, the whole cycle can be nuked.

Maintain invar#1

  • mutator makes refs in only 3 ways:
  • root->strong; constructor-arg->strong; ditto->weak

invar#2:

  • delete strong ref - check for weak-in and no strong-in and strong out  
       
    • might delete last strong-in  
    • correct by making output strong edge into a weak one (and recursively)  
    • No limit on amount of work for weaking an edge  

This approach limited to pure functional programming - can build cycles in the constructors but NOT by set!'ing.

Independence Theory - edge coloring is independent of other optimizations and heuristics.  "push-out" should be possible. Typically for every node reclaimed, less than 2 are scanned.


A Study of Java Object Demograpics

Current collectors tend to have a very simple classification scheme - short lived, long lived, immortal.  And then act accordingly: young-gen, promotion, old-gen decisions are all driven by the prior classification.

Real programs have a richer structure.  Objects from the libs, the app, and the JVM itself have different lifetimes.  Clusters of allocation sites are stable across program inputs; a small number of clusters dominate all allocations. 

Now look at individual allocation sites, and look at the lifetime of objects from those sites.  Use some cheap statistical test to see if 2 sites have similar average lifetimes.  Group all allocation sites with a greedy-algorithm based on closeness of lifetime patterns.  The top 10 of these groups, or clusters, will cover 90% of all allocation.

A small amount of inlining made each "hot" inlining site very deterministic. 


Data-Race Detection via Linear Programming

Assume a 'capability' for each memory location.  You need the capability to write, or a tiny fraction of it to read.  Thus many readers are allowed, but only 1 writer (and no readers if any writer). From this can define typing rules for the lambda calculus, and from that define some linear inequalities.  Can solve system of linear inequalities with, e.g. GLPK or LPSOLVE. 

Now try to add locks to the lambda calculus; locks "store" capbilities.  Creating a lock also has to include some capabilities. Can stll get a real lambda calc, then linear inequalities.  Able to run this on some small pthreads/C programs and find some races.  Has the usual false-positive issues (lots reported; needs some expert extensions to his lambda calculus formulation to remove the false positives).


"sketch" - A Tool for Writing Concurrent Data Structures

insight - implementation - validation- fails-validate - fix either insight or implementation.

This tool: fix the validate/implementation cycle.  User provides insight, but tool tries implementation and validation cycle. 

You write a partial program and a spec; 'sketcher' fills in the rest of the program (exhaustive search) until it gets a program which meets the spec.  Your partial program includes 'holes' to be filled in, and asserts to be honored.  Plus you fill in a list of sample statements to be filled into the hole (but unordered statements, plus various reg-exp expansions - so pretty flexible about what can go in the 'hole').

Their gizmo needs a counter-example input on failure.  It eats the failing example to help drive a smarter next-go-around.  Failing traces from SPIN model-checker for one program can be used to exclude whole classes of other programs which would exhibit the same bug.

The using constraint-solver to produce new program attempts that avoid all prior buggy traces and fits the 'holes' given by sketcher.


Velodrome

Free from race conditions is not sufficient; must have good atomic properties as well.  Can show some interleaved schedules are equivalent to some serialized execution.  Atomic code blocks are serializable.  Atomicity violations often reveal bugs.

Race-Freedom - as-if sequentially consistent

Atomicity - as-if each atomic block is executed serially

Want both properties.

Race-Freedom/Race-Detectors - e.g. Eraser (incomplete, false alarms) vs complete tools.  But Atomicity checkers only come in the "False alarm" flavor.  Bunch of these kinds of atomicity checkers.

Velodrome - complete checker for atomicity - no false alarms.

Demos a non-standard sync idiom (Dekker's algorithm).  Shows happens-before layered over program order, plus H-B on volatiles and shared variables.  Also, anything not in a XTN is treated as operating in its own 1-instruction XTN.  If this H-B ordering has any cycles IFF then not serializable.  Velodrome detects such cycles.

Now: make this scale to billions of XTN's.  Cannot keep the entire graph; too big.  Can use GC to remove XTN's - XTN's with no incoming H-B edges will never be part of a cycle.  Can't get rid of XTN's that are mid-progress.  Ref-Counting works well (since no cycles!)

Must optimize unary-XTN's (single-instruction XTN's).  If a unary XTN has no in-edges - then it will never have in-edges, so can be GC'd immediately - so actually never even allocate a node for it in the graph.  If unary has exactly 1 in-edge, can merge it into the prior XTN as well - and "1 in-edge" refers to all the nodes smushed into the same larger XTN.  Must do this to keep costs reasonable. 

Velodrome: part of RoadRunner framework for Java; uses BCEL.  Can also add other analysis in a pipeline on top of velodrome (e.g. Atomizer and Eraser).  Velodrome causes slowdown by 13x.  H-B XTN graphs tend to be of size ~20 nodes! 

Compare Velodrome to Atomizer on 14 multi-threaded java programs. Atomizer has 1/3 false warning rate (238 total warnings).  Velodrome gives 133 errors, no false alarms.  84 more errors are Atomizer false alarms, and 21 errors are only detected by Atomizer. 

So - take Atomizer errors, and run with Velodrome but pause the executing thread that Atomizer complains about.  Now Velodrome usually catches a concrete witness of the broken thread scheduling. 


Inferring Locks for Atomic Sections

This talk: compiler support for atomic sections via pessimistic concurrency; i.e. implement the 'atomic' keyword with a plain lock. Useful for XTN's with non-reversable ops (i.e., I/O or wait/notify).

Actually grabbing fine-grained locks at the start of each atomic section corresponding to all the memory locations being touched inside the atomic region.  (how is this different from every other STM manager out there?).  Compiler framework is parameterized: can use any heap-shape analysis to guard chunks of heaps with coarse-grained locks.  Also the coarse-grained locks are kept in a hiearchy (tree order); locks closer to the root cover everything under them.

Performance is better relative to TL2/STM in a 80% puts R/B tree, and comparible when using 80% reads.  Still need to figure out read/write locks, but then think performance will always tie or beat a good STM. 

Testing methodology is flawed, so hard to tell how well it really works.


Model Checking Concurrent Programs

stress testing rarely finds all the bugs.  Poor selection of thread interleavings. 

"CHESS" - Systematically enumerate thread interleavings.  Provide some qualified coverage guarantees.  Replace OS scheduler with a "demonic" scheduler.  Want to enumerate all paths in a state-space diagram, but can't capture program states (too big).  Very effective for acyclic state spaces, but not for cyclic spaces.

Problem is to get interesting schedule coverage.  e.g. spin loops are very boring to spin at; and if you spin 100 vs 101 times - do you then explore the states following each seperately?  Likely the states following the spin loops are same for looping 100 vs 101 times.  Also can't really detect live-locks very well - they just look like bigger badder spin locks.

(Good) programs indicate their lack-of-progress - a thread scheduled infinitely often will yield infinitly often (via sleep, yield, thread-ends, asm("rep nop"), etc).  A violation of the Good Samaritan assumption is a performance error.

Found lots of bugs (not presented in paper!)

Found in singularity - etc, programs from 6K to 175K. Can boot and shutdown singularity under the model-checker - found 2 bugs.

Now in production use at Microsoft!


Expressive and Safe Static Reflection with MorphJ

So far - looks like a technique to avoid replicating synchronized wrapper classes (i.e., instead of a SynchronizedList class and a SynchronizedSet and a SynchronizedMap class, you can add sync to any prior class).

Adds syntax to allow iteration over reflective calls:

  for( R m(A) : T.methods ) ...

where T is an interface Type, T.methods is a list of methods in T. SO for example can declare:

  for( R m (A) : T.methods ) 
    public R m (A args) { sync(mutex) { me.m(A); } }

It's an interesting syntatic extension for defining new classes of code subsumption and commoning... but probably not one that's truely useful.  Much easier just to clone the code by hand.  Certainly have enuf obvious extensions that it seems more broadly applicable.  Looks like a good language hackers' tool.  e.g., can replace reflection hacking + BCEL fairly easily; used MorphJ to replace Heirly's auto-STM tool.

// here is "for all fields with a getter..."
  for( T f : X.fields ; some get#f : X.methods )

Liquid Types

Better static typing?  Can put restrictions on actual int values as 'types'.  Such as 'i is never 0 so division is ok', or a type such as 'i such that i is < array.length'.  Problem with dependent-types: need huge type annotations.  Can't we use inference?

Limited to logical qualifiers: "0<=v" or "v<len a" or "v<100", plus conjuctions of them.  Type qualifiers limited to the usual, plus these logical quals.

Provable array safety; 1% annotation in some large code, 10% for a container library.  Seems to do a decent job of inferring proper types. ML type-inference, plus 'dsolve' - linear constraint solver. Caught an off-by-one in the bitvector library; safety check for bitblit:

  if( c < 0 || 
      offset1 < 0 || offset2 < 0 ||
      offset1+c > length1 || offset2+c > length2 ) ....

Error is should be "offset1+c >= length1"


Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling

Bartok

Start with TAL - type-safe asm code

Compiler is 200KLOC; does about 40 optimizations. Generates code that is 2% slower than no TAL; compiler is 40% slower. But gen'd code is typed.

Carrying type-proofs through all of the optimization steps? Have a type-system for passing by-ref values. Support most optimizations; does not support inline allocation, or array-bounds checking ABCD (ABCD is not safe in the prescence of overflow).  Proof is generated by Bartok, carried along with the TAL. Proof-checker can check proof in linear time, long after optimizer is gone.  Verification takes about 2.5% of compilation time.

Also check that GC info is correct, that return address is immutable.

Not handling correctness proofs for concurrent programs.

Object sizes have grown alot, to hold type info - 2x more data overhead than plain optimized code (ala plain C).  Execution time is only a little slower (no inlining of allocation code); I assume allocation is considered trusted code (part of TCB). 


Execution Indexing Problem

same program run twice w/slightly different inputs OR 2 versions of same program OR same multi-threaded program, w/different schedules

How do you represent an execution point across different executions? How to set breakpoints in the same execution-point in 2 different executions?

Describe a grammer for the execution trace (remember Manish's trace compression?)  Same game: it's a grammer to describe execution.  Can compute these online, and use them to decribe execution points.  Can ask for their 'indices' same as asking for a call-stack.  Instrument at branches and post-dominators; counters for loops.


XMEM - Type-Safe Shared Memory for Cross-JVM Communication

Uses HotSpot!  Type-safe shared memory: a pool of shared memory. No shared => private-JVM ptrs allowed.  GC safe.  Direct access. Use it, e.g., for HTTP or RMI or CORBA communication between different JVMs physically located on the same box. 

Ala Azul's DirectPath (but we don't break OS protection).

Shared objects not distinguishable from private objects.  Mapped to same virtual address; shared objects have same pointers in each JVM. XMem uses a write-barrier to enforce no-private-ptrs in shared-heap. Otherwise shared objects are exactly the same as private ones.

Need Local Class Table and Global Class Table - so shared objects can have globally shared classes as well.  Shared objects then have 2 flavors of classes - the global version, and via indirect, a private "real" local version. 

Some special native/JNI calls to allocate objects into the shared memory; plus some conveience modes for making objects shared by default.  Can 'clone' objects into shared space.  Can also open sockets and read/write to get objects into shared space.

GC: extends card-marks to tell which private heaps point into the shared heap.  When shared heap is done, trigger a minor collection in each JVM to find roots into the shared heap.  Any tracing algorithm will now work; they did a parallel STW copy collector.  I assume one JVM did all the GC for the collection using the shared memory.

Support locks in the shared heap; must always inflate such locks - as they cross JVM boundaries. 

Overhead vs XMEM oblivious programs- 2% slowdown inside HotSpot for extra work, lock inflation, etc.

But communication microbenchmarks show 20x to 200x speedups.

Also looked at TomCat and Hsqldb - using shared memory instead of serialization (XML) over sockets - often 1000x faster.

Cliff Notes: They lose OS safety protections.  A broken VM can corrupt the shared memory, thus also crashing an unrelated VM using the same shared memory.  Note the difference between 2 VMs talking via serialization over sockets: a corrupt VM will cause a Java level exception in the other VM, which might be able to correctly handle the crashing VM.  Not so here.


Yet Another sampling talk, where they show that a 1% sampling rate is good enough to track all sorts of memory-trace analyses.


RADAR - dataflow analysis for datarace detection

(solves a problem Java doesn't have)


Postlude

I'm off onto a couple of weeks vacation.  No email!  I'll be going through Internet Addiction Withdrawal.  (No cell phone either, but I'm a visual person and hate the phone anyways).

Cliff

Category: Web/Tech | Permalink | Comments (4) | TrackBack (0)


Clojure: STMs vs Locks
May 28, 2008

I've been participating in this fascinating discussion with Rich Hickey, the author of Clojure about Software Transactional Memory.  I decided to cleanup and echo the conversation here, but the original can be found here.


The Problem Statement: it's not "atomic-vs-lock" but instead it's "where do we put atomic/lock?"

Randall R Schulz

  I found the following comment on an Azul Systems blog (http://blogs.azulsystems.com/cliff/2008/05/javaone.html) :

 

Cliff Click wrote:   

    Performance in an STM is not transparent: it requires a runtime to do abort/retry, and as the runtimes get fancy & complex the behavior under load of STM's gets.... weird and unpredictable.  Not something you can put into a production environment.   

  Any comments or counter-examples to this?

Raoul Duke

Rich Hickey, richhic...@gmail.com

wrote:

  Are explicit locking designs free from such anomalies? I wouldn't think so.

paraphrase: the behaviour under load gets weird and tricky.

well, in some ways maybe that doesn't happen with locks: the question of (depending on which STM system we're looking at) figuring out what caused all the rollbacks vs. with locks, you can at least generally quickly see that a given lock has 90% of all threads waiting on it kind of thing. whereas you don't know what variable in the venn diagram of overlapping transactions caused the retry?

tho i believe Rich previously said that Clojure would actually have a way of finding that out!
http://groups.google.com/group/clojure/browse_thread/thread/2f73b80fd...

Cliff Click

You got it - under load, performance is unpredictable.

With locks, you can see which locks are loaded, generally figure out why, and plan a strategy (stripe the lock, shorter hold time, switch to the j.u.concurrent datastructures, etc).

Not so (at least not yet) with STM.  I've talked to a few people who've tried STM out in a larger scale than the usual academic papers - and the results are downright disappointing.  Unless you become an expert in the STM runtime you're using (where "expert" means "not just grok it, but can build & tweak it") anytime performance is not good enough - you get stuck.  This is an open research problem at best, with no clear indication that the problem can be solved at all.

Rich Hickey

One could as generically argue that systems with manual locking are usually broken, and therefore their behavior itself, never mind their performance, is unpredictable.  That's been my experience with manual locking in the hands of mere mortal developers.  It can be as difficult to understand the behavior of any single manually-concurrent program as to understand the performance characteristics of an STM.  In the latter case, at least the STM is handling correctness for you, and all users of the same STM can share knowledge (and bug fixes and performance improvements), vs each application being its own Gordian knot.  And in any case in which the STM is failing you performance-wise, you can opt out and use manual locks outside of transactions.  To the extent you can use it, STM can drastically reduce the complexity of a system.

I'm wary of unifying the discussion of concurrency with one about performance, as the problems of concurrency start with correctness, where the manual locking story, due to its complexity, is quite bad.  Scalability is not a universal problem - some systems need to handle thousands of simultaneous connections - most don't, some need to leverage hundreds of CPUs in a single application - most don't.  But every application that wants to leverage multiple cores needs to be correct.

It would be no problem to instrument the Clojure STM references with fault counters that would allow one to know the exact per-reference contention level. Once known, the answers in both cases are similar - share less mutable state, don't have both long and short-lived transactions contend for the same data, check your success conditions ASAP etc.

STM designs differ from one another quite a bit, so any general statements are difficult to assess.  I think the level of granularity of the STM comes into play.  Most of the STM papers I've read concentrate on creating lots of transactional cells with which they envision constructing concurrent data structures like the ones in java.util.concurrent.  I don't think that's a good idea at all.

Clojure encourages the use of immutable persistent data structures, which need no locking or coordination whatsoever in order to perform correctly in a concurrent situation, and STM references to support the appearance of identity-preserving change.  It also encourages the use of java.util.concurrent and the like for caches and work queues - the best tools for the job.

As far as I know, Clojure is the first STM that uses snapshot MVCC, avoiding both read logging and transaction restarts due to read invalidations.  Clojure's STM also gives you the ability to 'ensure' a reference you later intend to read or write, thus providing some manual control over resource acquisition order (rather than just relying on the side effect of access).  It also supports explicit commute, further reducing retries for commutative writes.  I'll not claim its performance characteristics are at all proven, but its contention due to, and for, reading is lower than in programs that use manual locking, and lower than in STMs in which write transactions can cause read transactions to retry.

Once you get to record-level STM granularity, like Clojure's, it's also a bit easier to draw parallels to the performance characteristics of the database systems it mimics, which have been doing similar things for decades.

I don't consider STM performance any more or less of a research problem than determining the correctness of programs that do manual locking - there's still work to do.

And of course, neither STM, nor any other mechanism, is a panacea.

Brett Morgan

Given that I am a mere developer, actually with MT i'd consider myself a rank newbie, the most important thing to me is visibility.  I want to be able to see what problems I am creating with my design, so that I can change my design, and see how the problems change.  In effect, I am looking for an environment that teaches me what I am doing wrong, if only by showing me which references are hotly contended.

In keeping with that, it would probably be helpful for both refs and transactions to be namable, such that debugging output from the STM runtime can actually be easily inspected.  The reason I say this is that the data I need to work over is non uniform, so I need to be able to label the hot reference to know which particular chunks of my data set are causing issues.

thoughts?

Cliff Click

The problem with manual locking - and it applies equally well to transactions - is that the there's no name-space-control over what needs to be locked/transacted.  The failure people have with locks (and limited experience with transactions) is that they can't properly name their shared variables.  They think they can, until we start looking hard, and then we keep finding more... and more that need to be locked/transacted atomically.  In short, people don't know where to put the lock/unlock or where to bound the transactional region.

Having said that, locks have some advantages over transactions: transparent performance being one of them.  Programs are busted if the locks are wrong (or transaction regions don't cover all the variables they need to; Clojure isn't immune here) AND they are busted if performance sucks.  For 2-4 way machines the issue is probably moot; for 8-32 way machines (standard servers these days) - it matters.  In a year every el-cheapo linux blade server out there will be looking at 16-64 cpus.  And of course for Azul, 100+ active cpus is the norm.  I claim that in a year, most people will either care (get bit by lack of scaling), or see that their need to care within another year.

Requiring shared variables to have a special syntax, ala Ref, is a big step forward here for both locks & transactions.  You're reducing the complexity of the problem via name-space control (only Ref's are shared) and promoting immutable shared state.  The STM issue is orthogonal.

As evidence of that, suppose you replace your STM implementation with a trivial single global lock.  Are the programs more or less complex than their locking counterparts?  The program is as-correct and as-complex as before (before you swap the STM implementation), and indeed is exactly equal to the locking solution using a single global lock.  Suppose I can get a correct program using STM - then I also have a correct program using a single global lock.  STM only varies from locking as I try to make the implementation more performant: for the STM I trust the language implementor to "stripe" the locks under the  hood.  For locks, I start using more than one lock.  In both cases I can also try to shrink the lock-hold-time (or reduce the size of the STM region), or switch algorithms.

Summary: STM doesn't solve the Big Problem (lack of name-space control).  Ref's might solve the Big Problem - and solving the Big Problem will be a big help to both locks & STM.  Locks have transparent performance and are thus easy to optimize.  Optimizing without good name-space control is a buggy process for both locks & STM.  I don't know how to optimize an STM implementation without becoming an expert in the particulars of the STM runtime (and there are so many to choose from!)

Rich Hickey

I agree that identifying the things that can change is one key issue, and Clojure's Refs, Agents, and Vars all do that.  But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure's) STM works.

Refs are not merely markers, they do enforcement.  There is simply no mutating a ref outside of a transaction - it's not an allowed mistake.  Any attempt to change a ref outside a transaction throws an exception.  So, there is no such thing as "transaction regions don't cover all the variables they need to" in Clojure.

Your thought exercise with a single lock is a good one.  Let's say Clojure's implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.

Which is more correct?  The Clojure one, because it is not subject to undetected 'forgetting to obtain the lock'.  Your assertion that "Suppose I can get a correct program using STM - then I also have a correct program using a single global lock" implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness.  All Clojure STM programs (that don't throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don't throw exceptions) are correct STM (or otherwise) programs.

Which one is has better performance? Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.

As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock acquisition order and the resultant deadlock risk.  At this point STM becomes dramatically less complex than manual locking.

I'll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy.  But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.

Analogies to databases are telling.  DBMS's took over locking long ago and most people were happy to have them do so.  Ditto memory management and GC.  In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.

So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement.  It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I'm sure people are working on that.  STMs can provide correctness enforcement today, and therefore the two are not equivalent.  Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance.

Cliff Click

On May 24, 10:51 am, Rich Hickey

wrote:

I agree that identifying the things that can change is one key issue, and Clojure's Refs, Agents, and Vars all do that.  But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure's) STM works. Refs are not merely markers, they do enforcement.  There is simply no mutating a ref outside of a transaction - it's not an allowed mistake.  Any attempt to change a ref outside a transaction throws an exception.  So, there is no such thing as "transaction regions don't cover all the variables they need to" in Clojure.

Alas - there still is.   :-(
Example down below, and this is my primary complaint against both locks & STM.  However, your other arguments are more quickly answered - mostly because I agree with you.

Your thought exercise with a single lock is a good one.  Let's say Clojure's implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.

Which is more correct? The Clojure one, because it is not subject to undetected 'forgetting to obtain the lock'. Your assertion that "Suppose I can get a correct program using STM - then I also have a correct program using a single global lock" implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness.  All Clojure STM programs (that don't throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don't throw exceptions) are correct STM (or otherwise) programs.

Good one.  I agree.  Clojure wins this round.

Which one is has better performance?  Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.

Not really relevant; nobody writes the 'single global lock' version, including Clojure.  It was just a thought exercise.

As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock acquisition order and the resultant deadlock risk.  At this point STM becomes dramatically less complex than manual locking.

Not interesting in practice.  Because in practice, deadlocks are easy to find & fix.  You get a crash dump, crawl the stacks, discover the lock cycle & reorganize.  Sure the situation could be better, but deadlocks are a 'crash once' kind of bug.  Dev's don't like 'em, but they don't lose sleep over them either.

I'll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy.  But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.

Analogies to databases are telling.  DBMS's took over locking long ago and most people were happy to have them do so.  Ditto memory management and GC.  In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.

Actually, I was thinking of the compiler-vs-ASM analogy that took place in the 60's & 70's.  But point well taken: I agree that we need some kind of support here.  Maybe in the very long run STM runtimes get better & STM does win.  Meanwhile life sucks.  If you can only solve the problem with STM's I'll point out below then I'll agree with you, and maybe even try my hand at an STM implementation.

So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement. It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I'm sure people are working on that. STMs can provide correctness enforcement today, and therefore the two are not equivalent. Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance.

Ok, the long sought after counter example: STM's do not compose w/correctness. Bad Java syntax follows, because I don't 'speak' Clojure.  I'm using the classic example.

    class Account {
      long _money;
      add( long m ) { atomic { _money = _money + m; } }
    }
   
    transfer( Account a1, Account a2, long m ) {
      a1.add( m);
      a2.add(-m);
    }

While add alone is STM'd & safe, the two calls to add that need to be wrapped in an atomic are not - so there is a race where the accounts do not balance.  Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

Less trivial examples quickly spiral out of control.  Sure I know Ref's A, B & C are all Ref's and thus only updated in dosync blocks - but can I split any collection of Ref updates into more than one atomic region?  The ad-absurdum answer is to wrap a dosync around all Ref's - which means the whole program, and that means a single-threaded program.  Imagine the program unrolled in front of you, as an endless sequence of actions (all control flow flattened out)... interspersed with other actions are updates to Ref's.  Now: dissect this list into groups with atomic or lock as you see fit, preserving the program meaning as another unrolled thread of execution is interspersed.  How do you decide where it's OK to split the stream and where it's not OK? Why, in the Grand Scheme of things is:

  "... { read _money ... write money } ... " 
a correct splitting of the action stream, while
  "... { read _money ... write _money }  ... { read _money ... write _money } ..." 
is broken, but
  "... {{ read _money ... write _money } ... { read _money ... write _money }} ..." 
is correct again?

A Brief Digression into the Usefulness of Deadlock Avoidance

Phil Jordan

I'm new to STM, I've stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking.  I especially don't know what goes on under the hood in Clojure's STM implementation, or how powerful the JVM is in terms of memory guarantees.

I'm chipping in out of interest and to improve my understanding, apologies if I sound like an idiot:

Cliff Click wrote:

  On May 24, 10:51 am, Rich Hickey

wrote:   

    As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock  acquisition order and the resultant deadlock risk. At this point STM becomes dramatically less complex than manual locking.   

  Not interesting *in practice*.  Because in practice, deadlocks are easy to find & fix.

From personal experience, this is far from the truth in complex systems.  Deadlocks happening only on "in the wild" systems, appearing in the form of heisenbugs, etc.  Not fun at all.  There's too much in the way of implicit contracts going on, which blows up in your face if you're trying to extend undocumented software that was written by someone who left the company before you arrived.  (and we all know the "well then document it" approach just doesn't happen in practice)

Maybe it's just the implementations I've used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful.

You get a crash dump, crawl the stacks, discover the lock cycle & reorganize. Sure the situation could be better, but deadlocks are a 'crash once' kind of bug.

You need a reasonable amount of infrastructure in place for that, plus you're relying on absence of evidence rather than proof that it can't deadlock.

Dev's don't like 'em, but they don't lose sleep over them either.

The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. :)

  Bad Java syntax follows, because I don't 'speak' Clojure.  I'm using the classic example.

    class Account {
      long _money;
      add( long m ) { atomic { _money = _money + m; } }
    }
   
    transfer( Account a1, Account a2, long m ) {
      a1.add( m);
      a2.add(-m);
    }

While add alone is STM'd & safe, the two calls to add that need to be wrapped in an atomic are not - so there is a race where the accounts do not balance.  Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic. The way Clojure handles this (I've been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a ref, and any attempt at modifying it will fail unless you're in a transaction.  Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction.

Less trivial examples quickly spiral out of control.  ...

Okay, you kind of lost me with what you're trying to say here.

I get the impression you're mixing up atomic operations on the low level with a high-level STM implementation like Clojure's.  The latter presumably won't be efficient unless the implementation uses the former, but my understanding is that the programmer using an STM implementation is largely supposed to be isolated from such details, as long as he/she is able to determine what operations should be wrapped in a single transaction.

Cliff Click

I'm new to STM, I've stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking.  I especially don't know what goes on under the hood in Clojure's STM implementation, or how powerful the JVM is in terms of memory guarantees.

I'm chipping in out of interest and to improve my understanding,

Let's see if I can do you some justice...

  From personal experience, this is far from the truth in complex systems.  Deadlocks happening only on "in the wild" systems, appearing in the form of heisenbugs, etc.  Not fun at all.  There's too much in the way of implicit contracts going on, which blows up in your face if you're trying to extend undocumented software that was written by someone who left the company before you arrived.

Yup.  So the deadlock happened.  Ouch.

  (and we all know the "well then document it" approach just doesn't happen in practice)

Yup, but You could make a difference.  HotSpot probably has the highest level of 'asserts' per line of code (and a pretty darned high level of comments per line of code) of anything Out There.  Docs are cheaper than debugging.  But it's an aside.... back to deadlocks-in-practice:

  Maybe it's just the implementations I've used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful.   

    You get a crash dump, crawl the stacks, discover the lock cycle & reorganize.   

  You need a reasonable amount of infrastructure in place for that,

Crash dump?  Core file?  'ctrl-backslash' to a JVM to get a thread-dump?
This stuff is commonly available.

If you don't have a proper tool chain, then Job One for you should be to go get one - and I'm very serious here.  Drop everything else until you get a functional 'path' (protocol, implementation, business process, etc) that allows you do debug a crash from a customer in the field.  You might need a willing beta-partner for this, hand him a broken program & let it crash, figure out he needs disk-space to capture the core & logs, needs training to hit 'ctrl-backslash' when suspecting a deadlock, etc - plus FTP the files back to Home Base, plus you need to be able to capture the files per-customer-per-bug (ala Bugzilla), then decrypt the file (gdb for core, eyeballs or a little perl for stack-trace logs), etc, etc, etc, ....  No Rocket Science, but annoying bits of Engineering along with some Customer Management.

  plus you're relying on absence of evidence rather than proof that it can't deadlock.

Err.... No.

I'm not shooting for perfection here; I'm shooting for "good enough to ship".  Which means that if - In Practice - each possible deadlock happens once, then gets fixed - that's almost always Good Enough.  Maybe not to run a nuclear reactor, but definitely good enough to run large businesses.  In practice, most possible deadlocks never happen - but a few that do require several go-'rounds before getting fixed properly.  And then the deadlock is fixed, and remains fixed.

 

    Dev's don't like 'em, but they don't lose sleep over them either.   

  The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. :)

Yup... and those folks are generally stuck anyways.  But if they are thinking about the problem ahead of time they are going to be way way ahead in the long run.

  My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic.  The way Clojure handles this (I've been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a Ref, and any attempt at modifying it will fail unless you're in a transaction.  Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction.   

...   

Okay, you kind of lost me with what you're trying to say here.


I Restate My Argument

Cliff Click

Trivial examples are.... trivial.  They can be fixed in a thousand obvious ways.  We need to extrapolate to the non-trivial, because that's the only place where the STM-vs-Locks argument becomes interesting.  So lets pretend that instead of a single Ref _money and two classes, I've got 500 Ref's and a million lines of code - ala HotSpot (Sun's JVM).  Easily >500 shared concurrent variables, about 750KLOC.  About ~100 unique locks (some are classes of striped locks) guarding them in very complex locking patterns.  Now replace all those locks with an STM & atomic.  Is my program any more correct?  Not really.... ...I might have avoided some potential deadlocks (HotSpot uses lock ranking asserts to avoid deadlock; deadlock rarely happens at the engineers desk and maybe once/year in the field across all HS users).  The set of deadlocks-in-the-field avoided was miniscule.  I'll concede that HotSpot is a rarely-well-engineered piece of concurrent code, and that deadlocks-in-the-field appear to happen more often to other large programs.  But still, fixing after the fact is reasonable when the deadlock rate is so low and each fix 'sticks'.

Instead of deadlock, HS crashes far far more often because the locks don't cover the right set of changes.  Switching out the locks for an STM didn't change what I was guarding; it only removed any fine-grained-lock errors (admittedly useful... but only incrementally more so than avoiding deadlocks).  I'm still stuck with a program that's too Big to see where the proper atomic/STM/locking boundaries need to be.  In a trivial example I can say "go up one call level and atomic there", but in the Real Program - I can't do that.  Go up how many layers and add atomic?  1 layer?  10 layers?  100 layers?  Yes, I see unique call-stacks with >100 layers.  I can't put atomic around main because that makes my program single-threaded.

Here's where I want some kind of compiler support.  Ref helps - because it at least demands I have an atomic around the Ref.  But Ref is insufficient, because a syntactally correct program simply wraps an atomic around each Ref - exact what my trivial example did.  I'd like to be able to specify classes & groupings of Refs that have to be wrapped in atomic - that the compiler will complain about.  Yes I'm still responsible for the semantics - I have to tell the compiler which groupings of Refs are interesting - but I'd like some kind of automatic support, so that as my program goes from 10 lines to 10MLOC I can be told "you touched both Ref Person._checking._money and Ref Person._savings._money without wrapping both Ref accesses in a single atomic, thats a datarace error".

Rich Hickey

On May 25, 11:04 am, cliffc

wrote:

Not interesting *in practice*.  Because in practice, deadlocks are easy to find & fix.

I think you'll find plenty of dissent on that point.

Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

That's a problem with the Account class and OOP - classes are simply the wrong granularity for so many program semantics.  Yet, programmers demonstrate they can handle this, and set appropriate transaction boundaries, because they do so in their database work.  There is a pretty close analogy there, since many DBMSs will treat every statement in a batch as an independent transaction unless collected in an explicit transaction.  In spite of the same risks, work precisely like this is getting done correctly, and relatively easily, I think in no small part due to the simplicity of wrapping a bunch of arbitrary statements in BEGIN TRANSACTION.

Less trivial examples quickly spiral out of control....

What you are asking here is a philosophical question, which computers may never answer, since the answer depends on the semantics of the program, and such semantics will probably never be conveyed to the system more easily than:

  (defn transfer [a1 a2 m]
    (dosync
      (add a1 m)
      (sub a2 m)))
or the Java-esque:
  transfer( Account a1, Account a2, long m ) {
     atomic{
       a1.add( m);
       a2.add(-m);
     }
   }

Absent some similar declaration of relatedness, the program will have to understand itself (or one program another), since while it may look like debit/credit to us, it just looks like foo/bar to it.

I hope you don't wait for an answer to this (hard AI, potentially impossible) problem before dipping your toes in STM - it's likely you could make a significant contribution.

Rich Hickey

Towards that end, and more generally, it would be nice if there were metrics for success other than anecdotes from the field.  There should be some meaningful problem statement STM and other solutions could take aim at, something more real-world than the typical STM paper's correctness test or Erlang's 'pass messages around in a circle' and 'accept connections until you die' benchmarks, and more attainable for a new technology than the million-line systems Dr. Click and Brian Goetz mentioned in their final Java One talk, or Erlang's nine-9s phone switches.  Then everyone could run proposed solutions on their own 2/16/multi-hundred-processor systems and say - nope, not quite, this works, this doesn't.

Unfortunately, describing such a problem in an architecture-neutral way is difficult.  I think there can be a certain presumption that any concurrency solution should allow people to architect systems much as they do now, as a graph of directly-connected mutable stateful objects.  Dropping STM (and many of the other concurrency mechanisms) into such an architecture is likely to forever disappoint.  IMO, if you're not serious about immutability, you're likely to struggle with concurrency.

On my side, points were taken re: transparency and control for STMs.  The Clojure STM's current architecture is certainly capable of both better reporting and control.  So I'll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become 'knobs' for performance tuning - timeout windows, retry counts, history cache sizes etc.  So, I have some work to do before I'd like that kind of scrutiny :)

Cliff Click

On my side, points were taken re: transparency and control for STMs.  The Clojure STM's current architecture is certainly capable of both better reporting and control.  So I'll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become 'knobs' for performance tuning - timeout windows, retry counts, history cache sizes etc.  So, I have some work to do before I'd like that kind of scrutiny :)

I'm going to be a sourpuss again here.   :-(
This is exactly the trap MPI fell into; and you have to do it anyways.   Double-unsmiley.   :-(   :-( 

Here's the deal:

     

  • I write a Large Complex Program, one that Really Needs STM to get it right.   
  • But performance sucks.   
  • So I do a deep dive into the STM runtime, and discover it has warts.   
  • So I hack my code to work around the warts in the STM.   
  • Crap like: at an average of 5 cpus in this atomic block the STM 'works', but at an average of 7 cpus in the same atomic block I get a continous fail/retry rate that's so bad I might as well not bother.  So I guard the STM area with a "5-at-a-time" lock and a queue of threads waiting to enter.  Bleah (been there; done that - for a DB not an STM but same-in-priniciple situation).  A thousand thousand variations of the same crap happens, each requiring a different hack to my code to make it performant.   
  • Meanwhile the STM author (You: Rich) hacks out some warts & hands me a beta version.   
  • I hack my code to match the STM's new behavior, and discover some new warts.

Back & Forth we go - and suddenly: my app's "performance correctness" is intimately tied to the exact STM implementation.  Any change to the STM's behavior kills my performance - and you, Rich, have learned a lot about the building of a robust good STM.  You (now) know the mistakes you made and know it's time to restart the STM implementation from scratch.

My options limited:

     

  1. Scream at you to Not Change A Thing.  Thus C/C++/Clojure Standards Committees are born.  1/2 smiley :-P   
  2. Cling tenaciously to the abandoned version, and recognize my code is now abandon'd ware.  No new features or support for me.  But maybe the App is running fine and I can forget about it.   
  3. Rewrite my app from scratch Yet Again, to match the new STM's warts/features.

MPI is in this position.  Every large parallel scientific App Out There is intimately tied to the marriage of parallelizing_compiler + MPI_runtime + computer_archicture.  Changing any one of those pieces requires the app to be rewritten from scratch in order to achieve a fraction of the required performance.

The parallelizing compilers have stablized in a performance way.  There's a coding style which will auto-vectorize/auto-parallelize/auto-stripe/etc and there's some codes "on the edge" (some compiler+hardware combos yes, some no), and there's a well known "don't go here, the compiler can't hack it".  The compiler support for STM is very much lacking and/or in-flux.  Your heading into this terrain now, as you add Clojure compiler support to your STM.  Me, as an STM user, can't know what's in store for me as you go through the learning process.  I know that I don't know what STM coding styles will be performant and what will not.

The MPI_runtime has now stablized (I believe, not sure) after 20+ years.  Again there's the "this is good, this is marginal, this is bad" folk wisdom that drives coding.  Again, for STM's, this area is very much in flux.  Go read some of the bazillion academic papers Out There.  Everybody's got their own take, every STM is good in this domain & bad in some other domain - and all the domains are different; god forbid I write a large program dependent on STM 'A' and later try to switch over to STM 'B'.

The computer_arch has been stable for message-passing for some time.  For STM, I believe it's trying to ramp-up.  i.e., the computer_arch for STM is "all the world's an X86" right now, and all hardware vendors are furiously studying what it would take to add some kind of STM/HTM/hybrid support.

Thus, for STM to make it to the 'Promised Land' - the STM industry needs to figure out:

     

  • what belongs in the compiler   
  • what belongs in the runtime   
  • where there are Dragons and where there are Green Pastures   
  • teach the STM users which paths lead to Dragons or Green Pastures.

If we BOTH don't go through the excercise, then STM will never hit the promised land.

MPI never really "made it"; the 'Green Pasture' area was too small, and it was always surrounded by steep performance cliffs leading to Dragons when you slipped off the edge.  Upgrade your hardware: rewrite your app.  Upgrade your compiler: 2 months of performance debugging. Upgrade your MPI: 2 months of performance debugging to discover you need to rewrite your app.

GC "made it"; the Green Pasture gradually got bigger & bigger; Dragons remain lurking - but only after you filled up an ever-growing Green Pasture and needed to poke at the edges of stability.


We Agree to Press On, Bringing Light into the Darkness

Raoul Duke

it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)

Rich Hickey

it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)

That could be fun.

Cliff Click

 

    it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)   

  That could be fun.

Send me an email; Azul hands out 'academic' accounts all the time.  You get a remote login; you upload your code & run it.

And, if you don't mind, I'm going to edit this entire thread for clarity and echo it on my blog. This is good stuff (this conversation), and I definitely wish you the best of luck.

Cliff

Category: Web/Tech | Permalink | Comments (54) | TrackBack (0)


Post JavaOne Update
May 12, 2008

I'd like to say I attended dozens of JavaOne presentations, but I barely managed to attend two plus do my own three talks, plus interview 3 job candidates (we're going to make all 3 offers), plus attend a Scala working dinner with Martin Odesky, catch up with some old friends, surf the pavilion, meet with 3 or 4 potential customers, visit the infamous Puzzle Pirate's HQ and have dinner in the oldest continuously operating restaurant in SF.

I attended Bill Pugh's "Defective Java Code: Turning WTF Code into a Learning Experience".  As usual for Bill, the talk was very entertaining with plenty of OMyGod code examples from big well-known projects, all found with static analysis and pattern matching using his fabulous FindBugs.  I found a good writeup in this blog, but I can't find a copy of the slides online.

I attended Brian Goetz's "Let's Resync: What's New for Concurrency".  Again, I found a good writeup here but no slides.  The Big News: Doug Lea's Fork-Join is coming along well and will be in Java7.  It promises to be a very easy to use well-scaling divide-and-conquer style parallel framework.  I'd love to see how it works on one of our big 768-way boxes

I gave 3 talks:

  1. Debugging Data Races.  I talk too fast, or at least I did during the memory-ordering portion of the talk.  Most folks liked the 2nd half (some fairly concrete examples and suggestions), but the 1st half needs some timing/pacing.
  2. Towards a Coding Style for Scalable Nonblocking Data Structures.  Sample blog entry here.  Biggest complaint from the talk: I went *way* to fast.  Sorry - it's a talk I've given a few too many times and I'm getting lazy about presentation.  The code from SF high-scale-lib got a bump: 50 downloads this last week.  I'm always psyc'd when I hear feedback from this talk; it looks to me like a 1-hit-wonder (well 2.5 hits) but I'm always left with the notion that with better tools (ala what the hardware folks have built) that this approach should have some legs.
  3. JVM Challenges and Directions in the Multi-Core Era.  I split the last talk with Brian Goetz.  Two-speaker talks are always hard to pull off, but I think we managed a great job.  Sample blog writeup here.  Funniest question (to me): "How's .Net doing here?" My answer: "I am not my brother's keeper!".  Seriously, I've had dinner talks with some of the .Net runtime group, and they are shockingly out of touch here.  I forsee a large multi-core train thundering down Microsoft's self-induced tunnel vision.

Azul News: We're hiring again!  We managed to survive the downturn in the banking industry.  We spent the last year widening our customer base, and it's finally paying off.  Plus the banks are buying again!  I interviewed 3 candidates who were in town for JavaOne.  I think we're making offers to all three.  Mentally exhausting, moreso for them than me I daresay. 

I had dinner with the bay area Scala working group (sorry couldn't find any official link), and Martin Odersky at Buca di Beppo's.   It was a fun geeky talk/dinner.  I'm please to report that (unlike say JRuby) Scala "maps" well onto a JVM.  Code performance is probably very transparent (very sad, but no personal experience here) and very good.  Martin claims he made no real compromises in fitting the language to the JVM (but perhaps some minor warts with 'null' or a lack of unsigned types).  A harder issue is getting the Actor model to map well.  I personally am a fan of "no shared variables" and I hope he gets a good fit figured out.  Like most functional languages, Scala would like to see tail-recursion promised from the JVM vendors/spec (and rumors from John Rose that this might happen to HotSpot).  With tail-recursion, the cost-model of tail-calls is that they cost no more than a 'goto' (which after JIT'ing can be quite low indeed).  The moral equivalent for data-structures is linear logic.  The key here is allowing the JVM and/or JIT to figure out when a message cannot be referenced by the sender again and hence the "message passing" isn't really a copy-cost - it's no more than a pointer change.  I.e, similar to tail-recursion, the cost-model becomes "pointer copy" - such that you can look at the code and know "this operation is (almost) free".  This is one of the major strong points of the C language - knowing a code's "cost" at a glance.

I also got the guided tour of Puzzle Pirate's headquarters.  All I can say is: Wow.  I'd love to work in that space.  From there I took a hike to have dinner at the Tadich Grill.  This is the most non-San Francisco restaurant I've ever eaten at in SF.  The both the staff and clientele were as about as white as you can get.  The food and service (and prices) were all great - but I definitely felt the lack of SF "color"!  No blue-dyed spiky hair, no leather&stud bracelets, no piercings (other than female ears), and exactly 2 customers not as pasty-white as myself (and I arrived with one of them).  It was so different, that it felt like a refreshing change from the usual SF polyglot... but I was glad to walk out the door and see somebody who didn't look like a mirror image of myself.  Like programming languages, sameness is bland in the short term and a form of death in the long term.  Viva 'la difference!

Cliff

Category: Web/Tech | Permalink | Comments (4) | TrackBack (0)


JavaOne
May 1, 2008

JavaOne is fast upon us.  It's a big year for Azul; we have 4 talks this year.  I'm giving 3 and Gil Tene, our CTO, is giving the other one.  I'm mentioning his talk specifically because it "sold out" and JavaOne kindly gave him a 2nd session in a new larger room at an earlier time - which will now be cavernously empty unless you sign up!  The talk is very interesting (to me) - because it's essentially the GC version of "How NOT To Write A Microbenchmark" - it's full of specific advice on how to benchmark GC for large applications, and what traps to avoid when trying to write your own GC benchmark or testsuite.  Some observations: SpecJBB2000 was originally intended as a Java middle-ware benchmark, yet was specifically designed so that GC never happened during the time portion of the run.  Do you never see a GC cycle during the "timed portion" of your Java workloads?  (SpecJBB2005 came about partially to correct this).  <link to slides coming soon>

Debugging Data Races is a re-run of a BOF I gave last year, but the material is more relevant that ever.  Slides here.  I had a number of comments from people who basically said "you shouldn't write datarace bugs in the first place" implying that (1) writing such bugs is Bad because they are so hard to find & fix, and (2) you wanted to write such bugs in the first place!  My advice from last year still holds, and indeed I've got more evidence than ever that the bugs I'm targeting are the common bugs that bite you, and what I'm proposing is very accessible (no Phd required!).

Towards a Coding Style for Scalable Nonblocking Data Structures is ... well, hopefully the title is fairly self-explanatory.  I've talked this on this blog before; slides here.  The basic idea is a coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a Finite State Machine built from the atomic update and logically replicated per array word.  The end result is a coding style allowing me to build several non-blocking (lock-free) data structures that also scale insanely well.  All the code is available on SourceForge.  At this point I'm sore tempted to add transactions to the non-blocking hash table and essentially create an uber-data-grid.

JVM Challenges and Directions in the Multi-Core Era is where Brian Goetz & I rant about the lack of any decent concurrent programming language.  Slides here.  Java's the obvious best State-of-the-Practice concurrent programming language, and there are plenty of better "academic toy" concurrent programming languages Out There. 
<soapbox>
But none seem to tackle head-on the main problem programmers have with large concurrent programs: lack of name-space control over inter-thread communication, i.e. shared variables.  By default in Java, all global variables are shared with all threads.  Instance fields and code can be made private or public but you can't limit access e.g. to specific threads.  For example: all your DB threads in the DB thread pool can run all the code in your crypto-lib, should you accidentally let them get a crypto instance.  No language or compiler support to throw some kind of early access-error.  Erlang and CSP deny all global variables; Fortress, X10 & Scala provide convenient ways to express some kinds of parallelism without shared variables, but amongst the recent more well-known crop of academic-toy languages I don't see anybody tackling the access-control issue head-on.
</soapbox>

All right I'll stop before I start really ranting, but it's a topic I'm passionate about. 
JavaOne promises to be an exciting time, come and see the show!

Cliff

Category: Web/Tech | Permalink | Comments (8) | TrackBack (0)


Earth Day
April 22, 2008

Today is Earth Day, a celebration of all things Earth-friendly.  As a Foolish Earthling, my family & I have decided to "put our money where our mouth is": we've invested both in solar panels and in a CSA.

Regrid Power installed the panels, which arrived two months early.   What was slated to happen in June is already installed on our roof!  As of this writing we are eagerly awaiting inspection by PG&E before they can be turned on. The installation went fairly smoothly and is basically invisible from street-level.  We suffered one day of roof-noise and three days of a work crew (very nice guys, very respectful of our privacy and property) wondering around our backyard and roof.  The economic theory goes something like this: during the hottest parts of the summer the panels will be generating at their peak power; we'll sell the excess back to PG&E at peak power rates.  During the evenings & winter season the panel won't generate enough, so we'll buy power from PG&E at the off-peak rate.  PG&E won't actually give us any money but the income from the excess power will be used to offset our normal power bill.  This also means we'll be able to run our A/C during peak hours with a much reduced bill.

Community Supported Agriculture is basically a collection of people directly paying the farmers in advance for the food they are going to grow.  We've signed up with Live Earth Farm.  For $30/week, we get about 15-20lbs of fruit & vegetables every week for 8 months.  It means the farmer gets a steady income and we get veggies often within 24 hours of being pulled from the dirt.  We've been loving the change in diet, we're eating a lot of new (to us) veggies we would never have tried otherwise.  Another good sign: our twin apple trees have had plenty of blossoms this spring, and lots of bees wondering around them.  No sign of the yellow jackets that plagued us last year.  The kids & I are anticipating apples galore this fall!

And now for the technical bits: Azul gear is very power efficient for server hardware.  We strive for high throughput, which means high core count (so tons of the very slow memory ops in flight at once - a fast out-of-order X86 might get 3 or 4 cache misses in flight at once while we routinely get ~150 memory ops per socket in flight at once).  High core count means small simple CPUs; these simple CPUs have short simple pipes and are hard to clock at high frequencies - and less productive to clock faster (you'll just hit the next cache miss quicker).  So our clock rate is modest - and this has a huge impact on power.  We aim for high throughput and as a bonus we get low power: about 65W per die when going full blast (we're also using DDR3 instead of FBDIMMs, which is another substantial power savings).  Here's another way to do the math: 65W/socket and 48 core/socket means ~1.3watts/core.  Not embedded-systems terrain but definitely waaaay low power for high-end servers!   Azul also buys the CO2 emissions offsets for every box we sell, both the manufacturing emissions and the lifetime operating costs.

My kids are a daily reminder to me that
There's Another Generation Following Us.
Happy Earth Day to You,
Cliff

Category: Web/Tech | Permalink | Comments (0) | TrackBack (0)


I Went to Boston And All You Got Was This Lousy Trip Report
April 14, 2008

 
CGO was in Boston this year.  I wanted to attend the workshops, so I needed to arrive by 1pm Sunday - there's no way to do that from the West Coast without an extra overnight stay or taking a red-eye.  I was fighting off a cold, so I took a generic night-time-no-cough-do-all pill just before takeoff.  JetBlue has the most legroom in coach and I got the exit aisle to myself: within 10 minutes I was laid out flat and snoozing.  This had to be the comfiest 5 hour flight I've ever taken. The winds blew in our favor, and we arrived a full 1/2 hour early: 4:30am local time in Boston.  I yawned my way down to the ground-transport level.

I lived in Boston briefly maybe 12 years ago; I remembered that it's a nightmare to drive around in but the local subway ("the T") is fast & easy.  I bought a week-pass on the T.  There's a bus from the airport to the nearest T stop; I hopped on the bus figuring I'd have some Real Boston Experiences while trying to re-figure out the T. First snag: the bus driver informs us that the T doesn't start running till 6am on Sunday!  Oops - looks like I got a hour wait in the airport.  While hanging out, Manish Vachharajani (figured out how to compress trace files using BDD's, both massive compression and you can do many trace-file analyses directly on the compressed format) jogs up: he too was stymied trying to take the T.  We end up splitting a cab. 

The Omni Parker House claims to be the oldest continuously operating hotel in America.  Certainly the building is very old, but with that old-hotel charm (gilded ceilings, massive old chandliers, dark wood paneling everywhere, helpful-not-intrusive staff).  It's smack downtown Boston; right between the old North Church cemetery (buried: John Winthrop, first governor, and other loyalists) and Granary Burying Ground (buried: Sam Adams, John Hancock, Paul Revere & revolutionaries).

I paid for maybe one dinner, AMD paid for another, Google paid for another.  I eventually got to use my T pass to get to and from the Google reception (during rush hour: so the subway was a crush but we beat the bus) and also to the airport, so the pass wasn't a waste.

As usual for these trip reports, I skip some sessions and brutally summarize what I do attend.  Apologies in advance if I gave your paper short-shrift, feel free to correct me here.  I actually attended 2 workshops and the CGO conference proper:
  • EPHAM homepage -Workshop on Exploiting Parallelism with Transactional Memory and other Hardware Assisted Methods
  • STMCS homepage - Third Workshop on Software Tools for MultiCore Systems
  • CGO homepage - 2008 International Symposium on Code Generation and Optimization
     

Here's the talks I attended, sorted roughly by how important I feel they are (and not in any time order):

I attended at least 8 more talks, who's notes are not worth repeating here.  I got to see snow, slush and drizzle; hiked around in 30 degree weather with cars splashing salt/sand/mud over the cracking stone curbs, dodged rusty metal poles, wandered through mildew-covered crumbling buildings and graffiti covered subways and generally remembered all the reasons why I left Boston so long ago.  It's a fun place to visit, but I'll never live there again.

Cliff



Long talk with Maurice Herlihy about his current work

He's building STM solutions at a higher level of abstraction, e.g. adding transaction support to a hashtable; the underlying structure (hash table in this case) can be anything with reversable operations. He basically builds an 'abstract lock' lazily as the transaction progresses, by taking locks covering all the 'keys' used in the hash table. A collision on the abstract lock means a (statistical likely) collision on the actual transactional data, and he can roll-back patially executed XTN's by using the reversable operation.

The actual covering locks are just a pile of striped locks in some large hash table, so contention can be made arbitrarily low.  Also, the underlying datastructure, e.g. a hashtable, is treated as a black-box. We could, for instance, add transactions to my NonBlockingHashMap fairly easily with this idea.

Paper:
ftp://ftp.cs.brown.edu/pub/techreports/07/cs07-08.pdf
http://www.cs.brown.edu/people/ejk/papers/boosting-ppopp08.pdf



Private conversation w/Manish

How to speed up worklist/task-queues.

His approach (low latency focus): keep producer & consumer seperated from each other by more than a cache-latency. i.e., producer stuffs into a ring buffer, consumer pulls from ring buffer. Ring buffer needs to be sized such that producer & consumer are not ping-ponging lines during worklist push/pop ops. Clearly the cache lines ping-pong - but you want them to ping-pong once per cycle around the whole ring buffer - not once per each time you insert or remove. Arrange to slow down consumer if it "catches up" to producer, because if the consumer gets to close, he slows down (the already slow) producer by making the producer take more cache-misses while inserting. Same in reverse for slow consumers (but if you have slow consumers then work accumulates ad-infinitem and really you need to throttle the producers).

My approach (scalable focus): stripe queues ab-absurdem until all queues are of length 1, hence really just an array word holding either null or task. Push/pop requires scanning the array for empty/full slots. Real trick for me: proper sizing of the array. Huge scanning is expensive, but some scanning is OK.

Combo: big array, with parallel insert/remove threads. Want an auto-resize heuristic, and an auto-"back-off" heuristic. Auto-resize: need a bigger array if getting too full (producers can't find empty slot).  In the make-bigger logic, can throttle producers (sleep until some reasonable # of empty slots appear). Need to make smaller array if consumers can't find producer's work: if typically end up scanning a larger fraction of array before finding work (it's OK to scan thru a large empty array IF we're following just behind a producer, so really scanning very little). Auto-"back off": if see CAS failures for insert/remove then we've run up against another threads' (could be consumer or producer) cache. ie., we're making the other guy slow (and not being fast ourselves) - so "back off". Either pick a different spot in the array to work, or sleep or grow the array.



Keynote: Vivek Sarker (now at Rice)

Need more emphasis one auto-parallization.
Some recent successes have improved common use cases a bunch.
Lots of cores now.  Lots of libraries for parallel languages; java concurrency, Fortress, X10, intel tbb, Cilk, MPI, .NET parallel extensions

So... compilers are "under seige" - the "old model" of: parse-to-IR, then optimize, then reg alloc, then code emit... isn't working. No room in the IR for all the new parallel paradygms.

Vivek suspects a coming paradygm shift (me too as well).

Shows what breaks if we include parallism in the low-level IR: e.g. parallel CFG; the usual "dominator" relationship doesn't hold; no dominator tree. But this analogy is flawed (in my mind) because the DATAflow still happens "as-if" the parallel blocks executed sequentially, only the control-flow is parallel.

But probably want some form of DATA-parallel use; admit that the usage is racey in the IR, and compiler is allowed to take any value that could reach - or even emit code that makes that choice at runtime.

Goes on to (attempt to) derive a parallel IR onstage in a keynote.

Long discussion followed; he promises to look at using the open-source HotSpot along with Jikes. More-or-less invited me to come visit Rice sometime and check out the grad students.



Keynote, ATI, Norm Rubin


Pixar: 100,000 min of compute per min of image

Blinn's law: time-per-frame is constant. It's 1000 machines (all the movie company can manage) and 100 min/frame (all the schedule allows). Every new movie, toss out all old machines & buy new: they are faster.  Faster machines lets the graphics become more realistic.

8 PCs; 8 groups of 64 lock-step SIMD threads = 512 threads.
Each individual instruction is a 5-way VLIW.
(VLIW packing rate varies from 3.4 ops/VLIW to 4.2 ops/VLIW).
Pipeline is fairly long, but it's 64x5 = 320 fp-ops / cycle.
Loads block a thread till they resolve.

Each SIMD pipe has 256 registers; each register is 128 bits.  If each thread needs 5 registers, then 256/5 = 51 wavefronts can get into run-queue. 51 wave-fronts = 3264 threads per simd or 13056 running or waiting threads.  Roughly 262,144 registers or 1meg of register space.

GPU programming model: producer/consumer relation. Hundreds of pending GPU instructions.  GPU runs a loop 30-60 frames/sec; build vertex buffer, set uniform inputs, draw.

User writes: vertex shader, pixel shader. Nothing about parallelism, or number of cores, or sync. That's all added by the GPU JIT.  No user program has a data-race; the language doesn't allow it.

Open interesting problems in register allocation.



Sawzall
Robert Greismer

Google server farm, map-reduce.
read/write using 'protocal buffers'
GFS
WorkQueue - scheduling system for cluster of processors. Tries to co-locate tasks near data (no network i/o, just disk i/o).
MapReduce - abstracts the distributed system

Filter code is auto-parallelized, and fed into aggegrators.  Not sure what's new here compared to StreamIt on a grand scale.
Sawzall - language of MapReduce. Domain specific. Like graphics, very parallel- but no concurrent programming.

Aggregators can be clever: cannot keep all the data coming in; must reduce the data. Compute sums, averages, std devs, top N.
Domain-specific language to write aggregators; strongly typed and concise. Mostly simple, regular syntax, close to C. Specific basic
types (strings, time, fingerprints, table types); value semantics, even for structs (no pointers). Lots of string regex support.
Explicit "undefined values"; want to abort sooner rather than later; willing to test for explicit "undefined value"; tests are not
auto-inserted everywhere, but must have explicit test. At runtime, can switch a flag to make undefined values silently ignored (better
to ignore e.g. network corruption rather than crash).

Quantifiers in the language: "when (i: some int; P(a[i])) {...}" will iterate over array 'a' for all indices 'i' and run some filter 'P'
first. 3 forms: each, some, all. Nest nicely. Seems similar to Fortress. Generates bytecodes, which are then JIT'd. Ref-counting GC
(value semantics, no cycles possible); uniform object model (all things are ptr's - even int's).



Cliff Notes: Hey! We *could* do some kind of decent sample-based profiling in the HotSpot Tiered World. Periodically have a sample-thread wake up and hack the inline-caches of the top hot methods to call their C1 equivalents. Have the C1 code patch back to the C2 code, but then run on collecting info (for 1 method invocation).  IBM has already shown that if the frequency of profiling is >0.1% then you get good-enough profiling info; we'd only need to hack the inline-caches rarely to get decent continuous C1 profiling info.



Hardware Acceleration for Lock-free Data Structures and Software Transactional Memory - AMD

AMD - Advanced Sync Facilitiy - not in silicon yet.
ASF limited to 8 cache lines.  Issue a set of 'locked' loads. 'acquire' starts critical section. Checks for consistency. Stores rIP & rSP for rollback. No other registers.  Then allow mods to the locked locations.  Finally 'commit' (or abort).  Use ASF for the 1st 8 lines of a TinySTM transaction, then switch to software afterwards.

Cliff Notes: More or less, this is AMD announcing HTM support in their lab: free (good) simulation available, but no silicon.  The hardwired '8' is a lower bound: you are guaranteed you can atomically update at least 8 lines.  Several (most?) lock-free algorithms require atomic update of 2-4 unrelated addresses.  This is an architectural guarantee that such algorithms would work on all future AMD processors with this support.  No actual hardware has been announced.



Energy Concerns using HTM
Maurice Herlihy

No space in a embedded to do a full STM or even a hybrid TM; so expect multiple cores in an embedded system with a simple HTM.

Basically, abort & retry costs power.

MP-ARM - has a 512b transaction cache (fully associative), plus an 8K L1. 128B scratchpad. On an abort, the CPU rolls-back the registers from the scratchpad & enters lower-power mode. It wakes up after a random interval & retries.

T-Cache costs them 20% more power - but make it up with shorter running time. Generally a win for apps with contention (less total power than locking & serializing).

Cliff Notes: Basically it's ARM's version of HTM



Synchronization Aware Conflict Resolution for Runtime Monitoring Using Transactional Memory

Use TM to detect dataraces. AUTO-insert TM begin/end around regions doing conflict detection. Tried it per basic-block in SpecInt - too expensive. Tried to make a TM per 8-12 basic blocks; then overhead is down to 10% or so. (Cliff Notes: What TM are they using? Since the TMs are small, they assume hardware TM which explains the low 10% overhead).

Turns out auto-inserted TM causes lots of live-lock issues.  He demos several live-lock scenarios with reasonable looking TM hardware & spin-locks.

So they spot spin-lock patterns and do not auto-insert TMs around spin-locks... no they make the hardware spot spin-locks and break the TM in a way that makes progress.



Automatic Array Inlining
Christian Wimmer

Uber object-inlining, for arrays. Great for short fixed-length arrays, ala what comes out of Reflection.

Really it's object co-location, which the GC preserves OR will tell the JITs to recompile.
  1. phase 1 - decide what to co-locate (profile in C1); tell GC
  2. phase 2 - GC co-locates all such things; tells JIT
  3. phase 3 - JIT recompiles assuming co-location.

Also JIT must detect that objects are *co-allocated*, and not modified later (implicit final fields: detect changes with field guards).

This technique is easy for known fixed-length arrays: it's just like inlining an object. Also works for unknown-length arrays that are not changing; the "tail" of the inlined-object-set isn't known, but isn't needed.

Can handle changing the inlined array field itself.  Can fold the array length check into the array field guard: set the old array length to 0 (the old array is ONLY pointed to by the original object, this is a precondition to inlining, so no other object/thread can see the length change).  If the range-check vs 0-length fails, 1st go to some slow code which re-checks after loading the proper field.  Eventually the GC resorts/realigns, and then the fast code works again.

Also can handle having remote /shared pointers to the inlined array, even with hacking the array length: have 2 array lengths. The "normal" length used by non-inlined code, and "inlined" length used only by inline-smart code.

Also can do inlining objects into arrays: first lay out the array, then have the GC place the objects following the array in order.  Alas, harder to detect changes & harder to detect setup conditions.  They gave up here - but seems to me you could do it if GC tells you the objects are unshared (only pointed-at by the array), and you catch all array-writes to that array (by tagging the array as read-only after GC does the layout, perhaps placing it in a read-only page).

Get 20% speedup on some SpecJVM98's.
Alas no speedup on DaCapo. :-(
Very surprising to me that no speedup on DaCapo; I would expect the String co-location to be worth alot - aha he's stuck with the old JDK strings which share char arrays, so it doesn't work for him.  I predict a ~10% DaCapo speedup on Azul based on the Strings optimization.



Branch-On-Random

Azul looked at this earlier and declared it a good-idea for our next-gen. Basically, branch-on-random powers-of-2, i.e. either 50% or 25% or 12.5% or ..., etc.  Use it to lower the costs of sampling.  Typical sampling cost uses a counter and every N (e.g. 1024) executions we take a heavy-weight sampling pass. But there is a cost to test&dec the counter, uses an extra register, extra control flow, extra cache, extra instructions.

'brr' is just a conditional branch, but the condition is a frequency. Only need 4 bits to cover all reasonable use cases. Low hardware requirements (a LSFR, total maybe 100 gates including mux's); fits well into existing pipelines. Can be made tolerant of branch mis-predict (i.e., can get the info very early in the pipe).



Multi-core Thread-level Speculative Parallelism

Discovering lots of speculative parallism opportunities in sequential code. Lots more if you are willing to do large code transforms.
Ex: while-ptr-chaining-loop finding the min.
Ex: nest the above loop in a worklist which pulls the min element, then tranforms the list (ala a standard worklist algorithm).

Use value prediction to allow thread-level speculation. Basically, start a parallel loop at a random value-predicted pointer location.
After a few iterations either crash or line up with some linked-list body. Actually, capture some value on the 1st time through the inner
loop; on later iterations thru inner loop have some good choices for prediction. There's some decent auto-load-balancing going on; each iteration of the inner loop records enuf info to allow a good (re)partition for the next iteration of the inner-loop.

Technique does not only do linked-lists; needs any double-nested loop. Inner loop gets parallized, outer loop allows the inner loop to be continously 'profiled' and 'load balanced'. Really can parallize other 'hard-to-parallize' inner loops. Would work well in Java; I probably could do it in C2 with enuf effort.



Analyzing perf diffs in JVMs

Using "calling context tree" - similar to our 8-deep call-stack capture.

Doing 'diffs' on these trees. Diff'ing based both on structure, and on edge weight. Doing things like "adjusting" the weight of top-nodes in a tree so 2 trees match in weight, then looking at diff's farther down the trees. Also looking at weight of normalized subtrees, etc.

Comes up with a fairly 'diff' like interface (cmd-line) to compare 2 forests of trees and highlight the diff's.

Able to track down differences in 2 websphere/trade6 setups.
Discovered differences between timezone settings between AIX & ZLinux, and also DB setups.



PiPa - Pipeline Profiling
(best student paper)

Faster tool for doing e.g. cachegrind. Parallize instrumented application. In SuperPIN - idea: main app just forks off another thread periodically to do the instrumentation work. (how is this different from jamming ops/time into a buffer, and having background
threads pick up new buffers & gzip 'em).

Idea: process the background buffer in another thread; do full analysis on the buffer(s) in parallel - in real time with the app running.

Demo running a cache-simulator, in "real time".  Using double-buffering, large shared buffer.  Pretty obviously equivalent to Azul's tracing mechanism, except she's doing a cache-sim instead of a gzip.  Can run the original program with 3x slowdown, but get a full cache sim.



Prediction & Trace Compression vs Nested Loop Recognition

Looks cool: find simple loops from traces; loops are simple linear algebra. Imperfect loops allowed. From loops, find nested loops.  Get massive compression ratios vs bzip & other trace compression techniques. Very simple to decompress; just "execute" the loops.



Fast Liveness Checking in SSA form

Answer based on demand: is var live here?
Faster precomp, slower queries.  Info depends on CFG & def-use chains
Invariant to adding & removing instrs; sounds useful for reg alloc adding & removing spill ops.

For HotSpot, liveness computation is probably 10% of total compile time.

Compute 'highest point' in dom tree where query point is dominated by def point. Compute all-paths dom-vs-each CFG node

- Compute transitive closure over reduced CFG (simple DAG). Simple post-order traversal
- For each block q, compute highest block (highest backedge target). Simple pre-order & post-order traversal.
- Arrange for highest node to be first set bit in trans closure

Query: goto highest block; see if def dominates highest block.
Looks fast & simple. Might speed up reg-alloc.



Nearly Optimal Instruction Selection on DAGs

Stupidly slick: tile the DAG once, just the way HotSpot does right now - except allow sharing always (right now HotSpot does "CSE" on sharing, forcing shared values in registers - except for small expressions that are common in address-arithmetic).

Then look at each place where a shared value was subsumed into larger instructions (hence replicated in the instructions). For just those places, see what the local cost would be if you did not share. If the cost is better to not-share, flag the node as getting it's own register result. Then re-run the DAG tiling, but living with the shared results from the 1st pass.

Ex: ((a+8)*(b+8)). Should we share the '8' or not?
Pass 1:
t1 = a+8   // 5 bytes
t2 = b+8   // 5 bytes
t3 = t1*t2 // 1 byte

But if '8' goes into a register:
t0 = movlo 8 // 5 bytes
t1 = a+t1    // 1 bytes
t2 = b+t1    // 1 bytes
t3 = t1*t2   // 1 bytes

Reduces code size by 2-4% nearly always, and very close to optimal. What hotspot does is essentially what he calls 'cse-leaf' in paper. His NOTLIS does somewhat better.
Cliff Note: I did a bunch of Azul-specific hackery in our DAG share-vs-clone heuristic to try and get what he gets naturally.

www.cs.cmu.edu/~dkoes - grad'ing soon



Exception Check Motion for Java


  • Run aggressive PRE, ignoring exceptions checks
  • See if can move checks to allow the PRE
  • Undo any PRE motion where the checks can't be moved

Got ~2% on production compiler - which confirms my suspicion that there's not much more to be had there. Mostly moving null-checks around - and HotSpot is already very aggressive about moving & removing null checks.

His examples HS will all get another way; loop-peeling and Split-If transform will wipe out all his gains.



Another unsafe code-motion talk for Java.

He's got better examples: loop-peeling won't wipe out his first example (but versioning will).

Safety checks turned into data-dependence, just like HS

heh, another talk that' almost getting as good as HS has been for the last 10 years.

Then convert data-dependency safety checks into IA64 predication bits and happily predicate speculative loads. Seeing 4% for just the PRE based motion (on SpecJVM98), another 2% for doing hardware predication & speculation bits.



Towards Fair, Scalable Locking
Barcelona Supercomputing & Microsoft

Scalable reader-writer locks (something I've thought about alot in the last year). Want fair (no starvation of either readers or writers). Want fine-grained striped locks (? specific to different data types)

Demo's starvation in a HTM system similar to Azul - eager updates can mutually kill each other.  Demo's cache-directory-reservation can lead to deadlock.

So far seems like a more complex version in hardware of something that's easy enough in software. It's only partially fair (not FIFO). Preserves proportions of readers & writers.

Using STM. Overhead looks very bad.



Runtime Tuning of SM Validation Techniques
Rice University

STM validation is expensive; no one technique wins all cases. Use runtime to dynamically change validation scheme. Periodically (every 5% of TM attempts) run the same TM with 1 of 3 different validation techniques; if the new technique is working better than the old one, then keep it - otherwise immediately revert back to what was working well.



Transactional Support for Dynamic Languages

Looking at Jython - get to use the fast JVM.  To make Python bytecodes need some huge bad wrappers around every little variable access. This makes Jython as slow as regular interpreted python.

What we'd like to do is skip the fancy python lookup and use plain Java variables - except that we want to keep the good python dynamic semantics. Would "lookup x in y" to be as fast as a local variable reference.

So use transactions to test the pre-conditions & use fast specialized code otherwise. i.e., execute inside a transaction & test preconditions.

Use best-effort HTM. Add a few specific speculation primitives to JVM (speculate, abort, commit). Jython generates both specialized-case fast-code and also full semantics. Use HTM to run the fast-case. If that fails, fall back to the full semantics bytecodes.

Cliff Notes: Sun Labs guy noticed that this technique requires abort info from the hardware (was it capacity vs remote thread) - does Rock TM lack the failure-mode data?



User-defined Priority Based Transaction Contention Management

Really, this is poster case for why TM's will not save the world.  He's proposing users can put some kind of priority on transactions - and higher priority transactions can "barge" over lower priority ones. What a mess!!!



Map-Reduce for GPU's

More general term than Google's map/reduce.  Really its stages of independent computation, followed by communication.

GPUs are great for doing 'map' - lots of GPUs, lots of bandwidth. But only local communication is cheap; global communication is very expensive. Also, reduction on GPUs is well known to be very difficult. Lots of PHDs going into deciding how much unrolling is best, or how best to split up computations. Sometimes see 60x slowdowns if the best-reduction for one dataset size is used for a different dataset size.

User writes a MAP function in CUDA (NVidia's C+GPU operaters).  User's map function outputs results in a per-thread local-store. User writes a binary associative reduction operator - takes 2 inputs and returns 1; and his framework will restructure the reduce step for best parallelism. (also needs an identity value, in case some MAP results are filtered out)



Analysis of CUDA Programs

auto-detect race conditions (detects potential dataraces!), detect bank conflicts (extreme poor memory access patterns)

CUDA abtractions: kernel functions, threads, scratchpad ram

Cores are SIMD; execute program in lock-step; program is in scratchpad ram.

2 global bookkeeping arrays, reads & writes to global memory.  two per-thread arrays: reads& writes of a single thread. After each access to a global, check for a race.

Scratchpad is 16-way banked per 16 threads; fast as a register if no bank collisions. Bank conflicts can reduce performance by a factor of 4; slowdown equal to not using scratchpad at all.



Streaming Applications on Multi-Core Systems

Boxes using different hardware for streaming: gpgpu, fpga, webcam, disk, network, etc. Very hard to simulate & debug when lots of different parts are moving.

Auto-Pipe is a set of tools to create, test, build, deploy, optimize distributed streaming apps. Lots of pipeline parallelism, and regular parallelism within a pipe stage.

Auto-Pipe - define a set of blocks and edges or computation & communication. Also describe the hardware layout: cpus, cores, fpus, etc.  Then provide a mapping from blocks to hardware chunks.  No need to worry about communication between the blocks.



Phase-Based Adaptive Recompilation in a VM


Using hardware perf counters to detect phase-changes, and trigger recompiles. There's been a lot of work in this area already, I'm not sure what's new here.



Finding Critical Instructions Without Hardware

"critical" - instructions on the hot path.  Create OFF-line synthetic instruction traces based on existing hardware profile info, including branch mis-predicts & aliasing info.  Then analyze those traces, and compile (schedule) to those.

Take a random-walk thru binary, throwing a weighted-die at each branch (starting at a random spot using weighted-die). Emulate a call stack for call/return sanity. Can "get stuck" in various oddball paths, so reseed the trace after N instructions.

More: seed abstract register state with random values, then start tracing. Loads & stores can use random addresses, or load/store addresses: aliases will appear "obvious" because the random addresses will collide.  Also simulate memory using the random addresses. Turns
out most aliasing is captured, but not all (structured linked lists will behave as random instead).

Finally: stuff the random traces thru the hardware simulator & come up with critical paths & schedules. Then use this info to compile better scheduling (he's doing clustered VLIW's). Seeing maybe 10% performance gain on his funny clustered VLIW's.



Program Optimization Space Pruning for a Multithreaded GPU

NVIDIA CUDA talk. Need loads of threads (12000 hardware supported). Have 45G/sec bandwidth (Gbyte? GBit?) but still run out; must use local caches. Encouraging loop unrolling, hardware prefetch, traditional optimizations. Scheduling loops for thread-level and memory-level parallism. Have about 8000+ total registers.

Increasing regs-per-thread might increase per-thread performance, but run out of the shared register space and thus be able to launch fewer threads in parallel. In short, need to search large space of configs; often hand-tuned code is about 15% below max trapped in some local maxima.  A non-intuitive mix of other optimizations can move you to a better minimum.



Cliff

Category: Web/Tech | Permalink | Comments (1) | TrackBack (0)