Cliff Click Jr.’s Blog's Blog
Category: Web/Tech
A Plea For Programs
September 19, 2008
[Update 9/21/2008: I've got a simple sample program for people to port, plus I've got at least some code for - Clojure, JavaScript/Rhino, JPC & JRuby; missing Scala at least - thanks].
I would like some non-Java Java-bytecode programs to do performance testing, for a talk I'm giving this coming Friday (my bad for starting this late) and I'm hoping my gentle readers can supply some. I'd like programs in different languages, but ones that are easy to setup and run. I'm going to do internal JVM profiling, so I'm not all that concerned with the output or "Foo-per-second" results. Ideally, my programs would be:
- Non-Java. Clojure, Scala, JPython, JRuby all come to mind. The more variation, the merrier!
- Easy setup. I'm not an expert in any of these, so the resulting program has to be easy to setup and run. Perferably a simple "java -cp Weirdo.jar FunnyProgram" command line.
- Plain JVM. Note that the 'java' command has to be there; I intend to use Azul Systems' JVM for profiling and we have our own. Any kind of odd-ball jar or class files should be fine.
- Long enough. The program has to run for several minutes at least, without "babysitting". Long enough for the JIT to settle down (if it's going to), and long enough for decent profiling.
- Little I/O. Besides DBs being a pain to setup, I'm really looking for CPU-bound programs. Plain file I/O is fine, if the files are provided and can be scripted easily (e.g. "java -cp Weirdo.jar FunnyProgram < BigInput.dat > /dev/null").
- Be multi-threaded. Not a requirement, but a definite nice-to-have. Several of these languages support alternative threading & coherency models and I'd like to test these features.
- Be Open Source, so I can post the collection for others to compare against. This is NOT a hard requirement; I'm all fine with keeping private anything you request be kept private. Performance profiling data *will* be released, as that is what the talk is about! (I'm also fine with signing NDA's but that's probably not going to be an issue with this crowd).
- An example: A multi-threaded Mandelbrot program would be fine, computing a 1000x1000 grid of points centered around (1.0,1.0) with a spread of (1.0,1.0) - so fill in the grid (0.5,0.5) to (1.5,1.5), using your choice of thread controls.
- Please include any names, so I can give credit where credit is due.
I hope to discover things like:
- How close does "plain code" match the JVM/JIT's expectations? How well does the JIT turn "plain code" into machine instructions? I hope to present the JIT'd code for sample language constructs and detailed profiling data.
- How well does the function-call logic match the JVM/JIT's expectations? Can trivial functions be inlined? What's the cost of a not-inlined function-call?
- Other interesting costs? (e.g., endless new-Class churning, endless new-bytecode churning causing endless JIT'ing; endless new weak-ref or finalizer creation causing GC grief, etc)
- How well does the alternative threading & coherency scale? Can Mandelbrot run on a thousand CPUs? (I expect: trivially yes). How about programs with more interesting coherency requirements?
I put a sample Java program here, if you'd like to port something really simple. The inner loop of this program looks like: "for( i=0; i<1000000; i++ ) { sum += ((int)(sum^i)/i); }". The JIT'd assembly code from HotSpot's server compiler looks like this, unrolled a few times:
| 2.83% |
243 |
0x12d93878 |
add4 r5, r4, 1 |
// tmp=i+1; unrolled 8 times, this is #1
|
| 0.06% |
5 |
0x12d9387c |
xor r3, r5, r1 |
// sum in r1, tmp in r5
|
| 0.06% |
5 |
0x12d93880 |
beq r5, 0, 0x012d93b40 |
// zero check before divide |
| 0.35% |
30 |
0x12d93884 |
div4 r0, r3, r5 |
// divide, notice cycles on next op |
| 2.64% |
227 |
0x12d93888 |
add4 r1, r0, r1
|
// sum += (sum ^ tmp)/tmp |
|
As expected, there's a pretty direct mapping from the source code to the machine code. I'd like to see how other JVM-based languages stack up here. Email me directly with small programs, or post links here.
Thanks! Cliff
Category: Web/Tech | Permalink
| Comments (27)
| TrackBack (0)
Chrome is at least Slightly Evil
September 10, 2008
Chrome did a number of Evil Things to me recently, including killing my FirePass VPN and AgeOfConan - even after a full reboot and never launching Chrome again.... but I get ahead of myself. Here's what happened:
I saw the Chrome announcement on SlashDot, amongst other places, read the comic and pretty much agreed with what the developers were saying. I installed Chrome to give it a test drive. Things went pretty well at first: the new interface was clearly going to take getting used to but also had a lot of strong points going for it. Then I hit one of my not-so-techy websites. Gack! Screaming Flashing Flash ADS IN YOUR FACE, Enlarge Your X Life, Your Small Ego Needs A Bigger SUV, etc... Quick, turn on AdBlock! But no, not available yet.
Without AdBlock, clearly Chrome isn't suitable for general surfing yet. So I went hunting for Chrome-friendly AdBlock replacements; I only found a gizmo called Privoxy that provides a filtering proxy service. It was waaay less easy to use than AdBlock and didn't provide that "you never even knew the ad was missing" experience that AdBlock achieves (i.e., I was left with lots of big holes saying "Prixovy chopped this ad out").
So I decided to give Chrome a rest and wait for post-Beta (and presumably give the AdBlock guys some time to work their magic). Meanwhile, Microsoft installed SP3 on my AMD/HP XP machine (Red Herring #1). FireFox upgrades to 3.0. Red Herring #3: I fiddled with my router's port-forwarding rules in an effort to get 5 people behind the same router to play Diablo 2 on Battle.net (me, my 4 kids, plus also my brother Uncle Eric in Atlanta and our lifelong friend in Texas so no LAN party suggestions please) - BTW D2 works fine for 4 players behind our router and the 5th can play in an unrelated D2 game, but when the 5th player attempts to join the same game Blizzard kicks out the last prior player to join - suggestions welcome on how to fix this!).
A day later and many reboots of both Man and Machine (and Router) I've given up on 5-player D2, and decide to play a little Age of Conan. But nope, some weirdo message pops up about not connecting to the patch server. I figure the AoC servers are down (no mention on the forums though) and will try again later. So I try to log into work via FirePass & VPN - it's also a no-go, with some weirdo message about having IE4.0 or later installed (gack). I file an IT trouble ticket and go to bed.
Next morning I spend 1/2 the morning failing to get FirePass to work. I spend hours surfing the web for people with similar problems; there's a fair number of not-quite similar situations. IT has a bunch of bland unhelpful suggestions, but FirePass is working for some folks with FF 3.0 and not others. In retrospect it HAD been working for me, for about a day (the FF upgrade came in before the Chrome try). The Web turns me on to the Microsoft XP SP3 + AMD debacle; I chase that one down all the next day - but that bug causes your machine to boot-to-blue-screen and thankfully I don't have that problem.
Web surfing with FF is fine (and ad-free!); telnet is fine; email is fine; nslookup & tracert is fine. I get a putty/SSH channel going so once again I can (text-mode only) log in to work. Good thing I'm an old-school Emacs-ian; text-mode Emacs is not quite a windowing O/S in-and-of itself; I manage to get some work done that day. Still AoC is down and in my evening play hours I decide to try and figure out what's going on. I'm not alone; plenty of people are complaining about this no-connect-to-patch-server problem, but the AoC tech folks keep claiming the patch servers are up and the fault is at the client end. And somebody mentions a "patcher.log" file dumped out when AoC startup does it's patch attempt.
This "patch.log" file provides the little bit of evidence I need to break the case. Buried in there are successfully connect attempts from days of yore, plus my more recent failures. The recent failures all mention - *TaDa!* - a proxy attempt using port 8118. ?Huh? Where did this funny port number come from? I remember it from somewhere... aha - Privoxy's default port. Now I un-installed privoxy days ago; who can be remembering it's port number (and feeding it to the AoC Patcher?)... only Chrome.
I have 4 browsers on my machine (IE, FF, Safari, Chrome) but only Chrome had a port 8118 typed into it and nary another mention of that port number has slipped my fingers anywhere. Somehow Chrome is feeding AoC port 8118 - and nothing's there on that port so AoC times out! I go check out my default browser settings. They say "Use my current Web browser". IE is disabled. Chrome, FireFox & Safari are enabled. Just to be sure, I click on FF as the default browser.... and Microsoft pops up and says that Chrome has to be de-installed to be made no longer available! Safari can politely be made not-the-default, but not Chrome?
So I de-install Chrome. During the de-install, Chrome asks for feedback using a web-browser... using the disabled IE instead of my (I thought!) default of FireFox. I exit out of IE as fast as possible. Now with Chrome removed, AoC starts fine. So I check out FirePass - and Yup, it's fine as well.
Evil #1 - Chrome became the default browser without permission. If they asked to become the default, they hid it well. I never allow new browsers to become the default without a few days prior surfing. Evil #2 - Chrome silently prevented me from running FirePass VPN, hence telecommuting. I assume they did it by handing FirePass the same crapolla proxy port somehow. This happens even after a full power-cycle/reboot and not launching Chrome ever again. Evil #3 - Chrome silently prevented AoC from launching, again by passing bogus proxy port numbers out. Evil #4 - Chrome cannot be disabled by unchecking "Enable access to the program", although Safari & IE can. Evil #5 - The Uninstall launches IE, even though it's flagged as "disabled access" and it wasn't the prior default browser. Typically I launch IE less than once/year and always because some other program ignores the settings and launches it anyways. Evil #6 - No AdBlock. Not really a big Evil, just a big Annoyance.
Postlude - I attempt to file a bug with Chrome using FireFox. But the Google site only allows non-crasher bugs to be filed via Chrome itself - and Chrome never crashed for me, it just broke other programs left and right. No way I'm going to install it again so instead I attempt to file a "Chrome crashed" bug via the website. Options are either "you got blue-screened" or "chrome died but you survived" and neither really fits the bill. I fill in a somewhat abbreviated version of the above blog, then hit submit. No go! Google insists I download Yet Another Program and run it and report the results before I can file. No Way, Google! I just got big-time burned from the last program I downloaded from Google.
Evil #7 - I can't even file a bug report with Google (without downloading & running at least some code from Google!). I know Googly's read my blog. Consider this your Bug Report.
Cliff
Category: Web/Tech | Permalink
| Comments (8)
| TrackBack (0)
Death by Decompilation
September 4, 2008
I recently got involved in trying to improve performance of a large customer application. Azul's JVMs come with an always-on profiling tool - RTPM - so it was easy to spot where the time was going. The name has been changed to protect the guilty but the name might as well have been "addToSet". A quick check of the JIT'd assembly showed that all time was spent in spinning down a long array (an ArrayList actually), checking for duplicates. In the end, no dup gets found and the new element gets appended. Adding N elements gives us a classic O(N^2) algorithm. The array was already >32000 elements in length at the time we profiled it, so time is proportional to (32K^2) ~ 1billion. Ugly stuff.
Naturally (grumble) this code is in a large 3rd party not-open-source application, where the owning corporation has been recently purchased by an even larger Faceless Corporation. The "proper approach" would be to file a performance bug and wait and wait and ...
We voted to try and work around the problem by doing a little decompilation, hacking the ArrayList into a HashSet, and inserting a new jar file in the bootclasspath. So we pulled this class file out of the pile-o-jar's and hit it with the standard decompiler jad (www.kpdus.com/jad.html). Presto! About 2000 lines of fairly legible Java code. Plus a pile-o-warnings about un-decompilable stuff (Ut-oh!) mostly around nested try/catch/finally clauses and inner classes. I surf the Java code looking for other uses of the offending ArrayList - there are definitely some - and start to see places where jad bailed out. When I turn javac loose on the new source file, I get lots of error messages.
No problem thinks I, I'll just get another decompiler. I immediately google-up cavaj (cavaj-java-decompiler.en.softonic.com). A few moments later I get another Java file - and whoops! It's header mentions jad by name down to the exact same revision. Sure enough this decompiler is a GUI front-end for jad complete with identically broken Java code. Turns out there are a number of such GUI wrappers for jad. So I search some more, discarding mocha (www.brouhaha.com/~eric/software/mocha) for being abandoned, ClassCracker (mayon.actewagl.net.au/index.html) because the demo version didn't produce any Java code at all, and IceBreaker (www.breakertech.com) because their web site was down. I also tried Java Decompiler (java.decompiler.free.fr) but it also produced loads of syntax errors although in different places than jad. DJ (members.fortunecity.com/neshkov/dj.html) also choked on my class file but just barely. Very long forward jumps spanning nested try/finally blocks would confuse it - but the inner classes came out looking nice. A colleague also tried JDec (sourceforge.net/projects/jdec) with limited success. I also stumbled across this nice list of decompilers from google.
In the end, I hand-integrated the code from JDec & DJ to get something that would compile. I then hand-refactored the code to insert (probably re-insert) generics and Java6 style iterators - this step was entirely mechanical on my part but no decompiler I tried did it. The resulting code was much smaller and more readable... and made a bug in the code more obvious (there are "before" & "after" Sets that periodically get unioned and during the union the dup-elimination code is broken and allows dups). The actual replacement of ArrayList with a HashSet took only a few minutes.
In summary: decompiling a single class file to reasonable-looking java code took about 6 hrs (and 7 decompilers looked at and 5 actually turned loose on the class file), while the refactoring to correct the egregious performance bug took maybe 10-minutes (yanked all the search loops and just used 'contains' and 'addAll') and included fixing the dups-allowed bug as a pleasant side-effect.
Is it time for another Java decompiler bake-off? Has somebody got a recent comparison of decompilers floating about?
Thanks, Cliff
Category: Web/Tech | Permalink
| Comments (15)
| TrackBack (0)
Just What-the-Heck is a "wait-free" algorithm?
August 8, 2008
What is a wait-free algorithm? Wikipedia has a nice article on it. There's the obvious definition: all threads complete the job in a finite number of steps - where a "step" is some unit of work granted a thread by having the OS give the thread some time-slice on a real CPU.
Wait-free isn't interesting for loop-free (and not recursive) programs - there's only a finite number of steps in a finite piece of straight-line code. However, wait-free is often used to describe multi-threaded programs - which often communicate between each other with some sort of atomic-update machine instruction (frequently a CAS instruction but sometimes it's LL/SC and sometimes something else). I.e. - most (all?) interesting wait-free algorithms rely on special atomic instructions (you always can write wait-free programs without any kind of atomic-update instruction but these tend to be really slow). Both CAS & LL/SC will fail if their initial conditions change and the "fix" for failing is usually to loop and try again.
To re-cap: wait-free algorithms in the real world rely on fragile failable machine instructions - and the failure is often fixed by looping while retrying. To remain a wait-free algorithm, the algorithm must *somehow* know the all loops are of finite length - i.e., failure is finite. This is often a reasonable assumption - under low load most machines' atomic-update instructions mostly work. But now let us posit an adversarial OS.
The definition of wait-free doesn't mention an Operating System - and yet real programs running on real hardware have to deal with a real OS. Programs often more threads than CPUs and one of the OS's jobs is to schedule the threads - grant them "steps" on the CPU to make forward progress. The OS chooses when a thread gets CPU cycles (steps) and for how long. Typically OS's attempt to be fair and efficient, granting all threads an equal number of steps in a round-robin fashion. However, the OS isn't required to be completely fair and typically they are not (in the name of efficiency). A wait-free program can make forward progress in the face of an OS that is not being nice, i.e. adversarial OS.
Let's assume the adversarial OS has to be at least slightly fair: it can't deny cycles to a thread forever. Indeed, if the program runs forever the the OS must hand out an infinite number of cycles to all threads (the unfair adversary isn't very interesting: the OS can simply choose to not hand out cycles some threads, or even to any thread - ala a hard-crash). What kind of damage can this fair adversarial OS do to our algorithms?
For starters, the OS can guarantee that no contended LL/SC ever succeeds! Its easy: the OS schedules threads right up until they execute the LL; then that thread isn't allowed cycles again until another thread also executes an LL to the same address. For all known implementations of LL, this will cause the 1st thread to break the "link" part of load-linked. The first thread is then allowed cycles until it tries another LL. The 2nd thread is preempted at once of course, since it also just executed an LL. Hence the OS ends up allowing all threads an infinite number of cycles while guaranteeing no LL/SC ever succeeds.
What does this all mean? It means in theory (since the adversarial OS is a theoretic notion) that IBM, Alpha & MIPS are screwed: they cannot use their primary atomic-update operation to write any algorithm without dragging in the OS. i.e., no one can write even plain locking algorithms on their hardware unless you also can show that the OS is at least slightly not-adversarial. Truly random time-slices would work - but fixed-cycle time slicing will be adversarial for algorithms which are "beating" in tune with the time slice.
It's a little worse than that: the "load" part of LL is a ... well, a load. A contended load. It will miss in cache and thus be very slow, with a larger-than-normal cycle count attributed to it. The random-slice OS is slightly more likely to context switch immediately after the missing load than at other points in the program. Of course, the context switch gives other CPUs a huge chance to request the same address - which means that the poor loser thread is both going to lose this LL/SC but also take another cache miss when it retries (and thus have another greater chance of a context switch).
This is all very theoretical: OS writers try hard to make their OS's not adversarial and exponential backoff/retry loops work great (or worked great on the 4-cpu PowerPC machines I used years ago). But you had to do it when you used LL/SC or you would get CPUs live-locked spinning on LL/SC.
How does CAS fair in the face of the fair adversarial OS? All hardware docs I could find claimed that a CAS does succeed if the initial conditions are correct. According to the docs, assuming that multiple threads attempt to update the same address via CAS using the same initial load then exactly one should succeed. Here then the adversary cannot stop at least one thread from making forward progress. This means that CAS can at least be used to make lock-free algorithms. The OS can arrange for some loser threads to always lose CAS races, by preempting them out after they load the initial CAS conditions and not granting them cycles until after the winner thread does a successful CAS update to the same location. In other words, you can't use CAS directly to write wait-free algorithms (you always write wait-free algorithms without either CAS or LL/SC, although they tend to be brutally slow).
Now here's another interesting bit: I distinctly recall using machines where CAS would sometimes fail for all threads even when everybody used the same correct initial conditions. Indeed, I want to claim it was true on early multi-socket Intel machines - but I cannot find any written documentation to support my memory! My fuzzy recollection was that since Intel didn't build the motherboards, Intel would not guarantee forward progress on contended CAS's - which had to do an external R/M/W cycle and the correctness of which all depended on the motherboard manufacturer. Indeed it seems to me that unless a CPU designer also designs the details of how CAS interacts across chip boundaries, then that CPU cannot make forward progress guarantees about CAS. I.e., like IBM has to claim the OS running on their gear is not a fair adversarial OS in order to make LL/SC useful, so also Intel has to claim that motherboards running their chips implement a strong version of the R/M/W cycle to make CAS useful.
Ok, this is all stretching the point a great deal. In practice these things work well- at least on smaller CPU count machines. But CPU counts are on the rise, and the issues mentioned here become less and less academic with each passing year. On an Azul 864-way machine, if all threads are attempting to CAS the same location in a tight loop Azul promises you exactly one succeeds. But also I can promise you that at least one CPU will perpetually fail unless you invoke some sort of fair-lock (which we do in the OS; Java threads can't quit write CAS loops that are so tight as to prevent all forward progress to one CPU whereas hand-written asm in the OS did). I wonder how well a 1000-cpu LL/SC machine will fair on such loops and how much exponential backoff will be required (and how much inefficiency will that cause)?
Or maybe the answer is: on some future 1million-cpu machine, if you have to CAS all CPUs on the same cache line then you are screwed: performance is so bad it's indistinguishable from a crash; you'll hit ^C and try again. Hence we all end up learning how to write programs which don't do that... I'll let you know if I get it figured out... :-)
Cliff
Category: Web/Tech | Permalink
| Comments (8)
| TrackBack (0)
On How I Went to 2008 PLDI and All You Got Was These Lousy Notes
June 15, 2008
On How I Went to 2008 PLDI and All You Got Was These Lousy Notes
ISMM and PLDI are in Tucson this year, in June (and CGO was in Boston
in Feburary - next year I hope they swap!!!). Outside temps are
expected to hit more than 105 in the shade. My Dad lives in Tucson,
so I went a day early to visit. He's recovering from a diastrous
surgical accident, slowly but surely.
The conference center is in a new resort, located in the desert.
Indeed the days where very hot and dry - but mostly spent indoors. I
didn't have very high expectations, but actually the place was very
nice. The landscaping was just superb; they had winding shaded paths
running around the resort with labelling on the desert plants. The
staff was nice and competent and the bathtub was ridiculously
oversized.
Monday I hiked the
Ventanna Canyon
trail. I started at 3:30 in the afternoon - hot part of the day, I
know, but I wanted to be back before dinner. I packed a bottle of
water, a broad sun hat and layered on the sunscreen. The trail itself
winds through private land for a mile or so before entering national
parklands. It's a rocky and winding trail, crossing a dry creek bed
repeatedly as is slowly climbs into the canyon. The trail is lined
with saguaro cactus, other cactus of every type, the occasional accaia
tree and a surprising amount of wildlife. I saw a dozen jackrabbits,
a 3-foot snake (not a rattler), and loads of lizards and birds. A
bright red cardinal landed 3 ft from my head and stared at me; the air
was loud with owl hoots and the buzzing of a kind of large heavy
hornet. Some kind of tree was popping seed pods; every few moments
there'd be a small pop and the rattling of seeds bouncing around the
stones. My feet crunched on the gravel in the hot sun except where I
rock-hopped across the dry creek bed. At times the trail came right
up to the rock canyon wall, other times I was wandering back and forth
across a narrow valley. The trail climbed slowly in the initial part,
but as the canyon narrows it starts to climb steeply up a ridgeline.
This part of the hike was slow going for me; it was still hot but with
nearly a 1000ft elevation gain. I metered my water carefully -
there'd be no refills till I got back - and I took lots of rest
breaks. Finally the trail levels out by my destination - the promised
Maiden Pools. At this time of year, they are reduced to small scummy
puddles covered with bugs - definitely a disappointment! After a
short break, I turned around. The ridgeline had lifted above the
canyon and I was treated with gorgous views of Tucson. The trip down
was a lot quicker than the trip up, and mostly in shade from the
narrow canyon walls. I got down about 7, nearly 3.5 hours after
starting on a 6.4 mile hike (which is definitely slow going for me!).
I saw my Dad again on the last day; he's finally come home from the
hospital. He's been in the hospital for nearly 2 months, so this is a
really welcome change! He's in good spirits and is on the mend.
The trip home started out fine: a quick flight to Phoenix from
Tucson, then a short layover and a flight home... except we got a slow
start, and then a slower start, and then they canceled the flight
while we were all sitting on the plane. Apparently we were too slow,
and San Jose's curfew kicked in. US Airways put us up in a Clarion
hotel about 20 minutes away. I skipped dinner thinking I could eat at
home and as of this writing it's 5am; I'm back in the airport before
any food joints open, waiting for the 1st flight to San Jose of the
day.
As usual with these trip reports, I summarize brutally (and only those
talks I actually attend) and sort by my personal agenda. No offense
intended, and if you think I didn't do your talk justice please
correct me here!
Some Cool Stuff
Building The Billion Dollar GC
Bounding Leaky Programs
Limits of Heap Compaction
Register Allocation via Puzzle Solving
Register Allocation/Coalescing paper
Foundations of the C++ Concurrency Memory Model
Race Directed Random Testing of Concurrent Programs
SharC - Checking C code for Locking Invariants
Automatic Volume Management for Programmable Microfluidics
Al Aho - Quantum Computer Compilers
Some Memory Management Papers
Immix - A Mark-Region Collector
Concurrent Real-Time Copying Garbage Collection
Limits of Parallel Marking
The Closer: Automating Resource Management in Java
Path Specialization: Reducing Phased Execution Overheads
Region-based Memory Management
Efficient Dynamic Heap Allocation of Scratch-Pad Memory
Virtual-Memory-Compaction....
Supporting Superpage Allocation
MPADs
Parallel GC with a Block-Structured Heap
Memory Management for Self-Adjusting Computation
Combine Ref Counting with Functional Program
A Study of Java Object Demograpics
Some Concurrent Programming Papers
Data-Race Detection via Linear Programming
"sketch" - A Tool for Writing Concurrent Data Structures
Velodrome
Inferring Locks for Atomic Sections
Model Checking Concurrent Programs
Other Stuff
Expressive and Safe Static Reflection with MorphJ
Liquid Types
Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling
Execution Indexing Problem
XMEM - Type-Safe Shared Memory for Cross-JVM Communication
TuningFork is now open-sourced on sourceforge.
(IBM's trace visualization tool and open trace format)
Building the Billion Dollar GC
David Bacon, Keynote
Metronome - worth several hundred million $$$ in revenue to IBM.
(David openned with an amusing story about the failure of technology -
his GPS and backup GPS both failed as his sailboat approached the reefs
surrounding Bermuda; weather was too poor to use a sextent reliably).
1999-2001, Recycler. Ref-counting based GC. max pause: 2.6ms, MMU
72% @ 10ms. But has pathologies - no bound on time with large numbers
of cycles, or on large cycle. No compaction. Required a GC processor
on the side; no way to define how beefy the GC processor had to be.
Next project: worked really really hard to get rid of pathologies.
Wanted robust performance. Decided on
Snapshot-at-Beginning because at time you declare snapshot, you bound
amount of work to complete the cycle. Was worried about
incremental-barrier approach (which is exactly the problem Azul
solves with our read-barrier).
Real-Time also means Real-Space. So some space obsessions. Choose a
segregated free-list. Lose a little memory to fragmentation. Compute
'rho' - fragmentation lossage ratio. Picking a 'rho' defines how many
memory (max) can be lost to fragmentation.
Array-Lets - choosen to handle compaction of large arrays well (most
allocators have a large-object-space which doesn't compact).
Would be very interesting to
run Fragger
on Metronome.
2001-2003: 2nd go-round. Max pause 12ms (up from 2.6ms).
MMU 45% @20ms (vs 72%@10ms). Much worse than last time.
BUT!!! Very robust against poor heap shape and bad apps;
immune to pathologies.
Now comes 3rd attempt - built into J9 (IBM production VM).
(sidebar - RTSJ doesn't "work" in a large scale, the scoped memory
doesn't "compose". Modules from different places, when run together,
have scope unpredictable failures. Raytheon was choking trying to use
RTSJ on the next-gen Destroyer project; saw Metronome and said "we'll
buy it now")
2004: went crazy trying to get it ready for Raytheon.
1st demo from Raytheon:
max 872us, MMU 65@%10ms, 79%@20ms
Finally IBM Software Group agrees to pick up the project (for $150M sales!)
2005-2006: Lots more features added: AoT compilers, timer modules,
priority inheritance, etc. Plus must make "real time friendly" all
the other cruft: jvmpi/ti, debug and RAS tools, weird ref's (soft, weak,
final, JNI), heap format/dump, class libs, all the RTSJ libs, ArrayLet
support, read-barrier support in class libs, documentation, etc.....
Plus testing 24x7, including performance testing, pause-time testing, etc.
Today - new multi-cpu version being worked on. Still only available
on blades - max 8cpus. Not tried 100G heap; no idea what the limit of
sustainable allocation rate.
Bounding Leaky Programs
Non-mutating Mutator - Take a snapshot from a real mutator thread; run
with massive profiling (e.g. all memory refs snapshot'd). The
NonMutator cannot update the heap; it's just profiling.
Bounding Leaky Programs. After a leaky program eats all of memory,
most of memory is reachable - but never used again. When it throws an
OutOfMemory - just capture the OOM. Then 'cut' a random pointer edge,
and get to reclaim most leaky memory. 'pick a random' - since most
memory is leaked and dead, likely to be cutting off dead stuff. But
also can use most common heap-type and staleness data (pages never
touched in a long long time). "posion" the cut edge, so that if that
pointer is actually used, you then throw the OOM you
would have thrown originally. This can post-pone OOM's from leaky
programs anywhere from 100x ("forever") to 10x or more.
A measure of heap fragmentation -
sum-of-(freeblocksize^2) / (sum-of-freeblocksize)^2
Gives a number from 0 to 1 -
- 0==maximally_compacted,
- 1==maximally_fragmented.
Can convert this number into a characteristic-free-block size. i.e.,
if we coalesce all free blocks, then split back into free blocks of
the same size - which size gives this metric.
Can I use this to decide "how likely a future alloction request will
fail due to fragmentation"? Probably...
Limits of Heap Compaction
Compare 20-zillion different heap-compression techniques across a wide
variety of heaps, and against e.g. gzip/LZW style compression.
For some applications (e.g. Xalan) see heap compressions upwards of
50% or more. Some of the techniques are cheapandeasy to use as well.
Trimming trailing zeros from arrays is a big one.
Biggest winner is zero-based object-compresion. Back each field with
a bit; store only the non-zero fields. Zero fields have a bit set in
the backing bitfield. Nearly 45% heap-size reduction across all heaps.
Used the Dacapo benchmarks+psuedojbb, 20-25 heap snapshots per
benchmark. Savings of each technique are very stable over time (all
snapshots show same savings per benchmark).
Register Allocation via Puzzle Solving
Solving for register-pairs. Split all live ranges maximally; split
before and after EACH instruction, using a complete parallel-rename.
Very graphical; 'puzzles' represent the short live range. Puzzles
have a top-area and a bottom-area for the incoming and outgoing
registers. Some puzzle pieces only go on the top, or bottom, or (if
it's a 2-adr live range) it will cover both. Never have a piece that
can go either on the top OR on the bottom.
Tried 4 different reg allocs on X86, using LLVM 1.9. Puzzle solving,
linear scan (default in LLVM), iterated register coalescing, PBQP
(partitioned boolean quadratic programs). Had to spill and retry 5% of
time. Puzzle-solving is fast; equiv to existing well-tuned linear
scan. Need to insert about 7% more spill ops than no allocation
(compare to C2's graph coloring: about 2% of all executed ops are C2
spill ops).
Cliff Notes:A nice way to handle mixing register pairs ala
X86 in a linear-scan allocator. Alas, as with all linear-scan
allocators the question is what happens at merge points. L-S
allocators always do well with large basic blocks, but the typical
Java basic block is 5 instructions long (HotSpot "-server", so plenty
of inlining and optimization).
Register Allocation/Coalescing paper
conservative coalescing vs aggressive coalescing.
but coalescing breaks SSA-form chordal graph form (and chordal graphs
can be colored in linear time!!!).
So do not modify graph, instead modifying the coloring to get a good
coalescing. Optimistically take the existing coloring (easy, from
chordal graph SSA coloring), then flip a nodes' color to coalesce an
edge - but this introduces a conflict - so again resolve the
color-clash by flipping a neighbor color until no more color clashes.
Sometimes have to give up a re-coloring. And what not should be
colored 1st?
Conduct re-coloring as a transaction. Mark nodes, change colors, etc.
If unresolvable, then rollback. Also, do not recolor already
recolored nodes (prevent coloring cycles). Also when picking colors,
try to pick colors not in any neighbor, etc.
Will attempt all nodes in the affinity graph (all nodes joined by
copies) in a single affinity group (split by self-interferences), will
attempt to recolor all those nodes to the same color at once. The
split agrees up front which copies will not be attempted to coalesced.
Not terribly fast, but not too bad: about N^1.2, and 1000msec for
1000 node IFG. After coalescing, 8.4% of copy-cost is left (vs 6.7%
left for PERFECT coalescing). Overall execution speed is about 85-90%
of no-coalescing.
Foundations of the C++ Concurrency Memory Model
Similar to JMM; if you avoid data races, then you get SC. Explicitly:
most compilers are now busted when updating bitfields.
Provide atomic<T> types that allow concurrent
access, without introducing dataraces. atomics allows concurrent
access and compiler optimizations on everything else. 'atomic' ends
up being similar to Java 'volatile' - but they couldn't make the
existing C++ volatile keyward do anything meaningful.
Looks like they are going to end up doing a good job; most of my
earlier complaints appear to be answered.
Race Directed Random Testing of Concurrent Programs
Take a previously known tool for finding potential dataraces. Then
force scheduling to expose the races "for real", and test to see if
the race is harmful. Works on large java programs, finds real races
with low false-positive rate. Reproducible: given the same random
starting seed, execution is deterministic.
Stress Testing: repeated execution. Fails: end up testing the same
schedule over and over again. Fails: on a crash, can not repro.
Fails: crashs in different environments can not be repro'd in-house.
So instead, force the schedule. Pick a random thread, step 1
instruction. Turns out to get much better randomized schedules (more
schedules tried than normal stress testing). But still does not do
much possible schedule-coverage (exponential schedules) - and so does
not cover buggy schedules.
So prioritize search to expose bugs quickly. Start with some existing
technique to get list of potential race conditions (all the "false
positives" are good now).
Good news: very modest slow-downs; gives a reproducable race/crash
with very high probability (reproducable, but perhaps the indicated
original potential race isn't a real race). Found a bunch of new
races in the standard JDK container libs, and in 300KLOC programs.
SharC - Checking C code for Locking Invariants
Programmer defines high-level sharing strategy. Example: this data is
protected by that lock, etc. SharC soundly checks thru static and dynamic
checking. Violations include real data-race traces.
- private
- readonly (read by all, no writes except init)
- locked(l) - only accessed under lock(l)
- dynamic - either private or readonly
- racy - benign race, no checking
Use these as type annotators:
"int readonly *private X;"
- private: only accssed by owning thread
- readonly; target of X can only be read
No non-private pts to private objects.
Sharc infers most annotations
Some common strategies:
- locked - counters, queues, caches
- private->readonly (init privately, goes to read-only)
- private-lock-private - an object moving from thread to thread
Key Insight: if you have the sole ptr to some object, you can change
the sharing stragety for an object. SharC checks this, and allows a
'sharing cast' to change the sharing strategy.
During runtime, SharC makes all the right guarantees with combos of
ref-counting and runtime checks. IF there is a race between 2
threads, SharC provides an error regardless of thread schedule.
Needed expensive checks for 'dynamic' typed objecs. Uses WPO to infer
dynamic and private sharing modes.
Tried it on 6 small/med concurrent programs. Added annotations,
investigate errors; add more annotations, repeat until no more errors.
Programs up to 350KLOC; needed 40 annotations/mods on 350KLOC.
Runtime overhead varies from 2% to 14%. Space overhead sometimes gets
large (30%+).
Cliff Notes: Actually seems plausible that HotSpot could be
annotated this way!
Automatic Volume Management for Programmable Microfluidics
Think "Lab on a Chip" . Chip with micro-pumps, micro-fluids, mixers,
values, sensors, heaters, reaction loops, etc. Want to mix and match
fluids for various kinds of sensors (e.g., blood glucose meter, or a
hormone tester or drug discovery).
'Assays' are an life-science experiment. Using these fluids, but NOT
"computing" using fluids - doing Real Science/Chemistry. Fluids are
not computer variables; fluids can be destroyed; have a fixed volume
and when it's used it's used up.
"Program" specifies relative volumes; but "execution" needs real
volumes. Sometimes a mix is needed for several things, so need to
decide how much of each fluid is being used. Have max capcity issues;
have minimum size issues (micro-pumps have a lower bound on how much
they can pump).
Very cool problem; using compiler-like technology to solve. Problem
looks something like network-flow, but it's actually harder (NP-hard).
Able to throw various linear-programming solvers at the problem.
2008 PLDI Keynote
Al Aho - Quantum Computer Compilers
Lots of strange algorithmic implications; some very hard problems become
asymptotically simpler (e.g. factorization goes from exponential to cubic).
Right now QC's are giant physics experiments, but progress is being
made on many fronts.
Immix - A Mark-Region Collector
Blackburn, McKinely
Really nice presentation.
GC Fundamentals - time/space tradeoff; Immix - improve this tradeoff curve
- Allocation (e.g. free list, bump-ptr)
- garbage type (e.g. tracing, ref-counting)
- reclaimation (e.g. sweep-to-free, compact, evacuation)
- 1960 - Mark-Sweep: free-list alloc, trace, sweep-to-fre
- 1967 - Mark-Compact:
- 1970 - Semi-Space: bump-alloc + trace + evac
Look at these 3 algorithms for: mutator performance, GC time, min-heap.
They all have spots in the curve where they are weaker than one another.
Add a 4th reclaimation: sweep-to-region
Mark-Region: bump + trace + sweep-to-region
Split heap into regions. Objects cannot span regions.
After making, any unmarked regions can be freed in bulk.
Large regions: suffer from fragmentation, but cheap bump-ptr
Small regions: suffer meta-data overhead, and small-fragmentation
Carefully pick block sizes to match TLB boundaries and cache
boundaries. Fragmentation doesn't hurt locality so much if you don't
span these boundaries. Recycle partially freed blocks (from last
cycle) first before working out of new free blocks.
Recycle in address order (explored other orders extensively). This
mimics the mutators' normal allocation order - i.e., continguous in
order generally.
Will also occasionally evacuate to get rid of really bad fragmentation
(but this is a whole-heap remap?). And it's only done rarely.
Extensive testing vs 9 other GCs (and Dacapo, JBB2000, JVM98).
Meets or exceeds all other collectors (in a single-gen and 2-gen settings).
Concurrent Real-Time Copying Garbage Collection
- hard RT GC is gaining support
- multi-core support is hard
This work: compacts; concurrent; lock-freedom; efficiency (actually,
Azul does all these - lock-freedom is very specific: where the mutator
interacts with the GC it does not need to lock; important for hard RT.)
For Azul - want happens on a reloc trap? Somebody locks, so that the
copy only happens once.
So this is simpler than Stopless - "Chicken" and "Clover".
"Chicken" - fast, but may not copy all objects so allows fragmentation
"Clover" - probabilitistic, but guarantees objects get copied
Chicken - abort copy if a mutator writes during copy. Uses a
Brookes-style forwarding ptr. Tag the low-bit during the copy.
Mutator will clear this bit atomically if it wants to write. (ugh
that's a write-barrier!!!).
So write-barrier is wait-free, and has a fast-path/slow-path.
In practice, only 1% of copies get aborted.
Clover - probabilistic indication of per-field copy. Steal a
bit-pattern, a random one. CAS this bit-pattern down to indicate that
the field is copied. Chance of collision with a REAL bit pattern is
1/2^128 (128-bit CAS on Intel) - infinitesimal. Then can make the
algorithm correct-but-probabilistic-lock-free by detecting if the
user is user your random R.
Both schemes - 20% throughput overhead (barrier costs).
Clover has more slowdown during the copy phase (3x slower)
Chicken - no other slowdown.
Alas, only implemented in Bartok; no comparision to Metronome.
Cliff Notes - I like the abort-copy-on-write as a way to
avoid stalling the writing threads. Be nice if we could be even
cheaper on the mutator - some like: TLB-write-protect the whole page.
If a TLB write trap happens, record that a write
happened somewhere on the page, unprotect the page
and let mutator proceed. At end of copy, the GC thread can see if any
writes happened. Since we only copy objects for pages that are mostly
empty, we'd only rarely have false aborts.
Limits of Parallel Marking
Did a study on longest ref chain see in SpecJVM98 heaps. Very short
for 1/2 the specs (max: 40). Some chains around length 1200 for
raytrace/mtrt and javac. Measured max heap depth every 10000 stores
into the heap - so pretty darned fine-grained measurements. No
SpecJAppServer, no Dacapo. Bummer.
Also compute "relative depth" - ratio between number of objects in
whole heap to longest single chain. Then can compute worse-case
speedup possible if whole heap filled with max-length chains. This
reports a guaranteed level of parallelism; real runs should do better
because most chains are short.
Cliff Notes: Doesn't know about Azul's spec-marking patent
and idea of being able to mark any shape heap in time
proportional to size-heap-over-num-cpus. Several researchers jumped
up and pointed out that the speculative approach could theoreticaly
bring in the whole heap (including all dead stuff), and I countered
with the random starting point approach (that you'd get an interesting
amount of dead stuff a vanishingly small %-tage of the time) and we
all admitted the analysis was over our collective heads. There's a
PhD thesis in there somewhere.
The Closer: Automating Resource Management in Java
Resource Management in Java (outside of GC)
Examples: socket open/close. Or AddListener and RemoveListner.
Really track anytime you need matching create/delete calls.
Usual approach is manual; has usual problems (double-free, dangling).
Also finalizers - but might run out of e.g. file handles, long before
running out of GC (happens to us all the time!). Error happens on a
fresh open, but instead we'd like to run a full GC cycle to reclaim
file handles BEFORE the 'open' call fails.
Would like to displose after it's left 'relavent' use. Listeners are
called when an action happens - but if the listener itself won't ever
call something else (there's no Observer on the Listener, but the
Listener is attached to the Observed), then the listener is dead. So
define 'interest reachablility'. Non-interest-links keep you GC
alive. 'Interest-links' keep your resource alive; call the dispose
method when only not-interest-reachable.
Also want to rapidly call 'dispose' as soon as only not-interest-
reachable. This greatly reduces drag-time on large resources.
Sometimes can do it statically, but not always.
User provides: primitive resources and non-interest links.
CLOSER provides: higher-level resources built from auto-made dispose calls.
Pretty standard flow analysis approach. See if they can track a
single 'owner' for a resource. If so, then either statically dispose
- or single-threaded-check dispose if the single owner ever created
the resource. If multiple owners then switch to ref-counting.
Worked great on a large Swing app (removed all existing manual
resource management; had to annotate 5 resources - other 62 resources
were 'auto-discovered' and auto-dispose code was almost completely
equal to the manual, except for fixing 1 bug.
Cliff Notes: I like this as a idea for simplifying
higher-level resource mgt. I'd rather hook into 'GC' somehow - but GC
on 'interest-links' somehow. Similar to how I'd like to see using
compiler/data-flow opts on normal programs.
Maybe this work could be done with weak-refs and finalizers and
standard GC? Interest-links are strong links, other links are weak-links and finalizers for cleanup.
Path Specialization: Reducing Phased Execution Overheads
It's "efficient" code cloning, to reduce GC barrier costs.
Have 1 set of code when GC is idea, and no barrier is needed. Still
need polling points, so GC can change phases.
Have another set for each different GC phase, with the different
barriers optimized inline. With a little bit of cleverness, can fold
these versions mostly together and not clone quite so much code.
Also claim read barrier costs are typically much much higher than has
been reported in the past (common paper is "Barriers: Friend or Foe"
which reports 10-20% cost for simple read barriers). However, e.g.
Stopless has barriers costing closer to 60% during some short phases,
but cost is back down during most of the GC cycle.
Region-based Memory Management
Check for unsafe usage of Regions (ala HotSpot) on C programs with
100KLOC. Found a dozen errors, and another dozen false positives.
Actually, quite nice - wonder if it works on HS.
Efficient Dynamic Heap Allocation of Scratch-Pad Memory
Allocator targets memorys less than 1Mb in size, for embedded systems.
Scratch pads are common in small machines, but also on e.g. Cell.
Scratch is big enuf to be useful, small enuf to be fast.
But can too much stuff to go in it; get fragmentation and control
problems. Need to manage carefully.
- manual: nightmare for programmers to manage
- compiler/static: doesn't make good use, no time sense
- dynamic: still hard to use (e.g. have a fast heap and a slow heap)
Example: DL-malloc will use 512bytes of state to manage a 4K
scratch-pad. 1/8th of fast memory is gone already!
SMA - 40 bytes on a 4K scratch. Also a lot less code and faster.
Uses a fast simple fat-block bitmap approach. Can break big
blocks up to avoid fragmentation. Split blocks are power-of-2
buddy system and stored on a fixed-sized free-lists.
No boundary tags. Instead coalesce tags: whole block region has a
bitmap showing free chunks of size 2, and another bitmap for chunks of
size 4, 8, etc. Makes the coalesce operation very efficient (constant
time per de-alloc or maybe log-n per de-alloc to coalesce).
Can deferr coalesce and stick with free-list for hi-volumne sizes.
Can update the bitmap with CAS for allocations that do not span a word
of bitmap. If spanning a word, will need several CAS's. If CAS fails,
then roll-back earlier allocations and retry.
Very much lower overhead than DL malloc on small allocations.
Won't scale well at all (part of the small-space-fast design
trade-off).
Virtual-Memory-Compaction....
double-map physical page as a way to move objects physically without
moving them virtually. It's memory compaction for C++! If the
page-offsets of 2 objects do not overlap, then can copy one object
from one phys-page to the other, and double-map both virtual addresses
into the same phys-page (and thus reclaim the copied-from phys-page).
Supporting Superpage Allocation
(avoiding fragmentation of physical memory)
There's lots of prior work to avoid frag; none seems to work well.
Define characteristics of pages:
- movable memory,
- reclaimable,
- unmovable, or
- temporary
Group pages together based on mobility.
Movable - e.g. an app page. Copy the data to a new page, and change
the apps Virt-2-Phys mapping. File-system cache pages are temp.
Free-lists for each kind of memory. Pre-split memory for each kind
of memory. Lots of details; he talks very very fast. I can't follow
it or type that fast.
Turns out it's working well enough, and will probably be in a
future version of Red Hat. No slow-down in page allocation (and often
a speedup), and also able to allocation super-pages about 50% of the
time.
MPADs
Auto convert array-of-structs into struct-of-array for better
cache locality which searching across the whole dataset.
Pointers-to-structs turn into ptrs-to-1st-field. All 1st-fields are
now in an array, and the array is power-of-2 aligned. 1st field must
also be a power-of-2 (and math is cheaper on 1st field, so pick a hot
field). Then can convert the ptr-to-1st-field address into the array
index by AND-masking to the array start, and doing a sub/shr. Other
fields can be found by adding different array bases and using the just
computed index.
Worked great on microbenchmarks; didn't help SpecCpu ... accidently
turned off hardware stream prefetching on Power chips.
Parallel GC with a Block-Structured Heap
Simon Marlow, Simon-Peyton-Jones
It's a parallel GC for Haskell. GC talks to the OS via 'malloc', not
'mmap'. Doing this for flexibility and portability. NO address-space
layout requirements. Grabbing OS pages (4K chunks) at a shot.
Similar to HotSpot TLAB - on 4k pages.
Actually grabbing 'megablocks' = 2^M in side and 2^M aligned. First
2^D bytes contain "block descriptors", then rest of 2^M holds 2^12
sizes block.
Memory Manage for Self-Adjusting Computation
Large data-set which changes slowly over time.
e.g. robots interact with phys environment
e.g. run the same program over and over again, where the input and output
from run to run vary slowly over time. Use change-propagation to
make a meta-program which acts "as if" it ran the whole program from
scratch - but it runs much faster, and assumes some large starting state
and produces only the delta to output state.
Benefits: much smaller in both time and space over running the same
program over and over again. Downside: more complex (but SMJ can
auto-produce such a program from the original whole-run-at-a-shot
program). Downside: lousy GC behavior.
So: result is indepedent from input trace, except for performance.
Given a good input trace, and a changed input, can compute a new
whole output efficiently. GIven a lousy input trace, will slowly
compute the new output as-if from scratch.
Has the classic bad allocation pattern: it's basically a giant FIFO
queue for the live dataset. New allocations are long-lived; things
that die are old things.
Combine Ref Counting with Functional Program
(ref counting not quite dead yet) - ref counting admits lots of
interesting optimizations. Cycles - 2 general approaches; Brownbridge
- maintain invariant that any cycle has at least one 'weak' edge and
don't walk weak edges. Very complex, very hard to get right. Or LMW
- trial deletion and see if a cycle is busted. Downside: very slow if
lots of cyclic live data. Also often have large chunks of data you
know isn't cyclic (shape of code) or isn't dead
(e.g. roots).
So pick these invariants:
- no cycle is entirely strong
- weak-in + strong-out
==> there-exists strong-in
Once the strong-count drops to zero, the whole cycle can be nuked.
Maintain invar#1
- mutator makes refs in only 3 ways:
- root
->strong; constructor-arg->strong; ditto->weak
invar#2:
- delete strong ref - check for weak-in and no strong-in and strong out
- might delete last strong-in
- correct by making output strong edge into a weak one (and recursively)
- No limit on amount of work for weaking an edge
This approach limited to pure functional programming - can build cycles
in the constructors but NOT by set!'ing.
Independence Theory - edge coloring is independent of other
optimizations and heuristics. "push-out" should be possible.
Typically for every node reclaimed, less than 2 are scanned.
A Study of Java Object Demograpics
Current collectors tend to have a very simple classification scheme -
short lived, long lived, immortal. And then act accordingly:
young-gen, promotion, old-gen decisions are all driven by the prior
classification.
Real programs have a richer structure. Objects from the libs, the
app, and the JVM itself have different lifetimes. Clusters of
allocation sites are stable across program inputs; a small number of
clusters dominate all allocations.
Now look at individual allocation sites, and look at the lifetime of
objects from those sites. Use some cheap statistical test to see if 2
sites have similar average lifetimes. Group all allocation sites with
a greedy-algorithm based on closeness of lifetime patterns. The top
10 of these groups, or clusters, will cover 90% of all allocation.
A small amount of inlining made each "hot" inlining site very
deterministic.
Data-Race Detection via Linear Programming
Assume a 'capability' for each memory location. You need the
capability to write, or a tiny fraction of it to read. Thus many
readers are allowed, but only 1 writer (and no readers if any writer).
From this can define typing rules for the lambda calculus, and from
that define some linear inequalities. Can solve system of linear
inequalities with, e.g. GLPK or LPSOLVE.
Now try to add locks to the lambda calculus; locks "store"
capbilities. Creating a lock also has to include some capabilities.
Can stll get a real lambda calc, then linear inequalities. Able to
run this on some small pthreads/C programs and find some races. Has the
usual false-positive issues (lots reported; needs some expert
extensions to his lambda calculus formulation to remove the false
positives).
"sketch" - A Tool for Writing Concurrent Data Structures
insight - implementation - validation- fails-validate - fix either
insight or implementation.
This tool: fix the validate/implementation cycle. User provides
insight, but tool tries implementation and validation cycle.
You write a partial program and a spec; 'sketcher' fills in the rest
of the program (exhaustive search) until it gets a program which meets
the spec. Your partial program includes 'holes' to be filled in, and
asserts to be honored. Plus you fill in a list of sample statements
to be filled into the hole (but unordered statements, plus various
reg-exp expansions - so pretty flexible about what can go in the
'hole').
Their gizmo needs a counter-example input on failure. It eats the
failing example to help drive a smarter next-go-around. Failing
traces from SPIN model-checker for one program can be used to exclude
whole classes of other programs which would exhibit the same bug.
The using constraint-solver to produce new program attempts that avoid
all prior buggy traces and fits the 'holes' given by sketcher.
Velodrome
Free from race conditions is not sufficient; must have good atomic
properties as well. Can show some interleaved schedules are
equivalent to some serialized execution. Atomic code blocks are
serializable. Atomicity violations often reveal bugs.
Race-Freedom - as-if sequentially consistent
Atomicity - as-if each atomic block is executed serially
Want both properties.
Race-Freedom/Race-Detectors - e.g. Eraser (incomplete, false alarms)
vs complete tools. But Atomicity checkers only come in the "False
alarm" flavor. Bunch of these kinds of atomicity checkers.
Velodrome - complete checker for atomicity - no false alarms.
Demos a non-standard sync idiom (Dekker's algorithm). Shows
happens-before layered over program order, plus H-B on volatiles and
shared variables. Also, anything not in a XTN is treated as operating
in its own 1-instruction XTN. If this H-B ordering has any cycles IFF
then not serializable. Velodrome detects such cycles.
Now: make this scale to billions of XTN's. Cannot keep the entire
graph; too big. Can use GC to remove XTN's - XTN's with no incoming
H-B edges will never be part of a cycle. Can't get rid of XTN's that
are mid-progress. Ref-Counting works well (since no cycles!)
Must optimize unary-XTN's (single-instruction XTN's). If a unary XTN
has no in-edges - then it will never have in-edges, so can be GC'd
immediately - so actually never even allocate a node for it in the
graph. If unary has exactly 1 in-edge, can merge it into the prior
XTN as well - and "1 in-edge" refers to all the nodes smushed into the
same larger XTN. Must do this to keep costs reasonable.
Velodrome: part of RoadRunner framework for Java; uses BCEL. Can also
add other analysis in a pipeline on top of velodrome (e.g. Atomizer
and Eraser). Velodrome causes slowdown by 13x. H-B XTN graphs tend
to be of size ~20 nodes!
Compare Velodrome to Atomizer on 14 multi-threaded java programs.
Atomizer has 1/3 false warning rate (238 total warnings). Velodrome
gives 133 errors, no false alarms. 84 more errors are Atomizer false
alarms, and 21 errors are only detected by Atomizer.
So - take Atomizer errors, and run with Velodrome but pause the
executing thread that Atomizer complains about. Now Velodrome usually
catches a concrete witness of the broken thread scheduling.
Inferring Locks for Atomic Sections
This talk: compiler support for atomic sections via pessimistic
concurrency; i.e. implement the 'atomic' keyword with a plain lock.
Useful for XTN's with non-reversable ops (i.e., I/O or wait/notify).
Actually grabbing fine-grained locks at the start of each atomic
section corresponding to all the memory locations being touched inside
the atomic region. (how is this different from every other STM
manager out there?). Compiler framework is parameterized: can use any
heap-shape analysis to guard chunks of heaps with coarse-grained
locks. Also the coarse-grained locks are kept in a hiearchy (tree
order); locks closer to the root cover everything under them.
Performance is better relative to TL2/STM in a 80% puts R/B tree, and
comparible when using 80% reads. Still need to figure out read/write
locks, but then think performance will always tie or beat a good STM.
Testing methodology is flawed, so hard to tell how well it really works.
Model Checking Concurrent Programs
stress testing rarely finds all the bugs. Poor selection of thread
interleavings.
"CHESS" - Systematically enumerate thread interleavings. Provide some
qualified coverage guarantees. Replace OS scheduler with a "demonic"
scheduler. Want to enumerate all paths in a state-space diagram, but
can't capture program states (too big). Very effective for acyclic
state spaces, but not for cyclic spaces.
Problem is to get interesting schedule coverage. e.g. spin loops are
very boring to spin at; and if you spin 100 vs 101 times - do you then
explore the states following each seperately? Likely the states
following the spin loops are same for looping 100 vs 101 times. Also
can't really detect live-locks very well - they just look like bigger
badder spin locks.
(Good) programs indicate their lack-of-progress - a thread scheduled
infinitely often will yield infinitly often (via sleep, yield,
thread-ends, asm("rep nop"), etc). A violation of the Good Samaritan
assumption is a performance error.
Found lots of bugs (not presented in paper!)
Found in singularity - etc, programs from 6K to 175K.
Can boot and shutdown singularity under the model-checker - found 2 bugs.
Now in production use at Microsoft!
Expressive and Safe Static Reflection with MorphJ
So far - looks like a technique to avoid replicating synchronized
wrapper classes (i.e., instead of a SynchronizedList class and a
SynchronizedSet and a SynchronizedMap class, you can add sync to any
prior class).
Adds syntax to allow iteration over reflective calls:
for( R m(A) : T.methods ) ...
where T is an interface Type, T.methods is a list of methods in T.
SO for example can declare:
for( R m (A) : T.methods ) public R m (A args) { sync(mutex) { me.m(A); } }
It's an interesting syntatic extension for defining new classes of
code subsumption and commoning... but probably not one that's truely
useful. Much easier just to clone the code by hand. Certainly have
enuf obvious extensions that it seems more broadly applicable. Looks
like a good language hackers' tool. e.g., can replace reflection
hacking + BCEL fairly easily; used MorphJ to replace Heirly's auto-STM tool.
// here is "for all fields with a getter..." for( T f : X.fields ; some get#f : X.methods )
Liquid Types
Better static typing? Can put restrictions on actual int values as
'types'. Such as 'i is never 0 so division is ok', or a type such as
'i such that i is < array.length'. Problem with dependent-types: need
huge type annotations. Can't we use inference?
Limited to logical qualifiers: "0<=v" or "v<len a" or "v<100", plus
conjuctions of them. Type qualifiers limited to the usual, plus these
logical quals.
Provable array safety; 1% annotation in some large code, 10% for a
container library. Seems to do a decent job of inferring proper types.
ML type-inference, plus 'dsolve' - linear constraint solver.
Caught an off-by-one in the bitvector library; safety check for bitblit:
if( c < 0 || offset1 < 0 || offset2 < 0 || offset1+c > length1 || offset2+c > length2 ) ....
Error is should be "offset1+c >= length1"
Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling
Bartok
Start with TAL - type-safe asm code
Compiler is 200KLOC; does about 40 optimizations.
Generates code that is 2% slower than no TAL; compiler is 40% slower.
But gen'd code is typed.
Carrying type-proofs through all of the optimization steps?
Have a type-system for passing by-ref values.
Support most optimizations; does not support inline allocation, or
array-bounds checking ABCD (ABCD is not safe in the prescence of
overflow). Proof is generated by Bartok, carried along with the TAL.
Proof-checker can check proof in linear time, long after optimizer is
gone. Verification takes about 2.5% of compilation time.
Also check that GC info is correct, that return address is immutable.
Not handling correctness proofs for concurrent programs.
Object sizes have grown alot, to hold type info - 2x more data
overhead than plain optimized code (ala plain C). Execution time is
only a little slower (no inlining of allocation code); I assume
allocation is considered trusted code (part of TCB).
Execution Indexing Problem
same program run twice w/slightly different inputs OR
2 versions of same program OR
same multi-threaded program, w/different schedules
How do you represent an execution point across different executions?
How to set breakpoints in the same execution-point in 2 different
executions?
Describe a grammer for the execution trace (remember Manish's trace
compression?) Same game: it's a grammer to describe execution. Can
compute these online, and use them to decribe execution points. Can
ask for their 'indices' same as asking for a call-stack. Instrument
at branches and post-dominators; counters for loops.
XMEM - Type-Safe Shared Memory for Cross-JVM Communication
Uses HotSpot! Type-safe shared memory: a pool of shared memory.
No shared => private-JVM ptrs allowed. GC safe. Direct access.
Use it, e.g., for HTTP or RMI or CORBA communication between
different JVMs physically located on the same box.
Ala Azul's DirectPath (but we don't break OS protection).
Shared objects not distinguishable from private objects. Mapped to
same virtual address; shared objects have same pointers in each JVM.
XMem uses a write-barrier to enforce no-private-ptrs in shared-heap.
Otherwise shared objects are exactly the same as
private ones.
Need Local Class Table and Global Class Table - so shared objects can
have globally shared classes as well. Shared objects then have 2
flavors of classes - the global version, and via indirect, a private
"real" local version.
Some special native/JNI calls to allocate objects into the shared
memory; plus some conveience modes for making objects shared by
default. Can 'clone' objects into shared space. Can also open
sockets and read/write to get objects into shared space.
GC: extends card-marks to tell which private heaps point into the
shared heap. When shared heap is done, trigger a minor collection in
each JVM to find roots into the shared heap. Any tracing algorithm
will now work; they did a parallel STW copy collector. I assume one
JVM did all the GC for the collection using the shared memory.
Support locks in the shared heap; must always inflate such locks - as
they cross JVM boundaries.
Overhead vs XMEM oblivious programs- 2% slowdown inside HotSpot for
extra work, lock inflation, etc.
But communication microbenchmarks show 20x to 200x speedups.
Also looked at TomCat and Hsqldb - using shared memory instead of
serialization (XML) over sockets - often 1000x faster.
Cliff Notes: They lose OS safety protections. A broken VM
can corrupt the shared memory, thus also crashing an unrelated VM
using the same shared memory. Note the difference between 2 VMs
talking via serialization over sockets: a corrupt VM will cause a Java
level exception in the other VM, which might be able to correctly
handle the crashing VM. Not so here.
Yet Another sampling talk, where they show that a 1% sampling rate
is good enough to track all sorts of memory-trace analyses.
RADAR - dataflow analysis for datarace detection
(solves a problem Java doesn't have)
Postlude
I'm off onto a couple of weeks vacation. No email! I'll be going
through Internet Addiction Withdrawal. (No cell phone either, but I'm
a visual person and hate the phone anyways).
Cliff
Category: Web/Tech | Permalink
| Comments (4)
| TrackBack (0)
Clojure: STMs vs Locks
May 28, 2008
I've been participating in this fascinating discussion with Rich Hickey, the author of Clojure about Software Transactional Memory. I decided to cleanup and echo the conversation here, but the original can be found here.
The Problem Statement: it's not "atomic-vs-lock" but instead it's "where do we put atomic/lock?"
Randall R Schulz I found the following comment on an Azul Systems blog (http://blogs.azulsystems.com/cliff/2008/05/javaone.html) :
Cliff Click wrote: Performance in an STM is not transparent: it requires a runtime to do abort/retry, and as the runtimes get fancy & complex the behavior under load of STM's gets.... weird and unpredictable. Not something you can put into a production environment.
Any comments or counter-examples to this?
Raoul Duke
Rich Hickey, richhic...@gmail.com
wrote: Are explicit locking designs free from such anomalies? I wouldn't think so.
paraphrase: the behaviour under load gets weird and tricky.
well, in some ways maybe that doesn't happen with locks: the question of (depending on which STM system we're looking at) figuring out what caused all the rollbacks vs. with locks, you can at least generally quickly see that a given lock has 90% of all threads waiting on it kind of thing. whereas you don't know what variable in the venn diagram of overlapping transactions caused the retry?
tho i believe Rich previously said that Clojure would actually have a way of finding that out! http://groups.google.com/group/clojure/browse_thread/thread/2f73b80fd...
Cliff Click
You got it - under load, performance is unpredictable.
With locks, you can see which locks are loaded, generally figure out why, and plan a strategy (stripe the lock, shorter hold time, switch to the j.u.concurrent datastructures, etc).
Not so (at least not yet) with STM. I've talked to a few people who've tried STM out in a larger scale than the usual academic papers - and the results are downright disappointing. Unless you become an expert in the STM runtime you're using (where "expert" means "not just grok it, but can build & tweak it") anytime performance is not good enough - you get stuck. This is an open research problem at best, with no clear indication that the problem can be solved at all. Rich Hickey
One could as generically argue that systems with manual locking are usually broken, and therefore their behavior itself, never mind their performance, is unpredictable. That's been my experience with manual locking in the hands of mere mortal developers. It can be as difficult to understand the behavior of any single manually-concurrent program as to understand the performance characteristics of an STM. In the latter case, at least the STM is handling correctness for you, and all users of the same STM can share knowledge (and bug fixes and performance improvements), vs each application being its own Gordian knot. And in any case in which the STM is failing you performance-wise, you can opt out and use manual locks outside of transactions. To the extent you can use it, STM can drastically reduce the complexity of a system.
I'm wary of unifying the discussion of concurrency with one about performance, as the problems of concurrency start with correctness, where the manual locking story, due to its complexity, is quite bad. Scalability is not a universal problem - some systems need to handle thousands of simultaneous connections - most don't, some need to leverage hundreds of CPUs in a single application - most don't. But every application that wants to leverage multiple cores needs to be correct.
It would be no problem to instrument the Clojure STM references with fault counters that would allow one to know the exact per-reference contention level. Once known, the answers in both cases are similar - share less mutable state, don't have both long and short-lived transactions contend for the same data, check your success conditions ASAP etc.
STM designs differ from one another quite a bit, so any general statements are difficult to assess. I think the level of granularity of the STM comes into play. Most of the STM papers I've read concentrate on creating lots of transactional cells with which they envision constructing concurrent data structures like the ones in java.util.concurrent. I don't think that's a good idea at all.
Clojure encourages the use of immutable persistent data structures, which need no locking or coordination whatsoever in order to perform correctly in a concurrent situation, and STM references to support the appearance of identity-preserving change. It also encourages the use of java.util.concurrent and the like for caches and work queues - the best tools for the job.
As far as I know, Clojure is the first STM that uses snapshot MVCC, avoiding both read logging and transaction restarts due to read invalidations. Clojure's STM also gives you the ability to 'ensure' a reference you later intend to read or write, thus providing some manual control over resource acquisition order (rather than just relying on the side effect of access). It also supports explicit commute, further reducing retries for commutative writes. I'll not claim its performance characteristics are at all proven, but its contention due to, and for, reading is lower than in programs that use manual locking, and lower than in STMs in which write transactions can cause read transactions to retry.
Once you get to record-level STM granularity, like Clojure's, it's also a bit easier to draw parallels to the performance characteristics of the database systems it mimics, which have been doing similar things for decades.
I don't consider STM performance any more or less of a research problem than determining the correctness of programs that do manual locking - there's still work to do.
And of course, neither STM, nor any other mechanism, is a panacea. > | Brett Morgan
Given that I am a mere developer, actually with MT i'd consider myself a rank newbie, the most important thing to me is visibility. I want to be able to see what problems I am creating with my design, so that I can change my design, and see how the problems change. In effect, I am looking for an environment that teaches me what I am doing wrong, if only by showing me which references are hotly contended.
In keeping with that, it would probably be helpful for both refs and transactions to be namable, such that debugging output from the STM runtime can actually be easily inspected. The reason I say this is that the data I need to work over is non uniform, so I need to be able to label the hot reference to know which particular chunks of my data set are causing issues.
thoughts? |
Cliff Click
The problem with manual locking - and it applies equally well to transactions - is that the there's no name-space-control over what needs to be locked/transacted. The failure people have with locks (and limited experience with transactions) is that they can't properly name their shared variables. They think they can, until we start looking hard, and then we keep finding more... and more that need to be locked/transacted atomically. In short, people don't know where to put the lock/unlock or where to bound the transactional region.
Having said that, locks have some advantages over transactions: transparent performance being one of them. Programs are busted if the locks are wrong (or transaction regions don't cover all the variables they need to; Clojure isn't immune here) AND they are busted if performance sucks. For 2-4 way machines the issue is probably moot; for 8-32 way machines (standard servers these days) - it matters. In a year every el-cheapo linux blade server out there will be looking at 16-64 cpus. And of course for Azul, 100+ active cpus is the norm. I claim that in a year, most people will either care (get bit by lack of scaling), or see that their need to care within another year.
Requiring shared variables to have a special syntax, ala Ref, is a big step forward here for both locks & transactions. You're reducing the complexity of the problem via name-space control (only Ref's are shared) and promoting immutable shared state. The STM issue is orthogonal.
As evidence of that, suppose you replace your STM implementation with a trivial single global lock. Are the programs more or less complex than their locking counterparts? The program is as-correct and as-complex as before (before you swap the STM implementation), and indeed is exactly equal to the locking solution using a single global lock. Suppose I can get a correct program using STM - then I also have a correct program using a single global lock. STM only varies from locking as I try to make the implementation more performant: for the STM I trust the language implementor to "stripe" the locks under the hood. For locks, I start using more than one lock. In both cases I can also try to shrink the lock-hold-time (or reduce the size of the STM region), or switch algorithms.
Summary: STM doesn't solve the Big Problem (lack of name-space control). Ref's might solve the Big Problem - and solving the Big Problem will be a big help to both locks & STM. Locks have transparent performance and are thus easy to optimize. Optimizing without good name-space control is a buggy process for both locks & STM. I don't know how to optimize an STM implementation without becoming an expert in the particulars of the STM runtime (and there are so many to choose from!) |
Rich Hickey
I agree that identifying the things that can change is one key issue, and Clojure's Refs, Agents, and Vars all do that. But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure's) STM works.
Refs are not merely markers, they do enforcement. There is simply no mutating a ref outside of a transaction - it's not an allowed mistake. Any attempt to change a ref outside a transaction throws an exception. So, there is no such thing as "transaction regions don't cover all the variables they need to" in Clojure.
Your thought exercise with a single lock is a good one. Let's say Clojure's implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.
Which is more correct? The Clojure one, because it is not subject to undetected 'forgetting to obtain the lock'. Your assertion that "Suppose I can get a correct program using STM - then I also have a correct program using a single global lock" implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness. All Clojure STM programs (that don't throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don't throw exceptions) are correct STM (or otherwise) programs.
Which one is has better performance? Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.
As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock acquisition order and the resultant deadlock risk. At this point STM becomes dramatically less complex than manual locking.
I'll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy. But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.
Analogies to databases are telling. DBMS's took over locking long ago and most people were happy to have them do so. Ditto memory management and GC. In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.
So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement. It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I'm sure people are working on that. STMs can provide correctness enforcement today, and therefore the two are not equivalent. Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance. |
Cliff ClickOn May 24, 10:51 am, Rich Hickey wrote: I agree that identifying the things that can change is one key issue, and Clojure's Refs, Agents, and Vars all do that. But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure's) STM works. Refs are not merely markers, they do enforcement. There is simply no mutating a ref outside of a transaction - it's not an allowed mistake. Any attempt to change a ref outside a transaction throws an exception. So, there is no such thing as "transaction regions don't cover all the variables they need to" in Clojure.
Alas - there still is. :-( Example down below, and this is my primary complaint against both locks & STM. However, your other arguments are more quickly answered - mostly because I agree with you. Your thought exercise with a single lock is a good one. Let's say Clojure's implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.
Which is more correct? The Clojure one, because it is not subject to undetected 'forgetting to obtain the lock'. Your assertion that "Suppose I can get a correct program using STM - then I also have a correct program using a single global lock" implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness. All Clojure STM programs (that don't throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don't throw exceptions) are correct STM (or otherwise) programs.
Good one. I agree. Clojure wins this round. Which one is has better performance? Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.
Not really relevant; nobody writes the 'single global lock' version, including Clojure. It was just a thought exercise. As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock acquisition order and the resultant deadlock risk. At this point STM becomes dramatically less complex than manual locking.
Not interesting in practice. Because in practice, deadlocks are easy to find & fix. You get a crash dump, crawl the stacks, discover the lock cycle & reorganize. Sure the situation could be better, but deadlocks are a 'crash once' kind of bug. Dev's don't like 'em, but they don't lose sleep over them either. I'll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy. But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.
Analogies to databases are telling. DBMS's took over locking long ago and most people were happy to have them do so. Ditto memory management and GC. In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.
Actually, I was thinking of the compiler-vs-ASM analogy that took place in the 60's & 70's. But point well taken: I agree that we need some kind of support here. Maybe in the very long run STM runtimes get better & STM does win. Meanwhile life sucks. If you can only solve the problem with STM's I'll point out below then I'll agree with you, and maybe even try my hand at an STM implementation. So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement. It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I'm sure people are working on that. STMs can provide correctness enforcement today, and therefore the two are not equivalent. Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance.
Ok, the long sought after counter example: STM's do not compose w/correctness. Bad Java syntax follows, because I don't 'speak' Clojure. I'm using the classic example.
class Account { long _money; add( long m ) { atomic { _money = _money + m; } } } transfer( Account a1, Account a2, long m ) { a1.add( m); a2.add(-m); }
While add alone is STM'd & safe, the two calls to add that need to be wrapped in an atomic are not - so there is a race where the accounts do not balance. Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class. THIS is the source of vast numbers of bugs.
Less trivial examples quickly spiral out of control. Sure I know Ref's A, B & C are all Ref's and thus only updated in dosync blocks - but can I split any collection of Ref updates into more than one atomic region? The ad-absurdum answer is to wrap a dosync around all Ref's - which means the whole program, and that means a single-threaded program. Imagine the program unrolled in front of you, as an endless sequence of actions (all control flow flattened out)... interspersed with other actions are updates to Ref's. Now: dissect this list into groups with atomic or lock as you see fit, preserving the program meaning as another unrolled thread of execution is interspersed. How do you decide where it's OK to split the stream and where it's not OK? Why, in the Grand Scheme of things is:
"... { read _money ... write money } ... " a correct splitting of the action stream, while "... { read _money ... write _money } ... { read _money ... write _money } ..." is broken, but "... {{ read _money ... write _money } ... { read _money ... write _money }} ..." is correct again? |
A Brief Digression into the Usefulness of Deadlock Avoidance
Phil Jordan
I'm new to STM, I've stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking. I especially don't know what goes on under the hood in Clojure's STM implementation, or how powerful the JVM is in terms of memory guarantees.
I'm chipping in out of interest and to improve my understanding, apologies if I sound like an idiot:
Cliff Click wrote: On May 24, 10:51 am, Rich Hickey wrote: As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach - lock acquisition order and the resultant deadlock risk. At this point STM becomes dramatically less complex than manual locking.
Not interesting *in practice*. Because in practice, deadlocks are easy to find & fix.
From personal experience, this is far from the truth in complex systems. Deadlocks happening only on "in the wild" systems, appearing in the form of heisenbugs, etc. Not fun at all. There's too much in the way of implicit contracts going on, which blows up in your face if you're trying to extend undocumented software that was written by someone who left the company before you arrived. (and we all know the "well then document it" approach just doesn't happen in practice)
Maybe it's just the implementations I've used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful. You get a crash dump, crawl the stacks, discover the lock cycle & reorganize. Sure the situation could be better, but deadlocks are a 'crash once' kind of bug.
You need a reasonable amount of infrastructure in place for that, plus you're relying on absence of evidence rather than proof that it can't deadlock. Dev's don't like 'em, but they don't lose sleep over them either.
The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. :) Bad Java syntax follows, because I don't 'speak' Clojure. I'm using the classic example.
class Account { long _money; add( long m ) { atomic { _money = _money + m; } } } transfer( Account a1, Account a2, long m ) { a1.add( m); a2.add(-m); }
While add alone is STM'd & safe, the two calls to add that need to be wrapped in an atomic are not - so there is a race where the accounts do not balance. Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class. THIS is the source of vast numbers of bugs.
My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic. The way Clojure handles this (I've been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a ref, and any attempt at modifying it will fail unless you're in a transaction. Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction. Less trivial examples quickly spiral out of control. ...
Okay, you kind of lost me with what you're trying to say here.
I get the impression you're mixing up atomic operations on the low level with a high-level STM implementation like Clojure's. The latter presumably won't be efficient unless the implementation uses the former, but my understanding is that the programmer using an STM implementation is largely supposed to be isolated from such details, as long as he/she is able to determine what operations should be wrapped in a single transaction. |
Cliff ClickI'm new to STM, I've stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking. I especially don't know what goes on under the hood in Clojure's STM implementation, or how powerful the JVM is in terms of memory guarantees.
I'm chipping in out of interest and to improve my understanding,
Let's see if I can do you some justice... From personal experience, this is far from the truth in complex systems. Deadlocks happening only on "in the wild" systems, appearing in the form of heisenbugs, etc. Not fun at all. There's too much in the way of implicit contracts going on, which blows up in your face if you're trying to extend undocumented software that was written by someone who left the company before you arrived.
Yup. So the deadlock happened. Ouch. (and we all know the "well then document it" approach just doesn't happen in practice)
Yup, but You could make a difference. HotSpot probably has the highest level of 'asserts' per line of code (and a pretty darned high level of comments per line of code) of anything Out There. Docs are cheaper than debugging. But it's an aside.... back to deadlocks-in-practice: Maybe it's just the implementations I've used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful. You get a crash dump, crawl the stacks, discover the lock cycle & reorganize.
You need a reasonable amount of infrastructure in place for that, Crash dump? Core file? 'ctrl-backslash' to a JVM to get a thread-dump? This stuff is commonly available.
If you don't have a proper tool chain, then Job One for you should be to go get one - and I'm very serious here. Drop everything else until you get a functional 'path' (protocol, implementation, business process, etc) that allows you do debug a crash from a customer in the field. You might need a willing beta-partner for this, hand him a broken program & let it crash, figure out he needs disk-space to capture the core & logs, needs training to hit 'ctrl-backslash' when suspecting a deadlock, etc - plus FTP the files back to Home Base, plus you need to be able to capture the files per-customer-per-bug (ala Bugzilla), then decrypt the file (gdb for core, eyeballs or a little perl for stack-trace logs), etc, etc, etc, .... No Rocket Science, but annoying bits of Engineering along with some Customer Management. plus you're relying on absence of evidence rather than proof that it can't deadlock.
Err.... No.
I'm not shooting for perfection here; I'm shooting for "good enough to ship". Which means that if - In Practice - each possible deadlock happens once, then gets fixed - that's almost always Good Enough. Maybe not to run a nuclear reactor, but definitely good enough to run large businesses. In practice, most possible deadlocks never happen - but a few that do require several go-'rounds before getting fixed properly. And then the deadlock is fixed, and remains fixed. Dev's don't like 'em, but they don't lose sleep over them either.
The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. :)
Yup... and those folks are generally stuck anyways. But if they are thinking about the problem ahead of time they are going to be way way ahead in the long run. My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic. The way Clojure handles this (I've been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a Ref, and any attempt at modifying it will fail unless you're in a transaction. Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction.
...
Okay, you kind of lost me with what you're trying to say here.
|
I Restate My Argument
Cliff Click
Trivial examples are.... trivial. They can be fixed in a thousand obvious ways. We need to extrapolate to the non-trivial, because that's the only place where the STM-vs-Locks argument becomes interesting. So lets pretend that instead of a single Ref _money and two classes, I've got 500 Ref's and a million lines of code - ala HotSpot (Sun's JVM). Easily >500 shared concurrent variables, about 750KLOC. About ~100 unique locks (some are classes of striped locks) guarding them in very complex locking patterns. Now replace all those locks with an STM & atomic. Is my program any more correct? Not really.... ...I might have avoided some potential deadlocks (HotSpot uses lock ranking asserts to avoid deadlock; deadlock rarely happens at the engineers desk and maybe once/year in the field across all HS users). The set of deadlocks-in-the-field avoided was miniscule. I'll concede that HotSpot is a rarely-well-engineered piece of concurrent code, and that deadlocks-in-the-field appear to happen more often to other large programs. But still, fixing after the fact is reasonable when the deadlock rate is so low and each fix 'sticks'.
Instead of deadlock, HS crashes far far more often because the locks don't cover the right set of changes. Switching out the locks for an STM didn't change what I was guarding; it only removed any fine-grained-lock errors (admittedly useful... but only incrementally more so than avoiding deadlocks). I'm still stuck with a program that's too Big to see where the proper atomic/STM/locking boundaries need to be. In a trivial example I can say "go up one call level and atomic there", but in the Real Program - I can't do that. Go up how many layers and add atomic? 1 layer? 10 layers? 100 layers? Yes, I see unique call-stacks with >100 layers. I can't put atomic around main because that makes my program single-threaded.
Here's where I want some kind of compiler support. Ref helps - because it at least demands I have an atomic around the Ref. But Ref is insufficient, because a syntactally correct program simply wraps an atomic around each Ref - exact what my trivial example did. I'd like to be able to specify classes & groupings of Refs that have to be wrapped in atomic - that the compiler will complain about. Yes I'm still responsible for the semantics - I have to tell the compiler which groupings of Refs are interesting - but I'd like some kind of automatic support, so that as my program goes from 10 lines to 10MLOC I can be told "you touched both Ref Person._checking._money and Ref Person._savings._money without wrapping both Ref accesses in a single atomic, thats a datarace error". |
Rich HickeyOn May 25, 11:04 am, cliffc wrote: Not interesting *in practice*. Because in practice, deadlocks are easy to find & fix.
I think you'll find plenty of dissent on that point. Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class. THIS is the source of vast numbers of bugs.
That's a problem with the Account class and OOP - classes are simply the wrong granularity for so many program semantics. Yet, programmers demonstrate they can handle this, and set appropriate transaction boundaries, because they do so in their database work. There is a pretty close analogy there, since many DBMSs will treat every statement in a batch as an independent transaction unless collected in an explicit transaction. In spite of the same risks, work precisely like this is getting done correctly, and relatively easily, I think in no small part due to the simplicity of wrapping a bunch of arbitrary statements in BEGIN TRANSACTION. Less trivial examples quickly spiral out of control....
What you are asking here is a philosophical question, which computers may never answer, since the answer depends on the semantics of the program, and such semantics will probably never be conveyed to the system more easily than:
(defn transfer [a1 a2 m] (dosync (add a1 m) (sub a2 m))) or the Java-esque: transfer( Account a1, Account a2, long m ) { atomic{ a1.add( m); a2.add(-m); } }
Absent some similar declaration of relatedness, the program will have to understand itself (or one program another), since while it may look like debit/credit to us, it just looks like foo/bar to it.
I hope you don't wait for an answer to this (hard AI, potentially impossible) problem before dipping your toes in STM - it's likely you could make a significant contribution. |
Rich Hickey
Towards that end, and more generally, it would be nice if there were metrics for success other than anecdotes from the field. There should be some meaningful problem statement STM and other solutions could take aim at, something more real-world than the typical STM paper's correctness test or Erlang's 'pass messages around in a circle' and 'accept connections until you die' benchmarks, and more attainable for a new technology than the million-line systems Dr. Click and Brian Goetz mentioned in their final Java One talk, or Erlang's nine-9s phone switches. Then everyone could run proposed solutions on their own 2/16/multi-hundred-processor systems and say - nope, not quite, this works, this doesn't.
Unfortunately, describing such a problem in an architecture-neutral way is difficult. I think there can be a certain presumption that any concurrency solution should allow people to architect systems much as they do now, as a graph of directly-connected mutable stateful objects. Dropping STM (and many of the other concurrency mechanisms) into such an architecture is likely to forever disappoint. IMO, if you're not serious about immutability, you're likely to struggle with concurrency.
On my side, points were taken re: transparency and control for STMs. The Clojure STM's current architecture is certainly capable of both better reporting and control. So I'll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become 'knobs' for performance tuning - timeout windows, retry counts, history cache sizes etc. So, I have some work to do before I'd like that kind of scrutiny :) |
Cliff ClickOn my side, points were taken re: transparency and control for STMs. The Clojure STM's current architecture is certainly capable of both better reporting and control. So I'll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become 'knobs' for performance tuning - timeout windows, retry counts, history cache sizes etc. So, I have some work to do before I'd like that kind of scrutiny :) I'm going to be a sourpuss again here. :-( This is exactly the trap MPI fell into; and you have to do it anyways. Double-unsmiley. :-( :-(
Here's the deal:
- I write a Large Complex Program, one that Really Needs STM to get it right.
- But performance sucks.
- So I do a deep dive into the STM runtime, and discover it has warts.
- So I hack my code to work around the warts in the STM.
- Crap like: at an average of 5 cpus in this atomic block the STM 'works', but at an average of 7 cpus in the same atomic block I get a continous fail/retry rate that's so bad I might as well not bother. So I guard the STM area with a "5-at-a-time" lock and a queue of threads waiting to enter. Bleah (been there; done that - for a DB not an STM but same-in-priniciple situation). A thousand thousand variations of the same crap happens, each requiring a different hack to my code to make it performant.
- Meanwhile the STM author (You: Rich) hacks out some warts & hands me a beta version.
- I hack my code to match the STM's new behavior, and discover some new warts.
Back & Forth we go - and suddenly: my app's "performance correctness" is intimately tied to the exact STM implementation. Any change to the STM's behavior kills my performance - and you, Rich, have learned a lot about the building of a robust good STM. You (now) know the mistakes you made and know it's time to restart the STM implementation from scratch.
My options limited:
- Scream at you to Not Change A Thing. Thus C/C++/Clojure Standards Committees are born. 1/2 smiley :-P
- Cling tenaciously to the abandoned version, and recognize my code is now abandon'd ware. No new features or support for me. But maybe the App is running fine and I can forget about it.
- Rewrite my app from scratch Yet Again, to match the new STM's warts/features.
MPI is in this position. Every large parallel scientific App Out There is intimately tied to the marriage of parallelizing_compiler + MPI_runtime + computer_archicture. Changing any one of those pieces requires the app to be rewritten from scratch in order to achieve a fraction of the required performance.
The parallelizing compilers have stablized in a performance way. There's a coding style which will auto-vectorize/auto-parallelize/auto-stripe/etc and there's some codes "on the edge" (some compiler+hardware combos yes, some no), and there's a well known "don't go here, the compiler can't hack it". The compiler support for STM is very much lacking and/or in-flux. Your heading into this terrain now, as you add Clojure compiler support to your STM. Me, as an STM user, can't know what's in store for me as you go through the learning process. I know that I don't know what STM coding styles will be performant and what will not.
The MPI_runtime has now stablized (I believe, not sure) after 20+ years. Again there's the "this is good, this is marginal, this is bad" folk wisdom that drives coding. Again, for STM's, this area is very much in flux. Go read some of the bazillion academic papers Out There. Everybody's got their own take, every STM is good in this domain & bad in some other domain - and all the domains are different; god forbid I write a large program dependent on STM 'A' and later try to switch over to STM 'B'.
The computer_arch has been stable for message-passing for some time. For STM, I believe it's trying to ramp-up. i.e., the computer_arch for STM is "all the world's an X86" right now, and all hardware vendors are furiously studying what it would take to add some kind of STM/HTM/hybrid support.
Thus, for STM to make it to the 'Promised Land' - the STM industry needs to figure out:
- what belongs in the compiler
- what belongs in the runtime
- where there are Dragons and where there are Green Pastures
- teach the STM users which paths lead to Dragons or Green Pastures.
If we BOTH don't go through the excercise, then STM will never hit the promised land.
MPI never really "made it"; the 'Green Pasture' area was too small, and it was always surrounded by steep performance cliffs leading to Dragons when you slipped off the edge. Upgrade your hardware: rewrite your app. Upgrade your compiler: 2 months of performance debugging. Upgrade your MPI: 2 months of performance debugging to discover you need to rewrite your app.
GC "made it"; the Green Pasture gradually got bigger & bigger; Dragons remain lurking - but only after you filled up an ever-growing Green Pasture and needed to poke at the edges of stability. |
We Agree to Press On, Bringing Light into the Darkness
Raoul Duke
it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-) |
Rich Hickeyit would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)
That could be fun. |
Cliff Click it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box :-)
That could be fun. Send me an email; Azul hands out 'academic' accounts all the time. You get a remote login; you upload your code & run it. And, if you don't mind, I'm going to edit this entire thread for clarity and echo it on my blog. This is good stuff (this conversation), and I definitely wish you the best of luck. |
Cliff
Category: Web/Tech | Permalink
| Comments (54)
| TrackBack (0)
Post JavaOne Update
May 12, 2008
I'd like to say I attended dozens of JavaOne presentations, but I barely managed to attend two plus do my own three talks, plus interview 3 job candidates (we're going to make all 3 offers), plus attend a Scala working dinner with Martin Odesky, catch up with some old friends, surf the pavilion, meet with 3 or 4 potential customers, visit the infamous Puzzle Pirate's HQ and have dinner in the oldest continuously operating restaurant in SF.
I attended Bill Pugh's "Defective Java Code: Turning WTF Code into a Learning Experience". As usual for Bill, the talk was very entertaining with plenty of OMyGod code examples from big well-known projects, all found with static analysis and pattern matching using his fabulous FindBugs. I found a good writeup in this blog, but I can't find a copy of the slides online.
I attended Brian Goetz's "Let's Resync: What's New for Concurrency". Again, I found a good writeup here but no slides. The Big News: Doug Lea's Fork-Join is coming along well and will be in Java7. It promises to be a very easy to use well-scaling divide-and-conquer style parallel framework. I'd love to see how it works on one of our big 768-way boxes.
I gave 3 talks:
- Debugging Data Races. I talk too fast, or at least I did during the memory-ordering portion of the talk. Most folks liked the 2nd half (some fairly concrete examples and suggestions), but the 1st half needs some timing/pacing.
- Towards a Coding Style for Scalable Nonblocking Data Structures. Sample blog entry here. Biggest complaint from the talk: I went *way* to fast. Sorry - it's a talk I've given a few too many times and I'm getting lazy about presentation. The code from SF high-scale-lib got a bump: 50 downloads this last week. I'm always psyc'd when I hear feedback from this talk; it looks to me like a 1-hit-wonder (well 2.5 hits) but I'm always left with the notion that with better tools (ala what the hardware folks have built) that this approach should have some legs.
- JVM Challenges and Directions in the Multi-Core Era. I split the last talk with Brian Goetz. Two-speaker talks are always hard to pull off, but I think we managed a great job. Sample blog writeup here. Funniest question (to me): "How's .Net doing here?" My answer: "I am not my brother's keeper!". Seriously, I've had dinner talks with some of the .Net runtime group, and they are shockingly out of touch here. I forsee a large multi-core train thundering down Microsoft's self-induced tunnel vision.
Azul News: We're hiring again! We managed to survive the downturn in the banking industry. We spent the last year widening our customer base, and it's finally paying off. Plus the banks are buying again! I interviewed 3 candidates who were in town for JavaOne. I think we're making offers to all three. Mentally exhausting, moreso for them than me I daresay.
I had dinner with the bay area Scala working group (sorry couldn't find any official link), and Martin Odersky at Buca di Beppo's. It was a fun geeky talk/dinner. I'm please to report that (unlike say JRuby) Scala "maps" well onto a JVM. Code performance is probably very transparent (very sad, but no personal experience here) and very good. Martin claims he made no real compromises in fitting the language to the JVM (but perhaps some minor warts with 'null' or a lack of unsigned types). A harder issue is getting the Actor model to map well. I personally am a fan of "no shared variables" and I hope he gets a good fit figured out. Like most functional languages, Scala would like to see tail-recursion promised from the JVM vendors/spec (and rumors from John Rose that this might happen to HotSpot). With tail-recursion, the cost-model of tail-calls is that they cost no more than a 'goto' (which after JIT'ing can be quite low indeed). The moral equivalent for data-structures is linear logic. The key here is allowing the JVM and/or JIT to figure out when a message cannot be referenced by the sender again and hence the "message passing" isn't really a copy-cost - it's no more than a pointer change. I.e, similar to tail-recursion, the cost-model becomes "pointer copy" - such that you can look at the code and know "this operation is (almost) free". This is one of the major strong points of the C language - knowing a code's "cost" at a glance.
I also got the guided tour of Puzzle Pirate's headquarters. All I can say is: Wow. I'd love to work in that space. From there I took a hike to have dinner at the Tadich Grill. This is the most non-San Francisco restaurant I've ever eaten at in SF. The both the staff and clientele were as about as white as you can get. The food and service (and prices) were all great - but I definitely felt the lack of SF "color"! No blue-dyed spiky hair, no leather&stud bracelets, no piercings (other than female ears), and exactly 2 customers not as pasty-white as myself (and I arrived with one of them). It was so different, that it felt like a refreshing change from the usual SF polyglot... but I was glad to walk out the door and see somebody who didn't look like a mirror image of myself. Like programming languages, sameness is bland in the short term and a form of death in the long term. Viva 'la difference!
Cliff
Category: Web/Tech | Permalink
| Comments (4)
| TrackBack (0)
JavaOne
May 1, 2008
JavaOne is fast upon us. It's a big year for Azul; we have 4 talks this year. I'm giving 3 and Gil Tene, our CTO, is giving the other one. I'm mentioning his talk specifically because it "sold out" and JavaOne kindly gave him a 2nd session in a new larger room at an earlier time - which will now be cavernously empty unless you sign up! The talk is very interesting (to me) - because it's essentially the GC version of "How NOT To Write A Microbenchmark" - it's full of specific advice on how to benchmark GC for large applications, and what traps to avoid when trying to write your own GC benchmark or testsuite. Some observations: SpecJBB2000 was originally intended as a Java middle-ware benchmark, yet was specifically designed so that GC never happened during the time portion of the run. Do you never see a GC cycle during the "timed portion" of your Java workloads? (SpecJBB2005 came about partially to correct this). <link to slides coming soon>
Debugging Data Races is a re-run of a BOF I gave last year, but the material is more relevant that ever. Slides here. I had a number of comments from people who basically said "you shouldn't write datarace bugs in the first place" implying that (1) writing such bugs is Bad because they are so hard to find & fix, and (2) you wanted to write such bugs in the first place! My advice from last year still holds, and indeed I've got more evidence than ever that the bugs I'm targeting are the common bugs that bite you, and what I'm proposing is very accessible (no Phd required!).
Towards a Coding Style for Scalable Nonblocking Data Structures is ... well, hopefully the title is fairly self-explanatory. I've talked this on this blog before; slides here. The basic idea is a coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a Finite State Machine built from the atomic update and logically replicated per array word. The end result is a coding style allowing me to build several non-blocking (lock-free) data structures that also scale insanely well. All the code is available on SourceForge. At this point I'm sore tempted to add transactions to the non-blocking hash table and essentially create an uber-data-grid.
JVM Challenges and Directions in the Multi-Core Era is where Brian Goetz & I rant about the lack of any decent concurrent programming language. Slides here. Java's the obvious best State-of-the-Practice concurrent programming language, and there are plenty of better "academic toy" concurrent programming languages Out There. <soapbox> But none seem to tackle head-on the main problem programmers have with large concurrent programs: lack of name-space control over inter-thread communication, i.e. shared variables. By default in Java, all global variables are shared with all threads. Instance fields and code can be made private or public but you can't limit access e.g. to specific threads. For example: all your DB threads in the DB thread pool can run all the code in your crypto-lib, should you accidentally let them get a crypto instance. No language or compiler support to throw some kind of early access-error. Erlang and CSP deny all global variables; Fortress, X10 & Scala provide convenient ways to express some kinds of parallelism without shared variables, but amongst the recent more well-known crop of academic-toy languages I don't see anybody tackling the access-control issue head-on. </soapbox>
All right I'll stop before I start really ranting, but it's a topic I'm passionate about. JavaOne promises to be an exciting time, come and see the show!
Cliff
Category: Web/Tech | Permalink
| Comments (8)
| TrackBack (0)
Earth Day
April 22, 2008
Today is Earth Day,
a celebration of all things Earth-friendly. As a Foolish
Earthling, my family & I have decided to "put our money where
our mouth is": we've invested both in solar
panels and in a CSA.
Regrid Power installed the panels,
which arrived two months early. What was slated to happen in
June is already installed on our roof! As of this writing we are
eagerly awaiting inspection by PG&E before they can be turned on.
The installation went fairly smoothly and is basically invisible from
street-level. We suffered one day of roof-noise and three days of a
work crew (very nice guys, very respectful of our privacy and property)
wondering around our backyard and roof. The economic theory goes
something like this: during the hottest parts of the summer the panels
will be generating at their peak power; we'll sell the excess back to
PG&E at peak power rates. During the evenings & winter season
the panel won't generate enough, so we'll buy power from PG&E at
the off-peak rate. PG&E won't actually give us any money but the
income from the excess power will be used to offset our normal power
bill. This also means we'll be able to run our A/C during peak hours
with a much reduced bill.
Community Supported Agriculture is basically a collection of
people directly paying the farmers in advance for the food they are
going to grow. We've signed up with Live Earth Farm. For
$30/week, we get about 15-20lbs of fruit & vegetables every week
for 8 months. It means the farmer gets a steady income and we get
veggies often within 24 hours of being pulled from the dirt. We've
been loving the change in diet, we're eating a lot of new (to us)
veggies we would never have tried otherwise. Another good sign: our
twin apple trees have had plenty of blossoms this spring, and lots of
bees wondering around them. No sign of the yellow
jackets that plagued us last year. The kids & I are
anticipating apples galore this fall!
And now for the technical bits: Azul gear is very
power efficient for server hardware. We strive for high throughput,
which means high core count (so tons of the very slow memory ops in
flight at once - a fast out-of-order X86 might get 3 or 4 cache misses
in flight at once while we routinely get ~150 memory ops per socket in
flight at once). High core count means small simple CPUs; these simple
CPUs have short simple pipes and are hard to clock at high frequencies
- and less productive to clock faster (you'll just hit the next cache
miss quicker). So our clock rate is modest - and this has a huge
impact on power. We aim for high throughput and as a bonus we
get low power: about 65W per die when going full blast (we're also
using DDR3 instead of FBDIMMs, which is another substantial power
savings). Here's another way to do the math: 65W/socket and 48
core/socket means ~1.3watts/core. Not embedded-systems terrain but
definitely waaaay low power for high-end servers! Azul also buys the
CO2 emissions
offsets for every box we sell, both the manufacturing emissions and
the lifetime operating costs.
My kids are a daily reminder to me that
There's Another Generation Following Us.
Happy Earth Day to You,
Cliff
Category: Web/Tech | Permalink
| Comments (0)
| TrackBack (0)
I Went to Boston And All You Got Was This Lousy Trip Report
April 14, 2008
CGO was in Boston this year. I wanted to attend the workshops, so I
needed to arrive by 1pm Sunday - there's no way to do that from the
West Coast without an extra overnight stay or taking a red-eye. I was
fighting off a cold, so I took a generic night-time-no-cough-do-all
pill just before takeoff. JetBlue has the most legroom in coach and I
got the exit aisle to myself: within 10 minutes I was laid out flat and
snoozing. This had to be the comfiest 5 hour flight I've ever taken.
The winds blew in our favor, and we arrived a full 1/2 hour early:
4:30am local time in Boston. I yawned my way down to the
ground-transport level.
I lived in Boston briefly maybe 12 years ago; I remembered that it's a
nightmare to drive around in but the local subway ("the T") is fast
& easy. I bought a week-pass on the T. There's a bus from the
airport to the nearest T stop; I hopped on the bus figuring I'd have
some Real Boston Experiences while trying to re-figure out the T.
First snag: the bus driver informs us that the T doesn't start running
till 6am on Sunday! Oops - looks like I got a hour wait in the
airport. While hanging out, Manish Vachharajani (figured out how to
compress trace files using BDD's, both massive compression and you can
do many trace-file analyses directly on the compressed format) jogs up:
he too was stymied trying to take the T. We end up splitting a cab.
The Omni Parker House claims to be the oldest continuously operating
hotel in America. Certainly the building is very old, but with that
old-hotel charm (gilded ceilings, massive old chandliers, dark wood
paneling everywhere, helpful-not-intrusive staff). It's smack downtown
Boston; right between the old North Church cemetery (buried: John
Winthrop, first governor, and other loyalists) and Granary Burying
Ground (buried: Sam Adams, John Hancock, Paul Revere &
revolutionaries).
I paid for maybe one dinner, AMD paid for another, Google paid for
another. I eventually got to use my T pass to get to and from the
Google reception (during rush hour: so the subway was a crush but we
beat the bus) and also to the airport, so the pass wasn't a waste.
As usual for these trip reports, I skip some sessions and brutally
summarize what I do attend. Apologies in advance if I gave your paper short-shrift, feel free to correct me here. I actually attended 2 workshops and the CGO conference proper:
- EPHAM homepage
-Workshop on Exploiting Parallelism with Transactional Memory and other
Hardware Assisted Methods
- STMCS
homepage - Third Workshop on Software Tools for MultiCore Systems
- CGO homepage - 2008
International Symposium on Code Generation and Optimization
Here's the talks I attended, sorted roughly by how important I feel
they are (and not in any time order):
I attended at least 8 more talks, who's notes are not worth repeating
here. I got to see snow, slush and drizzle; hiked around in 30 degree
weather with cars splashing salt/sand/mud over the cracking stone
curbs, dodged rusty metal poles, wandered through mildew-covered
crumbling buildings and graffiti covered subways and generally
remembered all the reasons why I left Boston so long ago. It's a fun
place to visit, but I'll never live there again.
Cliff
Long talk with Maurice Herlihy about his
current work
He's building STM solutions at a higher level of abstraction, e.g.
adding transaction support to a hashtable; the underlying structure
(hash table in this case) can be anything with reversable operations.
He basically builds an 'abstract lock' lazily as the transaction
progresses, by taking locks covering all the 'keys' used in the hash
table. A collision on the abstract lock means a (statistical likely)
collision on the actual transactional data, and he can roll-back
patially executed XTN's by using the reversable operation.
The actual covering locks are just a pile of striped locks in some
large hash table, so contention can be made arbitrarily low. Also, the
underlying datastructure, e.g. a hashtable, is treated as a black-box.
We could, for instance, add transactions to my NonBlockingHashMap
fairly easily with this idea.
Paper:
ftp://ftp.cs.brown.edu/pub/techreports/07/cs07-08.pdf
http://www.cs.brown.edu/people/ejk/papers/boosting-ppopp08.pdf
Private conversation w/Manish
How to speed up worklist/task-queues.
His approach (low latency focus): keep producer & consumer
seperated from each other by more than a cache-latency. i.e., producer
stuffs into a ring buffer, consumer pulls from ring buffer. Ring buffer
needs to be sized such that producer & consumer are not
ping-ponging lines during worklist push/pop ops. Clearly the cache
lines ping-pong - but you want them to ping-pong once per cycle around
the whole ring buffer - not once per each time you insert or remove.
Arrange to slow down consumer if it "catches up" to producer, because
if the consumer gets to close, he slows down (the already slow)
producer by making the producer take more cache-misses while inserting.
Same in reverse for slow consumers (but if you have slow consumers then
work accumulates ad-infinitem and really you need to throttle the
producers).
My approach (scalable focus): stripe queues ab-absurdem until all
queues are of length 1, hence really just an array word holding either
null or task. Push/pop requires scanning the array for empty/full
slots. Real trick for me: proper sizing of the array. Huge scanning is
expensive, but some scanning is OK.
Combo: big array, with parallel insert/remove threads. Want an
auto-resize heuristic, and an auto-"back-off" heuristic. Auto-resize:
need a bigger array if getting too full (producers can't find empty
slot). In the make-bigger logic, can throttle producers (sleep until
some reasonable # of empty slots appear). Need to make smaller array if
consumers can't find producer's work: if typically end up scanning a
larger fraction of array before finding work (it's OK to scan thru a
large empty array IF we're following just behind a producer, so really
scanning very little). Auto-"back off": if see CAS failures for
insert/remove then we've run up against another threads' (could be
consumer or producer) cache. ie., we're making the other guy slow (and
not being fast ourselves) - so "back off". Either pick a different spot
in the array to work, or sleep or grow the array.
Keynote: Vivek Sarker (now at Rice)
Need more emphasis one auto-parallization.
Some recent successes have improved common use cases a bunch.
Lots of cores now. Lots of libraries for parallel languages; java
concurrency, Fortress, X10, intel tbb, Cilk, MPI, .NET parallel
extensions
So... compilers are "under seige" - the "old model" of: parse-to-IR,
then
optimize, then reg alloc, then code emit... isn't working. No room in
the IR for all the new parallel paradygms.
Vivek suspects a coming paradygm shift (me too as well).
Shows what breaks if we include parallism in the low-level IR: e.g.
parallel CFG; the usual "dominator" relationship doesn't hold; no
dominator tree. But this analogy is flawed (in my mind) because the
DATAflow still happens "as-if" the parallel blocks executed
sequentially, only the control-flow is parallel.
But probably want some form of DATA-parallel use; admit that the usage
is racey in the IR, and compiler is allowed to take any value that
could reach - or even emit code that makes that choice at runtime.
Goes on to (attempt to) derive a parallel IR onstage in a keynote.
Long discussion followed; he promises to look at using the open-source
HotSpot along with Jikes. More-or-less invited me to come visit Rice
sometime and check out the grad students.
Keynote, ATI, Norm Rubin
Pixar: 100,000 min of compute per min of image
Blinn's law: time-per-frame is constant. It's 1000 machines (all the
movie company can manage) and 100 min/frame (all the schedule allows).
Every new movie, toss out all old machines & buy new: they are
faster. Faster machines lets the graphics become more realistic.
8 PCs; 8 groups of 64 lock-step SIMD threads = 512 threads.
Each individual instruction is a 5-way VLIW.
(VLIW packing rate varies from 3.4 ops/VLIW to 4.2 ops/VLIW).
Pipeline is fairly long, but it's 64x5 = 320 fp-ops / cycle.
Loads block a thread till they resolve.
Each SIMD pipe has 256 registers; each register is 128 bits. If each
thread needs 5 registers, then 256/5 = 51 wavefronts can get into
run-queue. 51 wave-fronts = 3264 threads per simd or 13056 running or
waiting threads. Roughly 262,144 registers or 1meg of register space.
GPU programming model: producer/consumer relation. Hundreds of pending
GPU instructions. GPU runs a loop 30-60 frames/sec; build vertex
buffer, set uniform inputs, draw.
User writes: vertex shader, pixel shader. Nothing about parallelism, or
number of cores, or sync. That's all added by the GPU JIT. No user
program has a data-race; the language doesn't allow it.
Open interesting problems in register allocation.
Sawzall
Robert Greismer
Google server farm, map-reduce.
read/write using 'protocal buffers'
GFS
WorkQueue - scheduling system for cluster of processors. Tries to
co-locate tasks near data (no network i/o, just disk i/o).
MapReduce - abstracts the distributed system
Filter code is auto-parallelized, and fed into aggegrators. Not sure
what's new here compared to StreamIt on a grand scale.
Sawzall - language of MapReduce. Domain specific. Like graphics, very
parallel- but no concurrent programming.
Aggregators can be clever: cannot keep all the data coming in; must
reduce the data. Compute sums, averages, std devs, top N.
Domain-specific language to write aggregators; strongly typed and
concise. Mostly simple, regular syntax, close to C. Specific basic
types (strings, time, fingerprints, table types); value semantics, even
for structs (no pointers). Lots of string regex support.
Explicit "undefined values"; want to abort sooner rather than later;
willing to test for explicit "undefined value"; tests are not
auto-inserted everywhere, but must have explicit test. At runtime, can
switch a flag to make undefined values silently ignored (better
to ignore e.g. network corruption rather than crash).
Quantifiers in the language: "when (i: some int; P(a[i])) {...}" will
iterate over array 'a' for all indices 'i' and run some filter 'P'
first. 3 forms: each, some, all. Nest nicely. Seems similar to
Fortress. Generates bytecodes, which are then JIT'd. Ref-counting GC
(value semantics, no cycles possible); uniform object model (all things
are ptr's - even int's).
Cliff Notes: Hey! We *could* do some kind of decent sample-based
profiling in the HotSpot Tiered World. Periodically have a sample-thread wake
up and hack the inline-caches of the top hot methods to call their C1
equivalents. Have the C1 code patch back to the C2 code, but then run
on collecting info (for 1 method invocation). IBM has already shown
that if the frequency of profiling is >0.1% then you get good-enough
profiling info; we'd only need to hack the inline-caches rarely to get
decent continuous C1 profiling info.
Hardware Acceleration for Lock-free Data
Structures and Software Transactional Memory - AMD
AMD - Advanced Sync Facilitiy - not in silicon yet.
ASF limited to 8 cache lines. Issue a set of 'locked' loads.
'acquire' starts critical section. Checks for consistency. Stores rIP
& rSP for rollback. No other registers. Then allow mods to the
locked locations. Finally 'commit' (or abort). Use ASF for the 1st 8
lines of a TinySTM transaction, then switch to software afterwards.
Cliff Notes: More or less, this is AMD announcing HTM support
in their lab: free (good) simulation available, but no silicon. The
hardwired '8' is a lower bound: you are guaranteed you can atomically
update at least 8 lines. Several (most?) lock-free algorithms require
atomic update of 2-4 unrelated addresses. This is an architectural
guarantee that such algorithms would work on all future AMD processors
with this support. No actual hardware has been announced.
Energy Concerns using HTM
Maurice Herlihy
No space in a embedded to do a full STM or even a hybrid TM; so expect
multiple cores in an embedded system with a simple HTM.
Basically, abort & retry costs power.
MP-ARM - has a 512b transaction cache (fully associative), plus an 8K
L1. 128B scratchpad. On an abort, the CPU rolls-back the registers from
the scratchpad & enters lower-power mode. It wakes up after a
random interval & retries.
T-Cache costs them 20% more power - but make it up with shorter running
time. Generally a win for apps with contention (less total power than
locking & serializing).
Cliff Notes: Basically it's ARM's version of HTM
Synchronization Aware Conflict
Resolution for Runtime Monitoring Using Transactional Memory
Use TM to detect dataraces. AUTO-insert TM begin/end around
regions
doing conflict detection. Tried it per basic-block in SpecInt - too
expensive. Tried to make a TM per 8-12 basic blocks; then overhead is
down to 10% or so. ( Cliff Notes: What TM are they using? Since
the TMs are small, they assume hardware TM which explains the low 10%
overhead).
Turns out auto-inserted TM causes lots of live-lock issues. He demos
several live-lock scenarios with reasonable looking TM hardware &
spin-locks.
So they spot spin-lock patterns and do not auto-insert TMs around
spin-locks... no they make the hardware spot spin-locks and break the
TM in a way that makes progress.
Automatic Array Inlining
Christian Wimmer
Uber object-inlining, for arrays. Great for short fixed-length arrays,
ala what comes out of Reflection.
Really it's object co-location, which the GC preserves OR will tell the
JITs to recompile.
- phase 1 - decide what to co-locate (profile in C1); tell GC
- phase 2 - GC co-locates all such things; tells JIT
- phase 3 - JIT recompiles assuming co-location.
Also JIT must detect that objects are *co-allocated*,
and not modified later (implicit final fields: detect changes with
field guards).
This technique is easy for known fixed-length arrays: it's just like
inlining an object. Also works for unknown-length arrays that are not
changing; the "tail" of the inlined-object-set isn't known, but isn't
needed.
Can handle changing the inlined array field itself. Can fold the array
length check into the array field guard: set the old array length to 0
(the old array is ONLY pointed to by the original object, this is a
precondition to inlining, so no other object/thread can see the length
change). If the range-check vs 0-length fails, 1st go to some slow
code which re-checks after loading the proper field. Eventually the GC
resorts/realigns, and then the fast code works again.
Also can handle having remote /shared pointers to the inlined array,
even with hacking the array length: have 2 array lengths. The "normal"
length used by non-inlined code, and "inlined" length used only by
inline-smart code.
Also can do inlining objects into arrays: first lay out the array, then
have the GC place the objects following the array in order. Alas,
harder to detect changes & harder to detect setup conditions. They
gave up here - but seems to me you could do it if GC tells you the
objects are unshared (only pointed-at by the array), and you catch all
array-writes to that array (by tagging the array as read-only after GC
does the layout, perhaps placing it in a read-only page).
Get 20% speedup on some SpecJVM98's.
Alas no speedup on DaCapo. :-(
Very surprising to me that no speedup on DaCapo; I would expect the
String co-location to be worth alot - aha he's stuck with the old
JDK strings which share char arrays, so it doesn't work for him. I
predict a ~10% DaCapo speedup on Azul based on the Strings optimization.
Branch-On-Random
Azul looked at this earlier and declared it a good-idea for our next-gen.
Basically, branch-on-random powers-of-2, i.e. either 50% or 25% or
12.5% or ..., etc. Use it to lower the costs of sampling. Typical
sampling cost uses a counter and every N (e.g. 1024) executions we take
a heavy-weight sampling pass. But there is a cost to test&dec the
counter, uses an extra register, extra control flow, extra cache, extra
instructions.
'brr' is just a conditional branch, but the condition is a frequency.
Only need 4 bits to cover all reasonable use cases. Low hardware
requirements (a LSFR, total maybe 100 gates including mux's); fits well
into existing pipelines. Can be made tolerant of branch mis-predict
(i.e., can get the info very early in the pipe).
Multi-core Thread-level Speculative
Parallelism
Discovering lots of speculative parallism opportunities in sequential
code. Lots more if you are willing to do large code transforms.
Ex: while-ptr-chaining-loop finding the min.
Ex: nest the above loop in a worklist which pulls the min element, then
tranforms the list (ala a standard worklist algorithm).
Use value prediction to allow thread-level speculation. Basically,
start a parallel loop at a random value-predicted pointer location.
After a few iterations either crash or line up with some linked-list
body. Actually, capture some value on the 1st time through the inner
loop; on later iterations thru inner loop have some good choices for
prediction. There's some decent auto-load-balancing going on; each
iteration of the inner loop records enuf info to allow a good
(re)partition for the next iteration of the inner-loop.
Technique does not only do linked-lists; needs any double-nested loop.
Inner loop gets parallized, outer loop allows the inner loop to be
continously 'profiled' and 'load balanced'. Really can parallize other
'hard-to-parallize' inner loops. Would work well in Java; I probably
could do it in C2 with enuf effort.
Analyzing perf diffs in JVMs
Using "calling context tree" - similar to our 8-deep call-stack capture.
Doing 'diffs' on these trees. Diff'ing based both on structure, and on
edge weight. Doing things like "adjusting" the weight of top-nodes in a
tree so 2 trees match in weight, then looking at diff's farther down
the trees. Also looking at weight of normalized subtrees, etc.
Comes up with a fairly 'diff' like interface (cmd-line) to compare 2
forests of trees and highlight the diff's.
Able to track down differences in 2 websphere/trade6 setups.
Discovered differences between timezone settings between AIX &
ZLinux, and also DB setups.
PiPa - Pipeline Profiling
(best student paper)
Faster tool for doing e.g. cachegrind. Parallize instrumented
application. In SuperPIN - idea: main app just forks off another thread
periodically to do the instrumentation work. (how is this different
from jamming ops/time into a buffer, and having background
threads pick up new buffers & gzip 'em).
Idea: process the background buffer in another thread; do full analysis
on the buffer(s) in parallel - in real time with the app running.
Demo running a cache-simulator, in "real time". Using
double-buffering, large shared buffer. Pretty obviously equivalent to
Azul's tracing mechanism, except she's doing a cache-sim instead of a
gzip. Can run the original program with 3x slowdown, but get a full
cache sim.
Prediction & Trace Compression vs
Nested Loop Recognition
Looks cool: find simple loops from traces; loops are simple linear
algebra. Imperfect loops allowed. From loops, find nested loops. Get
massive compression ratios vs bzip & other trace compression
techniques. Very simple to decompress; just "execute" the loops.
Fast Liveness Checking in SSA form
Answer based on demand: is var live here?
Faster precomp, slower queries. Info depends on CFG & def-use
chains
Invariant to adding & removing instrs; sounds useful for reg alloc
adding & removing spill ops.
For HotSpot, liveness computation is probably 10% of total compile time.
Compute 'highest point' in dom tree where query point is dominated by
def point. Compute all-paths dom-vs-each CFG node
- Compute transitive closure over reduced CFG (simple DAG). Simple
post-order traversal
- For each block q, compute highest block (highest backedge target).
Simple pre-order & post-order traversal.
- Arrange for highest node to be first set bit in trans closure
Query: goto highest block; see if def dominates highest block.
Looks fast & simple. Might speed up reg-alloc.
Nearly Optimal Instruction Selection on
DAGs
Stupidly slick: tile the DAG once, just the way HotSpot does right now
- except allow sharing always (right now HotSpot does "CSE" on sharing,
forcing shared values in registers - except for small expressions that
are common in address-arithmetic).
Then look at each place where a shared value was subsumed into larger
instructions (hence replicated in the instructions). For just those
places, see what the local cost would be if you did not share. If the
cost is better to not-share, flag the node as getting it's own register
result. Then re-run the DAG tiling, but living with the shared results
from the 1st pass.
Ex: ((a+8)*(b+8)). Should we share the '8' or not?
Pass 1:
t1 = a+8 // 5 bytes
t2 = b+8 // 5 bytes
t3 = t1*t2 // 1 byte
But if '8' goes into a register:
t0 = movlo 8 // 5 bytes
t1 = a+t1 // 1 bytes
t2 = b+t1 // 1 bytes
t3 = t1*t2 // 1 bytes
Reduces code size by 2-4% nearly always, and very close to optimal.
What hotspot does is essentially what he calls 'cse-leaf' in paper. His
NOTLIS does somewhat better.
Cliff Note: I did a bunch of Azul-specific hackery in our DAG
share-vs-clone heuristic to try and get what he gets naturally.
www.cs.cmu.edu/~dkoes -
grad'ing soon
Exception Check Motion for Java
- Run aggressive PRE, ignoring exceptions checks
- See if can move checks to allow the PRE
- Undo any PRE motion where the checks can't be moved
Got ~2% on production compiler - which confirms my suspicion that
there's not much more to be had there. Mostly moving null-checks around
- and HotSpot is already very aggressive about moving & removing
null checks.
His examples HS will all get another way; loop-peeling and Split-If
transform will wipe out all his gains.
Another unsafe code-motion talk for
Java.
He's got better examples: loop-peeling won't wipe out his first example
(but versioning will).
Safety checks turned into data-dependence, just like HS
heh, another talk that' almost getting as good as HS has been for the
last 10 years.
Then convert data-dependency safety checks into IA64 predication bits
and happily predicate speculative loads. Seeing 4% for just the PRE
based motion (on SpecJVM98), another 2% for doing hardware predication
& speculation bits.
Towards Fair, Scalable Locking
Barcelona Supercomputing & Microsoft
Scalable reader-writer locks (something I've thought about alot in the
last year). Want fair (no starvation of either readers or writers).
Want fine-grained striped locks (? specific to different data types)
Demo's starvation in a HTM system similar to Azul - eager updates can
mutually kill each other. Demo's cache-directory-reservation can lead
to deadlock.
So far seems like a more complex version in hardware of something
that's easy enough in software. It's only partially fair (not FIFO).
Preserves proportions of readers & writers.
Using STM. Overhead looks very bad.
Runtime Tuning of SM Validation Techniques
Rice University
STM validation is expensive; no one technique wins all cases. Use
runtime to dynamically change validation scheme. Periodically (every 5%
of TM attempts) run the same TM with 1 of 3 different validation
techniques; if the new technique is working better than the old one,
then keep it - otherwise immediately revert back to what was working
well.
Transactional Support for Dynamic Languages
Looking at Jython - get to use the fast JVM. To make Python bytecodes
need some huge bad wrappers around every little variable access. This
makes Jython as slow as regular interpreted python.
What we'd like to do is skip the fancy python lookup and use plain Java
variables - except that we want to keep the good python dynamic
semantics. Would "lookup x in y" to be as fast as a local variable
reference.
So use transactions to test the pre-conditions & use fast
specialized code otherwise. i.e., execute inside a transaction &
test preconditions.
Use best-effort HTM. Add a few specific speculation primitives to JVM
(speculate, abort, commit). Jython generates both specialized-case
fast-code and also full semantics. Use HTM to run the fast-case. If
that fails, fall back to the full semantics bytecodes.
Cliff Notes: Sun Labs guy noticed that this technique requires
abort info from the hardware (was it capacity vs remote thread) - does
Rock TM lack the failure-mode data?
User-defined Priority Based Transaction
Contention Management
Really, this is poster case for why TM's will not save the world. He's
proposing users can put some kind of priority on transactions - and
higher priority transactions can "barge" over lower priority ones. What
a mess!!!
Map-Reduce for GPU's
More general term than Google's map/reduce. Really its stages of
independent computation, followed by communication.
GPUs are great for doing 'map' - lots of GPUs, lots of bandwidth. But
only local communication is cheap; global communication is very
expensive. Also, reduction on GPUs is well known to be very difficult.
Lots of PHDs going into deciding how much unrolling is best, or how
best to split up computations. Sometimes see 60x slowdowns if the
best-reduction for one dataset size is used for a different dataset
size.
User writes a MAP function in CUDA (NVidia's C+GPU operaters). User's
map function outputs results in a per-thread local-store. User writes a
binary associative reduction operator - takes 2 inputs and returns 1;
and his framework will restructure the reduce step for best
parallelism. (also needs an identity value, in case some MAP results
are filtered out)
Analysis of CUDA Programs
auto-detect race conditions (detects potential dataraces!),
detect bank
conflicts (extreme poor memory access patterns)
CUDA abtractions: kernel functions, threads, scratchpad ram
Cores are SIMD; execute program in lock-step; program is in scratchpad
ram.
2 global bookkeeping arrays, reads & writes to global memory. two
per-thread arrays: reads& writes of a single thread. After each
access to a global, check for a race.
Scratchpad is 16-way banked per 16 threads; fast as a register if no
bank collisions. Bank conflicts can reduce performance by a factor of
4; slowdown equal to not using scratchpad at all.
Streaming Applications on Multi-Core
Systems
Boxes using different hardware for streaming: gpgpu, fpga, webcam,
disk, network, etc. Very hard to simulate & debug when lots of
different parts are moving.
Auto-Pipe is a set of tools to create, test, build, deploy, optimize
distributed streaming apps. Lots of pipeline parallelism, and regular
parallelism within a pipe stage.
Auto-Pipe - define a set of blocks and edges or computation &
communication. Also describe the hardware layout: cpus, cores, fpus,
etc. Then provide a mapping from blocks to hardware chunks. No need
to worry about communication between the blocks.
Phase-Based Adaptive Recompilation in
a VM
Using hardware perf counters to detect phase-changes, and trigger
recompiles. There's been a lot of work in this area already, I'm not
sure what's new here.
Finding Critical Instructions Without
Hardware
"critical" - instructions on the hot path. Create OFF-line synthetic
instruction traces based on existing hardware profile info, including
branch mis-predicts & aliasing info. Then analyze those traces,
and compile (schedule) to those.
Take a random-walk thru binary, throwing a weighted-die at each branch
(starting at a random spot using weighted-die). Emulate a call stack
for call/return sanity. Can "get stuck" in various oddball paths, so
reseed the trace after N instructions.
More: seed abstract register state with random values, then start
tracing. Loads & stores can use random addresses, or load/store
addresses: aliases will appear "obvious" because the random addresses
will collide. Also simulate memory using the random addresses. Turns
out most aliasing is captured, but not all (structured linked lists
will behave as random instead).
Finally: stuff the random traces thru the hardware simulator & come
up with critical paths & schedules. Then use this info to compile
better scheduling (he's doing clustered VLIW's). Seeing maybe 10%
performance gain on his funny clustered VLIW's.
Program Optimization Space Pruning for a
Multithreaded GPU
NVIDIA CUDA talk. Need loads of threads (12000 hardware supported).
Have 45G/sec bandwidth (Gbyte? GBit?) but still run out; must use local
caches. Encouraging loop unrolling, hardware prefetch, traditional
optimizations. Scheduling loops for thread-level and memory-level
parallism. Have about 8000+ total registers.
Increasing regs-per-thread might increase per-thread performance, but
run out of the shared register space and thus be able to launch fewer
threads in parallel. In short, need to search large space of configs;
often hand-tuned code is about 15% below max trapped in some local
maxima. A non-intuitive mix of other optimizations can move you to a
better minimum.
Cliff
Category: Web/Tech | Permalink
| Comments (1)
| TrackBack (0)
|