Azul Systems
UNBOUND COMPUTE

Cliff Click Jr.’s Blog's Blog

Category: Web/Tech

I've Been Slashdot'd
January 19, 2010

I've been Slashdot'd.  The slides in question are also here.

I gave the talk at the JVM Language Summit, which itself was a lot of fun.

The talk is a repeat of one of the talks I did at JavaOne.  I also gave two other talks, but the Sun JavaOne website appears to be unable to deliver the video right now.  I also gave a short interview at JavaOne.

One of the talks I mentioned on the InfoQ video is also available here as a Google Tech Talk; Java on 1000 cores: Tales of Hardware/Software Co-Design.  I also mentioned a talk Azul's Experiences with Hardware Transactional Memory, and my blog on that is here.  Alas, I don't believe the HTM talk has been video'd for public consumption at any time.  If you are interested in HTM support, you should also check out this short gem.  The GC talk alluded too has slides all over the web; here's the original paper, but I could not find a public video presentation.

Cliff

Category: Web/Tech | Permalink | Comments (10) | TrackBack (0)


Touching Base...
December 22, 2009

It's been awhile since I blogged, so I thought I'd touch base with people to let them know what's been going on.  Azul Systems has been hard at work improving our JVM.  This is a bigger statement than it sounds - there are not many groups that have a large enough 'quorum' of JVM engineers to do large-scale changes to the HotSpot JVM.  Azul has nearly a dozen engineers doing core HotSpot work (not counting JDK work or QA folks - counting only core JVM engineers)!  We've been doing large-scale changes to HotSpot for nearly 8 years now.  Our HotSpot has been improved over Sun's standard HotSpot or the OpenJDK in a large number of ways, some more visible and some less so.   Some of the more obvious stuff we've got working:

  • A new complete replacement GC: Generational Pauseless GC (and the older PauselessGC paper is here).  This is one of our core strengths.  GPGC handles heaps from 60Megabytes to 600Gigabytes and allocation rates from 4Megabytes/sec to 40Gigabytes/sec, with MAX pause-times consistently down in the 10-20msec range. 
  • GPGC requires read barriers, and this means instrumenting every read from the garbage-collected heap.  Instrumenting the JIT'd reads is easy: we altered the JITs long ago to emit the needed instructions.  Instrumenting the VM itself is a bigger job; every time we integrate a new source drop from Sun we have to find all the new heap-reads Sun has inserted into their new C++ code (HotSpot itself is a large complex C++ program) and add read-barriers to them.
  • Real Time Performance Monitoring - RTPM.  This is our high-resolution always-on no-overhead integrated profiling tool and is our 2nd major selling point.  Because it's no-overhead (literally less than 1%; it's very hard to measure the overhead) we leave it always on.  This means you can look at a JVM that's been up in production for a week or a month and introspect it.  It's *common* for a 1hr session with RTPM to answer performance questions that have plagued production systems for years, or to have people walk away with 10-line fixes worth 30% speedups.  It's as-if you've been blind to what your JVM has been doing and suddenly your eyes are opened.  Live stack traces, heap contents, leaks, hot-locks with contending stack traces, profiled JIT'd assembly, I/O bottlenecks, GC issues, etc, etc.  See the link for a demo.
  • Virtualized JVM - We can take pretty much any old server, install a new JDK, change JAVA_HOME to the new JDK and re-launch the application... and it now runs on Azul's JVM backed by an Azul appliance.  No hardware change and no OS change.  This is a great solution for in-place speedups of older gear.
  • More recently of course, we've been hard at working porting our JVM to our new hardware platform.  This work is going well; look for more discussion here as we have things to announce!

Here's some of the LESS obvious stuff we have working:

  • Tiered Compilation.  Despite the fact that Sun has shipped "-client" and "-server" configurations for years, they never integrated these two JITs into a single system.  Most other JVMs have had a tiered compilation configuration for years and Azul Systems did this to HotSpot a few years ago.  We consistently see a roughly 15% speed improvement over a plain "-server" configuration.  We use the "-client" JIT (also known internally as C1) to do fast high-resolution profiling; this high-quality profile information allows the "-server" JIT (C2) to do a much better job of inlining and compiling.
  • A complete replacement for the existing HotSpot CodeCache: the holder of all JIT'd code in the system.  While *adding* code has always been easy, *removing* code has always been tricky (well, tricky to do it without blowing all code away at once and without requiring all calls to indirect through a 'handle').  Most large server apps slowly churn new code, so if you leak code you eventually run out of memory.  The new CodeCache uses GC to control code lifetimes and this results in a vastly simpler and less buggy structure all around.  We also use GC to manage all the auxiliary data structures surrounding code, e.g. the list of "class dependencies" for a piece of JIT'd code is a standard heap object now.  (A "class dependency" lists the set of classes & methods that a piece of JIT'd code assumes are NOT overridden; if a new class and/or method overrides one of these then some inlining decision made by the JIT is now illegal and the JIT'd code needs to be deoptimized, removed & recompiled).  Besides being a common management point for all code, the CodeCache is pinned in the low 4-Gig.  This means all hardware Program Counters can be limited to 32bits (in our otherwise 64-bit system) and this is a tidy cost savings (shorter instruction sequences for calls; less I-cache space consumed, etc).
  • Tons of internal JVM scaling work.  We run on systems with 100's of CPUs and so we've found (and fixed!) any number of internal JVM scaling limitations.  GPGC can run with hundreds of worker CPUs if needed.  The JITs compile in parallel with dozens of CPUs (50 is common during a large application startup).  Many internal VM structures have been made lock-free or have had their lock hold-times reduced by 10x or more.  Self-tuning auto-sizing JIT/compiler thread pool.  Concurrent stub/native-wrapper generation.  Concurrent code-dependency insertion (during compilation) and checking (during class loading).  Self-tuning finalizer work queues.  etc, etc, etc....
  • Cooperative Safepointing allows thousands of *running* threads (not just alive-but-blocked-on-IO) to come to a Safepoint in under a millisecond.  Merely safepointing 100's of threads is down in the microseconds.  Note that a full-on Safepoint does not happen until the last thread checks-in but the stall time starts when the first thread stops for a Safepoint.  The time-to-safepoint pause is measured from when the first running thread stops till when the last thread checks-in.
  • The ability to asynchronously stop & signal individual threads, to have them do various self-service tasks cheaper than a remote thread can do it.  This includes, e.g. stack crawls for GC or profiling (a thread's stack is hot in his own L1 cache and can be crawled vastly faster than by a remote thread), or to acknowledge GC phase shifts or to allow code to be deoptimized (jargon word for what happens to code that is no longer valid due to class loading). We can also efficiently do "ragged safepoints" - this is like a full Safepoint except we don't need to simultaneously stop all threads.  Instead we merely need to know when all threads have acknowledged e.g. a GC phase shift.  The threads "check in" as they individually acknowledge the Safepoint and keep on running.  When the last thread has checked in, the "ragged safepoint" (and GC phase shift) is complete.
  • No more "perm-gen" space to run out or require a separate tuning flag.  No more old-gen or young-gen either.  No GC-thread-count knobs, or space/ratio tuning knobs or GC age or SurvivorXXX flags.  GPGC takes no flags (except max total resources allowed), and runs well.  There Is Only One Heap Space, and GPGC Rules It All.
  • A new thread & stack layout that lets us use the stack-pointer also as a ThreadLocal storage pointer, the HotSpot "JavaThread*", AND as a small dense integer thread-id (requires 1 or 2 integer ops to flip between these forms).  This frees up a CPU register for general use, while still allowing 1-cycle access to performance critical thread-local structures.
  • A complete replacement for the existing HotSpot locking mechanisms.  Our new locks are 'biased' (here's the original paper idea) similar in theory to Sun's +BiasedLocking but based on entirely new code.  No more "displaced header" madness (this comment is probably only relevant to hard-core HotSpot engineers).  Biased locks do not require ANY atomic operation or memory barrier during locking & unlocking, unless the lock needs to "change hands".  Since we can stop individual threads asynchronously, we have a fairly cheap way to hand biased locks off between threads.  Once individual locks demonstrate that they need to "change hands", we inflate that one lock (not the whole class of locks) and it  becomes a "thin lock" as long as the contention is low enough switching over to a "thick lock" only when there are threads waiting to acquire the lock.  The issues here are fairly complex and subtle and deserve an entire 'nother blog!

That's enough for this Blog.  More later...

Cliff

Category: Web/Tech | Permalink | Comments (14) | TrackBack (0)


Panic Last Week
August 16, 2009

(warning: no technical content)

Prior Friday: my Mom calls my sister & tells her my brother Eric (hiking the entire PCT this summer) is 2 days late/overdue.  Sister calls me in a PANIC!!!  I plan on Saturday driving 5 hrs to northern California to calm my Mom & help coordinate search & rescue.  It's not like I can get on the trail and expect to cover 100miles in 4 days like my brother's been doing.  I can't reach Mom (no doubt she's driving twisty mountain roads and is out of cell phone range).  To make matters worse my wife has been sick all the prior week while trying to run Vacation Bible School (her and 8 kids got it from somebody who brung it; why do people bring their sick kids to School?); she's still recovering.

Saturday: no panic.  Mom calls; Eric's been found and is not really late: he had told her he probably would not be able to call midway through the 100 mile stretch and indeed had no cell coverage the entire time.  Mom had the local Sheriff out canvassing everything and almost ready to put up a Search&Rescue plane when she finds Eric (and the Sheriff really didn't want to do that because there was already almost 100 planes in the air fighting the local wildfires).  Mom's too old to panic like that.   Grrrrr.....

More Saturday: I take the kids to the beach while my wife sleeps off the cold; my eldest daughter needs to photo tidepools for her Bio project and we all need to relax.  In the middle of the relaxation my Dad calls home (my folks are divorced so no relationship with my Mom anymore) and reaches my wife napping; Dad and new wife (of nearly 20 years) are in Stockton and will be arriving tomorrow.  PANIC!!!!  Wife's still recovering; the house is a total shambles AND her inlaws are arriving in less than 24 hrs. 

Sunday: Inlaws arrive.  Much grandparent fun.

Monday: I leave my folks to have a looong planned meeting with Intel.  I arrive and PANIC!!! the power is out.  We relocate to the break room which has floor-to-ceiling windows.  Makes me feel like Azul is a high-powered got-act-together startup to host Intel folks in the dark.  Dad & wife move on (an unannounced drive-by grand-parenting), as they are traveling the western US as part of their retirement.

Tuesday.  I spend all day on training & mentoring, and recovering from repeated power failures.  All boxes need rebooting and ..."PANIC!!!"... patches get installed, don't play nice together, need more patches, more screwing around.  Things more or less functional again by late Tuesday.  Of course a company wide demo coming on Thursday is hitting an unexpected new crash, so the demo gods are panic'ing and I can't help them all day because of power outage crapping out my machines.

Wednesday.  Unrelated at-home PANIC keeps me busy all morning (plus bugs in high-scale-lib showing up a year late).  I finally get the fires out around noon and my wife offers to go out to lunch w/me.  First time we'd eaten together w/out kids in a month.  Of course, the housekeeper has hid my scheduler book and there's something in it keeps nagging me.  Just before we walk out the door I find it and stuff it in my briefcase un-inspected - because a rare lunch opportunity w/my wife is not to be missed.  5 minutes later Jeremy of Google calls: I missed my planned Google lunch w/ high-powered Java hackers and he wants to know if I'm still giving the tech-talk at 1?  PANIC!!!!  I turn the car around, drop off the wife, panic another 5 minutes trying to get latest slides on wife's tiny new netbook because the standard presentation laptop is sick before the laptop recovers and I grab it and roar out the door.  I make it to Google just in time (no officer, I wasn't speeding I was just flying too low).  The talk goes well.  (slides are here

Thursday.  Demo goes OK.  I actually get some work done.  It's quite shocking.

Friday: Planning on some kind of panic but it hasn't happened yet.  Got a concert planned tonight in Santa Cruz withOUT kids... so an hour drive away from any kind of rescue.  We're making plans (got neighbors lined up, etc).  In the end wife and I are so exhausted that we bail on the concert and settle for a nice dinner out without kids.

Saturday: We have a 7:30am Dr appointment for my youngest (no: I normally never hear of Dr's giving 7:30am Saturday appointments, but my youngest needs to see a serious specialist who happens also to be a really nice guy), then a long planned trip to SF zoo (In theory: more grist for the daughter's Bio project.  In practice: she's got a long-distance romance going on with somebody who's 2 hours north of SF so they agreed to meet in the middle).   We all get too much sun wandering the SF zoo in perfect weather.  I finally come down with the dreaded VBS cold, and feel like crud the whole day but had a great time nonetheless.

Sunday: We're all recovering from massive sunburns and a week of madness.  School starts tomorrow... plenty of time to panic then!!!


Cliff

PS- Concerned that there might be a correlation between PANIC!!!! and my receding hairline

Category: Web/Tech | Permalink | Comments (6) | TrackBack (0)


2009 ECOOP
August 2, 2009

Long delayed, but at last I found time to publish my notes.



I got a free all-expense paid trip to Italy, and all I had to do was give a keynote speech (slides) at 2009 ECOOP.  It's a nice gig if you can get it.  Travel to Genoa from San Francisco takes awhile; my 8:30am SF flight got me in Genoa about 2pm on the *next* day - plus 2 hrs lead time at SF, plus a 1hr drive to SF - I started my "day" at 6am on Friday and ended it at 8pm on Saturday.  I managed to stay up till dinner (barely) then crashed pretty hard.  Sunday was my day for sightseeing; I managed to wander pretty far all around Genoa; rode some funiculars; walked all over; took loads of pictures; ate weird food and generally acted like a typical American tourist. 

There have been loads of sidewalk vendors; all Blacks aggressively selling fake designer purses and sunglasses laid out on sheets on the sidewalk.  Monday afternoon I got to watch a bust go down; lots of yelling, they grabbed their sheet bundles and took off running with plain clothes police in hot pursuit.  Somebody lost his wares and the police quit chasing to grab the loot.  It's Tuesday evening and the Blacks haven't returned.  I promised my wife I'd get her a cheap Gucci purse and now I'm regretting I didn't do it earlier.

Hotel Bristol is a nice older place (means: lots of paint chips and paint layers, big chandeliers, gold leaf trim, dark wood, creaky old elevators; rooms are a mix of really old and really new; follow the link and oogle at the pictures).  My room is vast with 15ft ceilings (with chandelier of course) and a large jacuzzi tub; enough room to put up a basketball hoop.  The conference food has been pretty nice (free good wine with every meal); lunches are served in the Palace Ducalle - historic summer residence of Doges.  I can't do the setting enough justice; the website is nice but the pictures don't really tell the story.  50ft ceiling with 20ft chandeliers and a dozen 8ft marble statues; gold leaf trim on every fresco; stunning paintings the size of basketball courts...  I totally laid down on the floor and stared at the ceiling for a good hour.  The local restaurants, on the other hand, are a hit-or-miss affair; food is only served at certain hours; sometimes the food has been very good and sometimes no better than the local deli around the corner from Azul.

The ICOOOLPS workshop was on Monday; I got invited to talk there at the last minute.  I had some misgivings but most of the papers at the workshop were pretty nice; I enjoyed myself.  I'm still fighting jet-lag (today is Tuesday) pretty hard, so I'm writing this to try and stay awake until dinner.  The papers are here.


1st up - The ICOOOLPS workshop.  ECOOP notes are further down.  As usual I abbreviate ruthlessly; skip some talks; take notes in a stream-of-consciousness style.  Caveat Emptor.  And before I forget:

The "Nick Mitchell SimpleDateFormat Challenge": parse a sample common SDF and JIT code to translate Date objects to ASCII efficiently.  Similar to C/C++ compiler optimizations for 'printf' strings.

Let me know when you got something working.



Towards an Actor-based Concurrent Machine Model

"delegation based" machine model; dynamic separation of concerns "MDSOC"

Machine model: objects, msgs, delegation.  High lvl objects represented as at least 2 low-level objects; a "proxy" and a real object.  Indirection to allow message interception.  All msg-sends get forwarded thru the proxy ("message" here is a java-lvl function call, not an OpenMPI msg - but could be either in a distributed system).

Aspect-Oriented-Programming - intercept msg-send links (between Class-proxy and methods in the actual Class-body), etc.  How to do CHA with basically dynamic sub-class insertion?  Just re-JIT in the new class hierarchy?

Ok - now new stuff... want concurrent in the delegation function.  Actor is a collection of local objects (and ptrs to remote objects).  Fcn calls between local objects are fast; fcn calls across actors are basically remote-msg send. Then receiving a msg invokes a co-routine in the remote actor.

Ugh, using 'yield' with the co-routines.  Sounds buggy to me: semantics of dead-lock will depend on proper calling 'yield'. 

(assuming a VM w/ actors not thread support)


An Efficient Lock-Aware Transactional Memory Implementation
Justin Gottschlich

Trying to integrate into the "boost" STM system.

Locks+TM break things.
But locks are prevalent - must be able to compose with them.

Example: Locks outside Transaction
- T1 runs lock; T2 runs transaction
Example: swap code.  Works with only-locks or only-TMs but breaks if mixed

Locks inside Trans
- T1 runs trans, calls lock; T2 runs either trans or lock

Prior work: "full lock protection" - if you take a lock, then the XTNs are all stopped/aborted/etc and you only get locking behavior.

Offers a programmer solution: programmer lists lock/XTN conflicts and the XTN system deals when you take a conflicting lock.  (I believe this is unrealistic).


Eric Jul? - Whiteboard not slides....

Blurring the line between compilers & runtimes

Gives some examples already done by hotspot

Nice discussion, but nothing new.  Mostly trying to get a discussion going about crosing compilers/JITs & runtimes.  He might not have been aware about what the JIT already does.


Trace-Based Type Specialization in JavaScript
Andreas Gal

Same basic talk as given at PLDI.  TraceMonkey? FireFox 3.5
JavaScript & Flash instead of Java/C/C++

JS has been very slow (interpreted only).  But is here to stay; very popular & growing.  ActiveX & client-side Java dying out (not sure about client-side Java which was never popular in the 1st place).

Static typing makes life easy, but dynamic typing is required.  More complex data types and more runtime tests.  Tag bits to be checked; overflow to Double, etc.

Coming around to wanting a HotSpot like dynamic JIT'ing thing based on types that happen to be true at the moment.  Basically, types in program traces remain stable over time.

  - for( var i=0; i<100; ++i ) { /* nothing */}

loop:
  if ( int(i) >= 100 ) break;
  i = box(int(i)+1);
  goto loop;

So trace can record, e.g. that 'i' has always been an 'int' so far.  Trace has a guard on the input types, and are type-specific.  Function calls are inlined in the traces, along with guard statements to check that you are taking the same control-flow. 

Tracing loops, and can verify trace is type-stable across loop.  So can remove e.g. boxing in the loop.

But real traces have many exits and the many exits are really taken.  So trace along each exit.  Build trace-trees, rooted at loop headers (exponential growth of trace trees as the loop DAG is split out?).  Can only link back into original tree if types all match up - else need a new loop /tree header.

Suffering for lack of a language spec; JS programs are driving the language semantics (i.e., w/a new interpreter some JS programs fail so we call the interpreter buggy- even if it meets the loose "spec").  No nothing of a memory model or threading; lots of other holes in the language.  Echo's of what Java went through: the popular implementations defined the spec.


Tracing the Meta-Level: PyPy's Tracing JIT Compiler
Carl Bolz

PyPy is a tracing JIT compiler.  Now apply this tracing JIT to an interpreter.
Compiler "Restricted Python" RPython - can target C, Java, .Net
Various interpreters: python, prolog, smalltalk, scheme, etc

Basic idea:
trace loops; look for type-stable loop execution; look for similar code path loops.

But the dispatch loop in interpreters means you never execute the same loop code twice (because each time you are running a new different bytecode).  Goal: trace user-mode program, not the language interpreter.  Effective the tracing interpreter unrolls the bytecode dispatch loop.  Provde 3 hints to laguange interpreter.  - hint for position key; here is the language interpreter's BCI; here are backwards edges; here is the PC modification

Works fairly well to clean out the language interpreter from the traces.
Then get traces which are fairly clean & can JIT them.

Bears resemblance to partial evaluation, arrived at by different means.  Future work: better optimization of traces; some escape analysis to remove boxing operations.  Optimize frame objects, apply to larger programs.


Faster than C#: efficient implementation of dynamic languages on .NET
Antonio Cuni

Trying to make e.g. Python faster on .Net.
Looking at IronPython, Jython, JRuby, Groovy
(also Self, JavaScript/TraceMonkey/V8)

Why so slow?  Hard to compile efficiently; lack of type info at runtime; VMs not optimized to run them.  .Net is a multi-language VM, right (sure, as long as the language is C# - his quote, not mine!).  JVM is in better shape, but still heavily optimized for Java.

JIT compiler?  Wait until you know what you need; interleave compile-time and runtime; exploit runtime info.  JIT layering; fight the VM...

PyPy; JIT compiler generator; Python semantics "for free".  JIT frontend not limited to python; JIT backend: x86 or CLI/.NET backends.  Fun games with partial eval: assume e.g. Python bytecodes to be constant & constant-prop them into the python interpreter.

Do even more fun with constants than HotSpot+OnStackReplacement - totally doing speculative constant value JIT'ing - if this argument is of value '3' then here is the JIT'd code.  Trick is to pick which variable to constant- speculate on (and getting that spec value as well).

Not yet doing all of Python, but getting really great speedups.


Strata Virtual Machine

Software Dynamic Translation - read a binary, & translate it/jit the translated code.

Using a code-based hash-table lookup.  Hash; jump to hash-table entry.  Miss: jump to 'strata' interpreter
Hit: jump to bucket; bucket checks target (like HS inline cache, check at target)

HotSpot could use this to make more efficient v-calls?  Hash; indirect jump to nearby code table; table jumps to target & checks target; expect no misses after warmup.  Nah... still got indirect branch in there.

Looks like a standard binary dynamic translation type stuff (converting PPC instructions to Java bytecodes?)


Automatic Vectorization in JIKES RVM

Using SIMD ops on X86.  Can't use BURS/BURG to pattern-match vector ops.  Unroll loops to make the patterns more obvious.

Basically getting some SIMD stuff to work (but talk given by PC not by author, and PC believes this is not the right way to discover SIMD ops).  I also believe this... it's not a clean fit to BURS.

Code emit via simple bitmask/shift/and/or.


JIT Compilation on ARM Processors
Michele Tartara

ARM - 31 32-bit GPRs, 16 available at a time?.  SP, LNK, PC are GPRs.  Fixed format 32-bit ops.  Dynamic fast compilation (cell-phone targets).  No BURS or tiling, just greedy rules.  This is probably the right way to go always.



ECOOP 2009 Proper Starts

ECOOP is being held in the Piazzo Ducall (Ducal Palace) - the main conference presentation chamber has perhaps 40ft ceilings, acres of gold leaf on the walls in between the massive medieval paintings.  The speaker dias was clearly meant to hold a throne or the orchestra; it's a circular marble dias perhaps 20ft in diameter with marble balustrade and marble railings.  The high tech screen & projector setup in the middle is really anachronistic.

The adjacent formal ballroom is much larger; 50ft+ ceilings; chandeliers of at least 20ft tall and 30ft in diameter; paintings & gold leaf in plenty plus also perhaps a dozen 8ft marble statues with ancient greek themes.


Keynote - Classes, Jim, but Not as we Know Them
Simon P Jones

As usual for Simon, he gave a wonderful presentation.

History - Haskell is 20yrs old.  Lots of fun with new languages & machines & ideas (Functional Programming).  After Backus's Turing Award lecture opened the gates, a storm of new languages hit the field.  Took 10 yrs before Haskell formed out of the chaos of new languages.

Lifetime of most programming languages:

  • Invented by 1 person, used by 1 person, dies within a year.
  • Slow death version: invited by 1 person, used by 100, dies in 5 yrs
  • Successful: used by 100K+ programmers, never dies, e.g. Fortran still alive
  • Haskell?  hovering around 1000 programmers, but sharp recent uptick?

Haskell might cross over into immortal language (100K+ programmers)?  Hackage: user uploaded libs, reaching 300 new libs uploaded /month, and approaching 1million downloads.

Type Classes-
Shows examples of lots of things you want actions on a wide variety of types:

  • equality (for set membership?)  how to do equality of functions?
  • ordering (sorting lists)
  • print/showing
  • serialization
  • computing hash function

(so far sounds like Java interfaces)
For all 'a' such that 'a' is an instance of class 'Num'
  square Num a ==> return a*a

Num classes have the following methods: +,-,*,/, etc.
The implementation of Num classes binds '+' to eq 'plusInt' or 'plusFloat'

Implementation is perhaps more clever than Java invokeinterface bytecode lookup: is passed in a vector of fcns, of the correct type of interface 'Num'.  The fcn lookup happens on this particular 'vtable' - call it an itable, but it's really a unique v-table per interface.

For Haskell, this is all done with syntatic sugar over the basic Haskell.  Which itself is very cool (e.g. Haskell compiler (javac equivalent) is doing interface calls w/syntatic sugar).

CLIFF NOTES: This is a faster way to do Java interface calls!!!  Or maybe this is how they are impl'd?  Find the i-table from a list of implemented interfaces, then pull out the correct fcn from the i-index.  In his world the i-table is pulled out early and kept pulled out, which avoids the point-by-point lookup of interfaces.  Ahhh.... he's statically computing the i-table ahead of time during compilation, using some combo of type unification and the closed-universe assumption.  Not applicable to Java.

Able to handle fcns which themselves take arguments of fcns with any signature, as long as the fcn handles the correct interface.

Doing type-based dispatch instead of value-based dispatch.  Not the same as Java interfaces...
In Haskell, can bind a value arg to multiple interfaces all at once.
In Haskell, existing classes can be made instances of new interfaces (duck typing)

Haskell has NO subtyping, but uses the interface (well: type class) notion instead.

Haskell: types are inferenced typically, type annotations occasionally needed
Java: types are declared explicitly; inference occasionally possible

Haskell does type unification, so things like Java's 'equal' call - which is normally a fcn of (Object,Object)=>Bool - in Haskell, we KNOW ahead of time that the 2 args are of the exact same type.  Hence the implementations of various equals calls do not need to ask the "instanceof" question that all Java equals call implementations all start with.


Making Sense of Large Heaps
(Nick Mitchell, IBM)

Where does the memory go in large java apps.

Example from 2008, Cobol to Java.  In Cobol: 600 bytes; in Java 4492 bytes.
Blow up from delegation, bulky collections, or too many data fields.

Size examples: 58M objects, 3Gig ram, 8000+ types.

Yeti Tool - does smart cluster of datatypes (String w/char[], or HashMap w/HashMapEntry[], HashMapEntry, HashMapEntrySet).

Yeti also does dominance of sharing; notices single-owner root/tree bits, and then breaks out the shared parts at the leaves.  Also detect e.g., similar kinds of sharing, such as array B[] points to a set of B's; but also a linked list of Entry's points to the elements of B[] in turn.  Really lumping together things with the same owner; maximal dominance w/same ownership relationship.  Does nice heap shape summaries, followed by size info. 

Picture is really alternate sandwich effect, alternating between the Collections you selected, and the user-data at that layer of the 'sandwich'.  Edges hold the various sizes of things.  They fold up collection 'backbones'. 

If we get the data structures right, can easily shrink heaps by 2x to 10x!!!  Huge speedups possible.  Shows examples of people using CHM to hold 1 element (but programmer doesn't want to think about expected size usage, so does the easy thing and grabs CHM).


Scaling CFL-Reachability-Based Points-To Analysis Using Context Sensitive Must-Not-Alias Analysis

Basically computes a must-not-alias approximation; where it holds do not bother to run a more expensive pts-too analysis.

Some experiments are medium programs (SpecJVM - 7 progs, Dacapo - 4 progs, 8 other progs).  Can compute pts-to graph in 2x to 5x smaller.  Compute time for the pts-to analysis is 3x faster on average than prior work (but still in the several minutes range)... but what do you do with the info?  (this is my standard question when presented with better/faster/new&improved points-to work: now that you have the info how much does it *really* help speed up programs?  especially compared to having strong typing in the first place)

Nice presentation of a pts-to analysis, but hit right at my sleepy jet-lag afternoon siesta; had trouble staying awake.


Intel NePalTM - Design and Implementation of Nested Parallelism for Transactional Memory Systems.  (Intel)

Issue is using locks instead a XTN; the locks are for the nested parallelism, but want the whole operation to be atomic.  Azul makes no attempt to use the HTM to span across threads.  Can't work for us, because the threads need to communicate through shared memory - and that shared memory will blow out the HTM. 

This is Intel's compiler-gen'd STM system for C++.  Handles either optimistic (read-friendly) or pessimistic (write-friendly) concurrency; roll-back & retry.

Sigh - shows basically no speedup over plain locking.

(and perhaps not the best name....)


Debugging Method Names

What's a naming bug?  A mismatch between the name and what the code does.

Break names apart (CamelCase in Java), do some grammer abstraction.
getLastName ==> get - last - name  ==> (verb) (adj) (noun), etc

Then also do semantic analysis of code.  Look for things like: has a loop, has a incoming parm, returns a value (from multiple points), includes some run-time checks in the loop, etc.  Decide it's a search loop, returning a result or NULL.

So come up with lots of data points for names & methods.
Then mine the rules from a large corpus of code.

E.g. a "contains-XXX" name method should NOT be a void return, because a "contains-XXX" name implies a question with an answer.  In fact, in probably ought to return a bool.

Then with rules, apply rules to a program and report violations.
Turns out pretty easy to suggestion candidate replacement names.

Example: 
  public void isCaching(bool b) { this.caching=b; }
Method isn't asking a question, it's setting value.
Tool suggestions name should be "setCaching".

Example:
  public bool equals( Object o) {
    if( this==o ) return true;
    if( o instanceof Value )
      return equals((Value)o);
    return equals(new Value(o.toString()));
  }
So not really an 'equals' method, because it makes a new 'Value' object

More examples, ends up finding interesting bugs as well.  Hard part is to decide whether or not to change the name or change the code.  Sometimes the name is correct, but the code is 'wrong' here - and also wrong elsewhere.  Basically finding a poor factoring of code.


Mapping & Recommending API usage Patterns

API, libraries - often complex, lots of classes & methods; complex; hard to use.

People use: books & docs, forums & newgroups; code search engines.

Example: we wish to add an item to Eclipse menu.  Browse docs; find potential call: "appendToGroup".  Google search finds 151 code snippets (at time of paper submission) and 287 code snippets (at time of talk).  The returned code snippets at least 2 different API usages. 

'MAPO' Tool does search; parses code; clusters snippets based on used; mines patterns; recommends final patterns.  Tool needs to inline non-3rd-party methods (scattered implementation).

Examples hard to follow...  I've been bit here before repeatedly (complex API & no way to learn how to use it easily).  So very sad that it's so hard to follow his talk. 


Supporting Framework Use via Automatically Extract Concept Implementation Templates

Another talk about finding & figuring out reuse of existing code.

Choice between Templates vs Documentation had much less impact on dev time than the task's concept complexity.


I skipped all the refactoring talks.  Exhaustion is setting in, plus I've seen one or two of these talks before.

Went to a couple of talks attempting to deal better with constructors.  Mostly it's issues like publishing objects before construction completes, or in Java call an overridden v-call from a constructor (which means the sub-class v-call executes before the sub-call constructor can work on the object).  Final fields, read-only objects still being constructed; not-null field properties before you set the value; et al; there's a bunch of problems that stem from constructors not being instantly-quick - and if they take time, what does the object look like in that in-between state.

This is a general language issue with constructors.  They are a nice notion, but unless they are instantly quick, you have to live with having objects which are not 'complete' or 'constructed' yet and hence do not meet the expected invariants for the object.


Inlining Security Policies

Want to inline code into the original program; the new program either runs or is truncated (if a security violation happens).  Must show no *new* behaviors from the inlining, also all old behaviors are allowed (except the secuirty ones), etc.  Single-threaded versions of these inlinings all exist and are well documented.

Multi-threaded is harder: general case isn't solvable (probably because the monitors themselves become non-deterministic).  But if the monitor is race-free, (and some other easy properties), then you can do it.  Monitor is normally a state-machine.


Cliff Click Talks About Azul

My keynote went well enough; lots of Q&A from the senior people in the audience afterwards.  We had to stop the Q&A after running quite a bit over.  I also got a lot of positive comments afterwards, so I think people really enjoyed the talk.  I'll add that I enjoyed both Simon Jones' and Dave Ungars' keynotes also.  This ECOOP was unusual in having 3 "gentleman hackers" give keynotes; we're all three both very scholarly and very prolific programmers.  Doug Lea added in a later email:

"Cliff: definitely the best talk I've seen you give (which is quite a few). Someday  you'll have to figure out how you packed in so much technical content yet still fully captured attention of people supposedly without the background to understand most of it." 

High praise from Doug Lea, so I musta done something right.   :-)

Trip home is uneventful, except for Paris's Charles DeGaul airport being the usual zoo.  I get up at 4:30am and am in a Genova taxi by 5am for the 1/2hr drive to the airport.  Except that there's no traffic at this hour and the driver thinks he's Mario Andretti.  We make the trip in about 10 minutes.

So I'm way early into Genova's airport for a 7:10am flight.  No power connections for my laptop, for-pay wi-fi (not worth it), no shops are open - not even coffee.  No air-partnership; I cannot check in for my USAirways flight from the Air France desk in Genova.  I blog for the next hour and a half.  The flight leaves on-time and the two hours flying are very smooth.  Kudos to Air France.  Then we land at Charles DeGaul and the madness begins.  We have to walk down the plane stairs and across the tarmac and about 1/4 mile more to the main terminal - but actually it's the small-plane terminal.  After figuring out I'm not in Terminal's 1, 2 OR 3, I get instructions to take bus Number 3 to the automatic train and take it to Terminal 1.  Of course the buses & bus stop are labeled- with LOTS and LOTS of numbers and plenty of French - takes a question of the drivers to figure out which is the #3 bus.

The bus ride is longish (for an airport bus) and uneventful and then I find the train.  Thankfully the train is clearly marked and again a longer ride to Terminal 1 than I expect: Charles DeGaul is a really spread out airport.  Terminal 1 is a total zoo; I see lines of people snaking out all over filling the large hall, hundreds of people long.  Three times I ask people about which line they are in; none of the lines are marked and all snake back and forth through out the hall; there's a steady stream of people traffic cutting through all the lines and the lines branch and rejoin helter-skelter.  Finally I find the 'entrance' to the line for flight #771 to Charlotte - just next to the line for Philly which clearly orbits the entire hall.  The Charlotte line is actually fairly short, maybe 20 people long, and after 10 minutes I'm talking to the agent.  No problem checking in and then I head off for gate 55 (cutting through the Philly line again).  Then it's customs (another 10 minute line), and then a small shopping zone - I figure I'll come back to it for a snack (no breakfast yes; nothing was open in Genova).  But then I discover there's yet another line for security.  This one is about 20mins long and no way I'm going back for my snack and then back through security.  I figure it's like US airports - once past security there will be food and such... WRONG!  It's just the end of the lines; I'm in a seating area with about 10 gates and absolutely no way to get a drink or snack - or bathroom near as I can tell. It's now 10:10pm and it's taken me a solid hour to navigate Charles DeGaul to find my gate.  As I type, I'm standing by a laptop charging station (no chairs nearby), in the hopes I can do a little something on the laptop during the next 8 hrs of flight.

Later: flight is fairly smooth.  They serve some meals; I have another flight through Charlotte, NC to San Francisco.  Home at last!  For about 24 hrs that is, then I'm on vacation in Texas with my entire family.  More plane flights  for me!  Yumm, I can already taste the small salty snacks they are going to serve...


Category: Web/Tech | Permalink | Comments (8) | TrackBack (0)


JavaOne Slides
July 21, 2009

Slides for my 2009 JavaOne talks:



I'd like to say more on JavaOne, but I'm (re)discovering that if you're a speaker at JavaOne it's really hard to get into the other talks & technology being presented.  I had my head full with my 3 talks, making sure they looked good, were relevant, etc.  To be sure, there were plenty of talks I wanted to attend, but I never seemed to find the time.   :-(

And of course shortly after JavaOne, I went to 2009 ECOOP in Italy for a week, and now I'm on a long deserved vacation. 

More blogging soon, on ECOOP at least,
Cliff

Category: Web/Tech | Permalink | Comments (10) | TrackBack (0)


IFIP WG2.4 Trip Report
May 18, 2009


2009 IFIP

IFIP WG2.4 remains one of my favorite mini-conferences to attend.  The group is eclectic & varied; the discussion vigorous. Everybody is willing to interrupt the speaker & ask questions.  The talks are usually lively and pointed, and this time was no exception.

The conference was held in Fort Worden, a historic Big-Gun (planes rapidly obsoleted the guns) fort of the Pacific Northwest, defending Admiralty Inlet and our shores from the marauding Russians ... or Japanese... or somebody, surely.  Nobody ever attacked that way, so clearly the fort was a big success.

Meanwhile, the fort has been turned into a park and it's beautiful.  The whitewashed buildings with green trim remind me of some mythical 50's era America that never was, the kind that shows up in the backdrop of various movies or maybe "Leave It To Beaver".  We stayed in the officers quarters and held our meeting in the old mess hall.  The officers lived quite nicely; the quarters are actually old duplexes in great shape, with high ceilings and crown molding and carved brass fittings everywhere - the house is clearly larger than my house and with a vast green lawn ta-boot.  In San Jose such a place would be upwards of $2m...

Fort Worden overlooks Admiralty Inlet from some high bluffs (classic fort-like location) and the surroundings are all gorgeous.  The weather varied from overcast & chilly to clear & crisp (alas, overcast and chilly was the mode on our all-day "whale" watching expedition.  We did lots of watching but saw no
whales of any kind.  And we bounced over heavy seas and got sprayed with frigid water most of the day.)

Since the Hood Canal bridge was out I got treated to a 3 hr drive the long way around the peninsula, but somebody else drove and the scenery was worth looking at.  The airplane portion of the trip was easy.


As usual, my reports are (very) brief, stream-of-consciousness writing.  I often skip around, or brutally summarize (or brutally criticize - but this group is above the usual conference par, so less of that is needed).  I skip talks; I sleep through ones I'm not interested in, etc, etc.  Caveat Emptor.

Synchronization Talks
Atomic Sets - another nice concurrency alternative
Fork/Join for .Net
occam / CSP - Uber lite-weight user-thread scheduling

Java Analysis Talks
SnuggleBug - Better bug reporting for Java
SATIrE - A complete deep analysis toolkit for Java
PointsTo - A comparison of various points-to algorithms for Java
analyzing JavaScript - Has a type-lattice as complex as C2's internal lattice
Performance variations in JVMs - Noticed the wild variations in run-to-run performance of JVMs

Misc
Cheap Cores discussion
Compiler-directed FPGA variations
Clone Wars - huge study on cut-n-paste code in large projects
Business Process Modeling

Security
PDF Attack - "Adobe Reader is the new Internet Explorer"
The Britney Spears Problem
Squeezing your Trusted Base
Who you gonna call?

Evolving Systems
Swarms in Space
Finding emergent behavior




Frank Tip, Type-Based Data-Centric Synchronization

Locks are hard.  Auto-inference of correct sync in O-O programs.
Even with no data-races still have atomicity races.

Instead of code-centric locking paradigms, do data-centric locking.
Tied to classes with fields, leverage O-O structures

Group a subset of fields in a class into a *atomic_set*.
Then find units_of_work for that atomic_set.

Add language features to Java:
  atomicset account;
  atomic(account) int checking;
  ...
  atomicset logging;
  atomic(logging) Log log;
  atomic(logging) int logCount;

So add 'atomicset' keyword, 'atomic' keyword and strongly type variables with atomic_sets.  Units_of_work expand to reach the syntactic borders anytime an atomic variable is mentioned.  Can expand unit_of_work by hand also.

Can at runtime cause atomic_sets to alias - to union together.  Used on ownership types (e.g. HashMap uses HashMap$Entry, so units_of_work on HashMap or HashMap$Entry become units_of_work for both).

Atomic_sets can be unioned; e.g. for LinkList the entire backbone of the list is one atomicset.  Lots of discussion - points out the issues with hacking all the Code Out There.

Has proven strong serializable properties on atomicsets.  Proven various kinds of soundness, including *internal* objects are only accessed by their owner, etc.

All concurrency annotations appear in the libraries but not in the client code - shows a nice concurrent example with 2 threads manipulating the same linked list.   

Has a working version using Eclipse & taking annotations as Java comments.  Using j.u.c.Locks; can do lock aliasing by simply sharing j.u.c Locks.  Applied this to a bunch of Java Collections; 63 types & 10KLOC of code.  Needs about 1 line of annotation per 21 lines of code for the libraries (and none in
clients)

About 15-30% slower than using plain sync wrappers on a single-threaded testcase; probably because the "synchronization" keyword is very optimized.  Performance is similar or slightly faster when running with 10 threads & high contention.


Daan Leijen - The Task Parallel Library for .Net

- A concurrent parallel library for .Net (think dl's Fork/Join).

Infrastructure: locks/threads/STM, async agents (responsiveness), task parallelism (performance)

Gives a demo of a parallel matrix-multiply w/C#.  Very ugly code.

So instead, write a nice combinator...  

Standard task-stealing work-queue approach.  (private set of tasks that don't need CAS, public set of steal-able tasks that require CAS).

Points out that want fine-grained parallelism, so can do work-stealing and do auto-load-balancing.  

-- Effectful state.  Statically typed, with typed side-effects.  Type signature includes all side-effects (yes!)

  int fib(int n) { return n<=1 ? 1 : fib(n-1)+fib(n-2); }

But this program

  int fib(int n) { return n<=1 ? print"hi",1 : fib(n-1)+fib(n-2); }

Does not type in Haskell - need the IO monad.  And now the function returns 2 results: the IO "hi" and the int.

So want some syntax to allow "auto-lifting" into a side-effect world - same function syntax, but auto-includes the IO monad as a result.

Then can prove that some stateful computation does not leave the function - that the state is entirely encapsulated & unobservable outside.  So can do type'ing which allows some state-ful work internally, but it doesn't escape the function "box".

  int fib(int n) {
    ref f1 = 1; // side effects, but entirely internal to the function
    ref f2 = 1;
    while( n ) { sum = *f1 + *f2; *f1 = *f2; *f2 = sum; n--; }
    return sum;
  }


Peter Welch, Multicore Scheduling for Lightweight Communicating Processes

Google: kroc install

Carl Ritson did all the work...

Cliff Notes: Insanely cheap process scheduling, plus "free" load balancing, plus "free" communicating-process affinity.  Very similar to work-stealing in GC marking, but with "threads".  OS guys take note!

Scheduler for OCCAM/PI.  Process-oriented computation (millions), very fine grained processors.  Message passing over channels, plus barriers.  Uses CSP and pi-calculus for reasoning.  Dynamic creation & deletion of processes & channels.

Goal: automatic exploitation of shared-memory multis.  "Nothing" for the programmer to do: the program exposes tons of parallelism.  "Nothing" for the compiler to do - it's all runtime scheduling.  

Goals: scalable performance from 10's of processes to millions; co-locate communicating processes - maximize cache, minimize atomics (lock-free & wait-free where possible).  Heuristics to balance spreading processes around on spare CPUs and co-locating the communicating ones.  

Usual sales job for CSP... which still looks good.  

A blocked process requires only *8* words of state (no stack for you!). Scheduling is cooperative not preemptive.  No complex processor state, no register dumps.  Up to 100M process contexts per second.  Typical context switch is 10ns to 100ns.  Performance heavily depends on getting good cache behavior.  So processes are batched.  Stacks are spaghetti-stack (no that's why no stack saving...).  These 8 words include the process run-queues.

Batching: sets of e.g. 48 processes are run on a single core; that core is the only core touching this queue & round-robins them until the entire batch times out.  So no cache-misses.  A Batch is a Process, plus a run-queue.

A Core is a Batch, plus a queue of Batches.  Again, no other cores touch this list either.  

A Core with no Batches does work-stealing.  There's some proposed batches available for work-stealing from each Core.  Cores need some atomic ops to manipulate the queue of Batches for these steal-able Batches.  Careful design to ensure all the interesting structures fit in a single cache line.

So until you need to enqueue, dequeue or steal a Batch - it's all done without any contention for process scheduling.  These Batch-switching operations require atomics but are otherwise fairly rare.  

Key bit missing: scaling up beyond 8 or 16 cores with some kind of log structure.  But otherwise looks really good for very low cost context switch & work steal.  Lots of questions about fairness.

Batches are formed & split by the runtime (no compiler or programmer).  Aim is to keep processes that communicating on the same core (so it's all cache-hit communication), but spread the work across cores.  Initial batches are kept small to encourage spreading.  Channel ops require atomics, but typically only 1 (sometimes 2).  Choosing amongst many channels might require up to 4 atomic ops.  Each time communication happens on a channel, the sleeping process is pulled into the run queue Batch of the awake process - thus naturally grouping communicating processes.  But as batches get large, need to split.  

Periodically can observe that if all the processes in a Batch are all communicating with each other in some complex cycle - then the spare run queue empties from time-to-time (with one runnable process).  So as long as the run queue empties now & then, then the Batch should stay together.  But if the run queue never empties, then split-the-Batch.  Split off a single process into a new Batch - which will drag it's connected component together into the new Batch.  But eventually if the original Batch had 2 complete strong graphs,
they will now be split - and can be stolen onto other CPUs.  

So this kernel is about 100x less overhead than the same design done on a JVM.
:-)


Satish Chandra, Symbolic Analysis for Java

"Snugglebug"

Communication between the programmer & the tool tends to stop early: "Dear Programmer: you have a null-ptr bug here"...

Going to the next step: "Dear Mr Programmer: try running your method with *these* values and you'll see the bug...".  Programmer:  now that I think about it, these values will never arise.

We just forced the programmer to define the legitmite range of values.

More: Bug Report is more believable if given a concrete input is given. 
API hardening: gives the preconditions under which this library will throw an exception
Unit-test generation: inputs needed to make a specific assert fail.

Works by weakest-precondition.  Given an assert in the program, work backwards computing the weakest-precondition at each point until you reach the top of the method.  If this chain works, then hand the WP to a decision procedure; sometime the decision is "I don't know" but many times you get an explicit counter example.

Good match for the counter-example problem.
- Yes loops are a problem, but don't really need the *weakest* precondition.

Most Java features can modeled:
- heap r/w, arrays, allocation, subtyping, virtual dispatch
- exceptions: many bugs are here so we modeled exception handling try/catch
  very closely.

Have a path-explosion problem.  For heap updates, even a single path includes branch effectively depending on aliasing.

Has to do strong interprocedural analysis.  Lazily grow the call-graph consistent with known symbolic constraints.  First look for a conflict on a call-free path, then on call-paths but without needing to look into the call.  Finally, if I must peek into the call - but use the symbolic info the reduce the set of possible call targets.

Generally, no fixed-point.  Cannot expect programs to give loop invariants.
Work-list management: avoid various common pathologies.
Goal is to find *a* contradiction, not to explore all options.


SATIrE

http://www.complang.tuwien.ac.at/satire/

Shape analysis, points-to, flow-sensitive, context-sensitive, annotations, can read them in & out.  Applications to e.g., slicing.

Read C/C++ - large complex tool chain.  Reads in program x-forms in either a functional programming language or in Prolog.  Can combine the tools in any order at this point (very well integrated).  Output is C/C++ w/annotations at each step (or internal "ROSE" ASTrees, etc).  

Has interval analysis for integers, shape analysis, loop (bounds) analysis, points-to, etc.  Gives example of reaching-def's in the functional programming language.  It's a domain-specific language for writing compiler analyzes.  Can specify the strength of the analyzes in many cases: e.g. the depth of the allowed call-chain for context-senstive (set to zero for context insensitive).

Shape analysis gets may/must alias pairs, etc - and all results put back into the C program as comments/annotations.

Basically describes a large complex tool chain for doing pointer analysis on C & C++ programs.  The tool chain looks very complete & robust.


Welf Lowe - Comparing, Combining & Improving Points-to Analysis

...skips explaining what points-to is.

Clients: compilers, JIT compilers, refactoring tools, software comprehension.
Different clients have different needs: speed, precision, security.
Granularity & conversatism also matter.  Clients might want, e.g., dead code elimination, or removing polymorphic calls, or rename methods & calls.  Other people want object granularity - escape info & side-effects (i.e., compilers and JITs).  Also static-vs-dynamic analysis.  Dynamic analyzes typically run the program and produce optimistic results.

Doing careful experiments of 2 analysis w.r.t. accuracy.  Hard part: can't get a perfect "gold standard" for real programs.  Hard even for a human to inspect a proposed "gold standard" and declare it "gold" or not.  Some special cases are easier: conservative analysis (no false positives) & optimistic analysis
(no false negatives).  Can under- and over-estimate the gold standard, but this still messes with the results (messes with detecting which of the 2 analysis is better w.r.t. accuracy).

E.g., ask the question: is it worth increasing k>1 (i.e., becoming context sensitive vs staying insensitive).  Checking size of computed call-graph - smaller graph is more precise.  Very small increase in accuracy.  Sometimes made things worse - because the original analysis was optimistic sometimes.

Open question: what is the use of an Uber Points-To - it's definitely beyond the point of diminishing returns for JIT compilers.


Anders Moller - Type Analysis for JavaScript

Not much tool support for catching errors during programming.
No nice IDE support.  Many bugs are browser specific, but we focus on language bugs.

Most JS programs are quite small, so throw a heavy-weight analysis at it.

Object based, dynamic typed, proto-type inheritance, 1st-class functions, coercions, dynamic code construction, exceptions, plus 161 built-in functions.  Types include null & undefined, primitive types, some properties "ReadOnly" and "DontDelete".

Broken things: tracking down an "undefined" and where it came from, reading an absent variable or absent property, etc.

So do a dataflow analysis over the program, including a lattice, transfer functions, etc.  Handle interprocedural stuff.  Tightly interleaved control flow & data - so need a complex lattice.  

Lattice is roughly as complex as C2's lattice, with more stuff on the allocation sites and less on the class hierarchy.  Also then model all the abstract machine state (C2 does this as well).  Full context info (total inlining at analysis time except for recursion).


Cliff Notes: Nice idea: for each allocation site, model both the *most recent object* from that allocation site, and also model *all the rest* from that site.  You can do strong-update on the singleton most-recent object, but weak update on all the rest.


Rhodes Brown - Measuring the Performance of Adaptive Virtual Machines

Noticing the badness of the "compile plan" - 1 run in 10 had a >10% performance reduction.

JIKES - always compiled, no interpreter.  On demand at 1st invoke.  Focused recompilation, lots of GC variants.

Points out the places where you get variations - stack-profiles for estimating the call-graph, basic-block counters, etc.

Causes of variation?
So far: GC & inlining
Concerned that i-cache placement is a major issue of instability.

*No* correlation between total amount of compile-time spent vs score.  You must compile the hot stuff, but apparently tons of not-so-hot code is getting compiled and it doesn't help the score any.


Kurt William - lots of cores for cheap

General discussion round, not really a "talk".

Expect >32 core/die in 5 yr, >128 core/die in 10 yrs.
What's "New" and what can we expect in 5 years?

Theory: contention is that there's nothing new here since the 70's & 80's (concerning concurrency).  (general agreement on incremental progress but not breakthrough).

Systems: Lots of libraries (TBB, pthreads), lots of platforms (webservers), all OS's support multi-core.  JVMs are New in this space.  Might see OS-support for user-level thread scheduling.

Languages: Support been there, but it's locks & threads - no good.  Support for TM is coming, but will it help?  Then there's CSP...

Applications: Not much new here.... more games, more interactive more user-interface.  Old stuff: lots of HPC, speech, image.  Matching: easy parallelism; lots of interesting apps here.


Uwe Kastens, Compiler-Driven Dynamic Reconfiguration of Arch Variants

What is "reconfigurable"?  "Usual" approach - HW guys compile to a netlist, do some mapping, place & route, make a FGPA/silicon, while the SW guys write some assembler on the FPGA - then at the end they try to run the SW on the HW.  Goal is to run some complex function faster on a FPGA than in normal ops.

Our approach: read source code, look at program structure, find hot code & hot loops; compiler knows how to switch the hardware between variants; then it compiles the code for a particular FPGA variant for each loop, and reloads the FPGA with a new function between loops.  But need to limit the FPGA to certain variants so don't actually need to reload the whole FPGA.  

Fixed small set of architecture variants.  Keeps cost to reconfigure the FPGA very low.  Compiler can choose the variants.  Example: FPGA runs either a small CPU or a small SIMD CPU or a MIMD CPU.  Or reconfigure the pipeline structure, or the default register bank access patterns.  In the different configurations can use more or less power for problems that are more or less parallel.  e.g., in SIMD mode have all ALUs active but turn all but 1 decoder.

Pick variants & design space to those known to be efficiently solved by a compiler - gets a fast compile time & good results there (use a human elsewhere).

Can map any arch register to any physical register - so can e.g. have CPUs share registers, or do unbalanced register allocation - if some CPU needs more registers in it's portion of code than others.  Does some register allocation affinity games, to avoid having to reconfigure register layouts between blocks of code.  

Nicely flexible overall design.  Instructions name arch registers & ops, but also can re-map real registers & ops periodically.  So use targeted registers & ops for a particular chunk of code, then reconfigure when the next chunk of code is different enough from the last chunk- and the cost to reconfigure is lower than running with a less-well-targeted registers & instructions.

Ugh, can't read data charts - has some 4 embedded programs.  Seeing 30% reduction in execution time AND power cost, by reconfiguring.  Code size is a little larger, because either poor parallization(?) & the reconfiguration overhead.


Jim - Clone Detection

Exact clone'd code, white-space only changes, near-miss clones.
Bug in 1 deserves inspection in all the rest.

Lots of cloning studies out there -esecpially in-depth of the linux file system.  But no wide-range-of-systems studies.  We did a bunch of open-source systems, C, Java, C#.  OS, Web apps, DBs, etc.  Used their own clone-detection, as a combo of AST-detection and text-based detection.

Text-based comparisons are sensitive to formating changes.  But AST parsing is expensive (requires tree compares) and does not find "near miss" clones.  Ended up doing "standard pretty-printing" - which uses the AST to normalize the text.  Then do text-based comparisons.  Also handle small diffs in the middle of clones.

Did 24 systems, 10 C, 7 Java, 7 C#.  Linux 2.6 kernel, all of j2sdk-swing, db4.  Varied from 4KLOC to 6MLOC.

Java methods have more cloning than C methods - 15%-25% (depending all the diff threshold) of Java methods are clones; C varies from 3% to 13%.  C# is more like Java, until the methods allow more diffs - and then there is a lot more cloning.  Swing is hugely clone-ish, >70%.

C systems - Linux has the same cloning behavior as postgres & bison & wget & httpd. Java has no more clones as you allow more diffs as compared to C - I suspect good Java refactoring tools detect & remove clones earlier.

C# shows a LOT more clones as more diff's are allowed.  Maybe cut-n-paste is a coding style?  C#'s clones are larger methods as well.

Something like 10-20% of all files in Java have 50% of functions are clones.  In C# it's more like 20-35% of all files.

Files with 100% cloned code: 15% of C# files are totally cloned.  For Java whole file cloning is fairly low.

Localization of cloned pairs: are they in the same file?  same directory?  different directories?


Thomas Gschwind - Business Process Modeling Accelerators

Business Process Modeling, patterns, x-forms, refactorings
Process Structure Tree -

 - Process is a large unstructured graph.  Visio-style graphical is error prone for large process models.

But can take the structure graph and build a tree from it.
(The graph is all fork/join parallelism).
A small part of the tree requires state-space exploration.

After the tree is correct, can apply patterns & Xforms to the tree.

Essentially Petri Net edit'ing on a structured graph/tree using Eclipse.

100's of users inside of IBM; generated business models are much higher quality, make them 2x faster.


Philip Levy, The New Age of Software Attacks

Why?
  • Not enuf to do?
  • Crime is Big Business
  • Usable exploit: can be sold for $75K to $100K
  • So called "security researchers".  Some are respectable, most are going after the $100K payoff.

"Visual Virus Studio" - actually exists.  PDF is a common vector for *targeted* attacks.  "Adobe Reader is the new Internet Explorer".

Looked at all the fixed vulnerabilities in Adobe Reader
  • Need to do a better job teaching students on how to write secure software
  • Stop writing in assembly language
    • ANY unmanaged code, include C/C++, should be banned
  • Languages need to be upgraded to be more secure
    • New constructs & type systems
  • More sophisticated testing tools & methodologies
  • Better analysis tools for legacy systems

How big is Adobe Reader?
  • 100K FILES
  • 42MLOC, no comments or blank lines: 25MLOC
  • Estimated 75 man years of development per year over it's lifetime

Problems are clustered - e.g. one module has had 25% of all security problems.
Top 6 modules have 80% of security problems.

Break down problems another way:
  • array index OOB - 32%
  • unknown - 27%
  • intege overflow - 14%
  • range check missing - 8%
  • capability check missing - 4%
  • out of memory 3%

How found:
  • Fuzzing - 42%  (creating illegal inputs and throwing them at the program)
  • Code Review - 35%
  • External report - 12%

Some examples:
  • asserting against null, but no asserts in debug build and null possible in release build.
  • Failed to check for error returns, sometimes even not catching return value
  • Casting "attacker controlled values" (values from the PDF file) to valid enum
  • Or using attacker-controlled value for loop & array bounds
  • Sometimes a complex pre-condition needs to be checked first
  • Some of the fixes are still incorrect

A Quick List:
  • Badly designed lib funcs (strcat, sprintf)
    • Microsoft's banned API list is 4 pages
  • AIOOB
  • Calling thru func ptrs
  • Integer over/underflow.  Many protocols send N followed by N bytes.  Send a really big number that overflows (after scaling for malloc) into a small number - then malloc succeeds, but the data overwrites the buffer.
  • Pointer math
  • Uninitialized vars
  • Dangling ref pointer
Basically: it's BASIC STUFF that goes bad, not fancy stuff.

All memory integrity or soundness type stuff.
But it goes beyond AIOOB -
e.g. Name string like "Robert); DROP TABLE Students;-- "


Jeff Fischer, Solving the "Britney Spears Problem"

Users assigned roles; roles are assigned privileges.
Benefit: users can come & go without changing policies.

"Patients" can view medical records, but "doctors" can change them.

Can add annotations to Java that specify roles.

Problem: lack of expressiveness; solution: logging.  Violation happened (people saw Britney Spears' data), but the logging allowed catching the violators.  Another solution: manual checks.  But no compiler support, so missed checks.

Problem: Lack of explicit access policies.
Really: need a way to "type" data (and hence methods using the data) with some kind of security/access policy.  More problems: often checks are hoisted to entry point for efficiency, and then violate policy later.  Relying on manually-inserted runtime checking.

Our stuff: parameterize classes by some kind of index value defining Role.
Each method includes an explicit policy (statically checked).
Annotation processor can either statically type-check the policy, or insert runtime checks.


Jens Knoop - Squeeze your Trusted Annotation Base

"Interesting program properties are undecidable!"

Compiler optimization - not so worried: just lose some optimality.  Results still usable; but if we have better results we can produce better code.

Worse-Case execution time analysis for hard real time: Bang.  Your dead.
WCET bounds must be safe (soundness), and wish to be tight (optimal or useful).
E.g., a bound of 1hr for a mpeg3 audio-player interrupt is safe, but useless.
Need, e.g., all loop bounds.  
Need user annotations.
User annotations are untrustworthy, but must be trusted.

Squeeze the trusted annotation base - reduce user chance of screwing up.

Try to replace trust by proof - as a side-effect generally improving optimality.  Then take advantage of trade-off between power & performance of analysis  algorithms.

Use model-checking with care.  If something can't be bounded automatically, as
the user for some bound - then prove it.  BUt also believe the user's bound isn't tight.  Try to prove a tighter bound with the model-checking using binary search.  Indeed - can do away with user assistance often; just guess and binary search to tighten it.

Summary: can often completely empty the trusted annotation base, usually shrink it; often improve WCET bounds.

Start with the cheapest techniques (constant prop), move up to interval analysis, then symbolic bounds & model checkers...


Joe Newcomer - Who do you trust to run code on your computer?

(Joe is a Windows attack expert)

Attack Vectors - autoexec.bat, .cshrc & friends, autoplay, device-driver installs (Sony), mac resource attacks

Engineered attacks: phishing attacks, downloads, activeX, client-side scripting

Lots of "security" designs & codes are done by junior programmers w/out adequate oversight.  Pervasive attitude: "software is secure - as designed, as implemented, as maintained".  Small+fast is all that matters...  not...

Rant against all modern OS's, talking about #of security patches in Mac, Linux, unix, Windows...

Client-side scripting is vulnerable. Flash, Office, Adobe Reader, etc.

What are we teaching people about security?  How about non-CS types?
Graphics designers, website designers...


Stefan Jahnichen, Swarm Systems in Space

Build very small satellites, swarms of them to watch the Earth.  They will spread out after launch to watch e.g. weather, bush fires, etc.  Takes about 2 days after launching to get a communication link.

Satellites have good earth coverage; orbit at 500 to 600km, orbit earth every 3 hrs or so.  Wide range of sensors amongst different satellites.

Have a Real Swarm of cameras; map a Virtual Swarm onto that; map mission goals on the V.S.  

Optimized synthesis of consensus algorithms.

Special OS - distributed OS basically.  Used to maintain swarm integrity (keep satellites together).  


Peter Welch, Engineering Emergence...

Thesis: in the future, systems will be to complex to design *explicitly*

Instead engineer systems to have the desired properties "implicitly".

"mob programming", ant mounds, etc.... simple rules of behavior leading to complex behaviors that emerge.  Emergent Behavior

Mechanisms Design - (game theory),
  • Rational actors have local private info
  • Emergent: optimal allocation of scarce resources
  • Optimal decisions rely on truth revelation

Swarming behavior (flocks, etc)
  • local actors, local behaviors
  • Emergent "swarm" behavior
  • UAV swarms & autonomous robots

Social communication (gossip, epidemic algorithms)
  • large, ad hoc networks
  • emergent: min power to achieve eventual consistency
  • low power, low reliability sensors & data propagation

Design for emergent behavior
  • occam process per "location", a 1024x1024 grid has 1mil processes
    • each location has a com channel to it's 4(8) neighbors
  • occam process per "mobile agent"
    • agents register with local process (or up to 4 if it's straddling a local region) & local process tells the agent what's nearby.


No idea how to design high-level behavior from low-level rules, but busy experimenting and looking for some cases.


Cliff

Category: Web/Tech | Permalink | Comments (2) | TrackBack (0)


DaCapo Trip Report
May 5, 2009


http://www.cs.tufts.edu/research/dacapo
DaCapo is a small (but growing) research group focused on GC & Memory Management issues.  They meet about every 9 months, and the meeting bounces around depending on who is hosting it.  People mostly present work-in-progress, and the audience is up front with their critiques... often interrupting the speakers or getting sidetracked into loud off-topic conversations, i.e., it's a fun boisterous crowd.

DaCapo is in the Boston area this year, held at Tufts University's Medford campus (everyone's heard of Harvard and MIT, but how about the Berklee College of Architecture, Fisher College, Boston University, Suffolk University and dozens and dozens more? - Boston has more colleges and universities per square mile than any other place I know).  As usual for me traveling to Boston, I skipped the car and just used public transportation - the T works marvelously well (as compared to e.g. VTA; crossing the entire town from Logan Airport stuck way out in the water to Tufts some 8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose on the VTA takes something like 2 hrs). 

Anyways the weather was supposed to be highs of 65 and rainy with lows in the 40's - so I packed my parka.  Turned out the weather report was a little bit off, with highs in the upper 80's and sunny.  Man did I roast.  But I got out and enjoyed the sun - we (the entire DaCapo group of 40 or 50 people) walked about a mile to lunch each day, plus I walked to and from my hotel & the T stop (also about a mile).  Enough ruminating, on to the talks!

As usual, I pay attention to some talks and chat w/neighbors through others.  I brutally summarize.  I'm opinionated & narrow focused... so Caveat Emptor.



Change Tracking

What changed?  Who did it?  When?  Points out the flaws of using 'diff'.

Logical-Structure diff.  Auto-magically spotted systematic change patterns.
It's a better code-review diff.

Can report diff's as reports like "for all XXX, replace XXX with YYY except for ZZZ_XXX_ZZZ", etc.  Vastly better error reporting!!!  Vastly better diff reporting!!!

Cliff Notes: We must get this gizmo and check it out!!!!

Paper report: http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf
Sample output: http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm



Inferred Call Path Profiling

Very low cost path profiling.  Hardware counters have very little cost, but provide little context.  Software is reversed: high context & high cost.
Cliff Notes: Compare this to call/ret RPC hashing.
Cliff Notes: Bond & McKinely does something similar.

Instead, look at number of bytes from main() to leaf call on stack.  Basically PC & SP (offset from stack base) provides all the context we need.  Dynamically detect unique PC & SP (they did it in a training run) pairs, and reverse them to get a call-stack.

Lots of duplicated stack heights, how to tell them apart?  On SpecInt unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call stack).  But huge spread; really sucks bad on some benchmarks.

If have a list of entire call-graph, then can remove ambiguity by growing some frames.  But requires some global analysis to know which frames to grow in order to be unique.  "Solved" problem with a simple greedy search.  Compute all call-stacks over whole program run, pick a conflict, grow some frame, re-compute & keep if more PC/SP pairs are distinct.  Repeat until it's rare enough that can live with dups.

Cliff Notes: The whole-program ahead-of-time thing isn't feasible for us.  How to do this on-the-fly during tick-profiling?  Hash PC/SP, find a tick record, 1-time-in-1024 check for dups (crawl stack the hard way and confirm call-stack), otherwise assume unique & increment tick record.  If MISS in hash table, then crawl stack the hard way.  If record is labeled as having dups, then crawl the call-stack & find true record?)  Actually, probably expect the call stacks to be common, but the PC's to be fairly unique.  So really asking for unique RPC+SP (not PC+SP).

His experiments: collect PC/SP pairs about 100 times/sec.  0.2% slowdown (must have had careful measurements to detect that!)



Breadcrumbs - Recovering Efficiently Computed Dynamic Calling Contexts


Also builds on probabilistic calling contexts. 

Given a (nearly) unique ID (say, computed at each fcn entry via a simple hash on the return PC).  IN the past, reversing these unique ids: use a dictionary (very large & high cost), use a calling-context-tree (smaller, but requires you to track where you are in the CC-tree at runtime).  Forward search, but undecidable and limited options for pruning.

Attempt to reverse the hash fcn (but only when we need to take the semi-unique id) and convert it to a call-stack.  Straight-up reverse is no good, but is easy to guide search.  Callers of leaf call themselves have structure - e.g. expect to see a call-op from the call-site, etc.  Still need a list of all call-sites (generally available anyways because of inline-cache structures).  His function is "p' = (3p+c)", where "c" is RPC?  Requires a handful of instructions on function entry.

Cliff Notes: If this reverse search is successful most of the time, it says that during a profile tick we can record just the PC/SP (or RPC/SP/PC?) and then reverse them to the entire call-stack later - on demand when RTPM requests.  So a 'tick' is no more than record a few bits (and handle buffer overflow), but no stack crawl.

Cliff Notes: combined nicely with 'null-assignment-stack' info, ie. collect the entire stack and 'color' all NULLs with a unique stack id.  Then if get a NPC can report both the NPC at the crash site, but also the call-stack that assigned the NULL in the first place.



Transparent Clustering - Phil McGachey, Eliot Moss, Tony Hosking

Attempt to auto-distribute multi-threaded programs.  Pure Java, no JVM hacks.  No programmer involvement.  Dynamic class re-writing.  JVMs can vary.  Hardware can vary.

All machines on a network run RuggedJ nodes.  RuggedJ includes a rewriting class loader, some runtime libs, and some application-specific partioning strategy.

Rewriter: generate lots of classes & interfaces; especially hook methods & fields.  Hook into the runtime.  Abstract over object location; remote & local objects.

Ex: Standard class "X".  Rewrite into a X_Local and X_Stub classes.  X objects on Node A are of type X_Local.  On Node B, X objects are really X_Stub objects where all the calls are remote calls.  Both X_Local and X_Stub implement the "X" interface.  Some other Y_Local can refer to X_Local on Node A, but refer to X_Stub on Node B.  If want to migrate from Node A to Node C, want to indirect the X_Locals; so for mobile objects there's an indirection layer called "X_Proxy", that then be switched from pointing to an X_Local vs an X_Stub (if the object migrates).  Non-mobile objects skip the X_Proxy indirection layer. 

Inheritance & interfaces are maintained in the rewritten classes.  Static classes conform to the RuggedJ model (I assume replication is allowed?).  Arrays are also wrapped to support indirection. 

Constraints: Native code breaks everything.  Try to not rewrite stuff referred to by native code (maybe can intercept all native calls and rewrite args?).  System code is also an issue: loaded by the system loader, is not rewritten.  (Can they use -Xbootclasspath?).  But most code run is actually system code!  (eg. String libs are very common!)

So must deal with system code without changing the JVM. 
  1. Can wrap system class objects.  Must unwrap when passing to system or native code; must re-box on return.  Re-wrapping is expensive: need the exact same wrapper, so need some kind of hash-lookup to do re-wrapping.  High overhead, so avoided where possible.
  2. Extend: some classes can be extended and preserve the original class & signatures.  Extended class follows the RuggedJ model but also the original system model. No need to unwrap/re-wrap.  Does not work e.g. when system code generates the object (since sys code makes instances of the original and not of the extended superclass). 
  3. Some system classes are only ever referenced from user code; can just dup the code out of rt.jar and make a user-class (which now gets rewritten by the normal RuggedJ model). 
  4. Immutable classes.  Just replicate.

Apparently able to cluster some complex programs.
No idea about performance as compared to explicitly clustered apps.



Jikes RVM Native Threads

Was Green threads, now Native threads
(everybody else is dropping green threads too....)
Motherhood & Apple Pie.



Computer Science as an Experimental Science

Reproducibility?  Experimental bias?
Show how these problems are solved in very-high-energy astrophysics.

(1) No publish without multiple colleague confirmation.
(2) Publish
(3) MUST also now publish reproductions or non-reproductions's of other peoples' experiments (or lose credibility as a research entity).  But bar to publication is lower, and expectations are lower.  i.e., people expect to publish in-progress work

Biggest comp-sci problem now: lack of infrastructures.
Examples of infrastructure bias: green-vs-native threads, some barrier styles are hard to use (.Net embedded objects), stack-scan (OVM makes fast stack scanning at the cost of everything else being slower).  HotSpot (high-value compiler, good GC, good most everything) vs JIKES.

Cliff Notes: This was actually a really fun talk (and lots of discussion) despite my paucity of notes. 



Web 2.0 Profiling

Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite.
Relies on browser to run Flash, JavaScript, REST, AJAX.
Benchmarks: IBM social networking software; Java PetStore 2.0, AJAX-based RIA using JEE5, uses Lucene search.
Apply workload: viewBlog (from 9 blogs), createBookmark (total 14), selectTag (PetStore 9), etc. 
Workload mix can be varied, but based on commonly observed usage pattersn from internal deployment of Lotus Connections).

Wanted to remove DB from the workload, put DB on ramdisk.  Same for LDAP.  Used Grinder to generate load.  Main server running J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5 1.66GHz.  Fairly reasonable setup for a small web/app load.

See lots more chattiness; very short fast requests, high interrupt rates, smaller requests but many more of them.  Looked at low-level profiling.  See fewer I-cache misses as compared to a large legacy apps.  Better I-cache behavior.  But data-cache miss rate is still very significant.

Also different transactions have different scaling limitations.  For PetStore the Java persistence layer started falling off after 4 cpus. 



Stream Processing

"Spade" - Cluster stream processing at IBM.  Continuous high-level streams.  Stream arrives for weeks & months; want to process and then emit stream data continously.  Want to express the problem with a language.

Priorities - fast streaming on a cluster; then Generality; finally Usability.  Beyond StreamIt.  Like StreamIt, works with streams that can be split & joined.  Language is typed; stream contents are strongly typed. 



Demystifying Magic - High-Level Low-Level Programming


Seen this talk before, but this time given by Steve Blackburn.  Nice historical perspective on writing system software in C vs asm - and the parallels to writing sys software in Java vs asm.

Java-In-Java
  • Key Principle: Containment.  Write as much as possible (99%!) in pure Java, and dive into the low level "magic" bits as little as possible.
  • Extensibility: Requires change quickly, languages change slowly.
  • Encapsulation: contain low-level semantics
  • Fine-grained lower of semantics: minimize impedance, seperation of concerns

Framework:
  • Extend Java semantics; intrinsics
  • Scoped semantics changes
  • Extend types; un/boxing, ref-vs-value, architecture sizes
  • Pragmatic: retain syntax where possible to keep front-end tools working

Introduce raw pointer type: "Address" looks like a Java type.
E.g.: Address a = ...;   ... = a.loadByte(offset);

Semantic Regimes:
Additive: "its ok here to have unchecked casts"
Subtractive: "no allocation via new allowed", similar to HotSpot VerifyNoGC.

Allow "unboxed" types - no v-table, so it's a bare primitive type.  But this isn't the default (but can be asked for).

Abstraction results are good: can implement a JVM on bare hardware and with a simple change essentially virtualize the same JVM inside a GC debugging harness running inside another JVM (where the virtualized JVM's heap is just a big array in the outer implementing JVM), etc.



Grace, Safe Multithreaded Programming for C/C++ - Emery Berger

Seen the paper before, but this is the talk by Emery.

Fake sequential semantics on stock multi-threaded hardware using fork/join and process protection. 

Grace is a pthreads replacement: "g++ -lpthreads" becomes "g++ -lgrace".

Speedups: in the absence of common benign races, Grace run programs run about 10% slower than raw pthreads - but have full sequential semantics.  It's "as-if" all the parallel programs are run sequentially immediately at the thread spawn point. 

CAN'T run applications where threads run forever, i.e. reactive or server apps.
Works well with fork/join, Cilk, TBB, etc.

So thread spawn becomes a unix fork with COW, use mmap for allow memory sharing.  At join points, smash in join'd threads memory updates via mmap.  Also need scalable thread-local malloc heap, plus aligned globals (to avoid false-sharing at the page granularity), plus some improved I/O.

Some simple measures remove nearly all false sharing.  Big one: everybody mallocs into their own space.  2nd big one: spread the globals out 1 per page.

Thread-spawn is as cheap as a fork on linux (has experiments to show it).  Due to thread-cpu affinity, if you spawn a bunch of threads they share a single CPU.  At the low end of thread-grain-length, fork is faster than spawn because the scheduler immediately spreads the processes across CPUs, whereas threads share for awhile.  So fork is actually faster than thread-spawn for awhile.

Real performance limiter for Grace is conflicts & rollbacks, and not thread-spawn overheads.

Performance is much better than e.g. STM, because after the 1st touch on a new page (and the SEGV & accounting), all accesses on that page run at full speed.



GC Assertions: Using the GC to check heap properties

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&rep=rep1&type=pdf

Global knowledge is required to understand seemingly local properties.
JBB example:

    Order order;
    for(...)
      order = ....;
    delete orderTable;
    ? are all 'orders' dead here?

But actually 'order' is held by the last-customer-transaction (as an optimization, or convienvence for customer?).  Leads to a leak of 'orders'.  Really: programmer does not understand program, too complex.

Add assertions to the code, and use the GC to check the property.

Cliff Notes: Azul is already gathering global heap properties during GC - but not asserts.  Gathering liveness & points-to info.  If points-to included a handful of sample real instances (and all the sames linked when possible), would be a very powerful way to instantly check some heap properties.

Sample asserts:
'assert_dead(p)' - expect to reclaim 'p' at the next GC cycle
Or region-collection: start_region(); ... assert_all_dead();
Or shape property: assert_unshared(p);
Or ownership properties (members of a collection are 'owned' by the collection)

If an assert fails, then do a slow crawl and provide the full path through the heap showing the failure.   Asserts only checked at GC points.



Proving Proving Correctness of Abstract Concurrency Control and Recovery - Trek Palmer & Eliot Moss

Transactions - closed nesting has issues
Open nesting & Boosting need from programmer: conflict info & roll-back info.
These are hard to provide correctly.

This work: a description language of abstract data types or structures.
Can describe the conflict predicates & inverses.
Can prove correctness of the conflict & inverse expressions.

Working with the abstract description of the data structure, and NOT e.g. with the real Java implemention.  E.g., no loops, no mutations, no recursion.  So the language isn't a 'real' language, but can describe many kinds of ADTs in code that looks sorta like Java.

Output isn't a functional program, instead the output is the result of using a SAT solver to prove correctness. 

XTN-semantics allows conflict detection to be proven correct pairwise, instead of having to do full interleaving.  Formally, a conflict
predicate is correct iff it is true when the operations do no commute.

The conflict predicate tells when 2 XTNs conflict, and the inverse allows an optimistic XTN to rollback.

Obvious use-case: use this tool to write a transactional-version of NBHM or other JDK concurrency utilities.



Hard Real-Time GC Scheduling

- Periodic Scheduling
 - GC runs at highest priority, but periodically yields to mutator
 - Metronome
- Slack-based Scheduling
 - GC runs at lowest priority
 - Can be preempted at any time
- Makinac (Sun RTS)
- Work based Scheduling
 - GC runs at allocation time
 - Problem with allocation-rate jitter

HRT - Deadlines must never be missed AND must be verifiable analytically that no deadlines are missed.  Systems tend to therefore be very simple, so that they can be proven.

OVM - like Metronome.  Dynamic defrag; arraylets; Brooks style barrier; replicating barrier/; incremental, concurrent, supporting slack-based style scheduling. 



Understanding Performance of Interactive Applications

Typical profilers give the wrong numbers: they report total time spent, not time spent between mouse-click and page updated.

AWT/Swing & SWT - similar: single-threaded, gui thread in "event loop", relies heavily on Listener model.  So profile using gui call-back Listeners between user events and the eventual display change.

Really simple idea: profiling & event visualizer tool between click & view times.


Program Metamorphsis

Trouble w/refactorings: must preserve program behavior with each step.  But if we want multiple refactorings then at each step along the way fixup code (needed to preserve program behavior) pollutes the code.  Want to do multiple refactoring steps at once - while leaving the program possibly broken in the in-between steps.

So compute a program-semantics (names & def-use chains), and let the user make partial refactoring steps and then declare "I think I'm done" - and the system compares the new program-semantics with the old, and reports if they are not equal. 

So actually comes up with primitive not-quite refactoring steps, which he will compose to build larger "real" refactorings, or compose lots to build a multi-stage refactoring. 



Instruction Selectors


HotSpot C2 uses a BURS framework.  Very hard to optimize & debug.  So these guys have a semantics which can span both the ideal IR and real machine instructions.  Describe both using "CISL" and the tool does the mapping.

Once the tool has a mapping, the user provides an adapter that converts the mapping into the code needed by the compiler back-end.  This would replace e.g. the ADLC.  Gives examples of machine encodings for PPC and Java bytecodes.

CISL is given an IR tree and produces a target tree with equivalent semantics.

Not sure it's entirely better than BURS (maybe no worse), but I'm still sold on rewriting in e.g. C++/Java hand-written greedy match rules.  These code-gen-gen tools are pain for long-term maintenance.



Do Design & Performance Relate?


Is fast code required to be ugly?  Is beautiful required to be slow?

Pick 200 code metrics.  Also pick some performance metrics (cycles, single threaded, objects created, etc).  Could not follow talk... or at least he wasn't very clear with the objectives & results (if any).  Interesting idea though.


Category: Web/Tech | Permalink | Comments (1) | TrackBack (0)


Odds & Ends
April 15, 2009

Various pending goodies...

Load-Linked/Store-Conditional vs Compare-And-Swap


Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!).  There was a long period of debate in the CompSci community one what constituted the "minimal" needed instruction to do useful multi-CPU work.  Eventually the community has decided that the Compare-And-Swap (CAS) instruction and the Load-Linked/Store-Conditional (LL/SC) instruction combo both are (1) sufficient to do useful work ("infinite concensus number") and (2) relatively easy to implement in real hardware.  X86's, Sparc's and Azul's CPUs use CAS.  IBM's Power, Alpha, MIPS & ARM6 all use LL/SC. ARM5 only has an atomic-swap.

LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms.  With LL/SC, you first Load-Linked a word.  The hardware marks the cache line as "linked".  You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter).  Finally you issue a Store-Conditional to the same address.  If the cache line is still "linked", the store happens.  If not, then not.  The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line.  Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented).  "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail.  I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure.  The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC.

With CAS, you typical first load a word with a normal load, then manipulate it.  Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory.  CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic ABA bug.  In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless all copies of A die first - including the one loaded before the CAS.  Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS.  Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively.  With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link".  E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS). 

These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations.  Nothing in the academic literature requires any sort of global ordering on these atomic instructions.  Instead the usual academic assumption is that all memory operations are strongly ordered - which is obviously not true on all modern hardware.  Practitioners are required to insert memory fences as needed to achieve the desired ordering.  Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do.  Azul's CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations.  (As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking.  Report back the %-tage of dropped counts and the throughput rate.  Then try it with an Atomic counter.  My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless).  I am less familiar with the fence properties of common LL/SC implementations.  Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc?

build.java


Some time ago I reported on a build tool I've been using.  Currently 'build' is being used to build HotSpot internally to Azul.  We've got it building 400+ files, plus build rules for building portions of the JDK, compiling w/gcc & javac and also 'javah', tar, jar, strip, binary signing, etc.  In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc.  It's blazingly fast, it's easy to hack (being written in plain-olde Java).  I hunted around awhile and couldn't find a good place to dump a medium-large blob of code, so here's a sample build.java in 4 parts:

Part 1
Part 2
Part 3
Part 4


A "HotSpot -server" aka C2 compiler IR visualizer

No I didn't do it!  This deserving grad student did it.

Here is the reference to the visualizer for the ideal graph of the HotSpot server compiler:

http://ssw.jku.at/igv/master.jnlp

It is a JNLP application hosted on a university server (http://ssw.jku.at/General/Staff/TW/ has also a test file to download), but the easiest way to get a first impression.  The source code of the tool is hosted on http://kenai.com/projects/igv  The server compiler instrumentation is part of OpenJDK and also included in Sun's weekly builds of Java 7.

More C2 Goodies

John Cavazos wrote:
... One of the things I am currently looking at is determining the right phase-ordering of optimizations applied to a particular program being optimized.  I have some nice (un-published) results for JikesRVM,
but it would be nice to replicate the research for HotSpot...

I have strong arguments for nearly all orderings of our passes, so I'm curious to know about your phase-ordering results.
The only obvious transform we might be missing is PRE, but our peephole opts pretty nearly subsumes PRE (they are not mathematically equivalent - I can write programs where either PRE or the peephole stuff makes progress against the other).  In practice, PRE will find essentially nothing once the peephole opts are done.  You'll notice that we do the peephole pass alot; it's totally incremental and provably linear bounded.  In other words, if there's nothing to be gained then there's no cost in trying.  The peephole pass includes amongst other things all pessimistic forms of copy propagation, value equivalence, constant propagation (especially the null/not-null property), constant test folding, repeated test folding, dead code elimination, load-after-store opts (aliasing analysis is included for free during building of SSA form), algebraic properties, and lots more.


For HotSpot, the optimization ordering is:

  • Build an inline tree, including class hierarchy analysis.  This is the one pass I'd be willing to move, as it happens too early.
  • (a single unified pass:) parse bytecodes, inline, build SSA form (the IR remains in SSA form always), do peephole opts over SSA form.  This pass typically costs 40% of compile-time budget.
  • Fixed-point the peephole opts over SSA form, once all backedges are known after parsing.
  • Loop opts round 1: "beautify loops" (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, split-if (zipper-peel CFG diamonds with common tests, plus some local cloning where I can prove progress), peel loops (required to remove loop-invariant null checks)
  • Fixed-point the peephole opts over SSA form
  • Loop opts round 2: "beautify loops" (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, lock coarsening, split-if & peel - but if these don't trigger because there's nothing to gain, the do iteration-splitting for range-check-elimination & a 1st loop unrolling.
  • Fixed-point the peephole opts over SSA form
  • Conditional Constant Propagation (the optimistic kind, instead of the pessimistic kind done by the peephole pass)
  • Iterate loop (split-if, peel, lock coarsen - but these typically never trigger again and take very little time to check), unrolling & peephole passes, until loops are unrolled "enough".  On last pass, insert prefetches.  Typically this iterates once or twice, unless this is a microbenchmark and then unrolling might happen 8 or 16 times.
  • Remove tail-duplication, and a bunch of other minor code-shaping optimizations e.g. absorb constant inputs into deoptimization-info in calls, or commuting Add ops so that 2-address machines can do update-in-place.
  • Convert "ideal" IR into machine code IR.
  • Build a real CFG for the 1st time, including a dominator tree, loop tree.  Populate with frequencies from earlier profiling.
  • Global latency-aware (loop-structure-aware) scheduling.
  • Replace null-checks with memory references where appropriate.
  • Register allocate.  Internally this has many passes.  This pass typically costs 40% of compile-time budget.
  • Sort basic blocks to get good control flow ordering (forward branches predicted not-taken, backwards predicted taken, etc)
  • Some last-minute peephole opts.
  • Emit code into a buffer, including OOP-maps & deoptimization info.


Travel Plans

I got too much travel coming up!

Apr 30, May 1st - DaCapo in Boston.  Favorite small group; GC & Java focused. Website: http://www.cs.tufts.edu/research/dacapo/.  I had to turn down the invite to an SSA seminar in France because the dates conflicted.  Very sad.  I hope one of the attendees will post a trip report.

May 11-May 15.  IFIP WG2.4  (International Federation of Information Processors, Working Group 2.4 - a *really* old European group with a random mix of industry & academia.)  It's a nice group to preview JavaOne talks, and always the meetings are a week-long rambling discussion in some quaint resort.

June 2-June 5.  JavaOne.  I owe slides for 3 talks by end of this week.  'enuf said.

July 6-10th - ECOOP, as a paid-for invited speaker.  A free vacation to Italy!   :-)

Cliff

Category: Web/Tech | Permalink | Comments (10) | TrackBack (0)


YATR2YAC (yet another trip report to yet another conference)
April 5, 2009

'Tis the Season to be Conferencing...
Blah, I'd rather be hacking, but here goes...  

I went to EclipseCon, to CGO and back to EclipseCon.  Here's how it can down: I got invited out of the blue to talk at EclipseCon by Darin Wright, who had seen me speak earlier at the JVM Language Summit.  Like a dummy I accepted it instantly instead of checking my calander.  Naturally it conflicts with CGO - and I was on the CGO PC this year, so I felt obligated to attend CGO as well.  Turns out my talk at EclipseCon got scheduled for Wednesday, and I decided I could skip CGO on Monday (I think there was one talk I was interested in) - so I scheduled a 1-day only trip to Seattle.  Monday I wandered EclispeCon at the Santa Clara Convention Center, Tuesday I took the 1st plane up to Seattle and the last plane back, and Wednesday it was back to Santa Clara to give my talk and surf the EclipseCon booths.  The flights on Alaska Air were smooth and easy.  No WiFi at CGO - the Hotel wanted some totally outrageous price for conference WiFi (I think the local chair was saying $50/head prepaid?!?).  (Remember my grief over Hotel WiFi in the last blog?)  So over lunch break (which was Chinese food and I don't like Asian food of any kind) I headed over to Starbucks.  Turns out Starbucks will happily give you 2hrs of free WiFi a day from ATT - provided you register $5 on a Starbucks card with them.  Yes!!!  Suddenly I have a $5/month nation-wide WiFi plan!  I snacked on some salad'y thing, drank my triple-latte, and happily Surfed the Internet until show-time at CGO came 'round again.

1st up: EclispeCon, mostly cause I got nothing to say.  These guys are in a totally different ball park from where I live; they mostly deal with complexity control (or at least 1/2 the talks essentially amounted to adding, changing, or removing a control layer).  Makes me wonder if the cure is really better than the disease. I gave my talk, but it was too much talk in too little a time window.  Both me and the attendees would have been better off if I could have spent more time on the details.  C'est la vie.  Other thing of note: the web site for EclipseCon is vastly better than the usual academic conference I attend.  Also the staff was exceptional friendly and easy to work with.

Next up: CGO.  Despite my 1 day fly-in-fly-out I got a lot more out of CGO. 

----------------------

Keynote: Vikram Adve

Missed!  Arrived barely in time for the last 10 minutes.  That's the penalty for attempting a 1-day fly-by.  I hear he said something good about Azul systems (he did ping me the day before for some details).  Basically it's a Call To Arms for improving C compilers and C runtimes.  He does say that fully auto-parallelizing code is probably a bad idea, and that we need to change the language instead (something I very much agree with!).

----------------------

Revisiting Out-Of-SSA Translation

Cliff Notes: HotSpot "comes out of" SSA form during register allocation, where it really preserves SSA form throughout the entire compile process, but adds register annotations as part of reg alloc (and inserts live-range splits as part of spilling). So mostly this talk is about stuff HotSpot has been doing for years... but there's a key new piece.

Linear-time colaescing congruence classes; speed & memory footprints.
Why is coming out of SSA hard? (Cliff Notes: It's not!)

And now they show what HS has been using for years: insert copies as needed for a simple out-of-SSA translation, then aggressively coalesce away the extra copies.

Nice diagrams for IFGs + copy-affinities.

Notices the value-equivalence property for defining IFGs (also done by HS).

Ok: here's what's new compared to HotSpot -server:

Fast interference test w/out building an IFG!
Only need to test against closest dominating var?
So scan the dom tree, with a stack-based DFS traverse of dom tree.
Uses 2 sorted lists (?); more complexity for value-based IFG tests.
But no building (and rebuilding?) the quadratic IFG.
Claim 2x faster than Sreedhar for coming out of SSA.
Big memory footprint reduction, because no need for live-ness sets & IFG!!!

40% of HS's total JIT budget is in reg-alloc! A faster reg-alloc is definitely interesting. A large part of that 40% is building the liveness & IFG sets

----------------------

Wave Propagation and Deep Propagation for Pointer Analysis

Wave Prop is very fast and uses huge memory (working on parallel version)
Deep prop is faster for small & precise memory sets - works better on desktop (and JIT?)

There's a zoo of ptr-analysis names:
- context-sensitive or not
- flow-sensitive or not
- field-sensitive or not
- inclusion-based (Andersen-based) or unification-based (Steesgaard-based)

Cliff Notes: HotSpot -server uses context-INsenstive, flow-INsensitive, field-Sensitive, unification-based

Build a graph of constraints, and do contraint-propagation.
Code: "a = &b;" Constraint: "a superset {b}" - means b element-of P(a)
Code "a = b" constaint: "a superset b"- means P(b) contained in P(a)
Code "a = *b" or "*a = b" - complex constraint, must add edges to the graph as constraint propagation adds new sets to the nodes.

Now build a graph: nodes are P(a) - sets of things pted-at by 'a', and edges represent subsets.

Cycles are common; collapsing them is crucial for performance. All things in the cycle must be equal. Uses Nuutila to find strongly connected components in a single pass (so does HotSpot - Vick's algorithm).  After collapsing cycles, you get a DAG - and Nuutila also produces a topo-sort as output.

(Still looks like a complex diff/delta propagation algorithm - but makes me wonder if standard bit-vector stuff is faster).

Not a bad-looking ptr-analysis. Might work in a JIT.
Manesh suggests using BDD's.

Cliff Notes: Real question of course, is if all the extra alias precision actually buys you any performance. PLDI paper of a few years ago suggestions otherwise: that HS's style of alias-analysis gets you 95% of all the possible gain.

----------------------

Fast Track: A Software System for Speculative Program Optimization

Chen Ding: another process-level fork/join speculative program optimization (contrast this to Emery Burger's "Grace" system). Gives sequential program semantics. Add annotations to "suggest" fork/join
points. Actually fork a process, use COW & mmap to guarantee sequential semantics in the face of races (end speculation, or force commit-ordering on speculative threads, etc).

New Idea: User can write an alternative implementation of some module; run both versions in parallel; keep the faster version & kill the slower version: "return do_whichever_is_faster( methodX(), fast_methodX() )"

New Idea: "fast_methodX" might be speculative and *wrong*. So run slow-version to completion and do error checking after the fact. But let the fast version run ahead (and if it's faster), it can spawn off new speculative fast versions. When slow version completes, use page-level protections to know what's been changed, and verify the correct semantics of the fast version. Run all the correctness checks in parallel with all the normal fast-case stuff. Needs some careful throttling.

Example using gcc's "mudflap". Array under/overflow, leaks, un-init'd memory, etc. Run 2 versions of all code: the existing code is "fast track" and the mudflap version is the "slow correct track". Goal: same speed as the "fast track" code with all the safety checks of the "slow track using mudflap". In some cases can get real nice speedups.

Cliff Notes: can we turn on mudflag for Azul?
----------------------
Scenario Based Optimization

Profile the hot methods. Tweak various parameters (e.g. loop unrolling, or cache-tiling parameters). Profile both old and new versions. Pick the best winner, and stick with it for awhile. Keep profiling continously, and switching code as needed.  Some JVM JITs do this.  HotSpot only sorta does.

The programmer has to pick what the new & old versions are, and the runtime picks the fastest one (no support for correctness - what if one of the versions is busted? Or tickles a timing bug which breaks things?)

Seeing some nice gains, some of the time. Avoids losing when turning on some aggressive optimizations. But generally unrealistic.

----------------------
Software Pipelined Execution of Stream Programs on GPUs

Basically StreamIt on a GPU

StreamIt is much easier to use than e.g. CUDA or OpenCL.

StreamIt model is a collection of connected serial parts; connected via pipes. Due to splits & joins, might need to exec some parts more often than than others (e.g. part Y "eats" 2 ouputs from part "X" before outputing a result itself). Get complex patterns of execution; difficult to map down to GPGPUs. Have bounded-buffer problems: need to map to a steady-state schedule using a fixed count of buffers. Then can arrange to do no allocation during execution.

- Profile each filter standalone, to figure out it's work load.
- Configure selection
- Add constraints
- Use CPLEX to solve
- Output CUDA code.

Looks like a classic software-pipeline, just done on a GPGPU. Compute minimum initialization interval, compile resource constraints, etc. Have issues with data-layout, so can do parallel data access.

Impressive compile times (30sec to 178sec).
Can get pretty good utilization on some kernels, so seeing 100x speedups.

Cliff Notes: My usual caveat for domain-specific languages applies here: if your domain fits StreamIt's model - then StreamIt is the fastest language for you!  If what you got doesn't look like StreamIt, then Too Bad.

----------------------
Stream Compilation for Real-Time Embedded Systems

StreamIt Again, but for Cell.
New constraint: out of memory per cell (distributed memory).
Different from GPU - GPU has shared memory, but very slow if doing bank-conflicting access - but can pile on more memory (more buffers) for some GPUs and less for others. Not so for Cell - must have a
uniform split of memory.


Thanks,
Cliff

PS - Thanks Doug for inspiring this blog


Category: Web/Tech | Permalink | Comments (8) | TrackBack (0)


Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report
March 19, 2009

Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report... sigh, all good hacks must come to an end, or at least be put aside so Real Life can intervene.  I went to DC for the ASPLOS and VEE conferences and all you get is this lousy blog.


2009 ASPLOS & VEE Conferences



ASPLOS & VEE are co-located in D.C. this year.  Mostly the trip itself was uneventful.  I was very busy either attending the conference sessions or preparing for my own talk and I didn't get around much.  The flights themselves were mostly uneventful (ComAir's HQ in the midwest lost power from a tornado, so I was stuck on the tarmac at JFK for an hour waiting to get permission to fly into DC - first time I've been delayed by bad weather when the weather between the source and destination airports was perfect).  The most notable point was that the hotel used WayPort as their internet access.  WayPort sucks horribly; I've used them in the past and swore I'd never use them again - but I forgot to ask when checking in to the hotel!  WayPort charges $13/night, and service is filtered through some horrible layout routing through TimBuk Too - average ping times to hit a server in the DC area was about 600msec (ping times from AT&T wireless in the conference room were ~50ms).  To add insult to injury, some other conference attendees stayed at a nearby hotel - which was cheaper and had free WiFi AND the free stuff was faster!  Definitely having WayPort as your WiFi provider is sufficient for me to change hotels.

Cliff Notes: I had a wonderful 1.5-long 1-on-1 talk with Monica Lam (http://en.wikipedia.org/wiki/Monica_S._Lam); she didn't know what Azul does.
Cliff Notes: More interesting talks with Jon Shapiro (http://www.eros-os.com, who's busy implementing a high-level low-level language for high-availability systems).

Reality Check: yes again a Mac failed to be able to output to the projector, and we all waited while the presenter hotswapped to a PC

Caveat Emptor: As usual for these talks, I only attend some sessions, I write stream-of-consciousness reports which I only lightly edit and the reports tend to be brutal and short.  They are roughly sorted here by my interest level, and not by e.g. time order.

General Cool Stuff:
Keynote: Monica Lam's Vision of the Future of Human / Computers Interaction
Getting Wrong Answers Without Doing Anything Wrong - all your base are belong to us
Text Processing in Parallel in a Register - wicked fast
Nanoscale Computing - wicked scary
Demystifying High-level Low-level Programming - how to write Java-in-Java
Asymmetric Chip Multiprocessor - 1st good use I've seen for these
Leak Pruning - Reported on before, but still good
TRIPS - Funky dataflow/VLIW hybrid (but not the weirdest I've recently seen!), too bad it didn't pan out

HTM Talks
Which I read but due to a snafu failed to attend the talks.   :-(
Maximum Benefit from Minimum HTM
- not so minimum (bigger than Azul's HTM!) and lousy results
Rock's HTM - a peek at Sun's experience with Rock's HTM; mostly Rock's HTM is really really weak

Multithreaded Bug Finders
CTrigger - The Winner in this section: cleverly induced sleeps trigger race bugs reliably
Anomaly-based Bug Detection & Prediction - It's an, ahhhh, interesting hack
Isolator - silently prevent race bugs

Making Programs Deterministic
Cliff Notes: Major flaw in the thinking here: Real Programs have Real Random Inputs.  Unless you capture & replay ALL external events, minor things like network-packet-arrival-order and the chaotic butterfly effect (http://en.wikipedia.org/wiki/Butterfly_effect) ruin your determinism.  So I think all these papers are tilting at windmills.
Kendo - Stock hardware, but requires a race-free program
Deterministic Replay - Goes only 1/2-way to preventing the Butterfly Effect
DMP - Deterministic Shared Multiprocessing - No replay attempt, just full-speed determinism (assuming no Butterflies)

Other Stuff
Talks I attended but aren't otherwise interested in...
Memory Buddies - Yet Another Use for Bloom Filters
Live VM Migration - pre-copy push still beats post-copy pull
Validate - Microsoft patches Are Bad (tm)
Raytracing vs GPUs - Doom!
Commutativity Analysis for Software Parallelization - the original work is WAY more interesting than this follow-on
Predicting Yield of GCs - Yea Olde Hat Trick
Phantom BTB - winner of the Most Unrealistic Hardware award



Keynote: Monica Lam

Why do computers break?  When we buy something, we expect it to work.  If it breaks, you buy another one.  But with computers they break on their own (e.g. u$oft update) or you can break it doing "normal" stuff - and if it breaks it doesn't help to buy another one - the new one is broken already.

Virtual Desktop - idea: keep the virtual machine in the datacenter, and do remote login.  Does not work well - requires WiFi everywhere.

With each new comp generation, 10x cheaper & 10x more users.
Mainframe-> mini -> worksta -> PC -> laptop -> cell
Also see 3 industries: internet, media/TV, telephone
Also see 3 networks: cable/satillite, broadband/phone, wifi/wimax
Also see 3 computing models: cell (always on, pocket), PC (rich legacy, offline exec, locality of reference), Cloud (device independent, central management). 
Things like GoogleGears or Virtual Desktop are trying to split the Cloud/PC model.

Guess where we are going?

1- (Man) a phone-y thing: key, cache, window into digital cloud, ID, personality, asserts, internet access.  Small, always on.  Can be lost easily & replaced easily.  No important state, but many Gigs of flash cache.  No real compute power.  Small visual display & keyboard.

2- (Earth) Can (1) plug into a PC.  Get big screen & keyboard, get fast processor.  Get hot DRAM, teribyte disk drive.

3- (Heaven) Backs onto the internet/cloud.


This model has some different security issues.  DataCenter is probably secure (or can be?)  You hold key in your hands.  Can we make path from hand-held-key to datacenter secure?

Trends - green, Bring-Your-Own-PC, Mac-vs-Windows.  End of Corporate PC model.  80% of people use personal PC for work; 45% of people use work PC for personal use.  25% of workers change security settings, etc.

SO... want: {User, Ubiquity, IT}  Separation of Concerns.

x86 virtualization seperates: hardware from VM; hardware from OS; different hardware configs; different user roles

MokiaFive company - Desktop-As-A-Service.  LivePC images "in the cloud".  Managed on central server.  Can go up to any mac/pc and run "your desktop" from a USB stick. USB stick/phone is a local cache only - backed up on the cloud. Think of USB memory as a "network accelerator".

Nice live demo.  From USB stick to a selection of PC desktops is not very long (+2 mins for a new PC to install software from stick).  No hotel wifi?  Claims to be sync'ing with cloud???

Nice integrity model: basically re-install the OS clean each time on a fresh VM-ware layer with anti-viral.  Intent is to clean out root-kits with each fresh login to machine.

Graphics is hard: many games are wired to DirectX, so losing much 3D performance.  Trying to bypass all the virtualization layers for DirectX.  But... guest & host have different address spaces; but graphics has lots of pointers; fast pointer munging is hard.

Other services: LDAP, revocation, offline access, logging, dlp. Encyption, Storage Virtualization: reju, incremental & atomic restore; secure access.  Uniquification, single-sign-on.

Seperate-of-Concerns Principles - 90% of code handles exceptions (only 10% does the normal stuff).  "Re-Install the OS" as a root-kit prevention.  e.g. system is structured as a fresh-OS-install because of the rare root-kit issue.

Bootstrap from a small trusted base: baremetal VM; LivePC player;  IT-managed base image (Windows+AntiVirus); finally user-apps & data

Like the Azul philosophy about GC - make the hard case (FullGC, infection from rootkit) common and then get it right - then all else follows.

Need to keep users ownership of own data - avoid the Big Brother mentality.  Give users a choice on where data is kept.


Getting Wrong Answers Without Doing Anything Wrong

Variations in unix env variable size, can see 15% variation in performance!  Running gcc -O2 vs gcc-O3, 400.perlbench.  Bunch of runs at each setting very repeatable per-run.  But with different env sizes see huge variation on 400.perlbench & 'ibm' benchmark.  

Conclusion: setting unix env variables can change your conclusions.

Also, sorting object files on the link-lines differently changes layout in memory, and in I-cache layout.

Conclusion: linking with 'ld *.o' is different from 'ld a.o b.o'.

Bias holds true even if looked at a large benchmark suite, or across different CPU #'s from Intel.

See a prior paper (which I now cannot find!!!) which studies this carefully.  Note that there are a bunch of JVM/Java benchmark issues here that also have been written up recently.

Cliff Notes: I've dealt with both of these in the past; so has Mark Moir (Sun) and Bean Anderson (SGI).  I discovered the need for I-cache layout while working for Motorola on PPC chips (somebody else actually did the implementation).  The results gave us an average 10% performance boost on a bunch of benchmarks, plus removed a huge source of variation.  Later at Azul, with our shared low-associativity caches we discovered that stack placement was crucial to avoid conflicts, fixed with a little random offset to each stack base.

Cliff Notes: What are other large systemic biases?  Are these two it?  And obvious question for a JIT - should I start doing I-cache layout games for performance?  How about making copies of an existing hot method and running both old & new and doing sampling, and pick the winner?


Text Processing in Parallel in a Register

Cliff Notes: this technique is insanely fast
Like SIMD, but char/bytes within a standard GPR.  So 4 or 8 chars in a GPR?

Doubling: field width from n-bit to 2n-bit
Inductive: repeated doubling.

Parallel bit streams; 1-to-1 correspondence to byte stream.  Vastly faster XML parsing; better cache locality; better branch prediction.

So: take 1st bit from each byte, and jam 64 bytes (bit #0 only) into 1 GPR.  Take 2nd bit into 2nd GPR, etc.  Basically, load 64 bytes into 8 GPRs, but bit-striped.  Need architecture support for bit-striping pack/unpack.

Now break up desired operations (e.g., compare for lexical token boundaries) into bit-field operations.  Can do compares of 64 bytes for tokens by simple ops on 8 GPRs.

Cliff Notes: really lousy presentation.  Hope the paper is better. Guesswork: take this string "java/lang/Object(Crunkify)" and find the '(' tokens.

  GPR #0 - xxxxxxxx - bit 0 from all bytes
  GPR #1 - xxxxxxxx - bit 1 from all bytes
  GPR #2 - xxxxxxxx - bit 2 from all bytes
  GPR #3 - xxxxxxxx - bit 3 from all bytes

My turn for a lousy presentation: Look at bits for '('.  Want some combo of 0's & 1's from the GPRs.  So e.g. XOR w/-1 all GPRs needing 0's - now need them as 1's. Then do 7 AND's (parallel log tree AND).  Any bits still set represent the '('s.  Nicely wide-issues per 64-bytes.  Expect to compare 64-bytes for a single char in maybe 6-8 clocks!?!?!?!  Plus evil pack/unpack issues/costs.

Wrote some code to demo finding '(' in a long string using maybe 1/10th clock per character, plus whatever the packing costs are (which probably dominate!).  In this code, 'b0' through 'b7' are 64-bit values holding bits 0 (through 7) for 64 chars in turn.  Here's the test-64-chars-for-paren check:
      // The open-paren char, '(', hex 0x28:
      //           0     0    1     0    1     0     0     0
      long res = ~b7 & ~b6 & b5 & ~b4 & b3 & ~b2 & ~b1 & ~b0;
      // A non-zero res means a '(' is found, and the bit-number is
      // the index into the 64-byte array where it was found.


See attached source code for TextScan.java


Nanoscale Computing
128 bytes of ram; 5micrometers in area

Smaller version: 8 bytes of ram; 1.5micrometers in area; similar in size to largest virus.  Chemical sensors report results by directly changing bits in the instruction stream (e.g., flipping a bit changes a JMP to a NOP, in effect making a conditional branch).  Can detect adequately running at 100Hz (not 100MHz).  Can write small simple test/loop codes in this form factor.  Can e.g. count molecule rates or virus rates in a solution (injected in bloodstream). 


Cliff Notes: Very Very Scary.  See Vernor Vinge's dystopia "A Deepness in the Sky".  http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky
http://en.wikipedia.org/wiki/Smartdust


Demystifying High-level Low-level Programming

Goal: Abstraction w/out guilt.  Separation of concerns.  Nice historical perspective.  Issues go back as far as 1975 - using C for writing an OS was a big issue back then.  So it's time for the modern revolution of the same thing.  JVMs for OS's. 

Various C support tools: coding styles; Purify; idioms; tools (Valgrind).  Also Systems-languages - BLISS, Modula-3, Cyclone, etc Or 2 languages: JNI+Java JikesRVM extends Java, also OVM, Moxie, DRLVM, C# (Bartok).  Usually in context of runtime research.  This talk: Jikes, Java-in-Java.

Containment:
- minimize exposure to low-level coding
- extensible; requires change fast, but language changes slowly
Encapsulate: contained low-level semantics.
Fine-grained; separation of concerns.

"Address" is Magic, like unsafe.
"Address foo; int x = foo.loadInt(offset)"
More types for "void*" in Java, or generic OOP pointers.

Intrinsics: contract between you & the compiler; at the Java level it has an empty implementation.  But the compiler will always insert the correct code for the e.g. prefetch intrinsic.

Also annotations to do e.g., turn off bounds-checks or null-checks (performance hints, but break safety).

Scoped semantic regions - scoped rule changes, e.g. "NoNew" - no allocation in this scope (HotSpot uses this technique extensively), or "NoBoundsChecks" in this scope, etc.  Unboxed is another annotation - no class header, no v-table,  basically a C struct.  Also can do direct field layout.

@RawStorage - basic static C memory layout.  Backed by native width raw storage, only accessed by native-width C intrinsics.

Can both compile "fast" - down to raw bits, and "virtual" - to some virtual implementations allowing full GC on full simulation.


Asymmetric Chip Multiprocessor

Idea: instead of e.g. 16 cores on a die, have 12 for normal work and 1 "fat" core for hot contended locks.

Cross grid of threads vs open-tables in MySQL. (looks like ref-counting of MySQL tables).  Blah - implementation looks alot of NmethodBuckets in HotSpot.  2-d grid.  Updates are complex & require complex locking.

Look at some locking traces.  If we could accelerate the critical sections then we get a Amdhal's Law approach - removes the serial portion.

Idea: switch to the one fast CPU on a contended lock.
Seeing speedups of 50% or so for 12-16 cpu system, where program was seeing e.g. 5x speedup and now gets e.g. 8x speedup.  One of the benefits: locks sections always executed by same CPU, so caches remain 'hot' for hot-lock code.

Cliff Notes: This is the first real use of Asymmetric Multiprocessors that I believe in.


Leak Pruning

See it & reported it before.  Survive OutOfMemory & leaks by reclaiming predicted dead objects.  Used a simple X86 read-barrier w/6% slowdown.  Hoping to see new slowdown numbers for the read-barrier: see 3% slowdown on Core2 & 5% on Pentium4.  Using 2 bits per reference; bit 0 is for stale-refs; bit 1 is for pruned refs.

So instead of throwing an OOM, look thru heap and see if find a dominate object.  Assume dominate object is dead & leaked; "Cut" link to object and replace with a poisoned pointer.  Touching this pointer
throws the original OOM.  But if never touch object, then never throw OOM.  Cut links allows dead object (and his transitive closure) to be reclaimed.

Try to predict roots of dead datastructures.  So profile max time between last use (using something like GC age bits, but exponential counter bits).  Also need to track which refs are keeping most bytes alive.  

HashEntry -> PreparedStatement -> ParserInfo -> byte[][] -> byte[]
                               -> ResultSet -> Field[]

Compute during Phase 1:
Sometimes the HashTable grows, and PreparedStatement gets touched.  So max-stale-then-used for HashTable-> PreparedStatement is 16-32 GC cycles.  But max-stale-then-used for  PreparedStatement-> ParserInfo is 0-1 GC cycles.

Compute during Phase 2:
Count bytes kept alive for each type of thing that has a low MSTU number.  See PPS->ParserInfo has 420M kept alive.  So kills all edges from old PPS->ParserInfo (but not young ones, because MSTU is kept per object?  I now think he keeps MSTU per-class, so no bits in object but can suffer from common types like Object[] being used both in leaks and in hot code).  Don't worry about DAG sharing - ends up blaming the leak on 1 ref when it's really kept alive by several refs; just means on 1st go-around you prune one edge - but leak remains alive.  On 2nd go-around all the leak is blamed on object type#2 and pruned then.

Results are decent - keeping some complex real apps alive essentially forever.  Performance is decent, but near OOM conditions things slow down as first come frequent GCs, then LeakPruning kicks in and cleans the heap back out.

Cliff Notes: Wondering if could do read-barrier check on low bits by e.g. loading from the loaded pointer with an X86 op which faults on odd addresses?  Means faulting on stale & poisoned refs; faulting on poisoned is OK (means throwing OOM anyways), but faulting on stale might be too frequent still.


TRIPS, McKinley et al

TRIPs is an alternative processor architecture, with a square 4x4 grid of functional units (GPUs); registers are along one "wall", D-cache is along another 2 sides, etc.  Inter-GPU traffic is explicitly scheduled, but with a data-flow-like flavor (static scheduling, dynamic timing).  I've reported on this project before.  This talk is essentially an end-of-project summary.  And the summary is: not much better than a Core2 (if the chip had been built with Core2 technology).

Each chip has 2 processors; 1M L2.  16GPUs, 128 registers, 32K L1 cache.  Processor is distributed.  4x4 grid of GPUs; registers along one side; D-cache along other sides. Window size of 1024 instructions.

Block atomic execution model; dataflow between blocks.  Blocks are single-entry, multi-exit.  1 non-speculative block; up to 7 speculative blocks.

Heavy use of predication to build larger blocks.  Direct instruction communication model.  Special instructions to handle large fan-out.  Large block of conditional/predicted stuff mapped onto the 4x4 grid of FPUs; using reads from registers and reads/writes to d-cache.

Larger blocks require more scheduling; harder to place.  What is the utilization of the large instruction window?  Both scheduling, plus e.g. cache-miss & branch mis-predict.  Static delay: scheduling issues; 1-clk to move data from FPU to FPU.

Benchmarks: 53 compiled benchmarks; 13 hand-optimized benchmarks; 2 hand-optimized and hand-scheduled.  Gather numbers of the actual chip, except when needs to use SIM numbers (because chip didn't gather the right data).

Generally, out of the possible 128 instructions in a block compiler is able to get about 60 instructions.  Hand-optimized code uses smaller blocks because better scalar optimizations reduces instruction count.  
About 1/4 of all instructions are predicated & never executed.
About 1/4 of all cycles are lost due to scheduling.
About 1/2 of all cycles do useful work.

Very large code-bloat: 55% bloat from nops, or direct reg read/write ops, 18% from predicated useless, 27% from instruction-replication optimizations (loop unrolling, inlining).  4 SPEC benchmarks have i-cache miss rates > 10%.  Direct memory accesses are much lower (no spill ops?)

Need variable-sized blocks; would reduce the code bloat from 11x to 6x (over canonical Alpha code).

L1 is partitioned.  Compiler can acheive 2 mem-ops /cycle on vector add; hand-optimized can hit the max of 4 mem-ops / cycle.  Again for matrix-multiple the compiler hits 1 flops/cycle; hand optimized hits 7 flops/cycle.  

Clock speed is 366Mhz; mem speed is 200Mhz; process tech is 130nm.  Compare to a Core2 but boost TRIPS "as if" if gets the 65nm process. Compiled versions are competitive to Core2.  Hand-optimized about 3x faster than Core2.

What works: dramatic reduction in data-cache & register file refs.  Compiler is able to work (to schedule to this weirdo thing).  But performance isn't very big over a Core2.  Challenges: dynamic code size; dataflow overhead instructions; I-cache efficiency; predication overhead; generating efficient code for control intensive codes.


Maximum Benefit from Minimum HTM

Running a (supposedly) simple HTM on a transaction-aware Linux kernel.  Their proposed HTM is somewhat stronger than Azuls', in that they support parallel independent transactions (the main one, and ones run inside interrupt handlers), a special version of CAS, and graceful overflow to software.  Also they holding pending transactions in a much more generous L1 than Azuls'.

Results look pretty anemic and fragile, which is on par with Azuls' results (4x speedups on 16-way boxes; 60% of exec cycles wasted on retries).  On the plus side, they at least attempt a reasonable benchmark set.


The Rock's HTM

A look at Sun's Rock's HTM.  It's a lot weaker than Azul's - max of 16 (or 32, depending on how many hardware threads you use) pending writes; aborts on function calls, TLB misses, mispredicted branches & some fp ops, - plus the same reasons Azul's HTM can fail: eviction from the L1 (due to transaction being too large, or conflict from another CPU).

Major discovery (just like Azul discovered!): feedback is crucial to decent heuristics.  Your failed HTMs need to report as much as possible about what went wrong so the heuristic stands a chance of getting it right.  This required a hardware spin (Rock2).  One useful heuristic is the standard backoff mechanism: if there's true contention, then backing off can spread the memory references far enough apart to allow the HTM to succeed (presumably succeed with less overhead than a simple spin-lock solution). 

They report hashtable numbers, but wildly concurrent hashtables are commonly available. 
They report using the Rock HTM for doing an N-CAS.  This is largely successful, and might simplify many libraries.


CTrigger - Atomicity Violation Bugs

Software testing is inadequate.  Poor interleaving during standard stress testing.

CTrigger - for any given program input, force interesting interleavings.  There is an exponential interleaving space for each input, but stress testing is using random interleavings.   Previous work only controlled the context-switch interleavings.  CTrigger focuses on interleavings that likely hide atomicity bugs and rarely happen during normal testing.

Focus on interleavings that likely hide bugs: e.g. 2 nearby refs to the same memory in one thread - and another write to same ref in another thread.  Force these accesses to interleave by injecting "sleeps".

Tried on 7 apps (server, client, scientific) on 8-core machine.  143 total Unserializable Interleavings (program interleavings deemed interesting from atomicity standpoint).  Stress testing only hits about 80 to 100 of the U.I., out of 143, even for many many runs.  Basically some (1/3rd) of the U.I.s are very unlikely to hit with stress testing - this demonstrates the key weakness of standard stress testing.

CTrigger looks at the access traces and attempts to force the interleaving.  Needs to trim out infeasible paths: locking forces many orderings to be impossible (i.e., the code is correct and no race happens).  Some refs are inside the same lock; some are Happens-Before on thread-create or thread-join.  Typically 90% of apparent bad interleavings are infeasible.

Focus on rare interleavings also (not-rare races happen easily in stress testing).  If the "local gap" is small - time between 2 refs in the same thread is short, then the race is narrow - and hard to hit in stress testing.  Also look at the distance between external ref distance in time to the gap.  Sort these potential races by likelihood.  Instrument the program to add an artificial delay at a few points to try & force these rare races to happen often.  

Then run program as normal.  Program speed is similar to unmodified program, because artificial delays are small and few in number.  Program runs fine on multi-core machines; using the multi-cores as before.  But now the rare races are common, and stress testing finds them quickly.

Typically find bugs now 100x faster than before!!!  Found bugs in all these apps.  Was able to repro the breakage 100% of the time, generally very quickly.


Anomaly-based Bug Detection & Prediction

Defect - the Bug
Infection - Defect is triggered ==> program state is corrupt
Failure - An observable error (Crash, hang, wrong result)

Infections spread over time, or get overwritten (hence bug is squashed in this case).

Software debugging: hypothesis bug; change code; attempt to repro bug, etc.

- detecting anomalies; out-of-bounds addresses, unusual control paths, page faults or redundant  computations.  So profile program to teach the detector what the "good" behavior is.  Some values are in fixed ranges during the teaching runs, or some mem locations only accessed from a few ld/st ops; or abnormal number of loop iterations.  Current crop of anomaly detectors have a fairly high false-positive rate.

- Now isolate relevant anomalies: compute dynamic forward slices from anomalies to crash point.  Ouch - this step looks really expensive.  Rerun the broken program, propagating tokens from anomaly site to crash site.  Use binary search on the anomaly set.  Can reduce the count of false positive anomalies by a factor of 10.

- Now "validate" that have found buggy line of code: NOP that line out and see that you can execute further.  

Still don't know what the fix is, but can point out the step that e.g. corrupted memory.

Appears to scale up to e.g. gcc (330KLOC) found a bug in distributive laws causing a loop in RTL tree optimization (C2 used to suffer these routinely).


Isolator

Isolation in concurrent programs: detect isolation guarantees; i.e. access in critical regions from non-locking threads (or incorrect locking).  Bad threads can break well-behaved threads, (the usual motivation talk for debugging data races, I wish people could skip this or at least limit it to 1 slide & 1 minute).  Also STMs only have weak atomicity, so bugs still happen for well-behaved STM threads.  Cliff Notes: This is the key: some other buggy thread breaks your well-behaved concurrent thread; studying the break point is useless.

Isolator: Program executes as-if no isolation violation.  Uses a combo of data-replication, virtual memory & code instrumentation.  Does not require the whole program.  Requires "guarded by" annotations.

Cliff Notes: just having thread-level TLB protection would let you do some COW-style (http://en.wikipedia.org/wiki/Copy-on-write) protections.  Just write-protect all pages touched during a locked region (except allow writes from self).  All other threads that attempt to write without holding the lock will get a TLB fault and can be stalled (and reported as racing).

Stalling can cause deadlocks, so need dynamic deadlock detector.  (Youch!)
Either allow racing access OR throw a DeadlockError.

This above option is done by Isolator, except requires annotations rather than lazily discovering touched variables.  

Issues: expensive to change protection levels on every lock.  Fix: "lazy unprotect" - only remove protections on-demand, as other threads need them.  Could "shatter" big pages, or split unrelated objects to private pages (via GC). 
Cliff Notes: similar to HotSpot's Biased Locking, but this wants per-thread page-protections; really want user-mode control over per-thread page protections.

Issues: page-size granularity - has fragmentation overhead for fine-grained locks.


Kendo

Only works for data-race free programs.  Very efficient deterministic interleaving.  Progress is made with "Deterministic logical clock" time.

Work on SPLASH benchmarks.  See 18% to 27% overhead, but deterministic always.  Requires no special hardware. 

Each thread maintains a "logical clock" similar to Azul's COUNT register or TSC on X86.  Could be e.g., a "bytecodes executed" counter.  Globally, the thread with the lowest "logical clock" has the right to do a e.g. lock acquire (non-determinstic action).  Assumes all inter-thread communication happens nicely inside locks.

Enforces lock-acquire ordering, except for nested locks (have to prevent deadlock, since basically adding another locking layer).

Cliff Notes: Mostly interesting because this could work on stock hardware (but it is limited to race-free programs which are precisely those that you'd like this to work with).


Deterministic Replay
"Time Travel" for concurrent debugging

Phase 1 - record non-deterministic events
Phase 2 - replay

Software versions have been around for ~15 years.  Now commercially available.  Also available are libraries, compilers & OS's to help.  SW stack flaw: limited to speed of 1 cpu

How do you integrate HW & SW stacks?  i.e., If HW has captured a log, where does it go?  Do not want to capture or replay the SW stack that is doing the SW recording.

Also need the HW to be running across the whole machine, so that all processes are using it because interacting processes need to all be deterministic (see butterfly effect above).

So some abstraction: not all elements of the system are being recorded and can be replayed.  May need to simo record multiple user-mode processes in the same recording "sphere".  Only user-mode threads are being recorded (OS support needed to virtualize HW, but also OS needs SW record/replay that is process specific).  Really he's making a weak stab at recording everything to avoid the butterfly effect.

Now record 2 different kinds of info:
HW info: memory interleave log
SW info: system stuff log; scheduling; context switch; I/O; signals

Replay also has to handle, e.g. malloc has to be deterministic.
Cliff Notes: but not GC, although HashCode has to be deterministic.

Key Issues:
1- Deterministic injection of data into controlled "sphere" (I/O)
2- Using fewer processors during replay than were used during recording
3- Emulating vs re-execute system call (e.g. fork requires a real system call) - but this makes a lot of OS state changes.

Key Issue #1 - Capturing ALL I/O
- Random OS calls write into user-space (i.e. setting return values in memory).  But during replay need e.g. the SAME "fd" as before.  But OS under no obligation to hand out the same "fd" at an open().  Or OS writes data into a buffer - BUT another user-thread is ALSO writing into the same buffer.  

Key Issue #2 - Not enough CPUs during replay
- I think he's expecting HW to do the memory interleaving on replay!!  But basically, it's time for thread#1 to run - but the OS wants to schedule thread#2 - just need to block #2 and allow #1 to  run.  Very slow (constant OS thread scheduling overhead).

Log size - in bits per kilo-instructions.  
Small: 3.15 bits/kilo-instruction.
Recording performance overhead: 20-50%.
Replay performance: using replay HW!?!?!  About 2x slowdown during replay.

How to handle large datasets?
 

DMP - Deterministic Shared Multiprocessing

Hard to debug, etc, etc...
So just make the whole thing deterministic.  No "Heisenbugs"; can use time-travel debugging.  Can test like it's a serial program.  No attempt at record/replay really - just force same ordering every time.  Cliff Notes: Suffers butterfly effect.

Instructions communicate via shared memory.  Only care about "communicating" instructions.  Can serialize - but that blows performance.

So really want to figure out which instructions really communicate. Requires HW support.  

Conceptually have a "deterministic token" which is handed out to processors and then is passed around all the CPUs in round-robin format.  This token totally orders all instructions.  Each "quantum" is roughly 1000 instructions, but varies.  Now filter out non-communicating instructions.  Allow the non-comm instructions to run in parallel.  Leverage cache coherency to track when cache lines move.  Build a "sharing table" to help order memory accesses.

Now add some TM support to help get more parallelism.
Now come some quantum-building optimizations: break at sync/lock boundaries.
Communication is often bursty: so break quantum after a series of stores to shared memory.


Memory Buddies

Sharing memory between VMware VMs(?)  Hypervisor finds identical pages and shares them under the hood.  (Probably good for e.g. shared libraries). See 1/3 reduction in memory usage in VMware ESX server. Get sharing only if you run similar apps on the same hardware.  Placement is hard in datacenter; lots of clients & software stacks; things change over time.  Manual placement is tough.

Goal: estimate sharing ability across many physical machines and many virtual machines across the datacenter.

Create 32bit hash per 4K page.  1MB hashs per 1GB of ram.  Forward to other hosts.  Comparing lists is slow.  So use a Bloom filter.  With hash lists, just sort & compare.  With Bloom filters you use a dot-product of bits; very accurate in estimating the number of elements shared between Bloom filters.

New VM launch - run on staging host.  Figure out it's memory footprint.  Compute it's Bloom filter per page.  Compare it's filter against other Filters on all other hosts in data center.  Find closest fit (on machine with enuf memory & cpu) and move the new VM there.


Live VM Migration

How to migrate VMs cheaper (force their free memory pools to shrink before migration; similar to having a JVM do a GC & heap-shrink before migration, then inflate back to normal size in the new host).

Contrast to the usual approach:
- 1st iteration copy all
- later iterations copy any dirty pages
- after dirty workset is "small enuf", stop VM, copy active dirty pages,
- restart on new host.
Goal: reduce downtime.
Con: Write-intensive apps have long migration times (because of constant page copying) and high network overhead.

Alternative: copy minimal CPU state & pages, start running on target.  VM demand-pages from source.  Also actively push pages from source to target.  Good: copy pages only once; lower total migration time & network overhead.  Works well for both read-intensive & write-intensive.  Con: resumed app suffers while pages demand-paged in.

Can make some predictions on future needed pages from apps page-faults - likely will want other pages nearby in virtual-memory-addresses soon.  Examples are walking an array or working a stack.

Make a psudeo-swap device in the guest OS for demand-pull pages (basically swapped over the network).  Allows the VM job to keep running in other threads or processes - only 1 thread is blocked on a page fault.  (Actual implementation has lots of warts due to nature of existing Xen infrastructure).  e.g. we have a much more direcy way to detect memory-full conditions and politely share memory between JVMS.

Results: feasible, but clearly more downtime over pre-copy in the common case.  But works well w/write-intensive VMs.


Validate

Patches are dangerous.  Delaying change is a safer bet even for security bugs: more likely for downtime from patches than your system gets hacked while you wait.  


Raytracing vs GPUs

Raytracing is more expensive per-ray, but less expensive per-object.  Can be parallelized.  Makes really great shadows & reflections.  Want to improve predictability in ray-tracing to get "good enough" frame rates.

Trying to use SIMD.  Primary rays are parallel, easy to SIMD.
Seconday rays are incoherent, not good for SIMD.

Cliff Notes: no surprise but modern GPUs are not well suited to raytracing work


Commutativity Analysis for Software Parallelization: Letting Program Transformations See the Big Picture

Sources of Commutativity - e.g. sum, malloc

Can detect it?  Check memory state before & after running fcn X.

But misses e.g. set's (via linked-list) - linked-list order is different when swapping 'put(A)' and 'put(B)' calls, but semantically the same: both elements are in the set, but the set's implementation stores them differently.

How to test for commutativity? Symbolically execute 'put' calls in different order. See if memory contents are same: "put(A);put(B)" vs "put(B);put(A)".  If yes, then Yes - but suppose not.

Now check reader fcns: try is_member() (on what values???) 
Also check writers like 'remove'.

Problem: exponential search space for equivalence.
Use Random Interpretation-
- choose random input values
- interpret till conditional branch
- follow which way it goes; tests program
- Then back up to cond branch
- Adjust memory & values such that test fails
- Now follow thru on other branch; test program

Now combine memory states for each branch of program.  Using an 'affine join' to combine memory states???  Somehow correctness bug becomes an exponentially low; very very unlikely the Random Interpretation ever computes some wrong answer (declares as commutative some function which is not really commutative).

Looking at MediaBench & SpecInt on an infinite issue machine.  See 20% to 30% of all functions are commutative (with themselves?).  Only see speedup when commutative fcns are called near each other, but then can see parallel speedup (log tree combining?).  Average is 50% speedup.  Apparently assuming an extremely wide machine, as opposed to threads (ignoring communication costs & thread start/stop).

Cliff Notes: Random Interpretation looks interesting, not sure about the rest of this paper.  In fact, this paper looks like a small delta (and very difficult to use correctly) over the R.I> work.   Instead see 2003 paper: "Discovering Affine Equalities Using Random Interpretation" http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf



Predicting Yield of GCs

Amount of memory reclaimed by GC *can* be low.  But looking at minimal GC footprints.  When GC is not productive, then time to extend the heap.  Estimate GC yield by using Virtual Memory to track pages which are touched and which are not touched - not-touched pages are probably full of mostly dead stuff.

Cliff Notes: Yawn, Olde Hat Trick.
Cliff Notes: got bored; symptom of Day 3 at a conference; worked on slides


Phantom BTB

Cliff Notes: Winner of the most ridiculous proposed hardware, backed up a big speedup on the most-unrealistic simulated hardware.

Virtual BTB Design - as in Virtual Memory not as in VMWare.
BTBs today are {small,fast,inaccurate}.  Want {big,fast,accurate}.
Allocate space for the BTB in the L2(!?!?!). 
But latency to the L2 is large, so much prefetch the BTB data.

Getting tidy 10%+ speedups on unrealistic hardware.

BTB in L2 cache is large+slow.  Missing BTB entries are evicted to L2. Then prefetch from L2 based on previous missing branches (assuming that an upcoming BTB miss can be predicted from a just-prior missing branch).  Attempted lots of other predictors; most failed miserably.

Pay-as-you go model; BTB gets effectively larger as more pressure happens on the on-chip BTB.

Cliff Notes: HotSpot inline-caches wipe out the need for faster jump-register support in hardware; standard branch predictors work well with HotSpot.  Total clocks spent doing jump-register instructions are less than 1% of all clocks on realistic hardware.





import java.lang.*;

class TextScan {

static String test1 = "scan_for_some_token(like_this_open_paren";
static String junk0 = "scan_for_some_token_like_this_open_paren";
static String junk1 = junk0.concat(junk0);
static String junk2 = junk1.concat(junk1);
static String test2 = junk2.concat(test1);

// A TextScan is a lite object that's really 8 longs representing
// the bits of 64 bytes split 90-degrees from verticle.
long b0,b1,b2,b3,b4,b5,b6,b7;

public static void main( String args[] ) {
TextScan[] bits = bit_split(test2);

System.out.print("bits="+bits.length+"@(");
for( int i = 0; i<bits.length; i++ )
System.out.print(bits[i]);
System.out.println(")");

// Scan for byte
System.out.println("Scanning for byte "+(byte)'(' + (char)0x28);
System.out.print("[");
for( int i=0; i<bits.length; i++ ) {
// bit pattern 0x28 - 0 0 1 0 1 0 0 0
TextScan ts = bits[i];
// The open-paren char, '(', hex 0x28:
// 0 0 1 0 1 0 0 0
long res = (~ts.b7) & (~ts.b6) & ts.b5 & (~ts.b4) & ts.b3 & (~ts.b2) & (~ts.b1) & (~ts.b0);
// A non-zero res means a '(' is found, and the bit-number is
// the index into the 64-byte array where it was found.
System.out.print(Long.toHexString(res)+",");
}
System.out.println("]");
}

// Split a String into an array of 64-bit longs.
// Parallel-split bits: bit 0 from 64 bytes go in 'b0',
// bit 7 from 64 bytes go in 'b7', etc.
public static TextScan[] bit_split( String s ) {
int slen = s.length();
TextScan[] bits = new TextScan[(slen+63)>>6];
for( int i=0 ; i<bits.length; i++ ) {
TextScan ts = bits[i] = new TextScan();
// Suck 64 bytes into bytearray from String.
for( int j=0; j<64; j++ ) {
int idx = (i<<6) + j;
byte b = idx<slen ? (byte)s.charAt(idx) : 0;
if( (b&(1<<0)) != 0 ) ts.b0 |= (1L<<j);
if( (b&(1<<1)) != 0 ) ts.b1 |= (1L<<j);
if( (b&(1<<2)) != 0 ) ts.b2 |= (1L<<j);
if( (b&(1<<3)) != 0 ) ts.b3 |= (1L<<j);
if( (b&(1<<4)) != 0 ) ts.b4 |= (1L<<j);
if( (b&(1<<5)) != 0 ) ts.b5 |= (1L<<j);
if( (b&(1<<6)) != 0 ) ts.b6 |= (1L<<j);
if( (b&(1<<7)) != 0 ) ts.b7 |= (1L<<j);
}
}
return bits;
}

// --- toString
// Reverse bit-swapping to a String.
public String toString() {
String s = "[";
for( int i=0; i<64; i++ ) {
byte b = 0;
b |= ((b0>>i)&1) << 0;
b |= ((b1>>i)&1) << 1;
b |= ((b2>>i)&1) << 2;
b |= ((b3>>i)&1) << 3;
b |= ((b4>>i)&1) << 4;
b |= ((b5>>i)&1) << 5;
b |= ((b6>>i)&1) << 6;
b |= ((b7>>i)&1) << 7;
if( b == 0 ) break;
s += (char)b;
}
s += "]";

return s;
}

}



Category: Web/Tech | Permalink | Comments (4) | TrackBack (0)