Cliff Click Jr.’s Blog

IFIP WG2.4 Trip Report
May 18, 2009

2009 IFIP

IFIP WG2.4 remains one of my favorite mini-conferences to attend.  The group is eclectic & varied; the discussion vigorous. Everybody is willing to interrupt the speaker & ask questions.  The talks are usually lively and pointed, and this time was no exception.

The conference was held in Fort Worden, a historic Big-Gun (planes rapidly obsoleted the guns) fort of the Pacific Northwest, defending Admiralty Inlet and our shores from the marauding Russians ... or Japanese... or somebody, surely.  Nobody ever attacked that way, so clearly the fort was a big success.

Meanwhile, the fort has been turned into a park and it's beautiful.  The whitewashed buildings with green trim remind me of some mythical 50's era America that never was, the kind that shows up in the backdrop of various movies or maybe "Leave It To Beaver".  We stayed in the officers quarters and held our meeting in the old mess hall.  The officers lived quite nicely; the quarters are actually old duplexes in great shape, with high ceilings and crown molding and carved brass fittings everywhere - the house is clearly larger than my house and with a vast green lawn ta-boot.  In San Jose such a place would be upwards of $2m...

Fort Worden overlooks Admiralty Inlet from some high bluffs (classic fort-like location) and the surroundings are all gorgeous.  The weather varied from overcast & chilly to clear & crisp (alas, overcast and chilly was the mode on our all-day "whale" watching expedition.  We did lots of watching but saw no
whales of any kind.  And we bounced over heavy seas and got sprayed with frigid water most of the day.)

Since the Hood Canal bridge was out I got treated to a 3 hr drive the long way around the peninsula, but somebody else drove and the scenery was worth looking at.  The airplane portion of the trip was easy.


As usual, my reports are (very) brief, stream-of-consciousness writing.  I often skip around, or brutally summarize (or brutally criticize - but this group is above the usual conference par, so less of that is needed).  I skip talks; I sleep through ones I'm not interested in, etc, etc.  Caveat Emptor.

Synchronization Talks
Atomic Sets - another nice concurrency alternative
Fork/Join for .Net
occam / CSP - Uber lite-weight user-thread scheduling

Java Analysis Talks
SnuggleBug - Better bug reporting for Java
SATIrE - A complete deep analysis toolkit for Java
PointsTo - A comparison of various points-to algorithms for Java
analyzing JavaScript - Has a type-lattice as complex as C2's internal lattice
Performance variations in JVMs - Noticed the wild variations in run-to-run performance of JVMs

Misc
Cheap Cores discussion
Compiler-directed FPGA variations
Clone Wars - huge study on cut-n-paste code in large projects
Business Process Modeling

Security
PDF Attack - "Adobe Reader is the new Internet Explorer"
The Britney Spears Problem
Squeezing your Trusted Base
Who you gonna call?

Evolving Systems
Swarms in Space
Finding emergent behavior




Frank Tip, Type-Based Data-Centric Synchronization

Locks are hard.  Auto-inference of correct sync in O-O programs.
Even with no data-races still have atomicity races.

Instead of code-centric locking paradigms, do data-centric locking.
Tied to classes with fields, leverage O-O structures

Group a subset of fields in a class into a *atomic_set*.
Then find units_of_work for that atomic_set.

Add language features to Java:
  atomicset account;
  atomic(account) int checking;
  ...
  atomicset logging;
  atomic(logging) Log log;
  atomic(logging) int logCount;

So add 'atomicset' keyword, 'atomic' keyword and strongly type variables with atomic_sets.  Units_of_work expand to reach the syntactic borders anytime an atomic variable is mentioned.  Can expand unit_of_work by hand also.

Can at runtime cause atomic_sets to alias - to union together.  Used on ownership types (e.g. HashMap uses HashMap$Entry, so units_of_work on HashMap or HashMap$Entry become units_of_work for both).

Atomic_sets can be unioned; e.g. for LinkList the entire backbone of the list is one atomicset.  Lots of discussion - points out the issues with hacking all the Code Out There.

Has proven strong serializable properties on atomicsets.  Proven various kinds of soundness, including *internal* objects are only accessed by their owner, etc.

All concurrency annotations appear in the libraries but not in the client code - shows a nice concurrent example with 2 threads manipulating the same linked list.   

Has a working version using Eclipse & taking annotations as Java comments.  Using j.u.c.Locks; can do lock aliasing by simply sharing j.u.c Locks.  Applied this to a bunch of Java Collections; 63 types & 10KLOC of code.  Needs about 1 line of annotation per 21 lines of code for the libraries (and none in
clients)

About 15-30% slower than using plain sync wrappers on a single-threaded testcase; probably because the "synchronization" keyword is very optimized.  Performance is similar or slightly faster when running with 10 threads & high contention.


Daan Leijen - The Task Parallel Library for .Net

- A concurrent parallel library for .Net (think dl's Fork/Join).

Infrastructure: locks/threads/STM, async agents (responsiveness), task parallelism (performance)

Gives a demo of a parallel matrix-multiply w/C#.  Very ugly code.

So instead, write a nice combinator...  

Standard task-stealing work-queue approach.  (private set of tasks that don't need CAS, public set of steal-able tasks that require CAS).

Points out that want fine-grained parallelism, so can do work-stealing and do auto-load-balancing.  

-- Effectful state.  Statically typed, with typed side-effects.  Type signature includes all side-effects (yes!)

  int fib(int n) { return n<=1 ? 1 : fib(n-1)+fib(n-2); }

But this program

  int fib(int n) { return n<=1 ? print"hi",1 : fib(n-1)+fib(n-2); }

Does not type in Haskell - need the IO monad.  And now the function returns 2 results: the IO "hi" and the int.

So want some syntax to allow "auto-lifting" into a side-effect world - same function syntax, but auto-includes the IO monad as a result.

Then can prove that some stateful computation does not leave the function - that the state is entirely encapsulated & unobservable outside.  So can do type'ing which allows some state-ful work internally, but it doesn't escape the function "box".

  int fib(int n) {
    ref f1 = 1; // side effects, but entirely internal to the function
    ref f2 = 1;
    while( n ) { sum = *f1 + *f2; *f1 = *f2; *f2 = sum; n--; }
    return sum;
  }


Peter Welch, Multicore Scheduling for Lightweight Communicating Processes

Google: kroc install

Carl Ritson did all the work...

Cliff Notes: Insanely cheap process scheduling, plus "free" load balancing, plus "free" communicating-process affinity.  Very similar to work-stealing in GC marking, but with "threads".  OS guys take note!

Scheduler for OCCAM/PI.  Process-oriented computation (millions), very fine grained processors.  Message passing over channels, plus barriers.  Uses CSP and pi-calculus for reasoning.  Dynamic creation & deletion of processes & channels.

Goal: automatic exploitation of shared-memory multis.  "Nothing" for the programmer to do: the program exposes tons of parallelism.  "Nothing" for the compiler to do - it's all runtime scheduling.  

Goals: scalable performance from 10's of processes to millions; co-locate communicating processes - maximize cache, minimize atomics (lock-free & wait-free where possible).  Heuristics to balance spreading processes around on spare CPUs and co-locating the communicating ones.  

Usual sales job for CSP... which still looks good.  

A blocked process requires only *8* words of state (no stack for you!). Scheduling is cooperative not preemptive.  No complex processor state, no register dumps.  Up to 100M process contexts per second.  Typical context switch is 10ns to 100ns.  Performance heavily depends on getting good cache behavior.  So processes are batched.  Stacks are spaghetti-stack (no that's why no stack saving...).  These 8 words include the process run-queues.

Batching: sets of e.g. 48 processes are run on a single core; that core is the only core touching this queue & round-robins them until the entire batch times out.  So no cache-misses.  A Batch is a Process, plus a run-queue.

A Core is a Batch, plus a queue of Batches.  Again, no other cores touch this list either.  

A Core with no Batches does work-stealing.  There's some proposed batches available for work-stealing from each Core.  Cores need some atomic ops to manipulate the queue of Batches for these steal-able Batches.  Careful design to ensure all the interesting structures fit in a single cache line.

So until you need to enqueue, dequeue or steal a Batch - it's all done without any contention for process scheduling.  These Batch-switching operations require atomics but are otherwise fairly rare.  

Key bit missing: scaling up beyond 8 or 16 cores with some kind of log structure.  But otherwise looks really good for very low cost context switch & work steal.  Lots of questions about fairness.

Batches are formed & split by the runtime (no compiler or programmer).  Aim is to keep processes that communicating on the same core (so it's all cache-hit communication), but spread the work across cores.  Initial batches are kept small to encourage spreading.  Channel ops require atomics, but typically only 1 (sometimes 2).  Choosing amongst many channels might require up to 4 atomic ops.  Each time communication happens on a channel, the sleeping process is pulled into the run queue Batch of the awake process - thus naturally grouping communicating processes.  But as batches get large, need to split.  

Periodically can observe that if all the processes in a Batch are all communicating with each other in some complex cycle - then the spare run queue empties from time-to-time (with one runnable process).  So as long as the run queue empties now & then, then the Batch should stay together.  But if the run queue never empties, then split-the-Batch.  Split off a single process into a new Batch - which will drag it's connected component together into the new Batch.  But eventually if the original Batch had 2 complete strong graphs,
they will now be split - and can be stolen onto other CPUs.  

So this kernel is about 100x less overhead than the same design done on a JVM.
:-)


Satish Chandra, Symbolic Analysis for Java

"Snugglebug"

Communication between the programmer & the tool tends to stop early: "Dear Programmer: you have a null-ptr bug here"...

Going to the next step: "Dear Mr Programmer: try running your method with *these* values and you'll see the bug...".  Programmer:  now that I think about it, these values will never arise.

We just forced the programmer to define the legitmite range of values.

More: Bug Report is more believable if given a concrete input is given. 
API hardening: gives the preconditions under which this library will throw an exception
Unit-test generation: inputs needed to make a specific assert fail.

Works by weakest-precondition.  Given an assert in the program, work backwards computing the weakest-precondition at each point until you reach the top of the method.  If this chain works, then hand the WP to a decision procedure; sometime the decision is "I don't know" but many times you get an explicit counter example.

Good match for the counter-example problem.
- Yes loops are a problem, but don't really need the *weakest* precondition.

Most Java features can modeled:
- heap r/w, arrays, allocation, subtyping, virtual dispatch
- exceptions: many bugs are here so we modeled exception handling try/catch
  very closely.

Have a path-explosion problem.  For heap updates, even a single path includes branch effectively depending on aliasing.

Has to do strong interprocedural analysis.  Lazily grow the call-graph consistent with known symbolic constraints.  First look for a conflict on a call-free path, then on call-paths but without needing to look into the call.  Finally, if I must peek into the call - but use the symbolic info the reduce the set of possible call targets.

Generally, no fixed-point.  Cannot expect programs to give loop invariants.
Work-list management: avoid various common pathologies.
Goal is to find *a* contradiction, not to explore all options.


SATIrE

http://www.complang.tuwien.ac.at/satire/

Shape analysis, points-to, flow-sensitive, context-sensitive, annotations, can read them in & out.  Applications to e.g., slicing.

Read C/C++ - large complex tool chain.  Reads in program x-forms in either a functional programming language or in Prolog.  Can combine the tools in any order at this point (very well integrated).  Output is C/C++ w/annotations at each step (or internal "ROSE" ASTrees, etc).  

Has interval analysis for integers, shape analysis, loop (bounds) analysis, points-to, etc.  Gives example of reaching-def's in the functional programming language.  It's a domain-specific language for writing compiler analyzes.  Can specify the strength of the analyzes in many cases: e.g. the depth of the allowed call-chain for context-senstive (set to zero for context insensitive).

Shape analysis gets may/must alias pairs, etc - and all results put back into the C program as comments/annotations.

Basically describes a large complex tool chain for doing pointer analysis on C & C++ programs.  The tool chain looks very complete & robust.


Welf Lowe - Comparing, Combining & Improving Points-to Analysis

...skips explaining what points-to is.

Clients: compilers, JIT compilers, refactoring tools, software comprehension.
Different clients have different needs: speed, precision, security.
Granularity & conversatism also matter.  Clients might want, e.g., dead code elimination, or removing polymorphic calls, or rename methods & calls.  Other people want object granularity - escape info & side-effects (i.e., compilers and JITs).  Also static-vs-dynamic analysis.  Dynamic analyzes typically run the program and produce optimistic results.

Doing careful experiments of 2 analysis w.r.t. accuracy.  Hard part: can't get a perfect "gold standard" for real programs.  Hard even for a human to inspect a proposed "gold standard" and declare it "gold" or not.  Some special cases are easier: conservative analysis (no false positives) & optimistic analysis
(no false negatives).  Can under- and over-estimate the gold standard, but this still messes with the results (messes with detecting which of the 2 analysis is better w.r.t. accuracy).

E.g., ask the question: is it worth increasing k>1 (i.e., becoming context sensitive vs staying insensitive).  Checking size of computed call-graph - smaller graph is more precise.  Very small increase in accuracy.  Sometimes made things worse - because the original analysis was optimistic sometimes.

Open question: what is the use of an Uber Points-To - it's definitely beyond the point of diminishing returns for JIT compilers.


Anders Moller - Type Analysis for JavaScript

Not much tool support for catching errors during programming.
No nice IDE support.  Many bugs are browser specific, but we focus on language bugs.

Most JS programs are quite small, so throw a heavy-weight analysis at it.

Object based, dynamic typed, proto-type inheritance, 1st-class functions, coercions, dynamic code construction, exceptions, plus 161 built-in functions.  Types include null & undefined, primitive types, some properties "ReadOnly" and "DontDelete".

Broken things: tracking down an "undefined" and where it came from, reading an absent variable or absent property, etc.

So do a dataflow analysis over the program, including a lattice, transfer functions, etc.  Handle interprocedural stuff.  Tightly interleaved control flow & data - so need a complex lattice.  

Lattice is roughly as complex as C2's lattice, with more stuff on the allocation sites and less on the class hierarchy.  Also then model all the abstract machine state (C2 does this as well).  Full context info (total inlining at analysis time except for recursion).


Cliff Notes: Nice idea: for each allocation site, model both the *most recent object* from that allocation site, and also model *all the rest* from that site.  You can do strong-update on the singleton most-recent object, but weak update on all the rest.


Rhodes Brown - Measuring the Performance of Adaptive Virtual Machines

Noticing the badness of the "compile plan" - 1 run in 10 had a >10% performance reduction.

JIKES - always compiled, no interpreter.  On demand at 1st invoke.  Focused recompilation, lots of GC variants.

Points out the places where you get variations - stack-profiles for estimating the call-graph, basic-block counters, etc.

Causes of variation?
So far: GC & inlining
Concerned that i-cache placement is a major issue of instability.

*No* correlation between total amount of compile-time spent vs score.  You must compile the hot stuff, but apparently tons of not-so-hot code is getting compiled and it doesn't help the score any.


Kurt William - lots of cores for cheap

General discussion round, not really a "talk".

Expect >32 core/die in 5 yr, >128 core/die in 10 yrs.
What's "New" and what can we expect in 5 years?

Theory: contention is that there's nothing new here since the 70's & 80's (concerning concurrency).  (general agreement on incremental progress but not breakthrough).

Systems: Lots of libraries (TBB, pthreads), lots of platforms (webservers), all OS's support multi-core.  JVMs are New in this space.  Might see OS-support for user-level thread scheduling.

Languages: Support been there, but it's locks & threads - no good.  Support for TM is coming, but will it help?  Then there's CSP...

Applications: Not much new here.... more games, more interactive more user-interface.  Old stuff: lots of HPC, speech, image.  Matching: easy parallelism; lots of interesting apps here.


Uwe Kastens, Compiler-Driven Dynamic Reconfiguration of Arch Variants

What is "reconfigurable"?  "Usual" approach - HW guys compile to a netlist, do some mapping, place & route, make a FGPA/silicon, while the SW guys write some assembler on the FPGA - then at the end they try to run the SW on the HW.  Goal is to run some complex function faster on a FPGA than in normal ops.

Our approach: read source code, look at program structure, find hot code & hot loops; compiler knows how to switch the hardware between variants; then it compiles the code for a particular FPGA variant for each loop, and reloads the FPGA with a new function between loops.  But need to limit the FPGA to certain variants so don't actually need to reload the whole FPGA.  

Fixed small set of architecture variants.  Keeps cost to reconfigure the FPGA very low.  Compiler can choose the variants.  Example: FPGA runs either a small CPU or a small SIMD CPU or a MIMD CPU.  Or reconfigure the pipeline structure, or the default register bank access patterns.  In the different configurations can use more or less power for problems that are more or less parallel.  e.g., in SIMD mode have all ALUs active but turn all but 1 decoder.

Pick variants & design space to those known to be efficiently solved by a compiler - gets a fast compile time & good results there (use a human elsewhere).

Can map any arch register to any physical register - so can e.g. have CPUs share registers, or do unbalanced register allocation - if some CPU needs more registers in it's portion of code than others.  Does some register allocation affinity games, to avoid having to reconfigure register layouts between blocks of code.  

Nicely flexible overall design.  Instructions name arch registers & ops, but also can re-map real registers & ops periodically.  So use targeted registers & ops for a particular chunk of code, then reconfigure when the next chunk of code is different enough from the last chunk- and the cost to reconfigure is lower than running with a less-well-targeted registers & instructions.

Ugh, can't read data charts - has some 4 embedded programs.  Seeing 30% reduction in execution time AND power cost, by reconfiguring.  Code size is a little larger, because either poor parallization(?) & the reconfiguration overhead.


Jim - Clone Detection

Exact clone'd code, white-space only changes, near-miss clones.
Bug in 1 deserves inspection in all the rest.

Lots of cloning studies out there -esecpially in-depth of the linux file system.  But no wide-range-of-systems studies.  We did a bunch of open-source systems, C, Java, C#.  OS, Web apps, DBs, etc.  Used their own clone-detection, as a combo of AST-detection and text-based detection.

Text-based comparisons are sensitive to formating changes.  But AST parsing is expensive (requires tree compares) and does not find "near miss" clones.  Ended up doing "standard pretty-printing" - which uses the AST to normalize the text.  Then do text-based comparisons.  Also handle small diffs in the middle of clones.

Did 24 systems, 10 C, 7 Java, 7 C#.  Linux 2.6 kernel, all of j2sdk-swing, db4.  Varied from 4KLOC to 6MLOC.

Java methods have more cloning than C methods - 15%-25% (depending all the diff threshold) of Java methods are clones; C varies from 3% to 13%.  C# is more like Java, until the methods allow more diffs - and then there is a lot more cloning.  Swing is hugely clone-ish, >70%.

C systems - Linux has the same cloning behavior as postgres & bison & wget & httpd. Java has no more clones as you allow more diffs as compared to C - I suspect good Java refactoring tools detect & remove clones earlier.

C# shows a LOT more clones as more diff's are allowed.  Maybe cut-n-paste is a coding style?  C#'s clones are larger methods as well.

Something like 10-20% of all files in Java have 50% of functions are clones.  In C# it's more like 20-35% of all files.

Files with 100% cloned code: 15% of C# files are totally cloned.  For Java whole file cloning is fairly low.

Localization of cloned pairs: are they in the same file?  same directory?  different directories?


Thomas Gschwind - Business Process Modeling Accelerators

Business Process Modeling, patterns, x-forms, refactorings
Process Structure Tree -

 - Process is a large unstructured graph.  Visio-style graphical is error prone for large process models.

But can take the structure graph and build a tree from it.
(The graph is all fork/join parallelism).
A small part of the tree requires state-space exploration.

After the tree is correct, can apply patterns & Xforms to the tree.

Essentially Petri Net edit'ing on a structured graph/tree using Eclipse.

100's of users inside of IBM; generated business models are much higher quality, make them 2x faster.


Philip Levy, The New Age of Software Attacks

Why?
  • Not enuf to do?
  • Crime is Big Business
  • Usable exploit: can be sold for $75K to $100K
  • So called "security researchers".  Some are respectable, most are going after the $100K payoff.

"Visual Virus Studio" - actually exists.  PDF is a common vector for *targeted* attacks.  "Adobe Reader is the new Internet Explorer".

Looked at all the fixed vulnerabilities in Adobe Reader
  • Need to do a better job teaching students on how to write secure software
  • Stop writing in assembly language
    • ANY unmanaged code, include C/C++, should be banned
  • Languages need to be upgraded to be more secure
    • New constructs & type systems
  • More sophisticated testing tools & methodologies
  • Better analysis tools for legacy systems

How big is Adobe Reader?
  • 100K FILES
  • 42MLOC, no comments or blank lines: 25MLOC
  • Estimated 75 man years of development per year over it's lifetime

Problems are clustered - e.g. one module has had 25% of all security problems.
Top 6 modules have 80% of security problems.

Break down problems another way:
  • array index OOB - 32%
  • unknown - 27%
  • intege overflow - 14%
  • range check missing - 8%
  • capability check missing - 4%
  • out of memory 3%

How found:
  • Fuzzing - 42%  (creating illegal inputs and throwing them at the program)
  • Code Review - 35%
  • External report - 12%

Some examples:
  • asserting against null, but no asserts in debug build and null possible in release build.
  • Failed to check for error returns, sometimes even not catching return value
  • Casting "attacker controlled values" (values from the PDF file) to valid enum
  • Or using attacker-controlled value for loop & array bounds
  • Sometimes a complex pre-condition needs to be checked first
  • Some of the fixes are still incorrect

A Quick List:
  • Badly designed lib funcs (strcat, sprintf)
    • Microsoft's banned API list is 4 pages
  • AIOOB
  • Calling thru func ptrs
  • Integer over/underflow.  Many protocols send N followed by N bytes.  Send a really big number that overflows (after scaling for malloc) into a small number - then malloc succeeds, but the data overwrites the buffer.
  • Pointer math
  • Uninitialized vars
  • Dangling ref pointer
Basically: it's BASIC STUFF that goes bad, not fancy stuff.

All memory integrity or soundness type stuff.
But it goes beyond AIOOB -
e.g. Name string like "Robert); DROP TABLE Students;-- "


Jeff Fischer, Solving the "Britney Spears Problem"

Users assigned roles; roles are assigned privileges.
Benefit: users can come & go without changing policies.

"Patients" can view medical records, but "doctors" can change them.

Can add annotations to Java that specify roles.

Problem: lack of expressiveness; solution: logging.  Violation happened (people saw Britney Spears' data), but the logging allowed catching the violators.  Another solution: manual checks.  But no compiler support, so missed checks.

Problem: Lack of explicit access policies.
Really: need a way to "type" data (and hence methods using the data) with some kind of security/access policy.  More problems: often checks are hoisted to entry point for efficiency, and then violate policy later.  Relying on manually-inserted runtime checking.

Our stuff: parameterize classes by some kind of index value defining Role.
Each method includes an explicit policy (statically checked).
Annotation processor can either statically type-check the policy, or insert runtime checks.


Jens Knoop - Squeeze your Trusted Annotation Base

"Interesting program properties are undecidable!"

Compiler optimization - not so worried: just lose some optimality.  Results still usable; but if we have better results we can produce better code.

Worse-Case execution time analysis for hard real time: Bang.  Your dead.
WCET bounds must be safe (soundness), and wish to be tight (optimal or useful).
E.g., a bound of 1hr for a mpeg3 audio-player interrupt is safe, but useless.
Need, e.g., all loop bounds.  
Need user annotations.
User annotations are untrustworthy, but must be trusted.

Squeeze the trusted annotation base - reduce user chance of screwing up.

Try to replace trust by proof - as a side-effect generally improving optimality.  Then take advantage of trade-off between power & performance of analysis  algorithms.

Use model-checking with care.  If something can't be bounded automatically, as
the user for some bound - then prove it.  BUt also believe the user's bound isn't tight.  Try to prove a tighter bound with the model-checking using binary search.  Indeed - can do away with user assistance often; just guess and binary search to tighten it.

Summary: can often completely empty the trusted annotation base, usually shrink it; often improve WCET bounds.

Start with the cheapest techniques (constant prop), move up to interval analysis, then symbolic bounds & model checkers...


Joe Newcomer - Who do you trust to run code on your computer?

(Joe is a Windows attack expert)

Attack Vectors - autoexec.bat, .cshrc & friends, autoplay, device-driver installs (Sony), mac resource attacks

Engineered attacks: phishing attacks, downloads, activeX, client-side scripting

Lots of "security" designs & codes are done by junior programmers w/out adequate oversight.  Pervasive attitude: "software is secure - as designed, as implemented, as maintained".  Small+fast is all that matters...  not...

Rant against all modern OS's, talking about #of security patches in Mac, Linux, unix, Windows...

Client-side scripting is vulnerable. Flash, Office, Adobe Reader, etc.

What are we teaching people about security?  How about non-CS types?
Graphics designers, website designers...


Stefan Jahnichen, Swarm Systems in Space

Build very small satellites, swarms of them to watch the Earth.  They will spread out after launch to watch e.g. weather, bush fires, etc.  Takes about 2 days after launching to get a communication link.

Satellites have good earth coverage; orbit at 500 to 600km, orbit earth every 3 hrs or so.  Wide range of sensors amongst different satellites.

Have a Real Swarm of cameras; map a Virtual Swarm onto that; map mission goals on the V.S.  

Optimized synthesis of consensus algorithms.

Special OS - distributed OS basically.  Used to maintain swarm integrity (keep satellites together).  


Peter Welch, Engineering Emergence...

Thesis: in the future, systems will be to complex to design *explicitly*

Instead engineer systems to have the desired properties "implicitly".

"mob programming", ant mounds, etc.... simple rules of behavior leading to complex behaviors that emerge.  Emergent Behavior

Mechanisms Design - (game theory),
  • Rational actors have local private info
  • Emergent: optimal allocation of scarce resources
  • Optimal decisions rely on truth revelation

Swarming behavior (flocks, etc)
  • local actors, local behaviors
  • Emergent "swarm" behavior
  • UAV swarms & autonomous robots

Social communication (gossip, epidemic algorithms)
  • large, ad hoc networks
  • emergent: min power to achieve eventual consistency
  • low power, low reliability sensors & data propagation

Design for emergent behavior
  • occam process per "location", a 1024x1024 grid has 1mil processes
    • each location has a com channel to it's 4(8) neighbors
  • occam process per "mobile agent"
    • agents register with local process (or up to 4 if it's straddling a local region) & local process tells the agent what's nearby.


No idea how to design high-level behavior from low-level rules, but busy experimenting and looking for some cases.


Cliff

Category: Web/Tech | Permalink » | Comments (2) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
DaCapo Trip Report
May 5, 2009

http://www.cs.tufts.edu/research/dacapo
DaCapo is a small (but growing) research group focused on GC & Memory Management issues.  They meet about every 9 months, and the meeting bounces around depending on who is hosting it.  People mostly present work-in-progress, and the audience is up front with their critiques... often interrupting the speakers or getting sidetracked into loud off-topic conversations, i.e., it's a fun boisterous crowd.

DaCapo is in the Boston area this year, held at Tufts University's Medford campus (everyone's heard of Harvard and MIT, but how about the Berklee College of Architecture, Fisher College, Boston University, Suffolk University and dozens and dozens more? - Boston has more colleges and universities per square mile than any other place I know).  As usual for me traveling to Boston, I skipped the car and just used public transportation - the T works marvelously well (as compared to e.g. VTA; crossing the entire town from Logan Airport stuck way out in the water to Tufts some 8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose on the VTA takes something like 2 hrs). 

Anyways the weather was supposed to be highs of 65 and rainy with lows in the 40's - so I packed my parka.  Turned out the weather report was a little bit off, with highs in the upper 80's and sunny.  Man did I roast.  But I got out and enjoyed the sun - we (the entire DaCapo group of 40 or 50 people) walked about a mile to lunch each day, plus I walked to and from my hotel & the T stop (also about a mile).  Enough ruminating, on to the talks!

As usual, I pay attention to some talks and chat w/neighbors through others.  I brutally summarize.  I'm opinionated & narrow focused... so Caveat Emptor.



Change Tracking

What changed?  Who did it?  When?  Points out the flaws of using 'diff'.

Logical-Structure diff.  Auto-magically spotted systematic change patterns.
It's a better code-review diff.

Can report diff's as reports like "for all XXX, replace XXX with YYY except for ZZZ_XXX_ZZZ", etc.  Vastly better error reporting!!!  Vastly better diff reporting!!!

Cliff Notes: We must get this gizmo and check it out!!!!

Paper report: http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf
Sample output: http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm



Inferred Call Path Profiling

Very low cost path profiling.  Hardware counters have very little cost, but provide little context.  Software is reversed: high context & high cost.
Cliff Notes: Compare this to call/ret RPC hashing.
Cliff Notes: Bond & McKinely does something similar.

Instead, look at number of bytes from main() to leaf call on stack.  Basically PC & SP (offset from stack base) provides all the context we need.  Dynamically detect unique PC & SP (they did it in a training run) pairs, and reverse them to get a call-stack.

Lots of duplicated stack heights, how to tell them apart?  On SpecInt unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call stack).  But huge spread; really sucks bad on some benchmarks.

If have a list of entire call-graph, then can remove ambiguity by growing some frames.  But requires some global analysis to know which frames to grow in order to be unique.  "Solved" problem with a simple greedy search.  Compute all call-stacks over whole program run, pick a conflict, grow some frame, re-compute & keep if more PC/SP pairs are distinct.  Repeat until it's rare enough that can live with dups.

Cliff Notes: The whole-program ahead-of-time thing isn't feasible for us.  How to do this on-the-fly during tick-profiling?  Hash PC/SP, find a tick record, 1-time-in-1024 check for dups (crawl stack the hard way and confirm call-stack), otherwise assume unique & increment tick record.  If MISS in hash table, then crawl stack the hard way.  If record is labeled as having dups, then crawl the call-stack & find true record?)  Actually, probably expect the call stacks to be common, but the PC's to be fairly unique.  So really asking for unique RPC+SP (not PC+SP).

His experiments: collect PC/SP pairs about 100 times/sec.  0.2% slowdown (must have had careful measurements to detect that!)



Breadcrumbs - Recovering Efficiently Computed Dynamic Calling Contexts


Also builds on probabilistic calling contexts. 

Given a (nearly) unique ID (say, computed at each fcn entry via a simple hash on the return PC).  IN the past, reversing these unique ids: use a dictionary (very large & high cost), use a calling-context-tree (smaller, but requires you to track where you are in the CC-tree at runtime).  Forward search, but undecidable and limited options for pruning.

Attempt to reverse the hash fcn (but only when we need to take the semi-unique id) and convert it to a call-stack.  Straight-up reverse is no good, but is easy to guide search.  Callers of leaf call themselves have structure - e.g. expect to see a call-op from the call-site, etc.  Still need a list of all call-sites (generally available anyways because of inline-cache structures).  His function is "p' = (3p+c)", where "c" is RPC?  Requires a handful of instructions on function entry.

Cliff Notes: If this reverse search is successful most of the time, it says that during a profile tick we can record just the PC/SP (or RPC/SP/PC?) and then reverse them to the entire call-stack later - on demand when RTPM requests.  So a 'tick' is no more than record a few bits (and handle buffer overflow), but no stack crawl.

Cliff Notes: combined nicely with 'null-assignment-stack' info, ie. collect the entire stack and 'color' all NULLs with a unique stack id.  Then if get a NPC can report both the NPC at the crash site, but also the call-stack that assigned the NULL in the first place.



Transparent Clustering - Phil McGachey, Eliot Moss, Tony Hosking

Attempt to auto-distribute multi-threaded programs.  Pure Java, no JVM hacks.  No programmer involvement.  Dynamic class re-writing.  JVMs can vary.  Hardware can vary.

All machines on a network run RuggedJ nodes.  RuggedJ includes a rewriting class loader, some runtime libs, and some application-specific partioning strategy.

Rewriter: generate lots of classes & interfaces; especially hook methods & fields.  Hook into the runtime.  Abstract over object location; remote & local objects.

Ex: Standard class "X".  Rewrite into a X_Local and X_Stub classes.  X objects on Node A are of type X_Local.  On Node B, X objects are really X_Stub objects where all the calls are remote calls.  Both X_Local and X_Stub implement the "X" interface.  Some other Y_Local can refer to X_Local on Node A, but refer to X_Stub on Node B.  If want to migrate from Node A to Node C, want to indirect the X_Locals; so for mobile objects there's an indirection layer called "X_Proxy", that then be switched from pointing to an X_Local vs an X_Stub (if the object migrates).  Non-mobile objects skip the X_Proxy indirection layer. 

Inheritance & interfaces are maintained in the rewritten classes.  Static classes conform to the RuggedJ model (I assume replication is allowed?).  Arrays are also wrapped to support indirection. 

Constraints: Native code breaks everything.  Try to not rewrite stuff referred to by native code (maybe can intercept all native calls and rewrite args?).  System code is also an issue: loaded by the system loader, is not rewritten.  (Can they use -Xbootclasspath?).  But most code run is actually system code!  (eg. String libs are very common!)

So must deal with system code without changing the JVM. 
  1. Can wrap system class objects.  Must unwrap when passing to system or native code; must re-box on return.  Re-wrapping is expensive: need the exact same wrapper, so need some kind of hash-lookup to do re-wrapping.  High overhead, so avoided where possible.
  2. Extend: some classes can be extended and preserve the original class & signatures.  Extended class follows the RuggedJ model but also the original system model. No need to unwrap/re-wrap.  Does not work e.g. when system code generates the object (since sys code makes instances of the original and not of the extended superclass). 
  3. Some system classes are only ever referenced from user code; can just dup the code out of rt.jar and make a user-class (which now gets rewritten by the normal RuggedJ model). 
  4. Immutable classes.  Just replicate.

Apparently able to cluster some complex programs.
No idea about performance as compared to explicitly clustered apps.



Jikes RVM Native Threads

Was Green threads, now Native threads
(everybody else is dropping green threads too....)
Motherhood & Apple Pie.



Computer Science as an Experimental Science

Reproducibility?  Experimental bias?
Show how these problems are solved in very-high-energy astrophysics.

(1) No publish without multiple colleague confirmation.
(2) Publish
(3) MUST also now publish reproductions or non-reproductions's of other peoples' experiments (or lose credibility as a research entity).  But bar to publication is lower, and expectations are lower.  i.e., people expect to publish in-progress work

Biggest comp-sci problem now: lack of infrastructures.
Examples of infrastructure bias: green-vs-native threads, some barrier styles are hard to use (.Net embedded objects), stack-scan (OVM makes fast stack scanning at the cost of everything else being slower).  HotSpot (high-value compiler, good GC, good most everything) vs JIKES.

Cliff Notes: This was actually a really fun talk (and lots of discussion) despite my paucity of notes. 



Web 2.0 Profiling

Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite.
Relies on browser to run Flash, JavaScript, REST, AJAX.
Benchmarks: IBM social networking software; Java PetStore 2.0, AJAX-based RIA using JEE5, uses Lucene search.
Apply workload: viewBlog (from 9 blogs), createBookmark (total 14), selectTag (PetStore 9), etc. 
Workload mix can be varied, but based on commonly observed usage pattersn from internal deployment of Lotus Connections).

Wanted to remove DB from the workload, put DB on ramdisk.  Same for LDAP.  Used Grinder to generate load.  Main server running J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5 1.66GHz.  Fairly reasonable setup for a small web/app load.

See lots more chattiness; very short fast requests, high interrupt rates, smaller requests but many more of them.  Looked at low-level profiling.  See fewer I-cache misses as compared to a large legacy apps.  Better I-cache behavior.  But data-cache miss rate is still very significant.

Also different transactions have different scaling limitations.  For PetStore the Java persistence layer started falling off after 4 cpus. 



Stream Processing

"Spade" - Cluster stream processing at IBM.  Continuous high-level streams.  Stream arrives for weeks & months; want to process and then emit stream data continously.  Want to express the problem with a language.

Priorities - fast streaming on a cluster; then Generality; finally Usability.  Beyond StreamIt.  Like StreamIt, works with streams that can be split & joined.  Language is typed; stream contents are strongly typed. 



Demystifying Magic - High-Level Low-Level Programming


Seen this talk before, but this time given by Steve Blackburn.  Nice historical perspective on writing system software in C vs asm - and the parallels to writing sys software in Java vs asm.

Java-In-Java
  • Key Principle: Containment.  Write as much as possible (99%!) in pure Java, and dive into the low level "magic" bits as little as possible.
  • Extensibility: Requires change quickly, languages change slowly.
  • Encapsulation: contain low-level semantics
  • Fine-grained lower of semantics: minimize impedance, seperation of concerns

Framework:
  • Extend Java semantics; intrinsics
  • Scoped semantics changes
  • Extend types; un/boxing, ref-vs-value, architecture sizes
  • Pragmatic: retain syntax where possible to keep front-end tools working

Introduce raw pointer type: "Address" looks like a Java type.
E.g.: Address a = ...;   ... = a.loadByte(offset);

Semantic Regimes:
Additive: "its ok here to have unchecked casts"
Subtractive: "no allocation via new allowed", similar to HotSpot VerifyNoGC.

Allow "unboxed" types - no v-table, so it's a bare primitive type.  But this isn't the default (but can be asked for).

Abstraction results are good: can implement a JVM on bare hardware and with a simple change essentially virtualize the same JVM inside a GC debugging harness running inside another JVM (where the virtualized JVM's heap is just a big array in the outer implementing JVM), etc.



Grace, Safe Multithreaded Programming for C/C++ - Emery Berger

Seen the paper before, but this is the talk by Emery.

Fake sequential semantics on stock multi-threaded hardware using fork/join and process protection. 

Grace is a pthreads replacement: "g++ -lpthreads" becomes "g++ -lgrace".

Speedups: in the absence of common benign races, Grace run programs run about 10% slower than raw pthreads - but have full sequential semantics.  It's "as-if" all the parallel programs are run sequentially immediately at the thread spawn point. 

CAN'T run applications where threads run forever, i.e. reactive or server apps.
Works well with fork/join, Cilk, TBB, etc.

So thread spawn becomes a unix fork with COW, use mmap for allow memory sharing.  At join points, smash in join'd threads memory updates via mmap.  Also need scalable thread-local malloc heap, plus aligned globals (to avoid false-sharing at the page granularity), plus some improved I/O.

Some simple measures remove nearly all false sharing.  Big one: everybody mallocs into their own space.  2nd big one: spread the globals out 1 per page.

Thread-spawn is as cheap as a fork on linux (has experiments to show it).  Due to thread-cpu affinity, if you spawn a bunch of threads they share a single CPU.  At the low end of thread-grain-length, fork is faster than spawn because the scheduler immediately spreads the processes across CPUs, whereas threads share for awhile.  So fork is actually faster than thread-spawn for awhile.

Real performance limiter for Grace is conflicts & rollbacks, and not thread-spawn overheads.

Performance is much better than e.g. STM, because after the 1st touch on a new page (and the SEGV & accounting), all accesses on that page run at full speed.



GC Assertions: Using the GC to check heap properties

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&rep=rep1&type=pdf

Global knowledge is required to understand seemingly local properties.
JBB example:

    Order order;
    for(...)
      order = ....;
    delete orderTable;
    ? are all 'orders' dead here?

But actually 'order' is held by the last-customer-transaction (as an optimization, or convienvence for customer?).  Leads to a leak of 'orders'.  Really: programmer does not understand program, too complex.

Add assertions to the code, and use the GC to check the property.

Cliff Notes: Azul is already gathering global heap properties during GC - but not asserts.  Gathering liveness & points-to info.  If points-to included a handful of sample real instances (and all the sames linked when possible), would be a very powerful way to instantly check some heap properties.

Sample asserts:
'assert_dead(p)' - expect to reclaim 'p' at the next GC cycle
Or region-collection: start_region(); ... assert_all_dead();
Or shape property: assert_unshared(p);
Or ownership properties (members of a collection are 'owned' by the collection)

If an assert fails, then do a slow crawl and provide the full path through the heap showing the failure.   Asserts only checked at GC points.



Proving Proving Correctness of Abstract Concurrency Control and Recovery - Trek Palmer & Eliot Moss

Transactions - closed nesting has issues
Open nesting & Boosting need from programmer: conflict info & roll-back info.
These are hard to provide correctly.

This work: a description language of abstract data types or structures.
Can describe the conflict predicates & inverses.
Can prove correctness of the conflict & inverse expressions.

Working with the abstract description of the data structure, and NOT e.g. with the real Java implemention.  E.g., no loops, no mutations, no recursion.  So the language isn't a 'real' language, but can describe many kinds of ADTs in code that looks sorta like Java.

Output isn't a functional program, instead the output is the result of using a SAT solver to prove correctness. 

XTN-semantics allows conflict detection to be proven correct pairwise, instead of having to do full interleaving.  Formally, a conflict
predicate is correct iff it is true when the operations do no commute.

The conflict predicate tells when 2 XTNs conflict, and the inverse allows an optimistic XTN to rollback.

Obvious use-case: use this tool to write a transactional-version of NBHM or other JDK concurrency utilities.



Hard Real-Time GC Scheduling

- Periodic Scheduling
 - GC runs at highest priority, but periodically yields to mutator
 - Metronome
- Slack-based Scheduling
 - GC runs at lowest priority
 - Can be preempted at any time
- Makinac (Sun RTS)
- Work based Scheduling
 - GC runs at allocation time
 - Problem with allocation-rate jitter

HRT - Deadlines must never be missed AND must be verifiable analytically that no deadlines are missed.  Systems tend to therefore be very simple, so that they can be proven.

OVM - like Metronome.  Dynamic defrag; arraylets; Brooks style barrier; replicating barrier/; incremental, concurrent, supporting slack-based style scheduling. 



Understanding Performance of Interactive Applications

Typical profilers give the wrong numbers: they report total time spent, not time spent between mouse-click and page updated.

AWT/Swing & SWT - similar: single-threaded, gui thread in "event loop", relies heavily on Listener model.  So profile using gui call-back Listeners between user events and the eventual display change.

Really simple idea: profiling & event visualizer tool between click & view times.


Program Metamorphsis

Trouble w/refactorings: must preserve program behavior with each step.  But if we want multiple refactorings then at each step along the way fixup code (needed to preserve program behavior) pollutes the code.  Want to do multiple refactoring steps at once - while leaving the program possibly broken in the in-between steps.

So compute a program-semantics (names & def-use chains), and let the user make partial refactoring steps and then declare "I think I'm done" - and the system compares the new program-semantics with the old, and reports if they are not equal. 

So actually comes up with primitive not-quite refactoring steps, which he will compose to build larger "real" refactorings, or compose lots to build a multi-stage refactoring. 



Instruction Selectors


HotSpot C2 uses a BURS framework.  Very hard to optimize & debug.  So these guys have a semantics which can span both the ideal IR and real machine instructions.  Describe both using "CISL" and the tool does the mapping.

Once the tool has a mapping, the user provides an adapter that converts the mapping into the code needed by the compiler back-end.  This would replace e.g. the ADLC.  Gives examples of machine encodings for PPC and Java bytecodes.

CISL is given an IR tree and produces a target tree with equivalent semantics.

Not sure it's entirely better than BURS (maybe no worse), but I'm still sold on rewriting in e.g. C++/Java hand-written greedy match rules.  These code-gen-gen tools are pain for long-term maintenance.



Do Design & Performance Relate?


Is fast code required to be ugly?  Is beautiful required to be slow?

Pick 200 code metrics.  Also pick some performance metrics (cycles, single threaded, objects created, etc).  Could not follow talk... or at least he wasn't very clear with the objectives & results (if any).  Interesting idea though.


Category: Web/Tech | Permalink » | Comments (1) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
Odds & Ends
April 15, 2009

Various pending goodies...

Load-Linked/Store-Conditional vs Compare-And-Swap


Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!).  There was a long period of debate in the CompSci community one what constituted the "minimal" needed instruction to do useful multi-CPU work.  Eventually the community has decided that the Compare-And-Swap (CAS) instruction and the Load-Linked/Store-Conditional (LL/SC) instruction combo both are (1) sufficient to do useful work ("infinite concensus number") and (2) relatively easy to implement in real hardware.  X86's, Sparc's and Azul's CPUs use CAS.  IBM's Power, Alpha, MIPS & ARM6 all use LL/SC. ARM5 only has an atomic-swap.

LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms.  With LL/SC, you first Load-Linked a word.  The hardware marks the cache line as "linked".  You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter).  Finally you issue a Store-Conditional to the same address.  If the cache line is still "linked", the store happens.  If not, then not.  The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line.  Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented).  "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail.  I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure.  The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC.

With CAS, you typical first load a word with a normal load, then manipulate it.  Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory.  CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic ABA bug.  In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless all copies of A die first - including the one loaded before the CAS.  Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS.  Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively.  With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link".  E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS). 

These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations.  Nothing in the academic literature requires any sort of global ordering on these atomic instructions.  Instead the usual academic assumption is that all memory operations are strongly ordered - which is obviously not true on all modern hardware.  Practitioners are required to insert memory fences as needed to achieve the desired ordering.  Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do.  Azul's CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations.  (As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking.  Report back the %-tage of dropped counts and the throughput rate.  Then try it with an Atomic counter.  My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless).  I am less familiar with the fence properties of common LL/SC implementations.  Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc?

build.java


Some time ago I reported on a build tool I've been using.  Currently 'build' is being used to build HotSpot internally to Azul.  We've got it building 400+ files, plus build rules for building portions of the JDK, compiling w/gcc & javac and also 'javah', tar, jar, strip, binary signing, etc.  In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc.  It's blazingly fast, it's easy to hack (being written in plain-olde Java).  I hunted around awhile and couldn't find a good place to dump a medium-large blob of code, so here's a sample build.java in 4 parts:

Part 1
Part 2
Part 3
Part 4


A "HotSpot -server" aka C2 compiler IR visualizer

No I didn't do it!  This deserving grad student did it.

Here is the reference to the visualizer for the ideal graph of the HotSpot server compiler:

http://ssw.jku.at/igv/master.jnlp

It is a JNLP application hosted on a university server (http://ssw.jku.at/General/Staff/TW/ has also a test file to download), but the easiest way to get a first impression.  The source code of the tool is hosted on http://kenai.com/projects/igv  The server compiler instrumentation is part of OpenJDK and also included in Sun's weekly builds of Java 7.

More C2 Goodies

John Cavazos wrote:
... One of the things I am currently looking at is determining the right phase-ordering of optimizations applied to a particular program being optimized.  I have some nice (un-published) results for JikesRVM,
but it would be nice to replicate the research for HotSpot...

I have strong arguments for nearly all orderings of our passes, so I'm curious to know about your phase-ordering results.
The only obvious transform we might be missing is PRE, but our peephole opts pretty nearly subsumes PRE (they are not mathematically equivalent - I can write programs where either PRE or the peephole stuff makes progress against the other).  In practice, PRE will find essentially nothing once the peephole opts are done.  You'll notice that we do the peephole pass alot; it's totally incremental and provably linear bounded.  In other words, if there's nothing to be gained then there's no cost in trying.  The peephole pass includes amongst other things all pessimistic forms of copy propagation, value equivalence, constant propagation (especially the null/not-null property), constant test folding, repeated test folding, dead code elimination, load-after-store opts (aliasing analysis is included for free during building of SSA form), algebraic properties, and lots more.


For HotSpot, the optimization ordering is:

  • Build an inline tree, including class hierarchy analysis.  This is the one pass I'd be willing to move, as it happens too early.
  • (a single unified pass:) parse bytecodes, inline, build SSA form (the IR remains in SSA form always), do peephole opts over SSA form.  This pass typically costs 40% of compile-time budget.
  • Fixed-point the peephole opts over SSA form, once all backedges are known after parsing.
  • Loop opts round 1: "beautify loops" (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, split-if (zipper-peel CFG diamonds with common tests, plus some local cloning where I can prove progress), peel loops (required to remove loop-invariant null checks)
  • Fixed-point the peephole opts over SSA form
  • Loop opts round 2: "beautify loops" (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, lock coarsening, split-if & peel - but if these don't trigger because there's nothing to gain, the do iteration-splitting for range-check-elimination & a 1st loop unrolling.
  • Fixed-point the peephole opts over SSA form
  • Conditional Constant Propagation (the optimistic kind, instead of the pessimistic kind done by the peephole pass)
  • Iterate loop (split-if, peel, lock coarsen - but these typically never trigger again and take very little time to check), unrolling & peephole passes, until loops are unrolled "enough".  On last pass, insert prefetches.  Typically this iterates once or twice, unless this is a microbenchmark and then unrolling might happen 8 or 16 times.
  • Remove tail-duplication, and a bunch of other minor code-shaping optimizations e.g. absorb constant inputs into deoptimization-info in calls, or commuting Add ops so that 2-address machines can do update-in-place.
  • Convert "ideal" IR into machine code IR.
  • Build a real CFG for the 1st time, including a dominator tree, loop tree.  Populate with frequencies from earlier profiling.
  • Global latency-aware (loop-structure-aware) scheduling.
  • Replace null-checks with memory references where appropriate.
  • Register allocate.  Internally this has many passes.  This pass typically costs 40% of compile-time budget.
  • Sort basic blocks to get good control flow ordering (forward branches predicted not-taken, backwards predicted taken, etc)
  • Some last-minute peephole opts.
  • Emit code into a buffer, including OOP-maps & deoptimization info.


Travel Plans

I got too much travel coming up!

Apr 30, May 1st - DaCapo in Boston.  Favorite small group; GC & Java focused. Website: http://www.cs.tufts.edu/research/dacapo/.  I had to turn down the invite to an SSA seminar in France because the dates conflicted.  Very sad.  I hope one of the attendees will post a trip report.

May 11-May 15.  IFIP WG2.4  (International Federation of Information Processors, Working Group 2.4 - a *really* old European group with a random mix of industry & academia.)  It's a nice group to preview JavaOne talks, and always the meetings are a week-long rambling discussion in some quaint resort.

June 2-June 5.  JavaOne.  I owe slides for 3 talks by end of this week.  'enuf said.

July 6-10th - ECOOP, as a paid-for invited speaker.  A free vacation to Italy!   :-)

Cliff

Category: Web/Tech | Permalink » | Comments (10) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
YATR2YAC (yet another trip report to yet another conference)
April 5, 2009

'Tis the Season to be Conferencing...
Blah, I'd rather be hacking, but here goes...  

I went to EclipseCon, to CGO and back to EclipseCon.  Here's how it can down: I got invited out of the blue to talk at EclipseCon by Darin Wright, who had seen me speak earlier at the JVM Language Summit.  Like a dummy I accepted it instantly instead of checking my calander.  Naturally it conflicts with CGO - and I was on the CGO PC this year, so I felt obligated to attend CGO as well.  Turns out my talk at EclipseCon got scheduled for Wednesday, and I decided I could skip CGO on Monday (I think there was one talk I was interested in) - so I scheduled a 1-day only trip to Seattle.  Monday I wandered EclispeCon at the Santa Clara Convention Center, Tuesday I took the 1st plane up to Seattle and the last plane back, and Wednesday it was back to Santa Clara to give my talk and surf the EclipseCon booths.  The flights on Alaska Air were smooth and easy.  No WiFi at CGO - the Hotel wanted some totally outrageous price for conference WiFi (I think the local chair was saying $50/head prepaid?!?).  (Remember my grief over Hotel WiFi in the last blog?)  So over lunch break (which was Chinese food and I don't like Asian food of any kind) I headed over to Starbucks.  Turns out Starbucks will happily give you 2hrs of free WiFi a day from ATT - provided you register $5 on a Starbucks card with them.  Yes!!!  Suddenly I have a $5/month nation-wide WiFi plan!  I snacked on some salad'y thing, drank my triple-latte, and happily Surfed the Internet until show-time at CGO came 'round again.

1st up: EclispeCon, mostly cause I got nothing to say.  These guys are in a totally different ball park from where I live; they mostly deal with complexity control (or at least 1/2 the talks essentially amounted to adding, changing, or removing a control layer).  Makes me wonder if the cure is really better than the disease. I gave my talk, but it was too much talk in too little a time window.  Both me and the attendees would have been better off if I could have spent more time on the details.  C'est la vie.  Other thing of note: the web site for EclipseCon is vastly better than the usual academic conference I attend.  Also the staff was exceptional friendly and easy to work with.

Next up: CGO.  Despite my 1 day fly-in-fly-out I got a lot more out of CGO. 

----------------------

Keynote: Vikram Adve

Missed!  Arrived barely in time for the last 10 minutes.  That's the penalty for attempting a 1-day fly-by.  I hear he said something good about Azul systems (he did ping me the day before for some details).  Basically it's a Call To Arms for improving C compilers and C runtimes.  He does say that fully auto-parallelizing code is probably a bad idea, and that we need to change the language instead (something I very much agree with!).

----------------------

Revisiting Out-Of-SSA Translation

Cliff Notes: HotSpot "comes out of" SSA form during register allocation, where it really preserves SSA form throughout the entire compile process, but adds register annotations as part of reg alloc (and inserts live-range splits as part of spilling). So mostly this talk is about stuff HotSpot has been doing for years... but there's a key new piece.

Linear-time colaescing congruence classes; speed & memory footprints.
Why is coming out of SSA hard? (Cliff Notes: It's not!)

And now they show what HS has been using for years: insert copies as needed for a simple out-of-SSA translation, then aggressively coalesce away the extra copies.

Nice diagrams for IFGs + copy-affinities.

Notices the value-equivalence property for defining IFGs (also done by HS).

Ok: here's what's new compared to HotSpot -server:

Fast interference test w/out building an IFG!
Only need to test against closest dominating var?
So scan the dom tree, with a stack-based DFS traverse of dom tree.
Uses 2 sorted lists (?); more complexity for value-based IFG tests.
But no building (and rebuilding?) the quadratic IFG.
Claim 2x faster than Sreedhar for coming out of SSA.
Big memory footprint reduction, because no need for live-ness sets & IFG!!!

40% of HS's total JIT budget is in reg-alloc! A faster reg-alloc is definitely interesting. A large part of that 40% is building the liveness & IFG sets

----------------------

Wave Propagation and Deep Propagation for Pointer Analysis

Wave Prop is very fast and uses huge memory (working on parallel version)
Deep prop is faster for small & precise memory sets - works better on desktop (and JIT?)

There's a zoo of ptr-analysis names:
- context-sensitive or not
- flow-sensitive or not
- field-sensitive or not
- inclusion-based (Andersen-based) or unification-based (Steesgaard-based)

Cliff Notes: HotSpot -server uses context-INsenstive, flow-INsensitive, field-Sensitive, unification-based

Build a graph of constraints, and do contraint-propagation.
Code: "a = &b;" Constraint: "a superset {b}" - means b element-of P(a)
Code "a = b" constaint: "a superset b"- means P(b) contained in P(a)
Code "a = *b" or "*a = b" - complex constraint, must add edges to the graph as constraint propagation adds new sets to the nodes.

Now build a graph: nodes are P(a) - sets of things pted-at by 'a', and edges represent subsets.

Cycles are common; collapsing them is crucial for performance. All things in the cycle must be equal. Uses Nuutila to find strongly connected components in a single pass (so does HotSpot - Vick's algorithm).  After collapsing cycles, you get a DAG - and Nuutila also produces a topo-sort as output.

(Still looks like a complex diff/delta propagation algorithm - but makes me wonder if standard bit-vector stuff is faster).

Not a bad-looking ptr-analysis. Might work in a JIT.
Manesh suggests using BDD's.

Cliff Notes: Real question of course, is if all the extra alias precision actually buys you any performance. PLDI paper of a few years ago suggestions otherwise: that HS's style of alias-analysis gets you 95% of all the possible gain.

----------------------

Fast Track: A Software System for Speculative Program Optimization

Chen Ding: another process-level fork/join speculative program optimization (contrast this to Emery Burger's "Grace" system). Gives sequential program semantics. Add annotations to "suggest" fork/join
points. Actually fork a process, use COW & mmap to guarantee sequential semantics in the face of races (end speculation, or force commit-ordering on speculative threads, etc).

New Idea: User can write an alternative implementation of some module; run both versions in parallel; keep the faster version & kill the slower version: "return do_whichever_is_faster( methodX(), fast_methodX() )"

New Idea: "fast_methodX" might be speculative and *wrong*. So run slow-version to completion and do error checking after the fact. But let the fast version run ahead (and if it's faster), it can spawn off new speculative fast versions. When slow version completes, use page-level protections to know what's been changed, and verify the correct semantics of the fast version. Run all the correctness checks in parallel with all the normal fast-case stuff. Needs some careful throttling.

Example using gcc's "mudflap". Array under/overflow, leaks, un-init'd memory, etc. Run 2 versions of all code: the existing code is "fast track" and the mudflap version is the "slow correct track". Goal: same speed as the "fast track" code with all the safety checks of the "slow track using mudflap". In some cases can get real nice speedups.

Cliff Notes: can we turn on mudflag for Azul?
----------------------
Scenario Based Optimization

Profile the hot methods. Tweak various parameters (e.g. loop unrolling, or cache-tiling parameters). Profile both old and new versions. Pick the best winner, and stick with it for awhile. Keep profiling continously, and switching code as needed.  Some JVM JITs do this.  HotSpot only sorta does.

The programmer has to pick what the new & old versions are, and the runtime picks the fastest one (no support for correctness - what if one of the versions is busted? Or tickles a timing bug which breaks things?)

Seeing some nice gains, some of the time. Avoids losing when turning on some aggressive optimizations. But generally unrealistic.

----------------------
Software Pipelined Execution of Stream Programs on GPUs

Basically StreamIt on a GPU

StreamIt is much easier to use than e.g. CUDA or OpenCL.

StreamIt model is a collection of connected serial parts; connected via pipes. Due to splits & joins, might need to exec some parts more often than than others (e.g. part Y "eats" 2 ouputs from part "X" before outputing a result itself). Get complex patterns of execution; difficult to map down to GPGPUs. Have bounded-buffer problems: need to map to a steady-state schedule using a fixed count of buffers. Then can arrange to do no allocation during execution.

- Profile each filter standalone, to figure out it's work load.
- Configure selection
- Add constraints
- Use CPLEX to solve
- Output CUDA code.

Looks like a classic software-pipeline, just done on a GPGPU. Compute minimum initialization interval, compile resource constraints, etc. Have issues with data-layout, so can do parallel data access.

Impressive compile times (30sec to 178sec).
Can get pretty good utilization on some kernels, so seeing 100x speedups.

Cliff Notes: My usual caveat for domain-specific languages applies here: if your domain fits StreamIt's model - then StreamIt is the fastest language for you!  If what you got doesn't look like StreamIt, then Too Bad.

----------------------
Stream Compilation for Real-Time Embedded Systems

StreamIt Again, but for Cell.
New constraint: out of memory per cell (distributed memory).
Different from GPU - GPU has shared memory, but very slow if doing bank-conflicting access - but can pile on more memory (more buffers) for some GPUs and less for others. Not so for Cell - must have a
uniform split of memory.


Thanks,
Cliff

PS - Thanks Doug for inspiring this blog


Category: Web/Tech | Permalink » | Comments (7) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report
March 19, 2009

Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report... sigh, all good hacks must come to an end, or at least be put aside so Real Life can intervene.  I went to DC for the ASPLOS and VEE conferences and all you get is this lousy blog.


2009 ASPLOS & VEE Conferences



ASPLOS & VEE are co-located in D.C. this year.  Mostly the trip itself was uneventful.  I was very busy either attending the conference sessions or preparing for my own talk and I didn't get around much.  The flights themselves were mostly uneventful (ComAir's HQ in the midwest lost power from a tornado, so I was stuck on the tarmac at JFK for an hour waiting to get permission to fly into DC - first time I've been delayed by bad weather when the weather between the source and destination airports was perfect).  The most notable point was that the hotel used WayPort as their internet access.  WayPort sucks horribly; I've used them in the past and swore I'd never use them again - but I forgot to ask when checking in to the hotel!  WayPort charges $13/night, and service is filtered through some horrible layout routing through TimBuk Too - average ping times to hit a server in the DC area was about 600msec (ping times from AT&T wireless in the conference room were ~50ms).  To add insult to injury, some other conference attendees stayed at a nearby hotel - which was cheaper and had free WiFi AND the free stuff was faster!  Definitely having WayPort as your WiFi provider is sufficient for me to change hotels.

Cliff Notes: I had a wonderful 1.5-long 1-on-1 talk with Monica Lam (http://en.wikipedia.org/wiki/Monica_S._Lam); she didn't know what Azul does.
Cliff Notes: More interesting talks with Jon Shapiro (http://www.eros-os.com, who's busy implementing a high-level low-level language for high-availability systems).

Reality Check: yes again a Mac failed to be able to output to the projector, and we all waited while the presenter hotswapped to a PC

Caveat Emptor: As usual for these talks, I only attend some sessions, I write stream-of-consciousness reports which I only lightly edit and the reports tend to be brutal and short.  They are roughly sorted here by my interest level, and not by e.g. time order.

General Cool Stuff:
Keynote: Monica Lam's Vision of the Future of Human / Computers Interaction
Getting Wrong Answers Without Doing Anything Wrong - all your base are belong to us
Text Processing in Parallel in a Register - wicked fast
Nanoscale Computing - wicked scary
Demystifying High-level Low-level Programming - how to write Java-in-Java
Asymmetric Chip Multiprocessor - 1st good use I've seen for these
Leak Pruning - Reported on before, but still good
TRIPS - Funky dataflow/VLIW hybrid (but not the weirdest I've recently seen!), too bad it didn't pan out

HTM Talks
Which I read but due to a snafu failed to attend the talks.   :-(
Maximum Benefit from Minimum HTM
- not so minimum (bigger than Azul's HTM!) and lousy results
Rock's HTM - a peek at Sun's experience with Rock's HTM; mostly Rock's HTM is really really weak

Multithreaded Bug Finders
CTrigger - The Winner in this section: cleverly induced sleeps trigger race bugs reliably
Anomaly-based Bug Detection & Prediction - It's an, ahhhh, interesting hack
Isolator - silently prevent race bugs

Making Programs Deterministic
Cliff Notes: Major flaw in the thinking here: Real Programs have Real Random Inputs.  Unless you capture & replay ALL external events, minor things like network-packet-arrival-order and the chaotic butterfly effect (http://en.wikipedia.org/wiki/Butterfly_effect) ruin your determinism.  So I think all these papers are tilting at windmills.
Kendo - Stock hardware, but requires a race-free program
Deterministic Replay - Goes only 1/2-way to preventing the Butterfly Effect
DMP - Deterministic Shared Multiprocessing - No replay attempt, just full-speed determinism (assuming no Butterflies)

Other Stuff
Talks I attended but aren't otherwise interested in...
Memory Buddies - Yet Another Use for Bloom Filters
Live VM Migration - pre-copy push still beats post-copy pull
Validate - Microsoft patches Are Bad (tm)
Raytracing vs GPUs - Doom!
Commutativity Analysis for Software Parallelization - the original work is WAY more interesting than this follow-on
Predicting Yield of GCs - Yea Olde Hat Trick
Phantom BTB - winner of the Most Unrealistic Hardware award



Keynote: Monica Lam

Why do computers break?  When we buy something, we expect it to work.  If it breaks, you buy another one.  But with computers they break on their own (e.g. u$oft update) or you can break it doing "normal" stuff - and if it breaks it doesn't help to buy another one - the new one is broken already.

Virtual Desktop - idea: keep the virtual machine in the datacenter, and do remote login.  Does not work well - requires WiFi everywhere.

With each new comp generation, 10x cheaper & 10x more users.
Mainframe-> mini -> worksta -> PC -> laptop -> cell
Also see 3 industries: internet, media/TV, telephone
Also see 3 networks: cable/satillite, broadband/phone, wifi/wimax
Also see 3 computing models: cell (always on, pocket), PC (rich legacy, offline exec, locality of reference), Cloud (device independent, central management). 
Things like GoogleGears or Virtual Desktop are trying to split the Cloud/PC model.

Guess where we are going?

1- (Man) a phone-y thing: key, cache, window into digital cloud, ID, personality, asserts, internet access.  Small, always on.  Can be lost easily & replaced easily.  No important state, but many Gigs of flash cache.  No real compute power.  Small visual display & keyboard.

2- (Earth) Can (1) plug into a PC.  Get big screen & keyboard, get fast processor.  Get hot DRAM, teribyte disk drive.

3- (Heaven) Backs onto the internet/cloud.


This model has some different security issues.  DataCenter is probably secure (or can be?)  You hold key in your hands.  Can we make path from hand-held-key to datacenter secure?

Trends - green, Bring-Your-Own-PC, Mac-vs-Windows.  End of Corporate PC model.  80% of people use personal PC for work; 45% of people use work PC for personal use.  25% of workers change security settings, etc.

SO... want: {User, Ubiquity, IT}  Separation of Concerns.

x86 virtualization seperates: hardware from VM; hardware from OS; different hardware configs; different user roles

MokiaFive company - Desktop-As-A-Service.  LivePC images "in the cloud".  Managed on central server.  Can go up to any mac/pc and run "your desktop" from a USB stick. USB stick/phone is a local cache only - backed up on the cloud. Think of USB memory as a "network accelerator".

Nice live demo.  From USB stick to a selection of PC desktops is not very long (+2 mins for a new PC to install software from stick).  No hotel wifi?  Claims to be sync'ing with cloud???

Nice integrity model: basically re-install the OS clean each time on a fresh VM-ware layer with anti-viral.  Intent is to clean out root-kits with each fresh login to machine.

Graphics is hard: many games are wired to DirectX, so losing much 3D performance.  Trying to bypass all the virtualization layers for DirectX.  But... guest & host have different address spaces; but graphics has lots of pointers; fast pointer munging is hard.

Other services: LDAP, revocation, offline access, logging, dlp. Encyption, Storage Virtualization: reju, incremental & atomic restore; secure access.  Uniquification, single-sign-on.

Seperate-of-Concerns Principles - 90% of code handles exceptions (only 10% does the normal stuff).  "Re-Install the OS" as a root-kit prevention.  e.g. system is structured as a fresh-OS-install because of the rare root-kit issue.

Bootstrap from a small trusted base: baremetal VM; LivePC player;  IT-managed base image (Windows+AntiVirus); finally user-apps & data

Like the Azul philosophy about GC - make the hard case (FullGC, infection from rootkit) common and then get it right - then all else follows.

Need to keep users ownership of own data - avoid the Big Brother mentality.  Give users a choice on where data is kept.


Getting Wrong Answers Without Doing Anything Wrong

Variations in unix env variable size, can see 15% variation in performance!  Running gcc -O2 vs gcc-O3, 400.perlbench.  Bunch of runs at each setting very repeatable per-run.  But with different env sizes see huge variation on 400.perlbench & 'ibm' benchmark.  

Conclusion: setting unix env variables can change your conclusions.

Also, sorting object files on the link-lines differently changes layout in memory, and in I-cache layout.

Conclusion: linking with 'ld *.o' is different from 'ld a.o b.o'.

Bias holds true even if looked at a large benchmark suite, or across different CPU #'s from Intel.

See a prior paper (which I now cannot find!!!) which studies this carefully.  Note that there are a bunch of JVM/Java benchmark issues here that also have been written up recently.

Cliff Notes: I've dealt with both of these in the past; so has Mark Moir (Sun) and Bean Anderson (SGI).  I discovered the need for I-cache layout while working for Motorola on PPC chips (somebody else actually did the implementation).  The results gave us an average 10% performance boost on a bunch of benchmarks, plus removed a huge source of variation.  Later at Azul, with our shared low-associativity caches we discovered that stack placement was crucial to avoid conflicts, fixed with a little random offset to each stack base.

Cliff Notes: What are other large systemic biases?  Are these two it?  And obvious question for a JIT - should I start doing I-cache layout games for performance?  How about making copies of an existing hot method and running both old & new and doing sampling, and pick the winner?


Text Processing in Parallel in a Register

Cliff Notes: this technique is insanely fast
Like SIMD, but char/bytes within a standard GPR.  So 4 or 8 chars in a GPR?

Doubling: field width from n-bit to 2n-bit
Inductive: repeated doubling.

Parallel bit streams; 1-to-1 correspondence to byte stream.  Vastly faster XML parsing; better cache locality; better branch prediction.

So: take 1st bit from each byte, and jam 64 bytes (bit #0 only) into 1 GPR.  Take 2nd bit into 2nd GPR, etc.  Basically, load 64 bytes into 8 GPRs, but bit-striped.  Need architecture support for bit-striping pack/unpack.

Now break up desired operations (e.g., compare for lexical token boundaries) into bit-field operations.  Can do compares of 64 bytes for tokens by simple ops on 8 GPRs.

Cliff Notes: really lousy presentation.  Hope the paper is better. Guesswork: take this string "java/lang/Object(Crunkify)" and find the '(' tokens.

  GPR #0 - xxxxxxxx - bit 0 from all bytes
  GPR #1 - xxxxxxxx - bit 1 from all bytes
  GPR #2 - xxxxxxxx - bit 2 from all bytes
  GPR #3 - xxxxxxxx - bit 3 from all bytes

My turn for a lousy presentation: Look at bits for '('.  Want some combo of 0's & 1's from the GPRs.  So e.g. XOR w/-1 all GPRs needing 0's - now need them as 1's. Then do 7 AND's (parallel log tree AND).  Any bits still set represent the '('s.  Nicely wide-issues per 64-bytes.  Expect to compare 64-bytes for a single char in maybe 6-8 clocks!?!?!?!  Plus evil pack/unpack issues/costs.

Wrote some code to demo finding '(' in a long string using maybe 1/10th clock per character, plus whatever the packing costs are (which probably dominate!).  In this code, 'b0' through 'b7' are 64-bit values holding bits 0 (through 7) for 64 chars in turn.  Here's the test-64-chars-for-paren check:
      // The open-paren char, '(', hex 0x28:
      //           0     0    1     0    1     0     0     0
      long res = ~b7 & ~b6 & b5 & ~b4 & b3 & ~b2 & ~b1 & ~b0;
      // A non-zero res means a '(' is found, and the bit-number is
      // the index into the 64-byte array where it was found.


See attached source code for TextScan.java


Nanoscale Computing
128 bytes of ram; 5micrometers in area

Smaller version: 8 bytes of ram; 1.5micrometers in area; similar in size to largest virus.  Chemical sensors report results by directly changing bits in the instruction stream (e.g., flipping a bit changes a JMP to a NOP, in effect making a conditional branch).  Can detect adequately running at 100Hz (not 100MHz).  Can write small simple test/loop codes in this form factor.  Can e.g. count molecule rates or virus rates in a solution (injected in bloodstream). 


Cliff Notes: Very Very Scary.  See Vernor Vinge's dystopia "A Deepness in the Sky".  http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky
http://en.wikipedia.org/wiki/Smartdust


Demystifying High-level Low-level Programming

Goal: Abstraction w/out guilt.  Separation of concerns.  Nice historical perspective.  Issues go back as far as 1975 - using C for writing an OS was a big issue back then.  So it's time for the modern revolution of the same thing.  JVMs for OS's. 

Various C support tools: coding styles; Purify; idioms; tools (Valgrind).  Also Systems-languages - BLISS, Modula-3, Cyclone, etc Or 2 languages: JNI+Java JikesRVM extends Java, also OVM, Moxie, DRLVM, C# (Bartok).  Usually in context of runtime research.  This talk: Jikes, Java-in-Java.

Containment:
- minimize exposure to low-level coding
- extensible; requires change fast, but language changes slowly
Encapsulate: contained low-level semantics.
Fine-grained; separation of concerns.

"Address" is Magic, like unsafe.
"Address foo; int x = foo.loadInt(offset)"
More types for "void*" in Java, or generic OOP pointers.

Intrinsics: contract between you & the compiler; at the Java level it has an empty implementation.  But the compiler will always insert the correct code for the e.g. prefetch intrinsic.

Also annotations to do e.g., turn off bounds-checks or null-checks (performance hints, but break safety).

Scoped semantic regions - scoped rule changes, e.g. "NoNew" - no allocation in this scope (HotSpot uses this technique extensively), or "NoBoundsChecks" in this scope, etc.  Unboxed is another annotation - no class header, no v-table,  basically a C struct.  Also can do direct field layout.

@RawStorage - basic static C memory layout.  Backed by native width raw storage, only accessed by native-width C intrinsics.

Can both compile "fast" - down to raw bits, and "virtual" - to some virtual implementations allowing full GC on full simulation.


Asymmetric Chip Multiprocessor

Idea: instead of e.g. 16 cores on a die, have 12 for normal work and 1 "fat" core for hot contended locks.

Cross grid of threads vs open-tables in MySQL. (looks like ref-counting of MySQL tables).  Blah - implementation looks alot of NmethodBuckets in HotSpot.  2-d grid.  Updates are complex & require complex locking.

Look at some locking traces.  If we could accelerate the critical sections then we get a Amdhal's Law approach - removes the serial portion.

Idea: switch to the one fast CPU on a contended lock.
Seeing speedups of 50% or so for 12-16 cpu system, where program was seeing e.g. 5x speedup and now gets e.g. 8x speedup.  One of the benefits: locks sections always executed by same CPU, so caches remain 'hot' for hot-lock code.

Cliff Notes: This is the first real use of Asymmetric Multiprocessors that I believe in.


Leak Pruning

See it & reported it before.  Survive OutOfMemory & leaks by reclaiming predicted dead objects.  Used a simple X86 read-barrier w/6% slowdown.  Hoping to see new slowdown numbers for the read-barrier: see 3% slowdown on Core2 & 5% on Pentium4.  Using 2 bits per reference; bit 0 is for stale-refs; bit 1 is for pruned refs.

So instead of throwing an OOM, look thru heap and see if find a dominate object.  Assume dominate object is dead & leaked; "Cut" link to object and replace with a poisoned pointer.  Touching this pointer
throws the original OOM.  But if never touch object, then never throw OOM.  Cut links allows dead object (and his transitive closure) to be reclaimed.

Try to predict roots of dead datastructures.  So profile max time between last use (using something like GC age bits, but exponential counter bits).  Also need to track which refs are keeping most bytes alive.  

HashEntry -> PreparedStatement -> ParserInfo -> byte[][] -> byte[]
                               -> ResultSet -> Field[]

Compute during Phase 1:
Sometimes the HashTable grows, and PreparedStatement gets touched.  So max-stale-then-used for HashTable-> PreparedStatement is 16-32 GC cycles.  But max-stale-then-used for  PreparedStatement-> ParserInfo is 0-1 GC cycles.

Compute during Phase 2:
Count bytes kept alive for each type of thing that has a low MSTU number.  See PPS->ParserInfo has 420M kept alive.  So kills all edges from old PPS->ParserInfo (but not young ones, because MSTU is kept per object?  I now think he keeps MSTU per-class, so no bits in object but can suffer from common types like Object[] being used both in leaks and in hot code).  Don't worry about DAG sharing - ends up blaming the leak on 1 ref when it's really kept alive by several refs; just means on 1st go-around you prune one edge - but leak remains alive.  On 2nd go-around all the leak is blamed on object type#2 and pruned then.

Results are decent - keeping some complex real apps alive essentially forever.  Performance is decent, but near OOM conditions things slow down as first come frequent GCs, then LeakPruning kicks in and cleans the heap back out.

Cliff Notes: Wondering if could do read-barrier check on low bits by e.g. loading from the loaded pointer with an X86 op which faults on odd addresses?  Means faulting on stale & poisoned refs; faulting on poisoned is OK (means throwing OOM anyways), but faulting on stale might be too frequent still.


TRIPS, McKinley et al

TRIPs is an alternative processor architecture, with a square 4x4 grid of functional units (GPUs); registers are along one "wall", D-cache is along another 2 sides, etc.  Inter-GPU traffic is explicitly scheduled, but with a data-flow-like flavor (static scheduling, dynamic timing).  I've reported on this project before.  This talk is essentially an end-of-project summary.  And the summary is: not much better than a Core2 (if the chip had been built with Core2 technology).

Each chip has 2 processors; 1M L2.  16GPUs, 128 registers, 32K L1 cache.  Processor is distributed.  4x4 grid of GPUs; registers along one side; D-cache along other sides. Window size of 1024 instructions.

Block atomic execution model; dataflow between blocks.  Blocks are single-entry, multi-exit.  1 non-speculative block; up to 7 speculative blocks.

Heavy use of predication to build larger blocks.  Direct instruction communication model.  Special instructions to handle large fan-out.  Large block of conditional/predicted stuff mapped onto the 4x4 grid of FPUs; using reads from registers and reads/writes to d-cache.

Larger blocks require more scheduling; harder to place.  What is the utilization of the large instruction window?  Both scheduling, plus e.g. cache-miss & branch mis-predict.  Static delay: scheduling issues; 1-clk to move data from FPU to FPU.

Benchmarks: 53 compiled benchmarks; 13 hand-optimized benchmarks; 2 hand-optimized and hand-scheduled.  Gather numbers of the actual chip, except when needs to use SIM numbers (because chip didn't gather the right data).

Generally, out of the possible 128 instructions in a block compiler is able to get about 60 instructions.  Hand-optimized code uses smaller blocks because better scalar optimizations reduces instruction count.  
About 1/4 of all instructions are predicated & never executed.
About 1/4 of all cycles are lost due to scheduling.
About 1/2 of all cycles do useful work.

Very large code-bloat: 55% bloat from nops, or direct reg read/write ops, 18% from predicated useless, 27% from instruction-replication optimizations (loop unrolling, inlining).  4 SPEC benchmarks have i-cache miss rates > 10%.  Direct memory accesses are much lower (no spill ops?)

Need variable-sized blocks; would reduce the code bloat from 11x to 6x (over canonical Alpha code).

L1 is partitioned.  Compiler can acheive 2 mem-ops /cycle on vector add; hand-optimized can hit the max of 4 mem-ops / cycle.  Again for matrix-multiple the compiler hits 1 flops/cycle; hand optimized hits 7 flops/cycle.  

Clock speed is 366Mhz; mem speed is 200Mhz; process tech is 130nm.  Compare to a Core2 but boost TRIPS "as if" if gets the 65nm process. Compiled versions are competitive to Core2.  Hand-optimized about 3x faster than Core2.

What works: dramatic reduction in data-cache & register file refs.  Compiler is able to work (to schedule to this weirdo thing).  But performance isn't very big over a Core2.  Challenges: dynamic code size; dataflow overhead instructions; I-cache efficiency; predication overhead; generating efficient code for control intensive codes.


Maximum Benefit from Minimum HTM

Running a (supposedly) simple HTM on a transaction-aware Linux kernel.  Their proposed HTM is somewhat stronger than Azuls', in that they support parallel independent transactions (the main one, and ones run inside interrupt handlers), a special version of CAS, and graceful overflow to software.  Also they holding pending transactions in a much more generous L1 than Azuls'.

Results look pretty anemic and fragile, which is on par with Azuls' results (4x speedups on 16-way boxes; 60% of exec cycles wasted on retries).  On the plus side, they at least attempt a reasonable benchmark set.


The Rock's HTM

A look at Sun's Rock's HTM.  It's a lot weaker than Azul's - max of 16 (or 32, depending on how many hardware threads you use) pending writes; aborts on function calls, TLB misses, mispredicted branches & some fp ops, - plus the same reasons Azul's HTM can fail: eviction from the L1 (due to transaction being too large, or conflict from another CPU).

Major discovery (just like Azul discovered!): feedback is crucial to decent heuristics.  Your failed HTMs need to report as much as possible about what went wrong so the heuristic stands a chance of getting it right.  This required a hardware spin (Rock2).  One useful heuristic is the standard backoff mechanism: if there's true contention, then backing off can spread the memory references far enough apart to allow the HTM to succeed (presumably succeed with less overhead than a simple spin-lock solution). 

They report hashtable numbers, but wildly concurrent hashtables are commonly available. 
They report using the Rock HTM for doing an N-CAS.  This is largely successful, and might simplify many libraries.


CTrigger - Atomicity Violation Bugs

Software testing is inadequate.  Poor interleaving during standard stress testing.

CTrigger - for any given program input, force interesting interleavings.  There is an exponential interleaving space for each input, but stress testing is using random interleavings.   Previous work only controlled the context-switch interleavings.  CTrigger focuses on interleavings that likely hide atomicity bugs and rarely happen during normal testing.

Focus on interleavings that likely hide bugs: e.g. 2 nearby refs to the same memory in one thread - and another write to same ref in another thread.  Force these accesses to interleave by injecting "sleeps".

Tried on 7 apps (server, client, scientific) on 8-core machine.  143 total Unserializable Interleavings (program interleavings deemed interesting from atomicity standpoint).  Stress testing only hits about 80 to 100 of the U.I., out of 143, even for many many runs.  Basically some (1/3rd) of the U.I.s are very unlikely to hit with stress testing - this demonstrates the key weakness of standard stress testing.

CTrigger looks at the access traces and attempts to force the interleaving.  Needs to trim out infeasible paths: locking forces many orderings to be impossible (i.e., the code is correct and no race happens).  Some refs are inside the same lock; some are Happens-Before on thread-create or thread-join.  Typically 90% of apparent bad interleavings are infeasible.

Focus on rare interleavings also (not-rare races happen easily in stress testing).  If the "local gap" is small - time between 2 refs in the same thread is short, then the race is narrow - and hard to hit in stress testing.  Also look at the distance between external ref distance in time to the gap.  Sort these potential races by likelihood.  Instrument the program to add an artificial delay at a few points to try & force these rare races to happen often.  

Then run program as normal.  Program speed is similar to unmodified program, because artificial delays are small and few in number.  Program runs fine on multi-core machines; using the multi-cores as before.  But now the rare races are common, and stress testing finds them quickly.

Typically find bugs now 100x faster than before!!!  Found bugs in all these apps.  Was able to repro the breakage 100% of the time, generally very quickly.


Anomaly-based Bug Detection & Prediction

Defect - the Bug
Infection - Defect is triggered ==> program state is corrupt
Failure - An observable error (Crash, hang, wrong result)

Infections spread over time, or get overwritten (hence bug is squashed in this case).

Software debugging: hypothesis bug; change code; attempt to repro bug, etc.

- detecting anomalies; out-of-bounds addresses, unusual control paths, page faults or redundant  computations.  So profile program to teach the detector what the "good" behavior is.  Some values are in fixed ranges during the teaching runs, or some mem locations only accessed from a few ld/st ops; or abnormal number of loop iterations.  Current crop of anomaly detectors have a fairly high false-positive rate.

- Now isolate relevant anomalies: compute dynamic forward slices from anomalies to crash point.  Ouch - this step looks really expensive.  Rerun the broken program, propagating tokens from anomaly site to crash site.  Use binary search on the anomaly set.  Can reduce the count of false positive anomalies by a factor of 10.

- Now "validate" that have found buggy line of code: NOP that line out and see that you can execute further.  

Still don't know what the fix is, but can point out the step that e.g. corrupted memory.

Appears to scale up to e.g. gcc (330KLOC) found a bug in distributive laws causing a loop in RTL tree optimization (C2 used to suffer these routinely).


Isolator

Isolation in concurrent programs: detect isolation guarantees; i.e. access in critical regions from non-locking threads (or incorrect locking).  Bad threads can break well-behaved threads, (the usual motivation talk for debugging data races, I wish people could skip this or at least limit it to 1 slide & 1 minute).  Also STMs only have weak atomicity, so bugs still happen for well-behaved STM threads.  Cliff Notes: This is the key: some other buggy thread breaks your well-behaved concurrent thread; studying the break point is useless.

Isolator: Program executes as-if no isolation violation.  Uses a combo of data-replication, virtual memory & code instrumentation.  Does not require the whole program.  Requires "guarded by" annotations.

Cliff Notes: just having thread-level TLB protection would let you do some COW-style (http://en.wikipedia.org/wiki/Copy-on-write) protections.  Just write-protect all pages touched during a locked region (except allow writes from self).  All other threads that attempt to write without holding the lock will get a TLB fault and can be stalled (and reported as racing).

Stalling can cause deadlocks, so need dynamic deadlock detector.  (Youch!)
Either allow racing access OR throw a DeadlockError.

This above option is done by Isolator, except requires annotations rather than lazily discovering touched variables.  

Issues: expensive to change protection levels on every lock.  Fix: "lazy unprotect" - only remove protections on-demand, as other threads need them.  Could "shatter" big pages, or split unrelated objects to private pages (via GC). 
Cliff Notes: similar to HotSpot's Biased Locking, but this wants per-thread page-protections; really want user-mode control over per-thread page protections.

Issues: page-size granularity - has fragmentation overhead for fine-grained locks.


Kendo

Only works for data-race free programs.  Very efficient deterministic interleaving.  Progress is made with "Deterministic logical clock" time.

Work on SPLASH benchmarks.  See 18% to 27% overhead, but deterministic always.  Requires no special hardware. 

Each thread maintains a "logical clock" similar to Azul's COUNT register or TSC on X86.  Could be e.g., a "bytecodes executed" counter.  Globally, the thread with the lowest "logical clock" has the right to do a e.g. lock acquire (non-determinstic action).  Assumes all inter-thread communication happens nicely inside locks.

Enforces lock-acquire ordering, except for nested locks (have to prevent deadlock, since basically adding another locking layer).

Cliff Notes: Mostly interesting because this could work on stock hardware (but it is limited to race-free programs which are precisely those that you'd like this to work with).


Deterministic Replay
"Time Travel" for concurrent debugging

Phase 1 - record non-deterministic events
Phase 2 - replay

Software versions have been around for ~15 years.  Now commercially available.  Also available are libraries, compilers & OS's to help.  SW stack flaw: limited to speed of 1 cpu

How do you integrate HW & SW stacks?  i.e., If HW has captured a log, where does it go?  Do not want to capture or replay the SW stack that is doing the SW recording.

Also need the HW to be running across the whole machine, so that all processes are using it because interacting processes need to all be deterministic (see butterfly effect above).

So some abstraction: not all elements of the system are being recorded and can be replayed.  May need to simo record multiple user-mode processes in the same recording "sphere".  Only user-mode threads are being recorded (OS support needed to virtualize HW, but also OS needs SW record/replay that is process specific).  Really he's making a weak stab at recording everything to avoid the butterfly effect.

Now record 2 different kinds of info:
HW info: memory interleave log
SW info: system stuff log; scheduling; context switch; I/O; signals

Replay also has to handle, e.g. malloc has to be deterministic.
Cliff Notes: but not GC, although HashCode has to be deterministic.

Key Issues:
1- Deterministic injection of data into controlled "sphere" (I/O)
2- Using fewer processors during replay than were used during recording
3- Emulating vs re-execute system call (e.g. fork requires a real system call) - but this makes a lot of OS state changes.

Key Issue #1 - Capturing ALL I/O
- Random OS calls write into user-space (i.e. setting return values in memory).  But during replay need e.g. the SAME "fd" as before.  But OS under no obligation to hand out the same "fd" at an open().  Or OS writes data into a buffer - BUT another user-thread is ALSO writing into the same buffer.  

Key Issue #2 - Not enough CPUs during replay
- I think he's expecting HW to do the memory interleaving on replay!!  But basically, it's time for thread#1 to run - but the OS wants to schedule thread#2 - just need to block #2 and allow #1 to  run.  Very slow (constant OS thread scheduling overhead).

Log size - in bits per kilo-instructions.  
Small: 3.15 bits/kilo-instruction.
Recording performance overhead: 20-50%.
Replay performance: using replay HW!?!?!  About 2x slowdown during replay.

How to handle large datasets?
 

DMP - Deterministic Shared Multiprocessing

Hard to debug, etc, etc...
So just make the whole thing deterministic.  No "Heisenbugs"; can use time-travel debugging.  Can test like it's a serial program.  No attempt at record/replay really - just force same ordering every time.  Cliff Notes: Suffers butterfly effect.

Instructions communicate via shared memory.  Only care about "communicating" instructions.  Can serialize - but that blows performance.

So really want to figure out which instructions really communicate. Requires HW support.  

Conceptually have a "deterministic token" which is handed out to processors and then is passed around all the CPUs in round-robin format.  This token totally orders all instructions.  Each "quantum" is roughly 1000 instructions, but varies.  Now filter out non-communicating instructions.  Allow the non-comm instructions to run in parallel.  Leverage cache coherency to track when cache lines move.  Build a "sharing table" to help order memory accesses.

Now add some TM support to help get more parallelism.
Now come some quantum-building optimizations: break at sync/lock boundaries.
Communication is often bursty: so break quantum after a series of stores to shared memory.


Memory Buddies

Sharing memory between VMware VMs(?)  Hypervisor finds identical pages and shares them under the hood.  (Probably good for e.g. shared libraries). See 1/3 reduction in memory usage in VMware ESX server. Get sharing only if you run similar apps on the same hardware.  Placement is hard in datacenter; lots of clients & software stacks; things change over time.  Manual placement is tough.

Goal: estimate sharing ability across many physical machines and many virtual machines across the datacenter.

Create 32bit hash per 4K page.  1MB hashs per 1GB of ram.  Forward to other hosts.  Comparing lists is slow.  So use a Bloom filter.  With hash lists, just sort & compare.  With Bloom filters you use a dot-product of bits; very accurate in estimating the number of elements shared between Bloom filters.

New VM launch - run on staging host.  Figure out it's memory footprint.  Compute it's Bloom filter per page.  Compare it's filter against other Filters on all other hosts in data center.  Find closest fit (on machine with enuf memory & cpu) and move the new VM there.


Live VM Migration

How to migrate VMs cheaper (force their free memory pools to shrink before migration; similar to having a JVM do a GC & heap-shrink before migration, then inflate back to normal size in the new host).

Contrast to the usual approach:
- 1st iteration copy all
- later iterations copy any dirty pages
- after dirty workset is "small enuf", stop VM, copy active dirty pages,
- restart on new host.
Goal: reduce downtime.
Con: Write-intensive apps have long migration times (because of constant page copying) and high network overhead.

Alternative: copy minimal CPU state & pages, start running on target.  VM demand-pages from source.  Also actively push pages from source to target.  Good: copy pages only once; lower total migration time & network overhead.  Works well for both read-intensive & write-intensive.  Con: resumed app suffers while pages demand-paged in.

Can make some predictions on future needed pages from apps page-faults - likely will want other pages nearby in virtual-memory-addresses soon.  Examples are walking an array or working a stack.

Make a psudeo-swap device in the guest OS for demand-pull pages (basically swapped over the network).  Allows the VM job to keep running in other threads or processes - only 1 thread is blocked on a page fault.  (Actual implementation has lots of warts due to nature of existing Xen infrastructure).  e.g. we have a much more direcy way to detect memory-full conditions and politely share memory between JVMS.

Results: feasible, but clearly more downtime over pre-copy in the common case.  But works well w/write-intensive VMs.


Validate

Patches are dangerous.  Delaying change is a safer bet even for security bugs: more likely for downtime from patches than your system gets hacked while you wait.  


Raytracing vs GPUs

Raytracing is more expensive per-ray, but less expensive per-object.  Can be parallelized.  Makes really great shadows & reflections.  Want to improve predictability in ray-tracing to get "good enough" frame rates.

Trying to use SIMD.  Primary rays are parallel, easy to SIMD.
Seconday rays are incoherent, not good for SIMD.

Cliff Notes: no surprise but modern GPUs are not well suited to raytracing work


Commutativity Analysis for Software Parallelization: Letting Program Transformations See the Big Picture

Sources of Commutativity - e.g. sum, malloc

Can detect it?  Check memory state before & after running fcn X.

But misses e.g. set's (via linked-list) - linked-list order is different when swapping 'put(A)' and 'put(B)' calls, but semantically the same: both elements are in the set, but the set's implementation stores them differently.

How to test for commutativity? Symbolically execute 'put' calls in different order. See if memory contents are same: "put(A);put(B)" vs "put(B);put(A)".  If yes, then Yes - but suppose not.

Now check reader fcns: try is_member() (on what values???) 
Also check writers like 'remove'.

Problem: exponential search space for equivalence.
Use Random Interpretation-
- choose random input values
- interpret till conditional branch
- follow which way it goes; tests program
- Then back up to cond branch
- Adjust memory & values such that test fails
- Now follow thru on other branch; test program

Now combine memory states for each branch of program.  Using an 'affine join' to combine memory states???  Somehow correctness bug becomes an exponentially low; very very unlikely the Random Interpretation ever computes some wrong answer (declares as commutative some function which is not really commutative).

Looking at MediaBench & SpecInt on an infinite issue machine.  See 20% to 30% of all functions are commutative (with themselves?).  Only see speedup when commutative fcns are called near each other, but then can see parallel speedup (log tree combining?).  Average is 50% speedup.  Apparently assuming an extremely wide machine, as opposed to threads (ignoring communication costs & thread start/stop).

Cliff Notes: Random Interpretation looks interesting, not sure about the rest of this paper.  In fact, this paper looks like a small delta (and very difficult to use correctly) over the R.I> work.   Instead see 2003 paper: "Discovering Affine Equalities Using Random Interpretation" http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf



Predicting Yield of GCs

Amount of memory reclaimed by GC *can* be low.  But looking at minimal GC footprints.  When GC is not productive, then time to extend the heap.  Estimate GC yield by using Virtual Memory to track pages which are touched and which are not touched - not-touched pages are probably full of mostly dead stuff.

Cliff Notes: Yawn, Olde Hat Trick.
Cliff Notes: got bored; symptom of Day 3 at a conference; worked on slides


Phantom BTB

Cliff Notes: Winner of the most ridiculous proposed hardware, backed up a big speedup on the most-unrealistic simulated hardware.

Virtual BTB Design - as in Virtual Memory not as in VMWare.
BTBs today are {small,fast,inaccurate}.  Want {big,fast,accurate}.
Allocate space for the BTB in the L2(!?!?!). 
But latency to the L2 is large, so much prefetch the BTB data.

Getting tidy 10%+ speedups on unrealistic hardware.

BTB in L2 cache is large+slow.  Missing BTB entries are evicted to L2. Then prefetch from L2 based on previous missing branches (assuming that an upcoming BTB miss can be predicted from a just-prior missing branch).  Attempted lots of other predictors; most failed miserably.

Pay-as-you go model; BTB gets effectively larger as more pressure happens on the on-chip BTB.

Cliff Notes: HotSpot inline-caches wipe out the need for faster jump-register support in hardware; standard branch predictors work well with HotSpot.  Total clocks spent doing jump-register instructions are less than 1% of all clocks on realistic hardware.





import java.lang.*;

class TextScan {

static String test1 = "scan_for_some_token(like_this_open_paren";
static String junk0 = "scan_for_some_token_like_this_open_paren";
static String junk1 = junk0.concat(junk0);
static String junk2 = junk1.concat(junk1);
static String test2 = junk2.concat(test1);

// A TextScan is a lite object that's really 8 longs representing
// the bits of 64 bytes split 90-degrees from verticle.
long b0,b1,b2,b3,b4,b5,b6,b7;

public static void main( String args[] ) {
TextScan[] bits = bit_split(test2);

System.out.print("bits="+bits.length+"@(");
for( int i = 0; i<bits.length; i++ )
System.out.print(bits[i]);
System.out.println(")");

// Scan for byte
System.out.println("Scanning for byte "+(byte)'(' + (char)0x28);
System.out.print("[");
for( int i=0; i<bits.length; i++ ) {
// bit pattern 0x28 - 0 0 1 0 1 0 0 0
TextScan ts = bits[i];
// The open-paren char, '(', hex 0x28:
// 0 0 1 0 1 0 0 0
long res = (~ts.b7) & (~ts.b6) & ts.b5 & (~ts.b4) & ts.b3 & (~ts.b2) & (~ts.b1) & (~ts.b0);
// A non-zero res means a '(' is found, and the bit-number is
// the index into the 64-byte array where it was found.
System.out.print(Long.toHexString(res)+",");
}
System.out.println("]");
}

// Split a String into an array of 64-bit longs.
// Parallel-split bits: bit 0 from 64 bytes go in 'b0',
// bit 7 from 64 bytes go in 'b7', etc.
public static TextScan[] bit_split( String s ) {
int slen = s.length();
TextScan[] bits = new TextScan[(slen+63)>>6];
for( int i=0 ; i<bits.length; i++ ) {
TextScan ts = bits[i] = new TextScan();
// Suck 64 bytes into bytearray from String.
for( int j=0; j<64; j++ ) {
int idx = (i<<6) + j;
byte b = idx<slen ? (byte)s.charAt(idx) : 0;
if( (b&(1<<0)) != 0 ) ts.b0 |= (1L<<j);
if( (b&(1<<1)) != 0 ) ts.b1 |= (1L<<j);
if( (b&(1<<2)) != 0 ) ts.b2 |= (1L<<j);
if( (b&(1<<3)) != 0 ) ts.b3 |= (1L<<j);
if( (b&(1<<4)) != 0 ) ts.b4 |= (1L<<j);
if( (b&(1<<5)) != 0 ) ts.b5 |= (1L<<j);
if( (b&(1<<6)) != 0 ) ts.b6 |= (1L<<j);
if( (b&(1<<7)) != 0 ) ts.b7 |= (1L<<j);
}
}
return bits;
}

// --- toString
// Reverse bit-swapping to a String.
public String toString() {
String s = "[";
for( int i=0; i<64; i++ ) {
byte b = 0;
b |= ((b0>>i)&1) << 0;
b |= ((b1>>i)&1) << 1;
b |= ((b2>>i)&1) << 2;
b |= ((b3>>i)&1) << 3;
b |= ((b4>>i)&1) << 4;
b |= ((b5>>i)&1) << 5;
b |= ((b6>>i)&1) << 6;
b |= ((b7>>i)&1) << 7;
if( b == 0 ) break;
s += (char)b;
}
s += "]";

return s;
}

}



Category: Web/Tech | Permalink » | Comments (4) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
And now some Hardware Transactional Memory comments...
February 25, 2009

(sorry for the long gap between postings; work's gotten interesting and I got busy)

I recently attended the Bay Area Workshop on Transactional Memory at Stanford generously hosted by HP.  Slides are here; my slides are helpfully titled "2009_TMW.pdf".  In this workshop I gave a brief overview of Azul Systems' Hardware Transactional Memory support plus our experiences in using it.  This was followed shortly by a blog posting from David Dice discussing Sun's HTM support in the Rock processor.

For Azul Systems' certainly (and to a lesser extent Rock), the name of the game is throughput: we appear to be generously over-provisioned with bandwidth, we can sustain 30G/sec allocation on 600G heaps with max pause times on the order of 10's of milliseconds (and unlike IBM's Metronome hard-real-time collector, our MMU past 20msec is 0.99; see MMU definition at bottom).  Each of our 864 cpus can sustain 2 cache-missing memory ops (plus a bunch of prefetches); a busy box will see 2300+ outstanding memory references at any time.  We have a lite microkernel style OS; we can easily handle 100K runnable threads (not just blocked ones).  Our JVM & GC scales easily to the whole box.  In short: the bottleneck is NOT the platform.  We need our users to be able to write scalable concurrent code.

The goal of our HTM has always been to accelerate "dusty deck" Java programs via Lock Elision.  (We've never been tempted to enter into the language wars and implement some form of full-blown transactional programming via an "atomic" keyword). We see a lot of defensive uses of the "synchronization" keyword; legacy libraries using "sync" on all methods or code maintainers sprinkling "sync" about liberally in order to squash some bug.  In practice, 99% of all locks are never contended - and these we run very fast (our CAS & FENCE instructions can hit in cache; an uncontended lock requires only a few clock cycles).  But the locks that are contended prevent parallel execution (Amdahl's Law and all that)- and we observed that much of the time the contention is on the lock itself and not on the data it protects.  So we added in HTM support in our first processors and this support has been shipping and turned on at all costumer sites for over 4 years now.  By now we have a lot of field experience with using HTM.

But first, a brief overview of our HTM support: we use a few extra bits on each L1 cache line to track "speculatively read" and "speculatively written" cache lines.  Each CPU can be put in a "speculate" mode; loads and stores then set these bits accordingly.  If a speculative line is evicted from the L1 for any reason then the transaction "aborts" (there is also an "abort" instruction).  Speculatively-written lines are marked invalid (meaning: they no longer cache any data) and speculatively-read lines have their spec-bit cleared.  Also the CPU branches to a fixed trap vector for software fixup. 

If the CPU makes it to a "commit" instruction then the spec bits are cleared - including the speculative-write bits - which atomically makes all those writes visible to other CPUs. In other words, the XTN commits.

Nothing aborts a transaction except the "abort" op or losing a speculative line out of L1.  We do not abort on function calls/returns, nor TLB misses, nor exceptions nor rain nor snow nor sleet nor dark of night...  Our XTNs are limited therefore by the size and associativity of the 16K 4-way L1 cache (and the shared inclusive L2).  We routinely see successful HTMs of many thousands of instructions with many hundreds of cache-hitting memory ops. (contrast this to what I can figure out about Rock: Rock appears to abort on function call/return, on running out of store-buffer depth (which limits an XTN to 16 stores?), on TLB misses, on L1 associativity, on common nested locking patterns (because of abort on function calls)). 

Software heuristics determine when to try the HTM (uncontended locks are much cheaper using a cache-hitting CAS).  The software measures the success/fail ratio of the HTM on contented locks and determines when to flip to a standard blocking implementation and when to use the HTM. 

While some apps get a nice 2x speedup (e.g. Trade6), most apps are speedup only slightly (we had lots of teething problems with the software heuristic that used to cause 10-20% slowdowns due to endless fail/retry loops, although these all appear to be fixed now).  We nearly always see a handful of locks using the HTM support, but these appear to only rarely be the "right" locks to get more CPUs really busy.  HTM failure appears to nearly always be due to conflict and not capacity.

Issues

In short, users' don't write "TM-friendly" code.  Neither do library writers.  For example the "modcount" field in Hashtable is bumped for every update in the Hashtable.  For a large Hashtable we can expect most mods to be in unrelated portions of the Hashtable.  i.e., without the modcount field update we would expect the HTM to allow parallel 'put' operations.  With the modcount field update, 'put' operations always have a true data conflict - and this aborts the HTM.  After some number of failed 'put' operations we have to start really locking the Hashtable (to allow the puts to make forward progress) and this now single-threads the 'get' operations as well.  This general pattern of updates to peripherial shared values (usually performance counters of one sort or another) is very common and it kills the HTM - because it presents a true data conflict.

Many times a small rewrite to remove the conflict makes the HTM useful.  But this blows the "dusty deck" code - people just want their old code to run faster.  The hard part here is getting customers to accept that a code rewrite is needed.  Once they are over that mental hump, once a code rewrite is "on the table" - then the customers go whole-hog.  Why make the code XTN-friendly when they can make it lock-friendly as well, and have it run fine on all gear (and not just HTM-enabled gear)?  Also locks have well understood performance characteristics, unlike TM's which generally rely on a complex and not-well-understood runtime portion (and indeed all the STMs out there have wildly varying "sweet spots" such that code which performs well on one STM might be really unusably slow on another STM). 

Really what the customers want to know is: "which locks do I need to 'crack' to get performance?".  Once they have that answer they are ready and willing to write fine-grained locking code.  And nearly always the fine-grained locking is a very simple step up in complexity over what they had before.  It's not the case that they need to write some uber-hard-to-maintain code to get performance.  Instead it's the case that they have no clue which locks need to be "cracked" to get a speedup, and once that's pointed out the fixes are generally straightforward. (e.g., replacing sync/HashMap with ConcurrentHashMap, striping a lock, reducing hold times (generally via caching), switching to AtomicXXX::increment, etc)

Summary

  • Very modest gains for HTM; usually <10% (although 2x for some large apps)
  • Rewrites of "dumb" direct contention (e.g. perf counters) would help a lot
  • Azul did this for some common JDK pieces Out There (e.g. Hashtable), but there's just too much of it
  • Customers not willing in general to touch code for HTM
    • If code is being hacked for performance, it will be with fine-grained locking instead of simply making it "HTM friendly". 

Thanks,
Cliff

[MMU is a common measure of GC performance and stands for "minimum mutator utilization".  It shows the worst-case amount of time a worker thread gets for a timeslice of a given length vs time spent in GC.  Thus an MMU graph plots utilization (a number from 0 to 1 where 1 is better) versus timeslice size.  A program using Metronome might see a max GC pause of 1microsecond; it's MMU at 1 microsecond is then zero because the mutator can do no work for that microsecond.  However the MMU at 1 second might be 70% showing that in the long run Metronome's GC exacts a 30% performance "tax" on mutator operations.  Contrast this to Azul's Pauseless GC: at 10msec our MMU is zero but at 1sec our MMU is 0.99; in the long run there is no GC "tax" but in the short run our max pauses are much larger than Metronome's.  Metronome is suitable for e.g. hard-real-time applications like audio-playback, whereas Azul Systems' is suitable for e.g. soft-real-time transactional apps with high throughput.]

Category: Web/Tech | Permalink » | Comments (11) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
A Brief Conversation with David Moon
November 18, 2008
I had an email conversation between myself, David Moon & Daniel Weinreb. For the younger readers: David Moon is one of the original architects of the Symbolics machines, which 20 years ago more-or-less attempted the same sorts of things Azul is doing now; i.e. language-specific hardware support. I lightly edited the emails for clarity and ordering (otherwise the nested interleavings get horrendous to follow).
[Ed: 11/20/2008 - Some small follow-on conversation has been appended to the end]


Cliff Click wrote: Mostly I felt inspired by your Symbolics work. Azul Systems makes gear for running Java (but actually it runs a C/C++ program - we just happen to run a large C++ that implements Java). One of the biggest impact changes we made was a hardware read-barrier for GC - a simple instruction that tests invariants of freshly loaded pointers and takes a fast-trap if the test fails. Fixup is entirely in software. It's a RISC-y approach to the GC hardware that Symbolics had 20 years ago.


David A Moon wrote: I remember this now; this was one of the things that impressed me when I read about your machine a couple years ago. As you may or may not know, the Symbolics 3600 had fairly complex GC hardware and the Symbolics Ivory went 90% of the way towards riscifying it, eliminating all the large hardware tables. I think there was still a 16x1 bit (NOT 16Kx1 !) table to indicate which portions of address space are oldspace; with 64 address bits instead of 32 you were able to reduce that to a single bit, which is nice.

Cliff Click wrote: We use very simple 3-address 32-register 64-bit RISC cores, with 54 cores per die. All chips are cache-coherent and uniform (medicore) memory access time for all; the upper bound is 16 chips or 864 cpus. The "big" box is about the size of a dorm room fridge, 14U form-factor. The big box also has top-20 super-computer bandwidth (barely). Modest 16K I- & 16K D-caches for all cpus, plus groups of 9 share a 2-meg L2. Chips are fully interconnected.


David A Moon wrote: 3-address 32-register 64-bit address, 32-bit instruction RISC does seem to be the "sweet spot" for instruction set architectures. [Cliff: Yes!] I hope you avoid needless complexity like condition codes, link registers, maybe even separate integer and floating point registers. [Cliff: Yes, yes & yes] I don't think Java needs floating point rounding and trapping modes either. [Cliff: Yes, it does not] For your application you probably don't even need separate user and supervisor modes since all executable code is generated by your just-in-time compiler from a safe language.

Cliff Click wrote: That's the theory but the practice is a little different: JVM's crash and 1 crashing JVM should not bring down the whole box. So we in fact have a user/kernel split and it's saved our butts any number of times.


David A Moon wrote: Maybe you don't need any kind of mode bits register at all? Being completely modeless would be cool.

Cliff Click wrote: There's also a speculative-memory mode: writes are buffered in the L1 but don't become visible until a 'COMMIT' instruction.


David A Moon wrote: That's a pissload o' cores. Are those 54 real cores per die? Or is it perhaps 6 real cores, each executing 9 interleaved instruction streams, in the style of Denelcor HEP?

Cliff Click wrote: Nope, real cores all the way.


David A Moon wrote: Does the precisely defined Java memory model give you any help in making a more simple implementation of cache coherency than is usually done e.g. for Intel x86 architecture? Or did that memory model come out after your hardware was designed?

Cliff Click wrote: The Java Memory Model came first, and we designed the hardware around it. Actually the hardware implements something somewhat weaker than the JMM by default, and the JIT inserts memory FENCE ops as needed. More like Sparc RMO or Power's memory model.


David A Moon wrote: I assume uniform memory access doesn't mean cached accesses aren't any faster than cache misses, but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network. [Cliff: Yup]

It must be a real challenge to get enough memory bandwidth to keep 864 instruction streams going at full speed, even more so with such tiny caches.


Daniel Weinreb wrote: Moon says: "... but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network." Actually, I heard a talk about a year ago, at the first New England Database Day, suggesting that the way information moves between the L1 caches of the cores is by a sequence of moves, each being up, down, right, or left, in the direction of the destination core.


Cliff Click wrote: No, we're a full direct interconnect on the L2's. 9 L1's share an L2. One step to the L2, then one step to the correct remote L2 - across the box in any direction (including another L2 on the same chip; it was easier to run outside the chip and back in than arrange for a special faster-path for within-chip communication. 15/16ths of your L2 misses are remote and the gain for speeding up the remaining 1/16th of the misses isn't worth it.


Daniel Weinreb wrote: If so, the times would not actually be uniform. Taking advantage of such knowledge in the software would be challenging except for special-purpose domains, I would think. I don't know how the 54 L1 caches (or maybe fewer if you're doing the piplined business that Moon described) are interconnected, but there is a limit to how many can participate in a full crossbar NxN connection. [Cliff: Key word: challenging]


Cliff Click wrote: We have a lot of memory bandwidth - one of our bigger boxes is on the top-20 supercomputer bandwidth list. The chips are fully interconnected.


David A Moon wrote: Meaning every chip has 15 ports directly connected to another chip? [Cliff: Yup] That's a lot of pins. You might want to look at http://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnect for an interesting slimmer approach.

Cliff Click wrote: I'm sure we looked at sicortex at one point.


Daniel Weinreb wrote: I agree about the challenge of the memory bandwidth for all those cores. Do you have very wide paths between the chips and the main memory?


Cliff Click wrote: There are 4 memory controllers per chip (and up to 16 chips so 64 controllers). All memory is striped across all controllers at a fine grained level so each core's memory requests are spread across the box - mostly eliminating "hotspots".


David A Moon wrote: Also they are a message passing architecture and you have to be a shared memory architecture, but I think that's a separate issue from the interconnect topology. I am no longer sure which I believe in. On the one hand, I am a shared memory guy from way, way back. On the other hand, it is becoming increasingly expensive to pretend that memory is uniform and random-access, and the illusion is becoming increasingly unconvincing. The von Neumann architecture may be reaching the end of its life. I like the way the Cell treats main memory as essentially an I/O device, but I have never attempted to program it.

Cliff Click wrote: By all accounts, programming the Cell is a nightmare. Since we can do shared-memory with so many cores, other people can too. The more difficult issue is parallel programming that uses all those cores. Shared memory is one way to make programming easier.


Daniel Weinreb wrote: What about cache coherency over multiple chips?

These days, as Moon has pointed out to me, the importance of caches means that you gain a lot by not looking at your address space as a sequence of co-equal words, but rather as a sort of block-oriented device (a little like the way one looks at a disk), in which the block size is the size of the cache line in some relevant cache (probably L2 but it depends a lot on the hardware if the hardware is novel). (See the word "illusion" in his mail.)

Does your Java implementation specifically take this into account, e.g. allocating objects aligned on cache line boundaries, implementing collections as B-trees with nodes being an integral number of cache lines, and that sort of thing? If not, you might consider benchmarking it; I'd love to know the results of such a test.


Cliff Click wrote: Our hardware has short cache lines (32b) so there's less to be gained by aligning objects to cache lines. For certain benchmarks (not to mention SpecJBB or anything) such alignment is probably worth alot - but not for most use cases. Perhaps it makes sense to expose cache-line alignment to the Java/JDK programmer?



Cliff Click wrote: Things we do specifically for Java:
  • GC read-barrier enables a fully parallel & concurrent GC; we can sustain 40G/sec allocation on a 400G heap indefinitely, with max-pause times on the order of 10-20msec. This uber-GC is partially made possible because of the read barrier (and partially possible because we 'own' the OS and can play major page-mapping tricks).

David A Moon wrote: That is indeed impressive! Even if I divide 40 GB/sec by 864 it's still fast. Of course you need it, much Java code really likes to generate garbage all over the place. My pet peeve is no way to return multiple values from a method without generating garbage.

Daniel Weinreb wrote: Dave, you say: "My pet peeve is no way to return multiple values from a method without generating garbage." In Rich Hickey's talk about Clojure(Cliff: a new Lisp dialect implemented on the JVM, mainly having been run on the Sun JVM but he ran it on an Azul at least once) addressed this issue. [Cliff: I'm familar with Clojure] He left out multiple value returns exactly on the grounds that the consing cost for a small returned value is so cheap these days, ephemeral GC's being so damned good. I do not know to what extent he has done actual metering, though.


Cliff Click wrote: Yes, annoying. And it's embedded into the JVM bytecodes that way, so you can't even rescue the problem by changing languages while keeping the JVM. The only hope here is for a combo of JIT'ing & smart memory allocation: either inline & optimize away the junk object or do some sort of stack-based allocation for the very-short-lived object.


David A Moon wrote: Can you say anything about what page-mapping tricks are good for?

Actually I am a little surprised you bother with virtual memory at all. If memory gets full surely it is faster to do a garbage collection than to swap pages out to disk (I think you don't even have a disk if I remember correctly). Maybe page mapping is just to avoid core shuffling if there are multiple heaps? Is the page size large to cut down on mapping overhead?


Cliff Click wrote: Ok, a bunch of things here:
  • Page sizes are 1Meg.
  • No swap; swapping is death for GC. If you are out of memory, then you are OutOfMemory. To put it another way: you are going to blow your performance guarantees so badly, the JVM might as well have crashed. Better to make the user allocate enough resources up front to keep the performance guarantees.
  • The page mapping is integral to the GC; we free physical memory earlier in the GC cycle than virtual memory. This lets us recycle physical memory much sooner, and makes it much harder for an application to "surprise" the concurrent GC - i.e. an app whose allocation rate suddenly accelerates might run a different concurrent GC out of physical memory - and force an application stall.
  • The mappings are also used to deny user-mode threads access to a page where the objects are being concurrently relocated, but still allow GC-mode access to that page. GC-mode is a subset of user-mode (so we sorta are like a 3-privilege mode machine), except that user-mode can promote itself to GC-mode anytime it wants.
  • The read-barrier will take a fast-trap on failure, and by default get promoted to GC-mode in the trap handler. This lets the faulting CPU fixup the object reference and continue, without needing to wait for GC to catch up.
  • We also have a write barrier, but it's merely a slight efficiency hack. The read-barrier enables the good GC.

David A Moon wrote: I would expect the write barrier to be more than a slight improvement, unless your memory scanning is incredibly fast. The main thing the write barrier accomplishes is to greatly decrease the number of pages (or cards or other chunks) that have to be scanned for pointers to the youngest generation.

Cliff Click wrote: Ahh.... here's the difference: a 'card-mark write barrier' can be fairly cheaply done with plain-o-instructions, so the write-barrier instruction is a delta above that. The biggest change is our write-barrier op 'filters' card-marks that don't need to happen which in turn cuts down on useless card-mark writes. Writes are generally really cheap - unless there's a ton of contention caused by rapidly updating of an old-gen object holding a pointer into young-gen. In that case the filtering is worth a lot.


David A Moon wrote: Since the GC is concurrent the time required to scan for pointers of interest to the GC shouldn't slow down the application unless the application uses all the cores, except surely you run out of memory bandwidth eventually and then the GC's memory scanning interferes with the application's accesses to memory. So even with all that concurrency and massive processing power and memory bandwidth I would think you would still get a lot of benefit from doing less scanning because of the write barrier. But I haven't measured it and you probably have.

Cliff Click wrote: We're definitely doing a generational collector, so we'd be using a software write-barrier if we didn't have the hardware one handy. Or maybe: "We have a write-barrier. The fast-path is done in hardware. There is a slow-path done in software."


David A Moon wrote: I don't know if this was published, but the write barrier on the Symbolics Ivory was really simple. Each page table word contains an encoding (in 4 bits?) of the age of the youngest generation pointed to by this page; the write barrier just causes a trap when those bits need to be updated. I don't remember if we kept another index or just had the GC scan the page table. Since it's an IBM-style inverted page table the page table is dense even when a lot of pages are swapped out so scanning it is not too slow. It would have worked better if the Ivory page size hadn't been way too small.

Did you ever look at Eliot Moss's "train" garbage collection ideas to do ephemeral-GC-type optimization for long-lived data? If I remember correctly there is a bug: memory consumption can become arbitrarily large in the worst case, but the idea might still be good. I think Microsoft is using a form of it in .NET.


Cliff Click wrote: HotSpot tried out the train collector years ago, and never got it to perform well in practice. There are a bunch of odd-ball issues, (remembered sets with popular objects?) that made it not very performant.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Simple 3-adr RISC is easy to JIT for; it's HotSpot's JIT and makes quite good code - easily comparable to "gcc -O2". [David: Yeah.]
  • Just-In-Time zero'ing of L1 cache for allocation - and we don't issue a memory read for the cache-line getting zap'd to zeros. This is worth about 1/3 less bandwidth than the usual "store a bunch of zeros in a row" implementations.

David A Moon wrote: dcbzinstruction on POWER. Clearly a good idea.

Cliff Click wrote: dcbz doesn't make the grade:
  • it's not a prefetch so you can't issue it far enough in advance without causing the stall you are trying to cover, and
  • any java lock that happens will invoke a fence, which will also stall
  • it does avoid the memory read bandwidth.
Our version gets all of the above right.


David A Moon wrote: Dedicated register for the allocation pointer?

Cliff Click wrote: No. The main issue with allocation is avoiding the mandatory cache-miss as you stream through memory. You can hide tons of dumb int-ops and cache-hitting-loads underneath that cache-miss. So we pay a little bit of integer-cycles to setup our J-I-T-zeroing, and getting the allocation pointer is one of those cycles. In short: it's too small picken's to be worth holding a register for.


Cliff Click wrote: Things we do specifically for Java (continued):
  • GC bits in each pointer (old-gen, young-gen, stack-allocated, normal-C-ptr) [David: I remember that now. Good.]
  • Java type-heirarchy bits in each pointer. This saves a trip to memory to fetch the Java type when making virtual calls.

David A Moon wrote: So you're saying that since 64 bits is way too many, you can use a bunch of those bits to put a rather wide tag in each pointer? Maybe 20 bits wide? Seems like a great idea. No problem with having too many classes defined and running out of bits?

Cliff Click wrote: Yes, wide tag per pointer. No problem (yet) with running out of classes. Big Java Apps these days seem to have about 2^15 classes.


Daniel Weinreb wrote: I am curious about how you do threads. On the Lisp machine, the stack had to be contiguous in virtual memory. So whenever we had to allocate one for a new thread, we had to trade off between wasting VM versus worrying about running out of stack. I don't know if we ever considered making the stack such that it could be a linked list of (big) blocks, but I suppose anything that slows down the function call path is very dicey. Dave, did you ever think about this? It would have been nice to have cheap full coroutines ("stack groups").


Cliff Click wrote: Nope - we're using the fairly classic stack layout for HotSpot. And yes this means you have to decide up front the stack-vs-heap memory split.


Daniel Weinreb wrote: Cliff, have you heard anything about the possibility that the JVM will be extended to be able to do tail-calling? Rich Hickey says that there was recently a "JVM summit" (something like that)of people writing non-Java languages for the JVM, and there was a lot of interest in tail-calling. It would be great for the Lisp world, since Scheme depends on it so much; it would be nice to re-unite the Scheme and Lisp communities as much as possible.


Cliff Click wrote: I'm well aware of the interest in tail-calling optimizations - but it has to compete for scarce engineering resources like everything else.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Really big interconnect & memory bandwidth. Java has lousy locality, so things miss in cache - a lot.

David A Moon wrote: Would that be true even with 4 times larger caches?

Cliff Click wrote: I think so. We did a bunch of studies on cache-sizes, looking at fewer-cores-more-cache vs more-cores-less-cache.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Support for 2 outstanding memory misses per cpu, up to 24 per L2 cache counting prefetches - so 2304 outstanding memory ops across the whole box. [David: "Non-blocking caches are good."]
  • Minor tweaks to help do Java-specific math: array addressing is sign-extend, THEN shift-and-add - which we do in 1 alu op. Also we have a Java range-check op, faster virtual-call support, the IEEE754 subset needed for Java FP, etc. [David: Good.]

David A Moon wrote: Do you have any kind of specialized cache to avoid memory references to find the address of a virtual method to call? I haven't measured it but I get the sense that all those not very predictable indirections through vtbls really hurt C++ speed.

Cliff Click wrote: Yes... specialized hardware, not specialized cache.

We do the standard HotSpot
"inline cache"trick that's done on all CPUs.

The cache is a 1-entry cache, inlined in the code. The Key is the expected class of the object, the Value is the target of the function call encoded as a plain "CALL" instruction. So you do a Key-compare inline then a plain static call. If the Key compare hits the hardware will Do The Right Thing with the static call. In practice, 95% of all call-sites (that can't be optimized by the JIT) never fail the 1-entry cache. The other 5% tend to fail for having lots and lots of targets.

So there IS a "cache", but it's not specialized. It's the normal HotSpot cache. Instead, we have an instruction to do the Key-compare in 1 cycle (and the call is another cycle). So we do cache-hitting v-calls in 2 clks & 2 ops. On a wide-issue O-O-O X86 that same sequence is maybe 4 dependent ops but issues also in 2 or 3 clks - except loading the object header is typically a cache-miss; but the branch predicts right and the O-O-O hardware lets the X86 proceed despite the miss.


Cliff Click wrote: Things we do specifically for Java (continued):
  • Simple hardware transactional memory, used to speculate through Java locks.

David A Moon wrote: I'd be interested in reading more about that.

Cliff Click wrote: We have a white-paper here. http://www.azulsystems.com/products/whitepaper/wp_otc.pdf


Cliff Click wrote: Ok, I'm done boasting. :-)


David A Moon wrote: It's plenty worth boasting about.

--Dave Moon


Cliff Click wrote: Thanks, Cliff


===========================================
11/20/2008 - A little more:

David A Moon wrote: Then I guess your thread scheduler is not aware of the partitioning of the hardware and doesn't try to put related threads onto cores in the same chip.

Cliff Click wrote: The thread scheduler is well aware of the hardware layout. It's no mean feat to schedule 800+ cpus with 1000's of runnable threads.

Instead, nobody has a clue what counts as "related threads". Certainly there's no Java-level hint. You might imagine doing L2-to-L2 miss-rate profiling and try to decide which set of threads are communicating through memory instead of through an L2. It's a non-trivial task for which there are some non-trivial academic papers out there showing modest gains for lots of hard work.


David A Moon wrote: But you said your memory access time is mediocre. That's the first symptom of the uniform memory illusion breaking down. Then you find that no matter how big you make your caches they aren't big enough. Then you find that each new generation has slower memory, as a ratio of memory access time to processor speed. Then you find that programs are too slow unless they are written with detailed knowledge of the hardware memory structure. You probably know way better than me how far we are down that slope today.

Cliff Click wrote: Split out the notions of 'uniform' from 'fast'. Everybody pays attention to caches (or should): the mindset is "memory is slow". But only some people pay attention to memory layout: the mindset is "memory is uniform".

For some parallel scientific programs, with well understood access patterns people can build both the machine and the program such that most accesses are "near" the CPU. In such a case it makes sense to build a NUMA machine: "near" accesses are faster than "far" accesses.

Azul builds are gear as 'UMA' because our programs do not have well understood access patterns. Instead, the patterns are mostly random (after cache filtering) and so it makes sense to have uniform mediocre speeds instead of 1/16th of memory fast and 15/16ths as slow.

In both the NUMA and UMA case everybody has given up on the illusion that memory is fast. Instead we all acknowledge the presence of caches; we prefetch (or expect the hardware to do so); we know that "1st access is slow, 2nd one is free"; we do cache-smart object layout and so on. No illusions about memory there.

Category: Web/Tech | Permalink » | Comments (11) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
Some Source Forge Threads on NBHM
November 13, 2008

A couple of interesting threads running on high-scale-lib's forums I thought I'd post to a wider audience.

Josh Dybnis (jdybnis) - 2008-11-10 11:46
I wrote an implementation of the non-blocking hash map in C. I put it up at http://code.google.com/p/nbds/


Cliff Click (cliffclickProject Admin) - 2008-11-10 17:59
I never know what to make of such statements -

- if the table has "almost no synchronization" then it's not non-blocking (but it might be highly concurrent). Non-blocking has a specific meaning to the academic community; if you say "non-blocking" you should also be able to say "no synch". If you are based on NBHM then you need to provide e.g. a non-blocking malloc and a way to reclaim storage. Both are things NBHM uses GC for, and are not-trivial to get non-blocking. NBHM does indeed do very little allocation - so the blocking embedded in "malloc" will probably never hurt you.

- No "C" program can ever implement any reasonable concurrent program, because "C" has no memory model (and the proposed model is a major punt and not much better than nothing). Instead you can implement NBHM using "C using gcc_ver_XXX on X86_model_YYY". If you change the "XXX" or the "YYY", then all bets are off, and the program can fail. In short, the "volatile" keyword in C does not have nearly the semantics as it does in Java, and C compilers routinely do things with "volatile" variables that Java does not allow. Changing the C compiler version or the underlying hardware can break things.  

Despite my complaints a C version of NBHM on X86 is surely a very useful and non-trivial thing.
Congrats!
Cliff


Josh Dybnis (jdybnis) - 2008-11-11 01:12
Thanks for the feedback. I really appreciate you sharing the algorithm. It's one of the most elegant pieces of work I've come across in a few years. 

- In theory, I agree 100% with you about writing concurrent programs in C. And people need to be clear about the limitations before attempting such a thing. In practice I think it is reasonable to do "porting" work for each platform/compiler. My implementation is targeted to gcc 4.x on 64-bit x86 (Intel does specify a memory model in their Software Developer's Manual). I think it could be ported to most other architectures without major modifications.

- My implementation includes a non-blocking malloc() and a method to do safe memory reclamation. 

- My statement about "almost no synchronization" was badly stated. I was referring to my malloc implementation and not the NBHM itself. Also I was also using "synchronization" to refer to x86 operations having the "lock" prefix, not any non-blocking property of the program.

- A couple of more caveats about my malloc and the safe memory reclamation. They aren't the most robust implementations. They need a bit of work. But they are non-blocking, and should be very low overhead on systems without too many cores (say less than 32). The malloc does call mmap() from multiple threads, it could block in there. So I guess technically it is not "non-blocking". But that is an artifact of the implementation. In theory one could use an asynchronous method of invoking the kernel to make it truly non-blocking. 

Josh

By: Cliff Click (cliffclickProject Admin) - 2008-11-11 07:41
How do you know when e.g. the NBHM Big Array is free?
Cliff


By: Josh Dybnis (jdybnis) - 2008-11-12 17:06
>How do you know when e.g. the NBHM Big Array is free? 
Missed this earlier. 
After I promote a new array the old one gets queued up for a deferred free. (The same with keys in the old array that are tombstoned and were not copied to the new array.) The frees actually happens at a later point when all the threads are guaranteed not to be holding references to the memory. I accomplish this using a technique from RCU. Each thread that might be holding a reference periodically announces that it is in a quiescent state. I have a way of detecting when every thread has made such an announcement. At that point I can safely free the array. My implementation of all that is not particularly robust. If a thread blocks, and never announces it is in a quiescent state nothing more will ever get freed. This is straightforward to fix with a more complex implementation. I'm having an on-going discussion about it on comp.programming.threads. http://groups.google.com/group/comp.programming.threads/browse_thread/thread/702aedc141b34fa1#

Also, one could also use GC in a C program too. Or one of the safe memory reclamation techniques documented in the academic literature (e.g. hazard pointers, smr, etc.). 


By: Cliff Click (cliffclickProject Admin) - 2008-11-12 17:21
Sounds good. Just was wondering.
Lack-of-GC generally brings about various broken attempts to free memory in concurrent programs.

Cliff

---------------------------

By: Cliff Click (cliffclickProject Admin) - 2008-11-11 13:52
> In theory, I agree 100% with you about writing concurrent programs in C..... I think it could be ported to most other architectures without major modifications. 

Itanium. :-)

Cliff


By: Josh Dybnis (jdybnis) - 2008-11-11 19:47
Yeah Itanium, the Alpha would be even worse. It is your NBHM algorithm though; I don't think there is anything dependent on having a strong memory model. You would just have to put memory barriers in the right places. I'm not saying that it would be easy to figure out the optimal solution, but it wouldn't require any major rework to the code. The lazy way to do it would be to change all the CAS calls to include a full memory fence. Doing it that way would constrict performance of course, but it would be correct, and probably still more scalable than a traditional lock-based hash table.


By: Cliff Click (cliffclickProject Admin) - 2008-11-11 21:25
The NBHM State Machine is "correct" independent of any memory model. I still need fences & a Memory Model around the promotion-logic & table-copy areas. It's here with an IA64 would cause you grief. If you didn't need table-copy (i.e., pre-made fixed-sized table), then it would be correct on any shared-memory machine. 

The NBHM as presented as stronger semantics than a minimalistic non-blocking hash table - Keys are treated with Java volatile semantics, so e.g. you can make & install a fresh Key on 1 thread, and have a 2nd thread see the Key from the table and also see the Key's contents so it can correctly run a Key compare. A naive C IA64 implementation can screw up there, although the X86 strong ordering saves it. Same issue happens with the returned Value (made in Thr1, returned in Thr2, but does Thr2 see the initialized contents of the Value?). 

Here's a sample IA64 table-copy screwup:

  • - T1 copies last K/V from old table to new table.
  • - T1 incrs the last count of copied values. Bug: No ordering between the stores.
  • - T2 sees the last count (read of T1's 2nd store), 
  • - T2 promotes (stores new table as default)
  • - T3 sees the new table (read of T2's store)
  • - T3 probes for K, but does not see T1's store.
  • - T3 reports K as not-in-table, when in fact it is.

Cliff

---------------------------

By: Josh Dybnis (jdybnis) - 2008-11-12 16:36
What happens if a thread copies a slot and then dies before updating _copyDone? Does the copy never get promoted?

By: Cliff Click (cliffclickProject Admin) - 2008-11-12 16:39
The fall-back case is that some thread (all threads?) scan the entire array and discover that all things are copied, despite any failed counts. At this point the "discovering" thread can promote. 


Cliff


By: Josh Dybnis (jdybnis) - 2008-11-12 17:15
I'm missing where that is in the implementation. I see where a thread enters panic-mode and does the scan, but it looks like it always compares its workdone with _copyDone and the size of the array. So if every slot is copied, but a thread didn't get around to updating _copyDone, none of the panicking threads will ever terminate their scans.



By: Cliff Click (cliffclickProject Admin) - 2008-11-12 17:23
Welcome to the difference between algorithm and implementation. :-)

Feel free to post a fix. It should be a 1-liner.
Cliff


By: Josh Dybnis (jdybnis) - 2008-11-12 17:44
Heh. I have a similar bug in my implementation.

=============================

compareAndSwap memory fence (New)

Andrew Trick (andrewtrick) - 2008-11-12 11:16
My interpretation of your slide-set/blog comments is that the NBHM state machine--ignore table copying for now--does not require any memory fences or store ordering. CAS_key, CAS_val gets the job done as long as your machine has atomic compare-exchange. Correct?
 
In fact, in NBHM source you bypass util.concurrent.atomic, calling Unsafe.compareAndSwapXXX presumably to avoid the memory fence. However, the Sun JDK and HotSpot assumes that Unsafe.compareAndSwapXXX enforces a full memory fence. So would it be fair to say that you've made Azul-specific changes to the JDK/JVM to optimize NBHM? Should Sun get rid of the full-fence semantics of Unsafe.compareAndSwap before more Java/JDK code begins to rely on it?
 
-Andy


By: Cliff Click (cliffclickProject Admin) - 2008-11-12 11:20
Yes, I need no fencing (ignoring table copy) for some kind of hashtable semantics. You DO need very carefully placed fences to get CHM semantics - which amount to 'volatile' semantics on values used in put & get calls.
 
I bypassed the Atomic classes to get the no-fencing CAS. The Unsafe calls specifically have weak no-fencing versions. On an X86 & Sparc you are stuck with the fencing anyways, but on Azul the weak no-fence version actually does not fence.
 
So I did not do any Azul-specific changes to the JDK here (Azul's JDK IS different from Sun's - but not here).
 
Cliff


Andrew Trick (andrewtrick) - 2008-11-12 19:05
Thanks for the reply and clarification.
 
The JDK does indeed have AtomicReference.weakCompareAndSet. Sorry to be pedantic about API details but... What I meant by "JDK assumes a fence" is that AtomicReference.compareAndSet for Sun is identical to weakCompareAndSet, so the underlying Unsafe.compareAndSwapObject must have a fence. What I meant by "HotSpot assumes a fence" is that the Unsafe.compareAndSwap intrinsic generates abstract memory barriers, which you would have to materialize as a real barrier on any machine that doesn't have implicit CAS fences.
 
But really I was just trying to understand the intention behind your NBHM source, which is pretty clear now. It's not important whether some careful JVM porting is needed to make it fast on certain h/w.

By: Cliff Click (cliffclickProject Admin) - 2008-11-12 21:35
Letsee...
 
- sun.misc.Unsafe does not specify any memory ordering properties for compareAndSet.
- AtomicReference.compareAndSet is doc'd to have volatile semantics, and operates on a volatile variable.
- AtomicReference.weakCompareAndSet is doc'd to NOT have volatile semantics.
 
A legit JDK implementation could then
- not fence for sun.misc.Unsafe
- but yes fence around volatile ops, including Unsafe ops on volatiles.
 
This is exactly the intent of the docs & implementation, and it specifically allows Azul to not fence on sun.misc.Unsafe & still meet the spec - which we do (on both fronts: not-fence and meet-the-spec). We also typically do not suffer from other peoples' bad assumptions here - where they expect Unsafe to fence and are surprised when it doesn't. I suspect Unsafe.CAS users are a rare breed.
 
Removing the fence for Unsafe ops is a tidy win for Azul on some highly parallel contention-heavy codes.
 
Cliff 

That's all,

Cliff

Category: Web/Tech | Permalink » | Comments (0) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
Cliff's New England Fall Travel Classic
October 28, 2008

Or how I went to New England to view the Fall Colors in Perfect Weather,
and All You Got was this Lousy Blog

Azul sent me to Manhattan (my least favorite place on the planet) to speak at the NYC Java Users Group.  I discover that the earliest flight out of San Jose (plus 3 time zones) gets in NYC at 4pm - and that you can't get from a NYC airport to downtown Manhattan in 2hrs during rush hour - so I can't make a 6pm talk if I leave San Jose on the same day.  Sigh - it's the Red-Eye for me.

Wednesday: I end up on the red-eye from Tuesday the night before and arrive at JFK at a delightful 6:30am local time.  No sleep for me.  I've done the NYC subway system a number of times so it's not much problem for me to buy a MetroCard and make my way to downtown.  I arrive at the hotel at maybe 8am in the morning.  Lo!  The hotel is full and they have no room for me - and indeed have no room until 3pm. Certainly by then I am one shabby dude - unshaven for 36 hours, same clothes, no shower, no sleep - not quite up (down?) to the standards of the street people hanging around the hotel but definitely looking grubby.  And I'm exhausted.

3pm arrives and I finally get my room and my shower, change clothes, lay down.... bad mistake.  I barely scrape up enough brain cells to force myself out of bed, and stumble, shaking from sleep deprivation, to the clock: only 90 minutes gone in a blink.  I splash some water on my face and go to meet the JUG.

The talks go very well; lots of good questions from an animated audience.  I presented Challenges and Directions in Java Virtual Machines and Towards a Scalable Non-Blocking Coding Style, and I got quite animated myself - all in all very good.  Dinner was some close-by expensive steak joint, but excellent food.  NYC definitely does steaks well.  Finally the hotel bed looms... and dawn breaks.

Thursday: Today is 2 customer visits, one planned and one directly due to the JUG talk the evening before.  Alas the new customer has micro-second response time requirements; we can only provide millisecond response times (albeit with massive throughput).  The planned customer visit goes much better; it's a tools&libraries group looking for advice on scalable Collections.  After that I take a train to visit my Uncle's farm in Connecticut.  Penn Central is full of these little train ticket kiosks; I happily approach one.  It takes my credit card, works me down through the menus to New London, CT - then hangs.  I try another; this one takes 30sec before posting loudly "Out of Servce".  I'm 0 for 2 so I sigh and get in the infinite line to see a human.  The line takes a full hour to buy a ticket.  Then I go wait for my (now a full hour later) train.  Just when it gets near the head of the list of arriving trains, the train board reports my train as "delayed: 10 minutes".  This goes on for 20 minutes before it changes to "stalled", then finally "now on track #9".  But all goes well after that, my Uncle picks me up in New London and we drive off to the farm.

Friday, Sat, Sun: The next few days are a real delight. The weather is perfect - highs maybe 65, lows in the 50's, breezy, no clouds and the trees in full color.  My cousins are all married and with kids all the same ages as my kids so it's a great time to visit.  I go to a soccer game, the end of a 5K run, a local chocolate & wine tasting, buy from a craft fair and shop at the local antiques shops; help (interfere?) with some hay harvesting, take a stab at fixing 2 Windoze machines (bizarre XP SP3 things about not being able to install Beethoven.wm3 after about of hour of disk grinding), and just plain visiting.

Monday: I pack up my suitcase, both physical and mental, and prepare to jump back into the fray.  The flight to Nashville is easy - except for my last child is finally going off to public school (we've home-schooled all our kids for years with great success), and my wife is in a frenzy getting him ready - and I'm a thousand miles away.  So I get to hear about it between plane flights, in airports, in the cab, etc. The Nashville hotel is nice, and only $150/night for a room twice the size of the NYC $500/night room, and a much nicer neighborhood. 

Tuesday, Wednesday, Thursday: OOPSLA.  Paid $710 at the door - I forgot to pre-reg.  I had great a lunch conversation Doug Lea, Bill Pugh, & Ben Zorn.  I take lots of notes.  Lots of great hallway and dinner conversation covering reader/write locks, the state of the JCP & Exec Committee, whether Sun should drop the JCP process altogether, whether the market will implode & we should all run up our credit cards to the max, end of civilization, etc. Ben tries to convince me that Microsoft can be trusted as the sole holder of all my critical personal information (sorry Ben) and that totally transparent sync'ing of all my portable & desktop devices is coming Real Soon Now.  Perhaps - but I'm never going to trust my personal info to some single central server - even if it does get old-school Ma-Bell-land-line-style 5-9's reliability for both the server and the sync'ing.  Too many companies have said "it's safe with us" and been wrong.  Too much motivation for crooks if all the eggs are so close together.  Other conversations that stuck in my mind:

  • Azul could Open Source or JCP our hacks for +UseLockedCollections (see about 1/2-way down the blog) and +UseCheckedCollections.  This could be a Good Thing for Azul as it will reduce the number of apps which fail "out of the box" on Azul as well as raise awareness of the issue.  Note that all major App-Servers (and many a large 3-letter-acronym company's major product) fail on Azul without +UseLockedCollections (and runs great with it) and the issue has nothing to do with Azul except we have more cores.  I.e., I predict these crashes will soon be coming to an X86 multi-core Near You!  I can hand out stack traces for any interested parties; email me for details (and yes, bug reports have been filed but between me and you I suspect most of them get filed into the round bin).
  • Bill Pugh offers that at Google they have a 3rd option: log&continue.  The logging is only invoked where now I would throw a ConcurrentModificationException - but there's no standard JDK place for logging such things.  We certainly could add an Azul logging option where the output is available on a RTPM data-race screen.
  • Doug Lea desperately needs 3 things to make his Fork-Join parallelism have a sane API - note that Fork-Join has a really nice approach to handling large complex parallel problems - there's no real limit on scaling plus it has a fantastic story on automatic load balancing which is typically really crucial for us.
    1. Tail-call optimization.  Part and parcel of the solution when doing recursive decomposition.
    2. Some kind of auto-boxing-removal story.  See also John Rose's DaVinci Machine.
    3. Some kind of forced inlining story, so that a complex F-J iterator gets cloned at the site where the loop-body is declared.  This will allow the (highly megamorphic) inner loop function to be specialized per call site, inlined into the iterator loop and decently optimized.
  • Several people asked for a Weak-Key'd AND Weak-Value'd NonBlocking HashMap.  

Friday: I make last minute plans back to NYC to visit the tools & libraries customer group again.  It's a 6:45am flight out of Nashville, which means be at airport by 5:am so leave hotel at 4:30am and get up at 3:30am.  Sign, another day with no sleep.  The flight to NYC is rocky as all heck, the clouds have finally arrived to ruin my days.  We bump for 2hrs out of a 2:45hr flight.  Then it's AirTrain to Jamaca Station, Long Island Railroad to Penn Station, Subway#3 to Wall Street - I'm coming quit good at navigating NYC.  I'm almost an hour early to my customer visit - which is really an open-ended tutorial on Azul's profiling tool: RTPM.

These guys are doing some fun stuff.  They have their own versions of the standard JDK collections where the expected collection size is 1million to 10+million elements.  They have very efficient striped iterators (100x speedups on a 100-way Azul box) and are looking at Doug Lea's Fork-Join for describing their inner loop closures.  RTPM goes a long way to helping them figure out what's going on.

There's also a nifty in-memory DB thingy where we discover that turning on "-tiered" mode (contrast to HotSpot's -client and -server modes) is worth a stunning 5x speedup (Azul usually sees something like a 15% speedup for -tiered).  We try to make the in-memory DB run that fast under a ReadWriteLock.  The customer has a very light-weight roll-your-own version (new in Java6: lots of new features in the JDK ReentrantReadWriteLock class which slows it down by 50%).  The customers' lock doesn't stripe the reader-counts and instantly suffers massively on the highly contended single reader count word (which the JDK version does as well).  I step their engineer through details of reader-lock-striping; it's definitely ticklish stuff to get right.

Finally 5pm arrives and everybody is going home - although for me "home" is really a subway ride to Penn Station, then a train back to New London, CT for another weekend on the farm.  I am sooooo ready to be back in San Jose.

Saturday, Sunday: Another great weekend on the farm; I finally get to see some rain (it's not rainy season in San Jose yet so I haven't see rain since March).  The wind blows down a tree over the power lines and we lose power from 9pm Sat to 5am Sunday (when my room lights pop on and wake me up, grumble).  With the power out and the internet gone it really feels like a farm.  On Sunday I take the train in to Boston, then have dinner with David Bacon at Legal Seafoods & take the T to my hotel.  Gosh I almost feel like a savvy New England traveler. 

Monday is the 2009 VEE PC meeting.  Despite missing 1/3 of the PC members (two have new babies, one got food poisoning, several had other issues that made them dial-in only for a few minutes at a time), the meeting went very well.  We got done early and I got to take the new Boston T Silverline to the airport - it's labeled like a subway but it's really an electric bus (the long bendy kind) in it's own bus-sized tunnel.  Logan was easy, the flight was smooth and I made it home (finally) at 1am local time.  It's good to be home.

Continue reading "Cliff's New England Fall Travel Classic " »

Category: Web/Tech | Permalink » | Comments (11) » | TrackBack (0) »

Like this post? del.icio.us | digg.com
 
JVM Language Summit
September 29, 2008

I spent last week at the JVM Language Summit, a small conference focused on non-Java JVM-based languages.  It was populated almost solely by either JVM implementors or bytecodes-are-my-new-assembler types.  It was a very fun & geeky conference for me; some of the best technical discussions I've had in a long time.  My favorite philosophical talk was definitely Erik Meijer talking about Fundamentalist Functional Programming.  In the end, he's all for non-pure side-effecting programming BUT he wants the type system to "stop lying" about side-effects.  We both agreed that we'd love to see a functional programming language which had types for side-effects (e.g. Haskell's IO "sin bin"), but with more elaborate side-effects.  I.e., if I'm writing my program with few-to-none side effects I might want to track them exactly (e.g. this function has type 'side-effects field _code') but if I call into the Swing or AWT libs I might want a type like 'side-effects all Java variables'.

My favorite techy/JVM talk was Bernd's Maxine VM.   I think he's got a really good nearly pure-Java layer figured out for generating ASM code.  You want something that looks like running plain Java code (and does field reads & writes) that guaranteed must inline down to where the JIT turns it into atomic machine code loads & stores.  He's missing lots & lots of major features (e.g. good JIT, good GC) but the framework looks nice.  If we ever are to get a true JVM-in-JVM (and skip bootstrapping your JVM via a C++ compiler which is what HotSpot does), I think Maxine has the best chance.

Most desired JVM feature: invokedynamic; this would cut huge chunks of evil code out of most JVM-based languages (but not Clojure & JPC, both of which target 1.5 JVMs more closely).
Most needed JVM feature without changing the spec: a trivial Escape Analysis suitable for dealing with endless Fixnum/BigNum math.  Most of these non-Java JVM languages do fixnum/bignum math (automatically overflowing fixed-int math into bignums) - which means every trivial bit of integer math creates a new Fixnum object.

I showed off Azul's RTPM and got lots of oooo's and aahhhh's.  I convinced lots of people to sign up for academic accounts mostly for access to RTPM.  I was able to quickly diagnose a big problem w/JRuby (failure to inline a key trampoline, masked by +PrintInlining lying), and also got 20% speedup on the ASM library - the core bytecode parsing library in widespread use.

My own talk was well received:

Brian Goetz wrote:

your talk was both the most popular and highest rated, according to the comment cards. 

The slides are here although the postscript conversion mostly screwed up my transparent ovals.  Mostly I ran this trivial program on Azul's JVM using these languages: Scala, Clojure, Jython, JRuby, JavaScript/Rhino, & JPC. I profiled them using RTPM and report on the profiling.  See the slides for the in-depth discussion.  3 things popped out:

  1. Clojure, JRuby, Jython, Javascript/Rhino all using fixnum/bignums for every trivial piece of math - which ends up allocating new FixNums for each simple integer operation. A major fix for this would be to implement the simplest possible Escape-Analysis. You'll still end up with the overflow checks but the major cost here is the Fixnum allocation (and NOT the GC costs; the fixnums trivally die in the young-gen; direct GC costs are quite modest).
  2. JRuby was thinking that a key trampoline function was inlining - and it wasn't.  See the slides for details, but hopefully we'll see a JRuby with a substantial performance boost soon.
  3. I ran Clojure's STM Traveling Salesman Problem using 600 worker 'ants' to optimize the path.  It ran straight-up without a hitch, no tweaks - and with great scaling (it did allocate about 20Gig/sec on a 200Gig heap but this is easy for an Azul box to cope with).  There may be hope for STM's after all - IF the set of STM variables is limited to a handful of named shared variables, and not for dusty-desks that must assume all-is-shared.

Cliff

Category: Web/Tech | Permalink » | Comments (17) » | TrackBack (0) »

Like this post? del.icio.us | digg.com