|
Cliff Click Jr.’s Blog's Blog
Category: Web/Tech
I've Been Slashdot'd
January 19, 2010
I've been Slashdot'd. The slides in question are also here.I gave the talk at the JVM Language Summit, which itself was a lot of fun. The talk is a repeat of one of the talks I did at JavaOne. I also gave two other talks, but the Sun JavaOne website appears to be unable to deliver the video right now. I also gave a short interview at JavaOne. One of the talks I mentioned on the InfoQ video is also available here as a Google Tech Talk; Java on 1000 cores: Tales of Hardware/Software Co-Design. I also mentioned a talk Azul's Experiences with Hardware Transactional Memory, and my blog on that is here. Alas, I don't believe the HTM talk has been video'd for public consumption at any time. If you are interested in HTM support, you should also check out this short gem. The GC talk alluded too has slides all over the web; here's the original paper, but I could not find a public video presentation. Cliff
Category: Web/Tech | Permalink
| Comments (10)
| TrackBack (0)
Touching Base...
December 22, 2009
It's been awhile since I blogged, so I thought I'd touch base with
people to let them know what's been going on. Azul Systems has been
hard at work improving our JVM. This is a bigger statement than it
sounds - there are not many groups that have a large enough 'quorum' of
JVM engineers to do large-scale changes to the HotSpot JVM. Azul has
nearly a dozen engineers doing core HotSpot work (not counting JDK work
or QA folks - counting only core JVM engineers)! We've been doing
large-scale changes to HotSpot for nearly 8 years now. Our HotSpot has
been improved over Sun's standard HotSpot or the OpenJDK in a large
number of ways, some more visible and some less so. Some of the more
obvious stuff we've got working:
- A new complete replacement GC: Generational
Pauseless GC (and the older PauselessGC
paper is here). This is one of our core strengths. GPGC handles
heaps from 60Megabytes to 600Gigabytes and allocation rates from
4Megabytes/sec to 40Gigabytes/sec, with MAX pause-times consistently
down in the
10-20msec range.
- GPGC requires read barriers,
and this means
instrumenting every read from the garbage-collected heap.
Instrumenting the JIT'd reads is easy: we altered the JITs long ago to
emit the needed instructions. Instrumenting the VM itself is a bigger
job; every time we integrate a new source drop from Sun we have to find
all the new heap-reads Sun has inserted into their new C++ code
(HotSpot itself is a large complex C++ program) and add read-barriers
to them.
- Real Time Performance
Monitoring - RTPM. This
is our high-resolution always-on no-overhead integrated profiling tool
and is our 2nd major selling point. Because it's no-overhead
(literally less than 1%; it's very hard to measure the overhead) we
leave it always on. This means you can look at a JVM that's been up in
production for a week or a month and introspect it. It's *common*
for a 1hr session with RTPM to answer performance questions that have
plagued production systems for years, or to have people walk away with
10-line fixes worth 30% speedups. It's as-if you've been blind to what
your JVM has been doing and suddenly your eyes are opened. Live stack
traces, heap contents, leaks, hot-locks with contending stack traces,
profiled JIT'd assembly, I/O bottlenecks, GC issues, etc, etc. See the
link for a demo.
- Virtualized JVM - We can take pretty much any old server, install
a new JDK, change JAVA_HOME to the new JDK and re-launch the
application... and it now runs on Azul's JVM backed by an Azul
appliance. No hardware change and no OS change. This is a great
solution for in-place speedups of older gear.
- More recently of course, we've been hard at working porting our
JVM to our new hardware platform. This work is going well; look for
more discussion here as we have things to announce!
Here's some of the LESS obvious stuff we have working:
- Tiered Compilation. Despite the fact that Sun has shipped
"-client" and "-server" configurations for years, they never integrated
these two JITs into a single system. Most other JVMs have had a tiered
compilation configuration for years and Azul Systems did this to
HotSpot a few years
ago. We consistently see a roughly 15% speed improvement over a
plain "-server" configuration. We use the "-client" JIT (also known
internally as C1) to do fast high-resolution profiling; this
high-quality profile information allows the "-server" JIT (C2) to do a
much better job of inlining and compiling.
- A complete replacement for the existing HotSpot CodeCache: the
holder of all JIT'd code in the system. While *adding* code has
always been easy, *removing* code has always been tricky (well,
tricky to do it without blowing all code away at once and without
requiring all calls to indirect through a 'handle'). Most
large server apps slowly churn new code, so if you leak code you
eventually run out of memory. The
new CodeCache uses GC to control code lifetimes and this results in a
vastly simpler and less buggy structure all around. We also use GC to
manage all the
auxiliary data structures surrounding code, e.g. the list of "class
dependencies" for a piece of JIT'd code is a standard heap object now.
(A "class dependency" lists the set of classes & methods that a
piece of JIT'd code assumes are NOT overridden; if a new class and/or
method overrides one of these then some inlining decision made by the
JIT is now illegal and the JIT'd code needs to be deoptimized, removed
&
recompiled). Besides being a common management point for all code, the
CodeCache is pinned in the low 4-Gig. This means all hardware Program
Counters can be limited to 32bits (in our otherwise 64-bit system) and
this is a
tidy cost savings (shorter instruction sequences for calls; less
I-cache space consumed, etc).
- Tons of internal JVM scaling work. We run on systems with 100's
of CPUs and so we've found (and fixed!) any number of internal JVM
scaling limitations. GPGC can run with hundreds of worker CPUs if
needed. The JITs compile in parallel with dozens of CPUs (50 is common
during a large application startup). Many internal VM structures have
been made lock-free or have had their lock hold-times reduced by 10x or more.
Self-tuning auto-sizing JIT/compiler thread pool. Concurrent stub/native-wrapper generation. Concurrent code-dependency insertion (during compilation) and checking (during class loading). Self-tuning
finalizer work queues. etc, etc, etc....
- Cooperative Safepointing allows thousands of *running*
threads
(not just alive-but-blocked-on-IO) to come to a Safepoint in
under a millisecond. Merely safepointing 100's of threads is down in
the microseconds. Note that a full-on Safepoint does not happen until
the last thread checks-in but the stall time starts when the first
thread stops for a Safepoint. The time-to-safepoint pause is measured
from when the first running thread stops till when the last thread
checks-in.
- The
ability to asynchronously stop & signal individual
threads, to have them do various self-service tasks cheaper than a
remote thread can do it. This includes, e.g. stack crawls for GC or
profiling (a thread's stack is hot in his own L1 cache and can be
crawled
vastly faster than by a remote thread), or to acknowledge GC phase
shifts or to allow code to be deoptimized (jargon word for what happens
to code that is no longer valid due to class loading). We can also
efficiently do "ragged safepoints" - this is like a full Safepoint
except we don't need to simultaneously stop all threads. Instead we
merely need to know when all threads have acknowledged e.g. a GC phase
shift. The threads "check in" as they individually acknowledge the
Safepoint and keep on running. When the last thread has checked in,
the "ragged safepoint" (and GC phase shift) is complete.
- No more "perm-gen" space to run out or require a separate tuning
flag. No more old-gen or young-gen either. No GC-thread-count knobs,
or space/ratio tuning knobs or GC age or SurvivorXXX flags. GPGC takes
no flags (except max total resources allowed), and runs well. There Is
Only One Heap
Space, and GPGC Rules It All.
- A new thread & stack layout that lets us use the
stack-pointer also as a ThreadLocal storage pointer, the HotSpot
"JavaThread*", AND as a small dense integer thread-id (requires 1 or 2
integer ops to flip between these forms). This frees up a CPU register
for general use, while still allowing 1-cycle access to performance
critical thread-local structures.
- A complete replacement for the existing HotSpot locking
mechanisms. Our new locks are 'biased' (here's the
original paper idea) similar in
theory to Sun's +BiasedLocking but based on entirely new code. No more
"displaced
header" madness
(this comment
is probably only relevant to
hard-core HotSpot engineers). Biased locks do not require ANY atomic
operation or memory barrier during locking & unlocking, unless the
lock needs to "change hands". Since we can stop individual threads
asynchronously, we have a fairly cheap way to hand biased locks off
between threads. Once individual locks demonstrate that they need to
"change hands", we inflate that one lock (not the whole class of locks)
and it becomes a "thin lock" as long as the contention is low enough
switching over to a "thick lock" only when there are threads waiting to
acquire the lock. The issues here are fairly complex and subtle and
deserve an entire 'nother blog!
That's enough for this Blog. More later...
Cliff
Category: Web/Tech | Permalink
| Comments (14)
| TrackBack (0)
Panic Last Week
August 16, 2009
(warning: no technical content) Prior Friday: my Mom calls my sister & tells her my brother Eric (hiking the entire PCT this summer) is 2 days late/overdue.
Sister calls me in a PANIC!!! I plan on Saturday driving 5 hrs to
northern California to calm my Mom & help coordinate search & rescue.
It's not like I can get on the trail and expect to cover 100miles in 4
days like my brother's been doing. I can't reach Mom (no doubt she's driving twisty mountain roads and
is out of cell phone range). To make matters worse my wife has been sick all the
prior week while trying to run Vacation Bible School (her and 8 kids got it from somebody who brung it; why do people bring their sick kids to School?); she's still recovering.
Saturday: no panic. Mom calls; Eric's been found and is not really
late: he had told her he probably would not be able to call midway
through the 100 mile stretch and indeed had no cell coverage the entire
time. Mom had the local Sheriff out canvassing everything and
almost ready to put up a Search&Rescue plane when she finds Eric (and the Sheriff really didn't want to do that because there was already almost 100 planes in the air fighting the local wildfires). Mom's too
old to panic like that. Grrrrr.....
More Saturday: I take the kids to the beach while my wife sleeps off the cold; my eldest daughter needs to photo tidepools for her Bio project and we all need
to relax. In the middle of the relaxation my Dad calls home (my folks are divorced so no relationship with my Mom anymore) and reaches my wife napping; Dad and new wife (of nearly 20 years) are
in Stockton and will be arriving tomorrow. PANIC!!!! Wife's
still recovering; the house is a total shambles AND her inlaws are
arriving in less than 24 hrs.
Sunday: Inlaws arrive. Much grandparent fun.
Monday: I leave my folks to have a looong planned meeting with Intel. I arrive and PANIC!!! the power is out. We
relocate to the break room which has floor-to-ceiling windows. Makes me feel like Azul is a high-powered
got-act-together startup to host Intel folks in the dark. Dad & wife move
on (an unannounced drive-by grand-parenting), as they are traveling the western US as part of their retirement.
Tuesday. I spend all day on training & mentoring, and
recovering from repeated power failures. All boxes need rebooting and
..."PANIC!!!"... patches get installed, don't play nice together, need
more patches, more screwing around. Things more or less functional
again by late Tuesday. Of course a company wide demo coming on Thursday is hitting an unexpected new crash, so the demo gods are panic'ing and I can't help them all day because of power outage crapping out my machines.
Wednesday. Unrelated at-home PANIC keeps me busy all morning
(plus bugs in high-scale-lib showing up a year late). I finally get the fires out around
noon and my wife offers to go out to lunch w/me. First time we'd eaten
together w/out kids in a month. Of course, the housekeeper has hid my
scheduler book and there's something in it keeps nagging me. Just
before we walk out the door I find it and stuff it in my briefcase
un-inspected - because a rare lunch opportunity w/my wife is not to be
missed. 5 minutes later Jeremy of Google calls: I missed my planned Google
lunch w/ high-powered Java hackers and he wants to know if I'm
still giving the tech-talk at 1? PANIC!!!! I
turn the car around, drop off the wife, panic another 5 minutes trying to get latest
slides on wife's tiny new netbook because the standard presentation laptop is
sick before the laptop recovers and I grab it and roar out the door. I make it to Google
just in time (no officer, I wasn't speeding I was just flying too low). The talk goes well. (slides are here)
Thursday. Demo goes OK. I actually get some work done. It's quite shocking.
Friday: Planning on some kind of panic but it hasn't
happened yet. Got a concert planned tonight in Santa Cruz withOUT
kids... so an hour drive away from any kind of rescue. We're making
plans (got neighbors lined up, etc). In the end wife and I are so exhausted that we bail on the concert and settle for a nice dinner out without kids.
Saturday: We have a 7:30am Dr appointment for my youngest (no: I normally never hear of Dr's giving 7:30am Saturday appointments, but my youngest needs to see a serious specialist who happens also to be a really nice guy), then a long planned trip
to SF zoo (In theory: more grist for the daughter's Bio project. In practice: she's got a long-distance romance going on with somebody who's 2 hours north of SF so they agreed to meet in the middle). We all get too much sun wandering the SF zoo in perfect weather. I finally come down with the dreaded VBS cold, and feel like crud the whole day but had a great time nonetheless.
Sunday: We're all recovering from massive sunburns and a week of madness. School starts tomorrow... plenty of time to panic then!!!
Cliff
PS- Concerned that there might be a correlation between PANIC!!!! and my
receding hairline
Category: Web/Tech | Permalink
| Comments (6)
| TrackBack (0)
2009 ECOOP
August 2, 2009
Long delayed, but at last I found time to publish my notes.
I got a free all-expense paid trip to Italy, and all I had to do was give a keynote speech (slides) at 2009 ECOOP. It's a nice gig if you can get it. Travel to Genoa from San Francisco takes awhile; my 8:30am SF flight got me in Genoa about 2pm on the *next* day - plus 2 hrs lead time at SF, plus a 1hr drive to SF - I started my "day" at 6am on Friday and ended it at 8pm on Saturday. I managed to stay up till dinner (barely) then crashed pretty hard. Sunday was my day for sightseeing; I managed to wander pretty far all around Genoa; rode some funiculars; walked all over; took loads of pictures; ate weird food and generally acted like a typical American tourist.
There have been loads of sidewalk vendors; all Blacks aggressively selling fake designer purses and sunglasses laid out on sheets on the sidewalk. Monday afternoon I got to watch a bust go down; lots of yelling, they grabbed their sheet bundles and took off running with plain clothes police in hot pursuit. Somebody lost his wares and the police quit chasing to grab the loot. It's Tuesday evening and the Blacks haven't returned. I promised my wife I'd get her a cheap Gucci purse and now I'm regretting I didn't do it earlier. Hotel Bristol is a nice older place (means: lots of paint chips and paint layers, big chandeliers, gold leaf trim, dark wood, creaky old elevators; rooms are a mix of really old and really new; follow the link and oogle at the pictures). My room is vast with 15ft ceilings (with chandelier of course) and a large jacuzzi tub; enough room to put up a basketball hoop. The conference food has been pretty nice (free good wine with every meal); lunches are served in the Palace Ducalle - historic summer residence of Doges. I can't do the setting enough justice; the website is nice but the pictures don't really tell the story. 50ft ceiling with 20ft chandeliers and a dozen 8ft marble statues; gold leaf trim on every fresco; stunning paintings the size of basketball courts... I totally laid down on the floor and stared at the ceiling for a good hour. The local restaurants, on the other hand, are a hit-or-miss affair; food is only served at certain hours; sometimes the food has been very good and sometimes no better than the local deli around the corner from Azul. The ICOOOLPS workshop was on Monday; I got invited to talk there at the last minute. I had some misgivings but most of the papers at the workshop were pretty nice; I enjoyed myself. I'm still fighting jet-lag (today is Tuesday) pretty hard, so I'm writing this to try and stay awake until dinner. The papers are here. 1st up - The ICOOOLPS workshop. ECOOP notes are further down. As usual I abbreviate ruthlessly; skip some talks; take notes in a stream-of-consciousness style. Caveat Emptor. And before I forget:
The "Nick Mitchell SimpleDateFormat Challenge": parse a sample common SDF and JIT code to translate Date objects to ASCII efficiently. Similar to C/C++ compiler optimizations for 'printf' strings.
Let me know when you got something working.
Towards an Actor-based Concurrent Machine Model "delegation based" machine model; dynamic separation of concerns "MDSOC" Machine model: objects, msgs, delegation. High lvl objects represented as at least 2 low-level objects; a "proxy" and a real object. Indirection to allow message interception. All msg-sends get forwarded thru the proxy ("message" here is a java-lvl function call, not an OpenMPI msg - but could be either in a distributed system). Aspect-Oriented-Programming - intercept msg-send links (between Class-proxy and methods in the actual Class-body), etc. How to do CHA with basically dynamic sub-class insertion? Just re-JIT in the new class hierarchy? Ok - now new stuff... want concurrent in the delegation function. Actor is a collection of local objects (and ptrs to remote objects). Fcn calls between local objects are fast; fcn calls across actors are basically remote-msg send. Then receiving a msg invokes a co-routine in the remote actor. Ugh, using 'yield' with the co-routines. Sounds buggy to me: semantics of dead-lock will depend on proper calling 'yield'. (assuming a VM w/ actors not thread support)
An Efficient Lock-Aware Transactional Memory Implementation Justin Gottschlich Trying to integrate into the "boost" STM system. Locks+TM break things. But locks are prevalent - must be able to compose with them. Example: Locks outside Transaction - T1 runs lock; T2 runs transaction Example: swap code. Works with only-locks or only-TMs but breaks if mixed Locks inside Trans - T1 runs trans, calls lock; T2 runs either trans or lock Prior work: "full lock protection" - if you take a lock, then the XTNs are all stopped/aborted/etc and you only get locking behavior. Offers a programmer solution: programmer lists lock/XTN conflicts and the XTN system deals when you take a conflicting lock. (I believe this is unrealistic).
Eric Jul? - Whiteboard not slides.... Blurring the line between compilers & runtimes Gives some examples already done by hotspot Nice discussion, but nothing new. Mostly trying to get a discussion going about crosing compilers/JITs & runtimes. He might not have been aware about what the JIT already does.
Trace-Based Type Specialization in JavaScript Andreas Gal Same basic talk as given at PLDI. TraceMonkey? FireFox 3.5 JavaScript & Flash instead of Java/C/C++ JS has been very slow (interpreted only). But is here to stay; very popular & growing. ActiveX & client-side Java dying out (not sure about client-side Java which was never popular in the 1st place). Static typing makes life easy, but dynamic typing is required. More complex data types and more runtime tests. Tag bits to be checked; overflow to Double, etc. Coming around to wanting a HotSpot like dynamic JIT'ing thing based on types that happen to be true at the moment. Basically, types in program traces remain stable over time. - for( var i=0; i<100; ++i ) { /* nothing */} loop: if ( int(i) >= 100 ) break; i = box(int(i)+1); goto loop; So trace can record, e.g. that 'i' has always been an 'int' so far. Trace has a guard on the input types, and are type-specific. Function calls are inlined in the traces, along with guard statements to check that you are taking the same control-flow. Tracing loops, and can verify trace is type-stable across loop. So can remove e.g. boxing in the loop. But real traces have many exits and the many exits are really taken. So trace along each exit. Build trace-trees, rooted at loop headers (exponential growth of trace trees as the loop DAG is split out?). Can only link back into original tree if types all match up - else need a new loop /tree header. Suffering for lack of a language spec; JS programs are driving the language semantics (i.e., w/a new interpreter some JS programs fail so we call the interpreter buggy- even if it meets the loose "spec"). No nothing of a memory model or threading; lots of other holes in the language. Echo's of what Java went through: the popular implementations defined the spec.
Tracing the Meta-Level: PyPy's Tracing JIT Compiler Carl Bolz PyPy is a tracing JIT compiler. Now apply this tracing JIT to an interpreter. Compiler "Restricted Python" RPython - can target C, Java, .Net Various interpreters: python, prolog, smalltalk, scheme, etc Basic idea: trace loops; look for type-stable loop execution; look for similar code path loops. But the dispatch loop in interpreters means you never execute the same loop code twice (because each time you are running a new different bytecode). Goal: trace user-mode program, not the language interpreter. Effective the tracing interpreter unrolls the bytecode dispatch loop. Provde 3 hints to laguange interpreter. - hint for position key; here is the language interpreter's BCI; here are backwards edges; here is the PC modification Works fairly well to clean out the language interpreter from the traces. Then get traces which are fairly clean & can JIT them. Bears resemblance to partial evaluation, arrived at by different means. Future work: better optimization of traces; some escape analysis to remove boxing operations. Optimize frame objects, apply to larger programs.
Faster than C#: efficient implementation of dynamic languages on .NET Antonio Cuni Trying to make e.g. Python faster on .Net. Looking at IronPython, Jython, JRuby, Groovy (also Self, JavaScript/TraceMonkey/V8) Why so slow? Hard to compile efficiently; lack of type info at runtime; VMs not optimized to run them. .Net is a multi-language VM, right (sure, as long as the language is C# - his quote, not mine!). JVM is in better shape, but still heavily optimized for Java. JIT compiler? Wait until you know what you need; interleave compile-time and runtime; exploit runtime info. JIT layering; fight the VM... PyPy; JIT compiler generator; Python semantics "for free". JIT frontend not limited to python; JIT backend: x86 or CLI/.NET backends. Fun games with partial eval: assume e.g. Python bytecodes to be constant & constant-prop them into the python interpreter. Do even more fun with constants than HotSpot+OnStackReplacement - totally doing speculative constant value JIT'ing - if this argument is of value '3' then here is the JIT'd code. Trick is to pick which variable to constant- speculate on (and getting that spec value as well). Not yet doing all of Python, but getting really great speedups.
Strata Virtual Machine Software Dynamic Translation - read a binary, & translate it/jit the translated code. Using a code-based hash-table lookup. Hash; jump to hash-table entry. Miss: jump to 'strata' interpreter Hit: jump to bucket; bucket checks target (like HS inline cache, check at target) HotSpot could use this to make more efficient v-calls? Hash; indirect jump to nearby code table; table jumps to target & checks target; expect no misses after warmup. Nah... still got indirect branch in there. Looks like a standard binary dynamic translation type stuff (converting PPC instructions to Java bytecodes?)
Automatic Vectorization in JIKES RVM Using SIMD ops on X86. Can't use BURS/BURG to pattern-match vector ops. Unroll loops to make the patterns more obvious. Basically getting some SIMD stuff to work (but talk given by PC not by author, and PC believes this is not the right way to discover SIMD ops). I also believe this... it's not a clean fit to BURS. Code emit via simple bitmask/shift/and/or.
JIT Compilation on ARM Processors Michele Tartara ARM - 31 32-bit GPRs, 16 available at a time?. SP, LNK, PC are GPRs. Fixed format 32-bit ops. Dynamic fast compilation (cell-phone targets). No BURS or tiling, just greedy rules. This is probably the right way to go always.
ECOOP 2009 Proper Starts
ECOOP is being held in the Piazzo Ducall (Ducal Palace) - the main conference presentation chamber has perhaps 40ft ceilings, acres of gold leaf on the walls in between the massive medieval paintings. The speaker dias was clearly meant to hold a throne or the orchestra; it's a circular marble dias perhaps 20ft in diameter with marble balustrade and marble railings. The high tech screen & projector setup in the middle is really anachronistic. The adjacent formal ballroom is much larger; 50ft+ ceilings; chandeliers of at least 20ft tall and 30ft in diameter; paintings & gold leaf in plenty plus also perhaps a dozen 8ft marble statues with ancient greek themes.
Keynote - Classes, Jim, but Not as we Know Them Simon P Jones As usual for Simon, he gave a wonderful presentation. History - Haskell is 20yrs old. Lots of fun with new languages & machines & ideas (Functional Programming). After Backus's Turing Award lecture opened the gates, a storm of new languages hit the field. Took 10 yrs before Haskell formed out of the chaos of new languages. Lifetime of most programming languages:
- Invented by 1 person, used by 1 person, dies within a year.
- Slow death version: invited by 1 person, used by 100, dies in 5 yrs
- Successful: used by 100K+ programmers, never dies, e.g. Fortran still alive
- Haskell? hovering around 1000 programmers, but sharp recent uptick?
Haskell might cross over into immortal language (100K+ programmers)? Hackage: user uploaded libs, reaching 300 new libs uploaded /month, and approaching 1million downloads. Type Classes- Shows examples of lots of things you want actions on a wide variety of types:
- equality (for set membership?) how to do equality of functions?
- ordering (sorting lists)
- print/showing
- serialization
- computing hash function
(so far sounds like Java interfaces) For all 'a' such that 'a' is an instance of class 'Num' square Num a ==> return a*a Num classes have the following methods: +,-,*,/, etc. The implementation of Num classes binds '+' to eq 'plusInt' or 'plusFloat' Implementation is perhaps more clever than Java invokeinterface bytecode lookup: is passed in a vector of fcns, of the correct type of interface 'Num'. The fcn lookup happens on this particular 'vtable' - call it an itable, but it's really a unique v-table per interface. For Haskell, this is all done with syntatic sugar over the basic Haskell. Which itself is very cool (e.g. Haskell compiler (javac equivalent) is doing interface calls w/syntatic sugar). CLIFF NOTES: This is a faster way to do Java interface calls!!! Or maybe this is how they are impl'd? Find the i-table from a list of implemented interfaces, then pull out the correct fcn from the i-index. In his world the i-table is pulled out early and kept pulled out, which avoids the point-by-point lookup of interfaces. Ahhh.... he's statically computing the i-table ahead of time during compilation, using some combo of type unification and the closed-universe assumption. Not applicable to Java. Able to handle fcns which themselves take arguments of fcns with any signature, as long as the fcn handles the correct interface. Doing type-based dispatch instead of value-based dispatch. Not the same as Java interfaces... In Haskell, can bind a value arg to multiple interfaces all at once. In Haskell, existing classes can be made instances of new interfaces (duck typing) Haskell has NO subtyping, but uses the interface (well: type class) notion instead. Haskell: types are inferenced typically, type annotations occasionally needed Java: types are declared explicitly; inference occasionally possible Haskell does type unification, so things like Java's 'equal' call - which is normally a fcn of (Object,Object)=>Bool - in Haskell, we KNOW ahead of time that the 2 args are of the exact same type. Hence the implementations of various equals calls do not need to ask the "instanceof" question that all Java equals call implementations all start with.
Making Sense of Large Heaps (Nick Mitchell, IBM) Where does the memory go in large java apps. Example from 2008, Cobol to Java. In Cobol: 600 bytes; in Java 4492 bytes. Blow up from delegation, bulky collections, or too many data fields. Size examples: 58M objects, 3Gig ram, 8000+ types. Yeti Tool - does smart cluster of datatypes (String w/char[], or HashMap w/HashMapEntry[], HashMapEntry, HashMapEntrySet). Yeti also does dominance of sharing; notices single-owner root/tree bits, and then breaks out the shared parts at the leaves. Also detect e.g., similar kinds of sharing, such as array B[] points to a set of B's; but also a linked list of Entry's points to the elements of B[] in turn. Really lumping together things with the same owner; maximal dominance w/same ownership relationship. Does nice heap shape summaries, followed by size info. Picture is really alternate sandwich effect, alternating between the Collections you selected, and the user-data at that layer of the 'sandwich'. Edges hold the various sizes of things. They fold up collection 'backbones'. If we get the data structures right, can easily shrink heaps by 2x to 10x!!! Huge speedups possible. Shows examples of people using CHM to hold 1 element (but programmer doesn't want to think about expected size usage, so does the easy thing and grabs CHM).
Scaling CFL-Reachability-Based Points-To Analysis Using Context Sensitive Must-Not-Alias Analysis Basically computes a must-not-alias approximation; where it holds do not bother to run a more expensive pts-too analysis. Some experiments are medium programs (SpecJVM - 7 progs, Dacapo - 4 progs, 8 other progs). Can compute pts-to graph in 2x to 5x smaller. Compute time for the pts-to analysis is 3x faster on average than prior work (but still in the several minutes range)... but what do you do with the info? (this is my standard question when presented with better/faster/new&improved points-to work: now that you have the info how much does it *really* help speed up programs? especially compared to having strong typing in the first place) Nice presentation of a pts-to analysis, but hit right at my sleepy jet-lag afternoon siesta; had trouble staying awake.
Intel NePalTM - Design and Implementation of Nested Parallelism for Transactional Memory Systems. (Intel) Issue is using locks instead a XTN; the locks are for the nested parallelism, but want the whole operation to be atomic. Azul makes no attempt to use the HTM to span across threads. Can't work for us, because the threads need to communicate through shared memory - and that shared memory will blow out the HTM. This is Intel's compiler-gen'd STM system for C++. Handles either optimistic (read-friendly) or pessimistic (write-friendly) concurrency; roll-back & retry. Sigh - shows basically no speedup over plain locking. (and perhaps not the best name....)
Debugging Method Names What's a naming bug? A mismatch between the name and what the code does. Break names apart (CamelCase in Java), do some grammer abstraction. getLastName ==> get - last - name ==> (verb) (adj) (noun), etc Then also do semantic analysis of code. Look for things like: has a loop, has a incoming parm, returns a value (from multiple points), includes some run-time checks in the loop, etc. Decide it's a search loop, returning a result or NULL. So come up with lots of data points for names & methods. Then mine the rules from a large corpus of code. E.g. a "contains-XXX" name method should NOT be a void return, because a "contains-XXX" name implies a question with an answer. In fact, in probably ought to return a bool. Then with rules, apply rules to a program and report violations. Turns out pretty easy to suggestion candidate replacement names. Example: public void isCaching(bool b) { this.caching=b; } Method isn't asking a question, it's setting value. Tool suggestions name should be "setCaching". Example: public bool equals( Object o) { if( this==o ) return true; if( o instanceof Value ) return equals((Value)o); return equals(new Value(o.toString())); } So not really an 'equals' method, because it makes a new 'Value' object More examples, ends up finding interesting bugs as well. Hard part is to decide whether or not to change the name or change the code. Sometimes the name is correct, but the code is 'wrong' here - and also wrong elsewhere. Basically finding a poor factoring of code.
Mapping & Recommending API usage Patterns API, libraries - often complex, lots of classes & methods; complex; hard to use. People use: books & docs, forums & newgroups; code search engines. Example: we wish to add an item to Eclipse menu. Browse docs; find potential call: "appendToGroup". Google search finds 151 code snippets (at time of paper submission) and 287 code snippets (at time of talk). The returned code snippets at least 2 different API usages. 'MAPO' Tool does search; parses code; clusters snippets based on used; mines patterns; recommends final patterns. Tool needs to inline non-3rd-party methods (scattered implementation). Examples hard to follow... I've been bit here before repeatedly (complex API & no way to learn how to use it easily). So very sad that it's so hard to follow his talk.
Supporting Framework Use via Automatically Extract Concept Implementation Templates Another talk about finding & figuring out reuse of existing code. Choice between Templates vs Documentation had much less impact on dev time than the task's concept complexity.
I skipped all the refactoring talks. Exhaustion is setting in, plus I've seen one or two of these talks before. Went to a couple of talks attempting to deal better with constructors. Mostly it's issues like publishing objects before construction completes, or in Java call an overridden v-call from a constructor (which means the sub-class v-call executes before the sub-call constructor can work on the object). Final fields, read-only objects still being constructed; not-null field properties before you set the value; et al; there's a bunch of problems that stem from constructors not being instantly-quick - and if they take time, what does the object look like in that in-between state. This is a general language issue with constructors. They are a nice notion, but unless they are instantly quick, you have to live with having objects which are not 'complete' or 'constructed' yet and hence do not meet the expected invariants for the object.
Inlining Security Policies Want to inline code into the original program; the new program either runs or is truncated (if a security violation happens). Must show no *new* behaviors from the inlining, also all old behaviors are allowed (except the secuirty ones), etc. Single-threaded versions of these inlinings all exist and are well documented. Multi-threaded is harder: general case isn't solvable (probably because the monitors themselves become non-deterministic). But if the monitor is race-free, (and some other easy properties), then you can do it. Monitor is normally a state-machine.
Cliff Click Talks About Azul My keynote went well enough; lots of Q&A from the senior people in the audience afterwards. We had to stop the Q&A after running quite a bit over. I also got a lot of positive comments afterwards, so I think people really enjoyed the talk. I'll add that I enjoyed both Simon Jones' and Dave Ungars' keynotes also. This ECOOP was unusual in having 3 "gentleman hackers" give keynotes; we're all three both very scholarly and very prolific programmers. Doug Lea added in a later email: "Cliff: definitely the best talk I've seen you give (which is
quite a few). Someday you'll have to figure out how you packed in so much technical content yet still fully captured attention of people supposedly without the background to understand most of it." High praise from Doug Lea, so I musta done something right. :-) Trip home is uneventful, except for Paris's Charles DeGaul airport being the usual zoo. I get up at 4:30am and am in a Genova taxi by 5am for the 1/2hr drive to the airport. Except that there's no traffic at this hour and the driver thinks he's Mario Andretti. We make the trip in about 10 minutes. So I'm way early into Genova's airport for a 7:10am flight. No power connections for my laptop, for-pay wi-fi (not worth it), no shops are open - not even coffee. No air-partnership; I cannot check in for my USAirways flight from the Air France desk in Genova. I blog for the next hour and a half. The flight leaves on-time and the two hours flying are very smooth. Kudos to Air France. Then we land at Charles DeGaul and the madness begins. We have to walk down the plane stairs and across the tarmac and about 1/4 mile more to the main terminal - but actually it's the small-plane terminal. After figuring out I'm not in Terminal's 1, 2 OR 3, I get instructions to take bus Number 3 to the automatic train and take it to Terminal 1. Of course the buses & bus stop are labeled- with LOTS and LOTS of numbers and plenty of French - takes a question of the drivers to figure out which is the #3 bus. The bus ride is longish (for an airport bus) and uneventful and then I find the train. Thankfully the train is clearly marked and again a longer ride to Terminal 1 than I expect: Charles DeGaul is a really spread out airport. Terminal 1 is a total zoo; I see lines of people snaking out all over filling the large hall, hundreds of people long. Three times I ask people about which line they are in; none of the lines are marked and all snake back and forth through out the hall; there's a steady stream of people traffic cutting through all the lines and the lines branch and rejoin helter-skelter. Finally I find the 'entrance' to the line for flight #771 to Charlotte - just next to the line for Philly which clearly orbits the entire hall. The Charlotte line is actually fairly short, maybe 20 people long, and after 10 minutes I'm talking to the agent. No problem checking in and then I head off for gate 55 (cutting through the Philly line again). Then it's customs (another 10 minute line), and then a small shopping zone - I figure I'll come back to it for a snack (no breakfast yes; nothing was open in Genova). But then I discover there's yet another line for security. This one is about 20mins long and no way I'm going back for my snack and then back through security. I figure it's like US airports - once past security there will be food and such... WRONG! It's just the end of the lines; I'm in a seating area with about 10 gates and absolutely no way to get a drink or snack - or bathroom near as I can tell. It's now 10:10pm and it's taken me a solid hour to navigate Charles DeGaul to find my gate. As I type, I'm standing by a laptop charging station (no chairs nearby), in the hopes I can do a little something on the laptop during the next 8 hrs of flight. Later: flight is fairly smooth. They serve some meals; I have another flight through Charlotte, NC to San Francisco. Home at last! For about 24 hrs that is, then I'm on vacation in Texas with my entire family. More plane flights for me! Yumm, I can already taste the small salty snacks they are going to serve...
Category: Web/Tech | Permalink
| Comments (8)
| TrackBack (0)
JavaOne Slides
July 21, 2009
Slides for my 2009 JavaOne talks:
I'd like to say more on JavaOne, but I'm (re)discovering that if you're a speaker at JavaOne it's really hard to get into the other talks & technology being presented. I had my head full with my 3 talks, making sure they looked good, were relevant, etc. To be sure, there were plenty of talks I wanted to attend, but I never seemed to find the time. :-(
And of course shortly after JavaOne, I went to 2009 ECOOP in Italy for a week, and now I'm on a long deserved vacation. More blogging soon, on ECOOP at least, Cliff
Category: Web/Tech | Permalink
| Comments (10)
| TrackBack (0)
IFIP WG2.4 Trip Report
May 18, 2009
2009 IFIP
IFIP WG2.4 remains one of my
favorite mini-conferences to attend. The group is eclectic &
varied; the discussion vigorous. Everybody is willing to interrupt the
speaker & ask questions. The talks are usually lively and pointed,
and this time was no exception.
The conference was held in Fort Worden, a historic
Big-Gun (planes rapidly obsoleted the guns) fort of the Pacific
Northwest, defending Admiralty Inlet
and our shores from the marauding Russians ... or Japanese... or
somebody, surely. Nobody ever attacked that way, so clearly the fort
was a big success.
Meanwhile, the fort has been turned into a park and it's beautiful.
The whitewashed buildings with green trim remind me of some mythical
50's era America that never was, the kind that shows up in the backdrop
of various movies or maybe "Leave
It To Beaver". We stayed in the officers quarters and held our
meeting in the old mess hall. The officers lived quite nicely; the
quarters are actually old duplexes in great shape, with high ceilings
and crown molding and carved brass fittings everywhere - the house is
clearly larger than my house and with a vast green lawn ta-boot. In
San Jose such a place would be upwards of $2m...
Fort Worden overlooks Admiralty Inlet from some high bluffs (classic
fort-like location) and the surroundings are all gorgeous. The weather
varied from overcast & chilly to clear & crisp (alas, overcast
and chilly was the mode on our all-day "whale" watching expedition. We
did lots of watching but saw no
whales of any kind. And we bounced over heavy seas and got sprayed
with frigid water most of the day.)
Since the Hood Canal bridge was out I got treated to a 3 hr drive the
long way around the peninsula, but somebody else drove and the scenery
was worth looking at. The airplane portion of the trip was easy.
As usual, my reports are (very) brief, stream-of-consciousness
writing. I often skip around, or brutally summarize (or brutally
criticize - but this group is above the usual conference par, so less
of that is needed). I skip talks; I sleep through ones I'm not
interested in, etc, etc. Caveat Emptor.
Synchronization Talks
Atomic Sets - another nice concurrency alternative
Fork/Join for .Net
occam / CSP - Uber lite-weight user-thread scheduling
Java Analysis Talks
SnuggleBug - Better bug reporting for Java
SATIrE - A complete deep analysis toolkit for Java
PointsTo - A comparison of various points-to
algorithms for Java
analyzing JavaScript - Has a type-lattice as
complex as C2's internal lattice
Performance variations in JVMs - Noticed the wild
variations in run-to-run performance of JVMs
Misc
Cheap Cores discussion
Compiler-directed FPGA variations
Clone Wars - huge study on cut-n-paste code in
large projects
Business Process Modeling
Security
PDF Attack - "Adobe Reader is the new
Internet Explorer"
The Britney Spears Problem
Squeezing your Trusted Base
Who you gonna call?
Evolving Systems
Swarms in Space
Finding emergent behavior
Frank Tip,
Type-Based Data-Centric
Synchronization
Locks are hard. Auto-inference of correct sync in O-O programs.
Even with no data-races still have atomicity races.
Instead of code-centric locking paradigms, do data-centric locking.
Tied to classes with fields, leverage O-O structures
Group a subset of fields in a class into a *atomic_set*.
Then find units_of_work for that atomic_set.
Add language features to Java:
atomicset account;
atomic(account) int checking;
...
atomicset logging;
atomic(logging) Log log;
atomic(logging) int logCount;
So add 'atomicset' keyword, 'atomic' keyword and strongly type
variables with atomic_sets. Units_of_work expand to reach the
syntactic borders anytime an atomic variable is mentioned. Can expand
unit_of_work by hand also.
Can at runtime cause atomic_sets to alias - to union together. Used on
ownership types (e.g. HashMap uses HashMap$Entry, so units_of_work on
HashMap or HashMap$Entry become units_of_work for both).
Atomic_sets can be unioned; e.g. for LinkList the entire backbone of
the list is one atomicset. Lots of discussion - points out the issues
with hacking all the Code Out There.
Has proven strong serializable properties on atomicsets. Proven
various kinds of soundness, including *internal* objects are
only
accessed by their owner, etc.
All concurrency annotations appear in the libraries but not in the
client code - shows a nice concurrent example with 2 threads
manipulating the same linked list.
Has a working version using Eclipse & taking annotations as Java
comments. Using j.u.c.Locks; can do lock aliasing by simply sharing
j.u.c Locks. Applied this to a bunch of Java Collections; 63 types
& 10KLOC of code. Needs about 1 line of annotation per 21 lines of
code for the libraries (and none in
clients)
About 15-30% slower than using plain sync wrappers on a single-threaded
testcase; probably because the "synchronization" keyword is very
optimized. Performance is similar or slightly faster when running with
10 threads & high contention.
Daan Leijen - The
Task Parallel Library
for .Net
- A concurrent parallel library for .Net (think dl's Fork/Join).
Infrastructure: locks/threads/STM, async agents (responsiveness), task
parallelism (performance)
Gives a demo of a parallel matrix-multiply w/C#. Very ugly code.
So instead, write a nice combinator...
Standard task-stealing work-queue approach. (private set of tasks that
don't need CAS, public set of steal-able tasks that require CAS).
Points out that want fine-grained parallelism, so can do work-stealing
and do auto-load-balancing.
-- Effectful state. Statically typed, with typed side-effects. Type
signature includes all side-effects (yes!)
int fib(int n) { return n<=1 ? 1 : fib(n-1)+fib(n-2); }
But this program
int fib(int n) { return n<=1 ? print"hi",1 : fib(n-1)+fib(n-2); }
Does not type in Haskell - need the IO monad. And now the function
returns 2 results: the IO "hi" and the int.
So want some syntax to allow "auto-lifting" into a side-effect world -
same function syntax, but auto-includes the IO monad as a result.
Then can prove that some stateful computation does not leave the
function - that the state is entirely encapsulated & unobservable
outside. So can do type'ing which allows some state-ful work
internally, but it doesn't escape the function "box".
int fib(int n) {
ref f1 = 1; // side effects, but entirely internal to the function
ref f2 = 1;
while( n ) { sum = *f1 + *f2; *f1 = *f2; *f2 = sum; n--; }
return sum;
}
Peter Welch, Multicore
Scheduling for
Lightweight Communicating Processes
Google: kroc install
Carl Ritson did all the work...
Cliff Notes: Insanely cheap process scheduling, plus "free" load
balancing, plus "free" communicating-process affinity. Very similar to
work-stealing in GC marking, but with "threads". OS guys take
note!
Scheduler for OCCAM/PI. Process-oriented computation (millions), very
fine grained processors. Message passing over channels, plus
barriers. Uses CSP and pi-calculus for reasoning. Dynamic creation
& deletion of processes & channels.
Goal: automatic exploitation of shared-memory multis. "Nothing" for
the programmer to do: the program exposes tons of parallelism.
"Nothing" for the compiler to do - it's all runtime scheduling.
Goals: scalable performance from 10's of processes to millions;
co-locate communicating processes - maximize cache, minimize atomics
(lock-free & wait-free where possible). Heuristics to balance
spreading processes around on spare CPUs and co-locating the
communicating ones.
Usual sales job for CSP... which still looks good.
A blocked process requires only *8* words of state (no stack for you!).
Scheduling is cooperative not preemptive. No complex processor state,
no register dumps. Up to 100M process contexts per second. Typical
context switch is 10ns to 100ns. Performance heavily depends on
getting good cache behavior. So processes are batched. Stacks are
spaghetti-stack (no that's why no stack saving...). These 8 words
include the process run-queues.
Batching: sets of e.g. 48 processes are run on a single core; that core
is the only core touching this queue & round-robins them until the
entire batch times out. So no cache-misses. A Batch is a Process,
plus a run-queue.
A Core is a Batch, plus a queue of Batches. Again, no other cores
touch this list either.
A Core with no Batches does work-stealing. There's some proposed
batches available for work-stealing from each Core. Cores need some
atomic ops to manipulate the queue of Batches for these steal-able
Batches. Careful design to ensure all the interesting structures fit
in a single cache line.
So until you need to enqueue, dequeue or steal a Batch - it's all done
without any contention for process scheduling. These Batch-switching
operations require atomics but are otherwise fairly rare.
Key bit missing: scaling up beyond 8 or 16 cores with some kind of log
structure. But otherwise looks really good for very low cost context
switch & work steal. Lots of questions about fairness.
Batches are formed & split by the runtime (no compiler or
programmer). Aim is to keep processes that communicating on the same
core (so it's all cache-hit communication), but spread the work across
cores. Initial batches are kept small to encourage spreading. Channel
ops require atomics, but typically only 1 (sometimes 2). Choosing
amongst many channels might require up to 4 atomic ops. Each time
communication happens on a channel, the sleeping process is pulled into
the run queue Batch of the awake process - thus naturally grouping
communicating processes. But as batches get large, need to split.
Periodically can observe that if all the processes in a Batch are all
communicating with each other in some complex cycle - then the spare
run queue empties from time-to-time (with one runnable process). So as
long as the run queue empties now & then, then the Batch should
stay together. But if the run queue never empties, then
split-the-Batch. Split off a single process into a new Batch - which
will drag it's connected component together into the new Batch. But
eventually if the original Batch had 2 complete strong graphs,
they will now be split - and can be stolen onto other CPUs.
So this kernel is about 100x less overhead than the same design done on
a JVM.
:-)
Satish Chandra,
Symbolic Analysis for Java
"Snugglebug"
Communication between the programmer & the tool tends to stop
early: "Dear Programmer: you have a null-ptr bug here"...
Going to the next step: "Dear Mr Programmer: try running your method
with *these* values and you'll see the bug...". Programmer:
now that I think about it, these values will never arise.
We just forced the programmer to define the legitmite range of values.
More: Bug Report is more believable if given a concrete input is
given.
API hardening: gives the preconditions under which this library
will throw an exception
Unit-test generation: inputs needed to make a specific assert
fail.
Works by weakest-precondition. Given an assert in the program, work
backwards computing the weakest-precondition at each point until you
reach the top of the method. If this chain works, then hand the WP to
a decision procedure; sometime the decision is "I don't know" but many
times you get an explicit counter example.
Good match for the counter-example problem.
- Yes loops are a problem, but don't really need the *weakest*
precondition.
Most Java features can modeled:
- heap r/w, arrays, allocation, subtyping, virtual dispatch
- exceptions: many bugs are here so we modeled exception handling
try/catch
very closely.
Have a path-explosion problem. For heap updates, even a single path
includes branch effectively depending on aliasing.
Has to do strong interprocedural analysis. Lazily grow the call-graph
consistent with known symbolic constraints. First look for a conflict
on a call-free path, then on call-paths but without needing to look
into the call. Finally, if I must peek into the call - but use the
symbolic info the reduce the set of possible call targets.
Generally, no fixed-point. Cannot expect programs to give loop
invariants.
Work-list management: avoid various common pathologies.
Goal is to find *a* contradiction, not to explore all options.
SATIrE
http://www.complang.tuwien.ac.at/satire/
Shape analysis, points-to, flow-sensitive, context-sensitive,
annotations, can read them in & out. Applications to e.g., slicing.
Read C/C++ - large complex tool chain. Reads in program x-forms in
either a functional programming language or in Prolog. Can combine the
tools in any order at this point (very well integrated). Output is
C/C++ w/annotations at each step (or internal "ROSE" ASTrees, etc).
Has interval analysis for integers, shape analysis, loop (bounds)
analysis, points-to, etc. Gives example of reaching-def's in the
functional programming language. It's a domain-specific language for
writing compiler analyzes. Can specify the strength of the analyzes in
many cases: e.g. the depth of the allowed call-chain for
context-senstive (set to zero for context insensitive).
Shape analysis gets may/must alias pairs, etc - and all results put
back into the C program as comments/annotations.
Basically describes a large complex tool chain for doing pointer
analysis on C & C++ programs. The tool chain looks very complete
& robust.
Welf Lowe -
Comparing, Combining &
Improving Points-to Analysis
...skips explaining what points-to is.
Clients: compilers, JIT compilers, refactoring tools, software
comprehension.
Different clients have different needs: speed, precision, security.
Granularity & conversatism also matter. Clients might want, e.g.,
dead code elimination, or removing polymorphic calls, or rename methods
& calls. Other people want object granularity - escape info &
side-effects (i.e., compilers and JITs). Also static-vs-dynamic
analysis. Dynamic analyzes typically run the program and produce
optimistic results.
Doing careful experiments of 2 analysis w.r.t. accuracy. Hard part:
can't get a perfect "gold standard" for real programs. Hard even for a
human to inspect a proposed "gold standard" and declare it "gold" or
not. Some special cases are easier: conservative analysis (no false
positives) & optimistic analysis
(no false negatives). Can under- and over-estimate the gold standard,
but this still messes with the results (messes with detecting which of
the 2 analysis is better w.r.t. accuracy).
E.g., ask the question: is it worth increasing k>1 (i.e., becoming
context sensitive vs staying insensitive). Checking size of computed
call-graph - smaller graph is more precise. Very small increase in
accuracy. Sometimes made things worse - because the original analysis
was optimistic sometimes.
Open question: what is the use of an Uber Points-To - it's definitely
beyond the point of diminishing returns for JIT compilers.
Anders Moller -
Type Analysis for
JavaScript
Not much tool support for catching errors during programming.
No nice IDE support. Many bugs are browser specific, but we focus on
language bugs.
Most JS programs are quite small, so throw a heavy-weight analysis at
it.
Object based, dynamic typed, proto-type inheritance, 1st-class
functions, coercions, dynamic code construction, exceptions, plus 161
built-in functions. Types include null & undefined, primitive
types, some properties "ReadOnly" and "DontDelete".
Broken things: tracking down an "undefined" and where it came from,
reading an absent variable or absent property, etc.
So do a dataflow analysis over the program, including a lattice,
transfer functions, etc. Handle interprocedural stuff. Tightly
interleaved control flow & data - so need a complex lattice.
Lattice is roughly as complex as C2's lattice, with more stuff on the
allocation sites and less on the class hierarchy. Also then model all
the abstract machine state (C2 does this as well). Full context info
(total inlining at analysis time except for recursion).
Cliff Notes: Nice idea: for each allocation site, model both
the *most recent object* from that allocation site, and also
model *all the rest* from that site. You can do strong-update
on the singleton most-recent object, but weak update on all the rest.
Rhodes Brown -
Measuring the Performance
of Adaptive Virtual Machines
Noticing the badness of the "compile plan" - 1 run in 10 had a >10%
performance reduction.
JIKES - always compiled, no interpreter. On demand at 1st invoke.
Focused recompilation, lots of GC variants.
Points out the places where you get variations - stack-profiles for
estimating the call-graph, basic-block counters, etc.
Causes of variation?
So far: GC & inlining
Concerned that i-cache placement is a major issue of instability.
*No* correlation between total amount of compile-time spent vs
score. You must compile the hot stuff, but apparently tons of
not-so-hot code is getting compiled and it doesn't help the score any.
Kurt William -
lots of cores for cheap
General discussion round, not really a "talk".
Expect >32 core/die in 5 yr, >128 core/die in 10 yrs.
What's "New" and what can we expect in 5 years?
Theory: contention is that there's nothing new here since the
70's & 80's (concerning concurrency). (general agreement on
incremental progress but not breakthrough).
Systems: Lots of libraries (TBB, pthreads), lots of platforms
(webservers), all OS's support multi-core. JVMs are New in this
space. Might see OS-support for user-level thread scheduling.
Languages: Support been there, but it's locks & threads - no
good. Support for TM is coming, but will it help? Then there's CSP...
Applications: Not much new here.... more games, more
interactive more user-interface. Old stuff: lots of HPC, speech,
image. Matching: easy parallelism; lots of interesting apps here.
Uwe Kastens,
Compiler-Driven Dynamic
Reconfiguration of Arch Variants
What is "reconfigurable"? "Usual" approach - HW guys compile to a
netlist, do some mapping, place & route, make a FGPA/silicon, while
the SW guys write some assembler on the FPGA - then at the end they try
to run the SW on the HW. Goal is to run some complex function faster
on a FPGA than in normal ops.
Our approach: read source code, look at program structure, find hot
code & hot loops; compiler knows how to switch the hardware between
variants; then it compiles the code for a particular FPGA variant for
each loop, and reloads the FPGA with a new function between loops. But
need to limit the FPGA to certain variants so don't actually need to
reload the whole FPGA.
Fixed small set of architecture variants. Keeps cost to reconfigure
the FPGA very low. Compiler can choose the variants. Example: FPGA
runs either a small CPU or a small SIMD CPU or a MIMD CPU. Or
reconfigure the pipeline structure, or the default register bank access
patterns. In the different configurations can use more or less power
for problems that are more or less parallel. e.g., in SIMD mode have
all ALUs active but turn all but 1 decoder.
Pick variants & design space to those known to be efficiently
solved by a compiler - gets a fast compile time & good results
there (use a human elsewhere).
Can map any arch register to any physical register - so can e.g. have
CPUs share registers, or do unbalanced register allocation - if some
CPU needs more registers in it's portion of code than others. Does
some register allocation affinity games, to avoid having to reconfigure
register layouts between blocks of code.
Nicely flexible overall design. Instructions name arch registers &
ops, but also can re-map real registers & ops periodically. So use
targeted registers & ops for a particular chunk of code, then
reconfigure when the next chunk of code is different enough from the
last chunk- and the cost to reconfigure is lower than running with a
less-well-targeted registers & instructions.
Ugh, can't read data charts - has some 4 embedded programs. Seeing 30%
reduction in execution time AND power cost, by reconfiguring. Code
size is a little larger, because either poor parallization(?) & the
reconfiguration overhead.
Jim - Clone Detection
Exact clone'd code, white-space only changes, near-miss clones.
Bug in 1 deserves inspection in all the rest.
Lots of cloning studies out there -esecpially in-depth of the linux
file system. But no wide-range-of-systems studies. We did a bunch of
open-source systems, C, Java, C#. OS, Web apps, DBs, etc. Used their
own clone-detection, as a combo of AST-detection and text-based
detection.
Text-based comparisons are sensitive to formating changes. But AST
parsing is expensive (requires tree compares) and does not find "near
miss" clones. Ended up doing "standard pretty-printing" - which uses
the AST to normalize the text. Then do text-based comparisons. Also
handle small diffs in the middle of clones.
Did 24 systems, 10 C, 7 Java, 7 C#. Linux 2.6 kernel, all of
j2sdk-swing, db4. Varied from 4KLOC to 6MLOC.
Java methods have more cloning than C methods - 15%-25% (depending all
the diff threshold) of Java methods are clones; C varies from 3% to
13%. C# is more like Java, until the methods allow more diffs - and
then there is a lot more cloning. Swing is hugely clone-ish, >70%.
C systems - Linux has the same cloning behavior as postgres & bison
& wget & httpd. Java has no more clones as you allow more diffs
as compared to C - I suspect good Java refactoring tools detect &
remove clones earlier.
C# shows a LOT more clones as more diff's are allowed. Maybe
cut-n-paste is a coding style? C#'s clones are larger methods as well.
Something like 10-20% of all files in Java have 50% of functions are
clones. In C# it's more like 20-35% of all files.
Files with 100% cloned code: 15% of C# files are totally cloned. For
Java whole file cloning is fairly low.
Localization of cloned pairs: are they in the same file? same
directory? different directories?
Thomas Gschwind -
Business Process
Modeling Accelerators
Business Process Modeling, patterns, x-forms, refactorings
Process Structure Tree -
- Process is a large unstructured graph. Visio-style graphical is
error prone for large process models.
But can take the structure graph and build a tree from it.
(The graph is all fork/join parallelism).
A small part of the tree requires state-space exploration.
After the tree is correct, can apply patterns & Xforms to the tree.
Essentially Petri Net edit'ing on a structured graph/tree using Eclipse.
100's of users inside of IBM; generated business models are much higher
quality, make them 2x faster.
Philip Levy, The
New Age of Software
Attacks
Why?
- Not enuf to do?
- Crime is Big Business
- Usable exploit: can be sold for $75K to $100K
- So called "security researchers". Some are respectable, most are
going after the $100K payoff.
"Visual Virus Studio" - actually exists. PDF is a common vector for *targeted*
attacks. "Adobe Reader is the new Internet Explorer".
Looked at all the fixed vulnerabilities in Adobe Reader
- Need to do a better job teaching students on how to write secure
software
- Stop writing in assembly language
- ANY unmanaged code, include C/C++, should be banned
- Languages need to be upgraded to be more secure
- New constructs & type systems
- More sophisticated testing tools & methodologies
- Better analysis tools for legacy systems
How big is Adobe Reader?
- 100K FILES
- 42MLOC, no comments or blank lines: 25MLOC
- Estimated 75 man years of development per year over it's lifetime
Problems are clustered - e.g. one module has had 25% of all security
problems.
Top 6 modules have 80% of security problems.
Break down problems another way:
- array index OOB - 32%
- unknown - 27%
- intege overflow - 14%
- range check missing - 8%
- capability check missing - 4%
- out of memory 3%
How found:
- Fuzzing - 42% (creating illegal inputs and throwing them at the
program)
- Code Review - 35%
- External report - 12%
Some examples:
- asserting against null, but no asserts in debug build and null
possible in release build.
- Failed to check for error returns, sometimes even not catching
return value
- Casting "attacker controlled values" (values from the PDF file)
to valid enum
- Or using attacker-controlled value for loop & array bounds
- Sometimes a complex pre-condition needs to be checked first
- Some of the fixes are still incorrect
A Quick List:
- Badly designed lib funcs (strcat, sprintf)
- Microsoft's banned API list is 4 pages
- AIOOB
- Calling thru func ptrs
- Integer over/underflow. Many protocols send N followed by N
bytes. Send a really big number that overflows (after scaling for
malloc) into a small number - then malloc succeeds, but the data
overwrites the buffer.
- Pointer math
- Uninitialized vars
- Dangling ref pointer
Basically: it's BASIC STUFF that goes bad, not fancy stuff.
All memory integrity or soundness type stuff.
But it goes beyond AIOOB -
e.g. Name string like "Robert); DROP TABLE Students;-- "
Jeff Fischer,
Solving the "Britney Spears
Problem"
Users assigned roles; roles are assigned privileges.
Benefit: users can come & go without changing policies.
"Patients" can view medical records, but "doctors" can change them.
Can add annotations to Java that specify roles.
Problem: lack of expressiveness; solution: logging. Violation happened
(people saw Britney Spears' data), but the logging allowed catching the
violators. Another solution: manual checks. But no compiler support,
so missed checks.
Problem: Lack of explicit access policies.
Really: need a way to "type" data (and hence methods using the data)
with some kind of security/access policy. More problems: often checks
are hoisted to entry point for efficiency, and then violate policy
later. Relying on manually-inserted runtime checking.
Our stuff: parameterize classes by some kind of index value defining
Role.
Each method includes an explicit policy (statically checked).
Annotation processor can either statically type-check the policy, or
insert runtime checks.
Jens Knoop -
Squeeze your Trusted
Annotation Base
"Interesting program properties are undecidable!"
Compiler optimization - not so worried: just lose some optimality.
Results still usable; but if we have better results we can produce
better code.
Worse-Case execution time analysis for hard real time: Bang. Your dead.
WCET bounds must be safe (soundness), and wish to be tight (optimal or
useful).
E.g., a bound of 1hr for a mpeg3 audio-player interrupt is safe, but
useless.
Need, e.g., all loop bounds.
Need user annotations.
User annotations are untrustworthy, but must be trusted.
Squeeze the trusted annotation base - reduce user chance of screwing up.
Try to replace trust by proof - as a side-effect generally improving
optimality. Then take advantage of trade-off between power &
performance of analysis algorithms.
Use model-checking with care. If something can't be bounded
automatically, as
the user for some bound - then prove it. BUt also believe the user's
bound isn't tight. Try to prove a tighter bound with the
model-checking using binary search. Indeed - can do away with user
assistance often; just guess and binary search to tighten it.
Summary: can often completely empty the trusted annotation base,
usually shrink it; often improve WCET bounds.
Start with the cheapest techniques (constant prop), move up to interval
analysis, then symbolic bounds & model checkers...
Joe Newcomer - Who do
you trust to run
code on your computer?
(Joe is a Windows attack expert)
Attack Vectors - autoexec.bat, .cshrc & friends, autoplay,
device-driver installs (Sony), mac resource attacks
Engineered attacks: phishing attacks, downloads, activeX, client-side
scripting
Lots of "security" designs & codes are done by junior programmers
w/out adequate oversight. Pervasive attitude: "software is secure - as
designed, as implemented, as maintained". Small+fast is all that
matters... not...
Rant against all modern OS's, talking about #of security patches in
Mac, Linux, unix, Windows...
Client-side scripting is vulnerable. Flash, Office, Adobe Reader, etc.
What are we teaching people about security? How about non-CS types?
Graphics designers, website designers...
Stefan
Jahnichen, Swarm Systems in Space
Build very small satellites, swarms of them to watch the Earth. They
will spread out after launch to watch e.g. weather, bush fires, etc.
Takes about 2 days after launching to get a communication link.
Satellites have good earth coverage; orbit at 500 to 600km, orbit earth
every 3 hrs or so. Wide range of sensors amongst different satellites.
Have a Real Swarm of cameras; map a Virtual Swarm onto that; map
mission goals on the V.S.
Optimized synthesis of consensus algorithms.
Special OS - distributed OS basically. Used to maintain swarm
integrity (keep satellites together).
Peter Welch,
Engineering Emergence...
Thesis: in the future, systems will be to complex to design *explicitly*
Instead engineer systems to have the desired properties "implicitly".
"mob programming", ant mounds, etc.... simple rules of behavior leading
to complex behaviors that emerge. Emergent Behavior
Mechanisms Design - (game theory),
- Rational actors have local private info
- Emergent: optimal allocation of scarce resources
- Optimal decisions rely on truth revelation
Swarming behavior (flocks, etc)
- local actors, local behaviors
- Emergent "swarm" behavior
- UAV swarms & autonomous robots
Social communication (gossip, epidemic algorithms)
- large, ad hoc networks
- emergent: min power to achieve eventual consistency
- low power, low reliability sensors & data propagation
Design for emergent behavior
- occam process per "location", a 1024x1024 grid has 1mil processes
- each location has a com channel to it's 4(8) neighbors
- occam process per "mobile agent"
- agents register with local process (or up to 4 if it's
straddling a local region) & local process tells the agent what's
nearby.
No idea how to design high-level behavior from low-level rules, but
busy experimenting and looking for some cases.
Cliff
Category: Web/Tech | Permalink
| Comments (2)
| TrackBack (0)
DaCapo Trip Report
May 5, 2009
http://www.cs.tufts.edu/research/dacapo
DaCapo is a small (but growing) research group focused on GC &
Memory Management issues. They meet about every 9 months, and the
meeting bounces around depending on who is hosting it. People mostly
present work-in-progress, and the audience is up front with their
critiques... often interrupting the speakers or getting sidetracked
into loud off-topic conversations, i.e., it's a fun boisterous crowd.
DaCapo is in the Boston area this year, held at Tufts University's Medford campus
(everyone's heard of Harvard and MIT, but how about the Berklee College
of Architecture, Fisher College, Boston University, Suffolk University
and dozens and dozens more? - Boston has more colleges and universities
per square mile than any other place I know). As usual for me
traveling to Boston, I skipped the car and just used public
transportation - the T
works marvelously well (as compared to e.g. VTA; crossing the
entire town from Logan Airport stuck way out in the water to Tufts some
8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose
on the VTA takes something like 2 hrs).
Anyways the weather was supposed to be highs of 65 and rainy with lows
in the 40's - so I packed my parka. Turned out the weather report was
a little bit off, with highs in the upper 80's and sunny. Man
did I roast. But I got out and enjoyed the sun - we (the entire DaCapo
group of 40 or 50 people) walked about a mile to lunch each day, plus I
walked to and from my hotel & the T stop (also about a mile).
Enough ruminating, on to the talks!
As usual, I pay attention to some talks and chat w/neighbors through others. I brutally summarize. I'm opinionated & narrow focused... so Caveat Emptor.
Change Tracking
What changed? Who did it? When? Points out the flaws of using 'diff'.
Logical-Structure diff. Auto-magically spotted systematic change
patterns.
It's a better code-review diff.
Can report diff's as reports like "for all XXX, replace XXX with YYY
except for ZZZ_XXX_ZZZ", etc. Vastly better error reporting!!! Vastly
better diff reporting!!!
Cliff Notes: We must get this gizmo and check it
out!!!!
Paper report: http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf
Sample output: http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm
Inferred Call Path Profiling
Very low cost path profiling. Hardware counters have very little cost,
but provide little context. Software is reversed: high context &
high
cost.
Cliff Notes: Compare this to call/ret RPC hashing.
Cliff Notes: Bond & McKinely does something similar.
Instead, look at number of bytes from main() to leaf call on stack.
Basically PC & SP (offset from stack base) provides all the context
we need. Dynamically detect unique PC & SP (they did it in a
training run) pairs, and reverse them to get a call-stack.
Lots of duplicated stack heights, how to tell them apart? On SpecInt
unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call
stack). But huge spread; really sucks bad on some benchmarks.
If have a list of entire call-graph, then can remove ambiguity by
growing some frames. But requires some global analysis to know which
frames to grow in order to be unique. "Solved" problem with a simple
greedy search. Compute all call-stacks over whole program run, pick a
conflict, grow some frame, re-compute & keep if more PC/SP pairs
are distinct. Repeat until it's rare enough that can live with dups.
Cliff Notes: The whole-program ahead-of-time thing isn't feasible for us. How to do this on-the-fly during tick-profiling?
Hash PC/SP, find a
tick record, 1-time-in-1024 check for dups (crawl stack the hard way
and confirm call-stack), otherwise assume unique & increment tick
record. If MISS in hash table, then crawl stack the hard way. If
record is labeled as having dups, then crawl the call-stack & find true record?) Actually, probably expect the
call stacks to be common, but the PC's to be fairly unique. So really
asking for unique RPC+SP (not PC+SP).
His experiments: collect PC/SP pairs about 100 times/sec. 0.2%
slowdown (must have had careful measurements to detect that!)
Breadcrumbs - Recovering Efficiently Computed Dynamic Calling Contexts
Also builds on probabilistic calling contexts.
Given a (nearly) unique ID (say, computed at each fcn entry via a
simple hash on the return PC). IN the past, reversing these unique
ids: use a dictionary (very large & high cost), use a
calling-context-tree (smaller, but requires you to track where you are
in the CC-tree at runtime). Forward search, but undecidable and
limited options for pruning.
Attempt to reverse the hash fcn (but only when we need to take the
semi-unique id) and convert it to a call-stack. Straight-up reverse is
no good, but is easy to guide search. Callers of leaf call themselves
have structure - e.g. expect to see a call-op from the call-site, etc.
Still need a list of all call-sites (generally available anyways
because of inline-cache structures). His function is "p' = (3p+c)", where "c" is RPC? Requires a handful of instructions on function entry.
Cliff Notes: If this reverse search is successful most of the
time, it says that during a profile tick we can record just the PC/SP
(or RPC/SP/PC?) and then reverse them to the entire call-stack later -
on demand when RTPM requests. So a 'tick' is no more than record a few
bits (and handle buffer overflow), but no stack crawl.
Cliff Notes: combined nicely with 'null-assignment-stack' info,
ie. collect the entire stack and 'color' all NULLs with a unique stack
id. Then if get a NPC can report both the NPC at the crash site, but
also the call-stack that assigned the NULL in the first place.
Transparent Clustering - Phil McGachey, Eliot Moss, Tony
Hosking
Attempt to auto-distribute multi-threaded programs. Pure Java, no JVM
hacks. No programmer involvement. Dynamic class re-writing.
JVMs can
vary. Hardware can vary.
All machines on a network run RuggedJ nodes. RuggedJ includes a
rewriting class loader, some runtime libs, and some
application-specific partioning strategy.
Rewriter: generate lots of classes & interfaces; especially hook
methods & fields. Hook into the runtime. Abstract over object
location; remote & local objects.
Ex: Standard class "X". Rewrite into a X_Local and X_Stub classes. X
objects on Node A are of type X_Local. On Node B, X objects are really
X_Stub objects where all the calls are remote calls. Both X_Local and
X_Stub implement the "X" interface. Some other Y_Local can refer to
X_Local on Node A, but refer to X_Stub on Node B. If want to migrate
from Node A to Node C, want to indirect the X_Locals; so for mobile
objects there's an indirection layer called "X_Proxy", that then be
switched from pointing to an X_Local vs an X_Stub (if the object
migrates). Non-mobile objects skip the X_Proxy indirection layer.
Inheritance & interfaces are maintained in the rewritten classes.
Static classes conform to the RuggedJ model (I assume replication is
allowed?). Arrays are also wrapped to support indirection.
Constraints: Native code breaks everything. Try to not rewrite stuff
referred to by native code (maybe can intercept all native calls and
rewrite args?). System code is also an issue: loaded by the system
loader, is not rewritten. (Can they use -Xbootclasspath?). But most
code run is actually system code! (eg. String libs are very common!)
So must deal with system code without changing the JVM.
- Can wrap system class objects. Must unwrap when passing to
system or native code; must re-box on return. Re-wrapping is
expensive: need the exact same wrapper, so need some kind of
hash-lookup to do re-wrapping. High overhead, so avoided where
possible.
- Extend: some classes can be extended and preserve the original
class & signatures. Extended class follows the RuggedJ model but
also the original system model. No need to unwrap/re-wrap. Does not
work e.g. when system code generates the object (since sys code makes
instances of the original and not of the extended superclass).
- Some system classes are only ever referenced from user code; can
just dup the code out of rt.jar and make a user-class (which now gets
rewritten by the normal RuggedJ model).
- Immutable classes. Just replicate.
Apparently able to cluster some complex programs.
No idea about performance as compared to explicitly clustered apps.
Jikes RVM Native Threads
Was Green threads, now Native threads
(everybody else is dropping green threads too....)
Motherhood & Apple Pie.
Computer Science as an Experimental Science
Reproducibility? Experimental bias?
Show how these problems are solved in very-high-energy astrophysics.
(1) No publish without multiple colleague confirmation.
(2) Publish
(3) MUST also now publish reproductions or non-reproductions's
of other peoples' experiments (or lose credibility as a research
entity). But bar to publication is lower, and expectations are lower.
i.e., people expect to publish in-progress work
Biggest comp-sci problem now: lack of infrastructures.
Examples of infrastructure bias: green-vs-native threads, some barrier
styles are hard to use (.Net embedded objects), stack-scan (OVM makes
fast stack scanning at the cost of everything else being slower).
HotSpot (high-value compiler, good GC, good most everything) vs JIKES.
Cliff Notes: This was actually a really fun talk (and lots of
discussion) despite my paucity of notes.
Web 2.0 Profiling
Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite.
Relies on browser to run Flash, JavaScript, REST, AJAX.
Benchmarks: IBM social networking software; Java PetStore 2.0,
AJAX-based RIA using JEE5, uses Lucene search.
Apply workload: viewBlog (from 9 blogs), createBookmark (total 14),
selectTag (PetStore 9), etc.
Workload mix can be varied, but based on commonly observed usage
pattersn from internal deployment of Lotus Connections).
Wanted to remove DB from the workload, put DB on ramdisk. Same for
LDAP. Used Grinder to generate load. Main server running
J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5
1.66GHz. Fairly reasonable setup for a small web/app load.
See lots more chattiness; very short fast requests, high interrupt
rates, smaller requests but many more of them. Looked at low-level
profiling. See fewer I-cache misses as compared to a large legacy
apps. Better I-cache behavior. But data-cache miss rate is still very
significant.
Also different transactions have different scaling limitations. For
PetStore the Java persistence layer started falling off after 4 cpus.
Stream Processing
"Spade"
- Cluster stream processing at IBM. Continuous high-level
streams. Stream arrives for weeks & months; want to process and
then emit stream data continously. Want to express the problem with a
language.
Priorities - fast streaming on a cluster; then Generality; finally
Usability. Beyond StreamIt.
Like StreamIt, works with streams that
can be split & joined. Language is typed; stream contents are
strongly typed.
Demystifying Magic - High-Level Low-Level Programming
Seen
this talk before, but this time given by Steve Blackburn. Nice
historical perspective on writing system software in C vs asm - and the
parallels to writing sys software in Java vs asm.
Java-In-Java
- Key Principle: Containment. Write as much as possible (99%!) in
pure Java, and dive into the low level "magic" bits as little as
possible.
- Extensibility: Requires change quickly, languages change slowly.
- Encapsulation: contain low-level semantics
- Fine-grained lower of semantics: minimize impedance, seperation
of concerns
Framework:
- Extend Java semantics; intrinsics
- Scoped semantics changes
- Extend types; un/boxing, ref-vs-value, architecture sizes
- Pragmatic: retain syntax where possible to keep front-end tools
working
Introduce raw pointer type: "Address" looks like a Java type.
E.g.: Address a =
...; ... = a.loadByte(offset);
Semantic Regimes:
Additive: "its ok here to have unchecked casts"
Subtractive: "no allocation via new allowed", similar to HotSpot
VerifyNoGC.
Allow "unboxed" types - no v-table, so it's a bare primitive type. But
this isn't the default (but can be asked for).
Abstraction results are good: can implement a JVM on bare hardware and
with a simple change essentially virtualize the same JVM inside a GC
debugging harness running inside another JVM (where the virtualized
JVM's heap is just a big array in the outer implementing JVM), etc.
Grace, Safe Multithreaded Programming for C/C++ - Emery Berger
Seen the paper before, but this is the talk by Emery.
Fake sequential semantics on stock multi-threaded hardware using
fork/join and process protection.
Grace is a pthreads replacement: "g++ -lpthreads" becomes "g++ -lgrace".
Speedups: in the absence of common benign races, Grace run programs run
about 10% slower than raw pthreads - but have full sequential
semantics. It's "as-if" all the parallel programs are run sequentially
immediately at the thread spawn point.
CAN'T run applications where threads run forever, i.e. reactive or
server apps.
Works well with fork/join, Cilk, TBB, etc.
So thread spawn becomes a unix fork with COW, use mmap for allow memory
sharing. At join points, smash in join'd threads memory updates via
mmap. Also need scalable thread-local malloc heap, plus aligned
globals (to avoid false-sharing at the page granularity), plus some
improved I/O.
Some simple measures remove nearly all false sharing. Big one:
everybody mallocs into their own space. 2nd big one: spread the
globals out 1 per page.
Thread-spawn is as cheap as a fork on linux (has experiments to
show it). Due to thread-cpu affinity, if you spawn a bunch of threads
they share a single CPU. At the low end of thread-grain-length, fork
is faster than spawn because the scheduler immediately spreads the
processes across CPUs, whereas threads share for awhile. So fork is
actually faster than thread-spawn for awhile.
Real performance limiter for Grace is conflicts & rollbacks, and
not thread-spawn overheads.
Performance is much better than e.g. STM, because after the 1st touch
on a new page (and the SEGV & accounting), all accesses on that
page run at full speed.
GC Assertions: Using the GC to check heap properties
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&rep=rep1&type=pdf
Global knowledge is required to understand seemingly local properties.
JBB example:
Order order;
for(...)
order = ....;
delete orderTable;
? are all 'orders' dead here?
But actually 'order' is held by the last-customer-transaction (as an
optimization, or convienvence for customer?). Leads to a leak of
'orders'. Really: programmer does not understand program, too complex.
Add assertions to the code, and use the GC to check the property.
Cliff Notes: Azul is already gathering global heap properties
during GC - but not asserts. Gathering liveness & points-to info.
If points-to included a handful of sample real instances (and all the
sames linked when possible), would be a very powerful way to instantly
check some heap properties.
Sample asserts:
'assert_dead(p)' - expect to reclaim 'p' at the next GC cycle
Or region-collection: start_region(); ... assert_all_dead();
Or shape property: assert_unshared(p);
Or ownership properties (members of a collection are 'owned' by the
collection)
If an assert fails, then do a slow crawl and provide the full path
through the heap showing the failure. Asserts only checked at GC
points.
Proving Proving Correctness of Abstract Concurrency Control and Recovery
- Trek Palmer & Eliot Moss
Transactions - closed nesting has issues
Open nesting & Boosting need from programmer: conflict info &
roll-back info.
These are hard to provide correctly.
This work: a description language of abstract data types or structures.
Can describe the conflict predicates & inverses.
Can prove correctness of the conflict & inverse expressions.
Working with the abstract description of the data structure,
and
NOT e.g. with the real Java implemention. E.g., no loops, no
mutations, no recursion. So the language isn't a 'real' language, but
can describe many kinds of ADTs in code that looks sorta like Java.
Output isn't a functional program, instead the output is the result of
using a SAT solver to prove correctness.
XTN-semantics allows conflict detection to be proven correct pairwise,
instead of having to do full interleaving. Formally, a conflict
predicate is correct iff it is true when the operations do no commute.
The conflict predicate tells when 2 XTNs conflict, and the inverse
allows an optimistic XTN to rollback.
Obvious use-case: use this tool to write a transactional-version of
NBHM or other JDK concurrency utilities.
Hard Real-Time GC Scheduling
- Periodic Scheduling
- GC runs at highest priority, but periodically yields to mutator
- Metronome
- Slack-based Scheduling
- GC runs at lowest priority
- Can be preempted at any time
- Makinac (Sun RTS)
- Work based Scheduling
- GC runs at allocation time
- Problem with allocation-rate jitter
HRT - Deadlines must never be missed AND must be verifiable
analytically that no deadlines are missed. Systems tend to therefore
be very simple, so that they can be proven.
OVM - like Metronome. Dynamic defrag; arraylets; Brooks style barrier;
replicating barrier/; incremental, concurrent, supporting slack-based
style scheduling.
Understanding Performance of Interactive Applications
Typical profilers give the wrong numbers: they report total time spent,
not time spent between mouse-click and page updated.
AWT/Swing & SWT - similar: single-threaded, gui thread in "event
loop", relies heavily on Listener model. So profile using gui
call-back Listeners between user events and the eventual display change.
Really simple idea: profiling & event visualizer tool between click
& view times.
Program Metamorphsis
Trouble w/refactorings: must preserve program behavior with each step.
But if we want multiple refactorings then at each step along the way
fixup code (needed to preserve program behavior) pollutes the code.
Want to do multiple refactoring steps at once - while leaving the
program possibly broken in the in-between steps.
So compute a program-semantics (names & def-use chains), and let
the user make partial refactoring steps and then declare "I think I'm
done" - and the system compares the new program-semantics with the old,
and reports if they are not equal.
So actually comes up with primitive not-quite refactoring steps, which
he will compose to build larger "real" refactorings, or compose lots to
build a multi-stage refactoring.
Instruction Selectors
HotSpot C2 uses a BURS framework. Very hard to optimize & debug.
So these guys have a semantics which can span both the ideal IR and
real machine instructions. Describe both using "CISL" and the tool
does the mapping.
Once the tool has a mapping, the user provides an adapter that converts
the mapping into the code needed by the compiler back-end. This would
replace e.g. the ADLC. Gives examples of machine encodings for PPC and
Java bytecodes.
CISL is given an IR tree and produces a target tree with equivalent
semantics.
Not sure it's entirely better than BURS (maybe no worse), but I'm still
sold on rewriting in e.g. C++/Java hand-written greedy match rules.
These code-gen-gen tools are pain for long-term maintenance.
Do Design & Performance Relate?
Is fast code required to be ugly? Is beautiful required to be slow?
Pick 200 code metrics. Also pick some performance metrics (cycles,
single threaded, objects created, etc). Could not follow talk... or at
least he wasn't very clear with the objectives & results (if any). Interesting idea though.
Category: Web/Tech | Permalink
| Comments (1)
| TrackBack (0)
Odds & Ends
April 15, 2009
Various pending goodies... Load-Linked/Store-Conditional vs Compare-And-Swap Pretty much all modern CPUs include some form of instruction for doing an atomic update - required for shared-memory multi-CPU machines (X86 has lots and lots!). There was a long period of debate in the CompSci community one what constituted the "minimal" needed instruction to do useful multi-CPU work. Eventually the community has decided that the Compare-And-Swap (CAS) instruction and the Load-Linked/Store-Conditional (LL/SC) instruction combo both are (1) sufficient to do useful work ("infinite concensus number") and (2) relatively easy to implement in real hardware. X86's, Sparc's and Azul's CPUs use CAS. IBM's Power, Alpha, MIPS & ARM6 all use LL/SC. ARM5 only has an atomic-swap.
LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms. With LL/SC, you first Load-Linked a word. The hardware marks the cache line as "linked". You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter). Finally you issue a Store-Conditional to the same address. If the cache line is still "linked", the store happens. If not, then not. The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line. Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented). "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail. I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure. The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC. With CAS, you typical first load a word with a normal load, then manipulate it. Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory. CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic ABA bug. In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless all copies of A die first - including the one loaded before the CAS. Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS. Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively. With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link". E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS). These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations. Nothing in the academic literature requires any sort of global ordering on these atomic instructions. Instead the usual academic assumption is that all memory operations are strongly ordered - which is obviously not true on all modern hardware. Practitioners are required to insert memory fences as needed to achieve the desired ordering. Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do. Azul's CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations. (As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking. Report back the %-tage of dropped counts and the throughput rate. Then try it with an Atomic counter. My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless). I am less familiar with the fence properties of common LL/SC implementations. Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc? build.java Some time ago I reported on a build tool I've been using.
Currently 'build' is being used to build HotSpot internally to Azul. We've got it building 400+ files, plus build rules for building portions of the
JDK, compiling w/gcc & javac and also 'javah', tar, jar, strip,
binary signing, etc. In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc. It's blazingly fast, it's easy to hack (being written in plain-olde Java). I hunted around awhile and couldn't find a good place to dump a medium-large blob of code, so here's a sample build.java in 4 parts:
Part 1 Part 2 Part 3 Part 4
A "HotSpot -server" aka C2 compiler IR visualizerNo I didn't do it! This deserving grad student did it. Here is the reference to the visualizer for the ideal graph of the
HotSpot server compiler:
http://ssw.jku.at/igv/master.jnlp
It is a JNLP application hosted on a university server
(http://ssw.jku.at/General/Staff/TW/ has also a test file to
download), but the easiest way to get a first impression.
The source code of the tool is hosted on http://kenai.com/projects/igv
The server compiler instrumentation is part of OpenJDK and also
included in Sun's weekly builds of Java 7.
More C2 Goodies
John Cavazos wrote:
... One of the things I am
currently looking
at is determining the right phase-ordering of optimizations applied to
a particular program being optimized. I have some nice (un-published) results for
JikesRVM,
but it would be nice to replicate the research for HotSpot...
I have strong arguments for nearly all orderings of our passes, so I'm
curious to know about your phase-ordering results.
The only obvious transform we might be missing is PRE, but our peephole
opts pretty nearly subsumes PRE (they are not mathematically equivalent
- I can write programs where either PRE or the peephole stuff makes
progress against the other). In practice, PRE will find essentially
nothing once the peephole opts are done. You'll notice that we do the
peephole pass alot; it's totally incremental and provably linear
bounded. In other words, if there's nothing to be gained then there's no
cost in trying. The peephole pass includes amongst other things
all pessimistic forms of copy propagation, value equivalence,
constant propagation (especially the null/not-null property), constant
test folding, repeated test folding, dead code elimination,
load-after-store opts (aliasing analysis is included for free during
building of SSA form), algebraic properties, and lots more.
For HotSpot, the optimization ordering is:
- Build an inline tree, including class hierarchy analysis. This is the
one pass I'd be willing to move, as it happens too early.
- (a single unified pass:) parse bytecodes, inline, build SSA form (the
IR remains in SSA form always), do peephole opts over SSA form. This
pass typically costs 40% of compile-time budget.
- Fixed-point the peephole opts over SSA form, once all backedges are
known after parsing.
- Loop opts round 1: "beautify loops" (force polite nesting of
ill-structured loops), build a loop-tree & dominator tree, split-if
(zipper-peel CFG diamonds with common tests, plus some local cloning
where I can prove progress), peel loops (required to remove
loop-invariant null checks)
- Fixed-point the peephole opts over SSA form
- Loop opts round 2: "beautify loops" (force polite nesting of
ill-structured loops), build a loop-tree & dominator tree, lock
coarsening, split-if & peel - but if these don't trigger because there's
nothing to gain, the do iteration-splitting for range-check-elimination
& a 1st loop unrolling.
- Fixed-point the peephole opts over SSA form
- Conditional Constant Propagation (the optimistic kind, instead of the
pessimistic kind done by the peephole pass)
- Iterate loop (split-if, peel, lock coarsen - but these typically never
trigger again and take very little time to check), unrolling & peephole
passes, until loops are unrolled "enough". On last pass, insert
prefetches. Typically this iterates once or twice, unless this is a
microbenchmark and then unrolling might happen 8 or 16 times.
- Remove tail-duplication, and a bunch of other minor code-shaping
optimizations e.g. absorb constant inputs into deoptimization-info in
calls, or commuting Add ops so that 2-address machines can do
update-in-place.
- Convert "ideal" IR into machine code IR.
- Build a real CFG for the 1st time, including a dominator tree, loop
tree. Populate with frequencies from earlier profiling.
- Global latency-aware (loop-structure-aware) scheduling.
- Replace null-checks with memory references where appropriate.
- Register allocate. Internally this has many passes. This pass
typically costs 40% of compile-time budget.
- Sort basic blocks to get good control flow ordering (forward branches
predicted not-taken, backwards predicted taken, etc)
- Some last-minute peephole opts.
- Emit code into a buffer, including OOP-maps & deoptimization info.
Travel Plans
I got too much travel coming up! Apr 30, May 1st - DaCapo in Boston. Favorite small group; GC &
Java focused. Website:
http://www.cs.tufts.edu/research/dacapo/. I had to turn down the invite to an SSA seminar in France because the dates conflicted. Very sad. I hope one of the attendees will post a trip report.
May 11-May 15. IFIP WG2.4 (International Federation of Information
Processors, Working Group 2.4 - a *really* old European group with a
random mix of industry & academia.) It's a nice group to preview
JavaOne talks, and always the meetings are a week-long rambling
discussion in some quaint resort.
June 2-June 5. JavaOne. I owe slides for 3 talks by end of this week. 'enuf said.
July 6-10th - ECOOP, as a paid-for invited speaker. A free vacation to
Italy! :-) Cliff
Category: Web/Tech | Permalink
| Comments (10)
| TrackBack (0)
YATR2YAC (yet another trip report to yet another conference)
April 5, 2009
'Tis the Season to be Conferencing... Blah, I'd rather be hacking, but here goes... I went to EclipseCon, to CGO and back to EclipseCon. Here's how it can down: I got invited out of the blue to talk at EclipseCon by Darin Wright, who had seen me speak earlier at the JVM Language Summit. Like a dummy I accepted it instantly instead of checking my calander. Naturally it conflicts with CGO - and I was on the CGO PC this year, so I felt obligated to attend CGO as well. Turns out my talk at EclipseCon got scheduled for Wednesday, and I decided I could skip CGO on Monday (I think there was one talk I was interested in) - so I scheduled a 1-day only trip to Seattle. Monday I wandered EclispeCon at the Santa Clara Convention Center, Tuesday I took the 1st plane up to Seattle and the last plane back, and Wednesday it was back to Santa Clara to give my talk and surf the EclipseCon booths. The flights on Alaska Air were smooth and easy. No WiFi at CGO - the Hotel wanted some totally outrageous price for conference WiFi (I think the local chair was saying $50/head prepaid?!?). (Remember my grief over Hotel WiFi in the last blog?) So over lunch break (which was Chinese food and I don't like Asian food of any kind) I headed over to Starbucks. Turns out Starbucks will happily give you 2hrs of free WiFi a day from ATT - provided you register $5 on a Starbucks card with them. Yes!!! Suddenly I have a $5/month nation-wide WiFi plan! I snacked on some salad'y thing, drank my triple-latte, and happily Surfed the Internet until show-time at CGO came 'round again. 1st up: EclispeCon, mostly cause I got nothing to say. These guys are in a totally different ball park from where I live; they mostly deal with complexity control (or at least 1/2 the talks essentially amounted to adding, changing, or removing a control layer). Makes me wonder if the cure is really better than the disease. I gave my talk, but it was too much talk in too little a time window. Both me and the attendees would have been better off if I could have spent more time on the details. C'est la vie. Other thing of note: the web site for EclipseCon is vastly better than the usual academic conference I attend. Also the staff was exceptional friendly and easy to work with.
Next up: CGO. Despite my 1 day fly-in-fly-out I got a lot more out of CGO. ---------------------- Keynote: Vikram Adve Missed! Arrived barely in time for the last 10 minutes. That's the penalty for attempting a 1-day fly-by. I hear he said something good about Azul systems (he did ping me the day before for some details). Basically it's a Call To Arms for improving C compilers and C runtimes. He does say that fully auto-parallelizing code is probably a bad idea, and that we need to change the language instead (something I very much agree with!). ---------------------- Revisiting Out-Of-SSA Translation Cliff Notes: HotSpot "comes out of" SSA form during register allocation, where it really preserves SSA form throughout the entire compile process, but adds register annotations as part of reg alloc (and inserts live-range splits as part of spilling). So mostly this talk is about stuff HotSpot has been doing for years... but there's a key new piece.
Linear-time colaescing congruence classes; speed & memory footprints. Why is coming out of SSA hard? (Cliff Notes: It's not!)
And now they show what HS has been using for years: insert copies as needed for a simple out-of-SSA translation, then aggressively coalesce away the extra copies.
Nice diagrams for IFGs + copy-affinities.
Notices the value-equivalence property for defining IFGs (also done by HS).
Ok: here's what's new compared to HotSpot -server:
Fast interference test w/out building an IFG! Only need to test against closest dominating var? So scan the dom tree, with a stack-based DFS traverse of dom tree. Uses 2 sorted lists (?); more complexity for value-based IFG tests. But no building (and rebuilding?) the quadratic IFG. Claim 2x faster than Sreedhar for coming out of SSA. Big memory footprint reduction, because no need for live-ness sets & IFG!!!
40% of HS's total JIT budget is in reg-alloc! A faster reg-alloc is definitely interesting. A large part of that 40% is building the liveness & IFG sets
---------------------- Wave Propagation and Deep Propagation for Pointer Analysis
Wave Prop is very fast and uses huge memory (working on parallel version)Deep prop is faster for small & precise memory sets - works better on desktop (and JIT?)There's a zoo of ptr-analysis names:- context-sensitive or not- flow-sensitive or not- field-sensitive or not- inclusion-based (Andersen-based) or unification-based (Steesgaard-based)Cliff Notes: HotSpot -server uses context-INsenstive, flow-INsensitive, field-Sensitive, unification-basedBuild a graph of constraints, and do contraint-propagation.Code: "a = &b;" Constraint: "a superset {b}" - means b element-of P(a)Code "a = b" constaint: "a superset b"- means P(b) contained in P(a)Code "a = *b" or "*a = b" - complex constraint, must add edges to the graph as constraint propagation adds new sets to the nodes.Now build a graph: nodes are P(a) - sets of things pted-at by 'a', and edges represent subsets.Cycles are common; collapsing them is crucial for performance. All things in the cycle must be equal. Uses Nuutila to find strongly connected components in a single pass (so does HotSpot - Vick's algorithm). After collapsing cycles, you get a DAG - and Nuutila also produces a topo-sort as output.(Still looks like a complex diff/delta propagation algorithm - but makes me wonder if standard bit-vector stuff is faster).Not a bad-looking ptr-analysis. Might work in a JIT.Manesh suggests using BDD's.Cliff Notes: Real question of course, is if all the extra alias precision actually buys you any performance. PLDI paper of a few years ago suggestions otherwise: that HS's style of alias-analysis gets you 95% of all the possible gain. ---------------------- Fast Track: A Software System for Speculative Program OptimizationChen Ding: another process-level fork/join speculative program optimization (contrast this to Emery Burger's "Grace" system). Gives sequential program semantics. Add annotations to "suggest" fork/joinpoints. Actually fork a process, use COW & mmap to guarantee sequential semantics in the face of races (end speculation, or force commit-ordering on speculative threads, etc).New Idea: User can write an alternative implementation of some module; run both versions in parallel; keep the faster version & kill the slower version: "return do_whichever_is_faster( methodX(), fast_methodX() )"New Idea: "fast_methodX" might be speculative and *wrong*. So run slow-version to completion and do error checking after the fact. But let the fast version run ahead (and if it's faster), it can spawn off new speculative fast versions. When slow version completes, use page-level protections to know what's been changed, and verify the correct semantics of the fast version. Run all the correctness checks in parallel with all the normal fast-case stuff. Needs some careful throttling.Example using gcc's "mudflap". Array under/overflow, leaks, un-init'd memory, etc. Run 2 versions of all code: the existing code is "fast track" and the mudflap version is the "slow correct track". Goal: same speed as the "fast track" code with all the safety checks of the "slow track using mudflap". In some cases can get real nice speedups.Cliff Notes: can we turn on mudflag for Azul? ----------------------
Scenario Based OptimizationProfile the hot methods. Tweak various parameters (e.g. loop unrolling, or cache-tiling parameters). Profile both old and new versions. Pick the best winner, and stick with it for awhile. Keep profiling continously, and switching code as needed. Some JVM JITs do this. HotSpot only sorta does.The programmer has to pick what the new & old versions are, and the runtime picks the fastest one (no support for correctness - what if one of the versions is busted? Or tickles a timing bug which breaks things?)Seeing some nice gains, some of the time. Avoids losing when turning on some aggressive optimizations. But generally unrealistic.----------------------
Software Pipelined Execution of Stream Programs on GPUs
Basically StreamIt on a GPUStreamIt is much easier to use than e.g. CUDA or OpenCL.StreamIt model is a collection of connected serial parts; connected via pipes. Due to splits & joins, might need to exec some parts more often than than others (e.g. part Y "eats" 2 ouputs from part "X" before outputing a result itself). Get complex patterns of execution; difficult to map down to GPGPUs. Have bounded-buffer problems: need to map to a steady-state schedule using a fixed count of buffers. Then can arrange to do no allocation during execution.- Profile each filter standalone, to figure out it's work load.- Configure selection- Add constraints - Use CPLEX to solve- Output CUDA code.Looks like a classic software-pipeline, just done on a GPGPU. Compute minimum initialization interval, compile resource constraints, etc. Have issues with data-layout, so can do parallel data access.Impressive compile times (30sec to 178sec).Can get pretty good utilization on some kernels, so seeing 100x speedups.Cliff Notes: My usual caveat for domain-specific languages applies here: if your domain fits StreamIt's model - then StreamIt is the fastest language for you! If what you got doesn't look like StreamIt, then Too Bad. ---------------------- Stream Compilation for Real-Time Embedded SystemsStreamIt Again, but for Cell.New constraint: out of memory per cell (distributed memory).Different from GPU - GPU has shared memory, but very slow if doing bank-conflicting access - but can pile on more memory (more buffers) for some GPUs and less for others. Not so for Cell - must have a uniform split of memory.Thanks, Cliff PS - Thanks Doug for inspiring this blog
Category: Web/Tech | Permalink
| Comments (8)
| TrackBack (0)
Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report
March 19, 2009
Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report... sigh, all good hacks must come to an end, or at least be put aside so Real Life can intervene. I went to DC for the ASPLOS and VEE conferences and all you get is this lousy blog.
2009 ASPLOS & VEE Conferences
ASPLOS & VEE are co-located in D.C. this year. Mostly the trip
itself was uneventful. I was very busy either attending the conference
sessions or preparing for my own talk and I didn't get around much.
The flights themselves were mostly uneventful (ComAir's HQ in the
midwest lost power from a tornado, so I was stuck on the tarmac at JFK
for an hour waiting to get permission to fly into DC - first time I've
been delayed by bad weather when the weather between the source and
destination airports was perfect). The most notable point was that the
hotel used WayPort as their internet access. WayPort sucks horribly;
I've used them in the past and swore I'd never use them again - but I
forgot to ask when checking in to the hotel! WayPort charges
$13/night, and service is filtered through some horrible layout routing
through TimBuk Too - average ping times to hit a server in the DC area
was about 600msec (ping times from AT&T wireless in the conference
room were ~50ms). To add insult to injury, some other conference
attendees stayed at a nearby hotel - which was cheaper and had free
WiFi AND the free stuff was faster! Definitely having WayPort as your
WiFi provider is sufficient for me to change hotels.
Cliff Notes: I had a wonderful 1.5-long 1-on-1 talk with Monica
Lam (http://en.wikipedia.org/wiki/Monica_S._Lam);
she didn't know what Azul does.
Cliff Notes: More interesting talks with Jon Shapiro (http://www.eros-os.com, who's
busy implementing a high-level low-level language for high-availability
systems).
Reality Check: yes again a Mac failed to be able to output to the
projector, and we all waited while the presenter hotswapped to a PC
Caveat Emptor: As usual for these talks, I only attend some sessions, I
write
stream-of-consciousness reports which I only lightly edit and the
reports tend to be brutal and short. They are roughly sorted here by
my interest level, and not by e.g. time order.
General Cool Stuff:
Keynote: Monica Lam's Vision of the Future of
Human / Computers Interaction
Getting Wrong Answers Without Doing
Anything Wrong - all
your base are belong to us
Text Processing in Parallel in a Register
- wicked fast
Nanoscale Computing - wicked scary
Demystifying High-level Low-level Programming
- how to write Java-in-Java
Asymmetric Chip Multiprocessor - 1st
good use I've seen for these
Leak Pruning - Reported on before, but
still good
TRIPS - Funky dataflow/VLIW hybrid (but not
the weirdest I've recently seen!), too bad it didn't pan out
HTM Talks
Which I read but due to a snafu failed to attend the talks. :-(
Maximum Benefit from Minimum HTM - not so
minimum (bigger than Azul's HTM!) and lousy results
Rock's HTM - a peek at Sun's experience with
Rock's HTM; mostly Rock's HTM is really really weak
Multithreaded Bug Finders
CTrigger - The Winner in this section:
cleverly induced
sleeps trigger race bugs reliably
Anomaly-based Bug Detection &
Prediction - It's an, ahhhh, interesting hack
Isolator - silently prevent race bugs
Making Programs Deterministic
Cliff Notes: Major flaw in the thinking here: Real Programs
have Real Random Inputs. Unless you capture & replay ALL external
events, minor things like network-packet-arrival-order and the chaotic
butterfly effect (http://en.wikipedia.org/wiki/Butterfly_effect)
ruin
your determinism. So I think all these papers are tilting at windmills.
Kendo - Stock hardware, but requires a
race-free program
Deterministic Replay - Goes only 1/2-way
to preventing the Butterfly Effect
DMP - Deterministic Shared Multiprocessing -
No replay attempt, just full-speed determinism (assuming no Butterflies)
Other Stuff
Talks I attended but aren't otherwise interested in...
Memory Buddies - Yet Another Use for
Bloom Filters
Live VM Migration - pre-copy push
still beats post-copy pull
Validate - Microsoft patches Are Bad (tm)
Raytracing vs GPUs - Doom!
Commutativity Analysis for Software
Parallelization - the
original work is WAY more interesting than
this follow-on
Predicting Yield of GCs - Yea Olde Hat
Trick
Phantom BTB - winner of the Most
Unrealistic Hardware award
Keynote: Monica
Lam
Why do computers break? When we buy something, we expect it to work.
If it breaks, you buy another one. But with computers they break on
their own (e.g. u$oft update) or you can break it doing "normal" stuff
- and if it breaks it doesn't help to buy another one - the new one is
broken already.
Virtual Desktop - idea: keep the virtual machine in the datacenter, and
do remote login. Does not work well - requires WiFi everywhere.
With each new comp generation, 10x cheaper & 10x more users.
Mainframe-> mini -> worksta -> PC -> laptop -> cell
Also see 3 industries: internet, media/TV, telephone
Also see 3 networks: cable/satillite, broadband/phone, wifi/wimax
Also see 3 computing models: cell (always on, pocket), PC (rich legacy,
offline exec, locality of reference), Cloud (device independent,
central management).
Things like GoogleGears or Virtual Desktop are trying to split the
Cloud/PC model.
Guess where we are going?
1- (Man) a phone-y thing: key, cache, window into digital cloud, ID,
personality, asserts, internet access. Small, always on. Can be lost
easily & replaced easily. No important state, but many
Gigs of flash cache. No real compute power. Small visual display
& keyboard.
2- (Earth) Can (1) plug into a PC. Get big screen & keyboard, get
fast processor. Get hot DRAM, teribyte disk drive.
3- (Heaven) Backs onto the internet/cloud.
This model has some different security issues. DataCenter is probably
secure (or can be?) You hold key in your hands. Can we make path from
hand-held-key to datacenter secure?
Trends - green, Bring-Your-Own-PC, Mac-vs-Windows. End of Corporate PC
model. 80% of people use personal PC for work; 45% of people use work
PC for personal use. 25% of workers change security settings, etc.
SO... want: {User, Ubiquity, IT} Separation of Concerns.
x86 virtualization seperates: hardware from VM; hardware from OS;
different hardware configs; different user roles
MokiaFive company - Desktop-As-A-Service. LivePC images "in the
cloud". Managed on central server. Can go up to any mac/pc and run
"your desktop" from a USB stick. USB stick/phone is a local cache only
- backed up on the cloud. Think of USB memory as a "network
accelerator".
Nice live demo. From USB stick to a selection of PC desktops is not
very long (+2 mins for a new PC to install software from stick). No
hotel wifi? Claims to be sync'ing with cloud???
Nice integrity model: basically re-install the OS clean each time on a
fresh VM-ware layer with anti-viral. Intent is to clean out root-kits
with each fresh login to machine.
Graphics is hard: many games are wired to DirectX, so losing much 3D
performance. Trying to bypass all the virtualization layers for
DirectX. But... guest & host have different address spaces; but
graphics has lots of pointers; fast pointer munging is hard.
Other services: LDAP, revocation, offline access, logging, dlp.
Encyption, Storage Virtualization: reju, incremental & atomic
restore; secure access. Uniquification, single-sign-on.
Seperate-of-Concerns Principles - 90% of code handles exceptions (only
10% does the normal stuff). "Re-Install the OS" as a root-kit
prevention. e.g. system is structured as a fresh-OS-install because of
the rare root-kit issue.
Bootstrap from a small trusted base: baremetal VM; LivePC player;
IT-managed base image (Windows+AntiVirus); finally user-apps & data
Like the Azul philosophy about GC - make the hard case (FullGC,
infection from rootkit) common and then get it right - then all else
follows.
Need to keep users ownership of own data - avoid the Big Brother
mentality. Give users a choice on where data is kept.
Getting Wrong Answers
Without Doing
Anything Wrong
Variations in unix env variable size, can see 15% variation in
performance! Running gcc -O2 vs gcc-O3, 400.perlbench. Bunch of runs
at each setting very repeatable per-run. But with different env sizes
see huge variation on 400.perlbench & 'ibm' benchmark.
Conclusion: setting unix env variables can change your conclusions.
Also, sorting object files on the link-lines differently changes layout
in memory, and in I-cache layout.
Conclusion: linking with 'ld *.o' is different from 'ld a.o b.o'.
Bias holds true even if looked at a large benchmark suite, or across
different CPU #'s from Intel.
See a prior paper (which I now cannot find!!!) which studies this
carefully. Note that there are a bunch of JVM/Java benchmark issues
here that also have been written up recently.
Cliff Notes: I've dealt with both of these in the past; so has
Mark Moir (Sun) and Bean Anderson (SGI). I discovered the need for I-cache layout while working for
Motorola on PPC chips (somebody else actually did the implementation).
The results gave us an average 10% performance boost on a bunch of
benchmarks, plus removed a huge source of variation. Later at Azul,
with our shared low-associativity caches we discovered that stack
placement was crucial to avoid conflicts, fixed with a little random
offset to each stack base.
Cliff Notes: What are other large systemic biases? Are these
two it? And obvious question for a JIT - should I start doing I-cache
layout games for performance? How about making copies of an existing
hot method and running both old & new and doing sampling, and pick
the winner?
Text Processing in
Parallel in a Register
Cliff Notes: this technique is insanely fast
Like SIMD, but char/bytes within a standard GPR. So 4 or 8 chars in a
GPR?
Doubling: field width from n-bit to 2n-bit
Inductive: repeated doubling.
Parallel bit streams; 1-to-1 correspondence to byte stream. Vastly
faster XML parsing; better cache locality; better branch prediction.
So: take 1st bit from each byte, and jam 64 bytes (bit #0 only) into 1
GPR. Take 2nd bit into 2nd GPR, etc. Basically, load 64 bytes into 8
GPRs, but bit-striped. Need architecture support for bit-striping
pack/unpack.
Now break up desired operations (e.g., compare for lexical token
boundaries) into bit-field operations. Can do compares of 64 bytes for
tokens by simple ops on 8 GPRs.
Cliff Notes: really lousy presentation. Hope the paper is
better. Guesswork: take this string "java/lang/Object(Crunkify)" and
find the '(' tokens.
GPR #0 - xxxxxxxx - bit 0 from all bytes
GPR #1 - xxxxxxxx - bit 1 from all bytes
GPR #2 - xxxxxxxx - bit 2 from all bytes
GPR #3 - xxxxxxxx - bit 3 from all bytes
My turn for a lousy presentation: Look at bits for '('. Want some
combo of 0's & 1's from the GPRs. So e.g. XOR w/-1 all GPRs
needing 0's - now need them as 1's. Then do 7 AND's (parallel log tree
AND). Any bits still set represent the '('s. Nicely wide-issues per
64-bytes. Expect to compare 64-bytes for a single char in maybe 6-8
clocks!?!?!?! Plus evil pack/unpack issues/costs.
Wrote some code to demo finding '(' in a long string using maybe 1/10th
clock per character, plus whatever the packing costs are (which
probably dominate!). In this code, 'b0' through 'b7' are 64-bit values
holding bits 0 (through 7) for 64 chars in turn. Here's the
test-64-chars-for-paren check:
// The open-paren char, '(', hex 0x28:
// 0 0 1 0 1 0 0 0
long res = ~b7 & ~b6 & b5 & ~b4 & b3 & ~b2
& ~b1 & ~b0;
// A non-zero res means a '(' is found, and the bit-number is
// the index into the 64-byte array where it was found.
See attached source code for TextScan.java
Nanoscale
Computing
128 bytes of ram; 5micrometers in area
Smaller version: 8 bytes of ram; 1.5micrometers in area; similar in
size to largest virus. Chemical sensors report results by directly
changing bits in the instruction stream (e.g., flipping a bit changes a
JMP to a NOP, in effect making a conditional branch). Can detect
adequately running at 100Hz (not 100MHz). Can write small
simple test/loop codes in this form factor. Can e.g. count molecule
rates or virus rates in a solution (injected in bloodstream).
Cliff Notes: Very Very Scary. See Vernor Vinge's dystopia "A
Deepness in the Sky".
http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky
http://en.wikipedia.org/wiki/Smartdust
Demystifying High-level Low-level Programming
Goal: Abstraction w/out guilt. Separation of concerns. Nice
historical perspective. Issues go back as far as 1975 - using C for
writing an OS was a big issue back then. So it's time for the modern
revolution of the same thing. JVMs for OS's.
Various C support tools: coding styles; Purify; idioms; tools
(Valgrind). Also Systems-languages - BLISS, Modula-3, Cyclone, etc Or
2 languages: JNI+Java JikesRVM extends Java, also OVM, Moxie, DRLVM, C#
(Bartok). Usually in context of runtime research. This talk: Jikes,
Java-in-Java.
Containment:
- minimize exposure to low-level coding
- extensible; requires change fast, but language changes slowly
Encapsulate: contained low-level semantics.
Fine-grained; separation of concerns.
"Address" is Magic, like unsafe.
"Address foo; int x = foo.loadInt(offset)"
More types for "void*" in Java, or generic OOP pointers.
Intrinsics: contract between you & the compiler; at the Java level
it has an empty implementation. But the compiler will always insert
the correct code for the e.g. prefetch intrinsic.
Also annotations to do e.g., turn off bounds-checks or null-checks
(performance hints, but break safety).
Scoped semantic regions - scoped rule changes, e.g. "NoNew" - no
allocation in this scope (HotSpot uses this technique extensively),
or "NoBoundsChecks" in this scope, etc. Unboxed is another annotation
- no class header, no v-table, basically a C struct. Also can do
direct field layout.
@RawStorage - basic static C memory layout. Backed by native width raw
storage, only accessed by native-width C intrinsics.
Can both compile "fast" - down to raw bits, and "virtual" - to some
virtual implementations allowing full GC on full simulation.
Asymmetric Chip
Multiprocessor
Idea: instead of e.g. 16 cores on a die, have 12 for normal work and 1
"fat" core for hot contended locks.
Cross grid of threads vs open-tables in MySQL. (looks like ref-counting
of MySQL tables). Blah - implementation looks alot of NmethodBuckets
in HotSpot. 2-d grid. Updates are complex & require complex
locking.
Look at some locking traces. If we could accelerate the critical
sections then we get a Amdhal's Law approach - removes the serial
portion.
Idea: switch to the one fast CPU on a contended lock.
Seeing speedups of 50% or so for 12-16 cpu system, where program was
seeing e.g. 5x speedup and now gets e.g. 8x speedup. One of the
benefits: locks sections always executed by same CPU, so caches remain
'hot' for hot-lock code.
Cliff Notes: This is the first real use of Asymmetric
Multiprocessors that I believe in.
Leak Pruning
See it & reported it before. Survive OutOfMemory & leaks by
reclaiming predicted dead objects. Used a simple X86 read-barrier w/6%
slowdown. Hoping to see new slowdown numbers for the read-barrier: see
3% slowdown on Core2 & 5% on Pentium4. Using 2 bits per reference;
bit 0 is for stale-refs; bit 1 is for pruned refs.
So instead of throwing an OOM, look thru heap and see if find a
dominate object. Assume dominate object is dead & leaked; "Cut"
link to object and replace with a poisoned pointer. Touching this
pointer
throws the original OOM. But if never touch object, then never throw
OOM. Cut links allows dead object (and his transitive closure) to be
reclaimed.
Try to predict roots of dead datastructures. So profile max time
between last use (using something like GC age bits, but exponential
counter bits). Also need to track which refs are keeping most bytes
alive.
HashEntry -> PreparedStatement -> ParserInfo -> byte[][]
-> byte[]
-> ResultSet -> Field[]
Compute during Phase 1:
Sometimes the HashTable grows, and PreparedStatement gets touched. So
max-stale-then-used for HashTable-> PreparedStatement is 16-32 GC
cycles. But max-stale-then-used for PreparedStatement-> ParserInfo
is 0-1 GC cycles.
Compute during Phase 2:
Count bytes kept alive for each type of thing that has a low MSTU
number. See PPS->ParserInfo has 420M kept alive. So kills all
edges from old PPS->ParserInfo (but not young ones, because MSTU is
kept per object? I now think he keeps MSTU per-class, so no bits in
object but can suffer from common types like Object[] being used both
in leaks and in hot code). Don't worry about DAG sharing - ends up
blaming the leak on 1 ref when it's really kept alive by several refs;
just means on 1st go-around you prune one edge - but leak remains
alive. On 2nd go-around all the leak is blamed on object type#2 and
pruned then.
Results are decent - keeping some complex real apps alive essentially
forever. Performance is decent, but near OOM conditions things slow
down as first come frequent GCs, then LeakPruning kicks in and cleans
the heap back out.
Cliff Notes: Wondering if could do read-barrier check on low
bits by e.g. loading from the loaded pointer with an X86 op which
faults on odd addresses? Means faulting on stale & poisoned refs;
faulting on poisoned is OK (means throwing OOM anyways), but faulting
on stale might be too frequent still.
TRIPS, McKinley et al
TRIPs is an alternative processor architecture, with a square 4x4 grid
of functional units (GPUs); registers are along one "wall", D-cache is
along another 2 sides, etc. Inter-GPU traffic is explicitly scheduled,
but with a data-flow-like flavor (static scheduling, dynamic timing).
I've reported on this project before. This talk is essentially an
end-of-project summary. And the summary is: not much better than a
Core2 (if the chip had been built with Core2 technology).
Each chip has 2 processors; 1M L2. 16GPUs, 128 registers, 32K L1
cache. Processor is distributed. 4x4 grid of GPUs; registers along
one side; D-cache along other sides. Window size of 1024 instructions.
Block atomic execution model; dataflow between blocks. Blocks are
single-entry, multi-exit. 1 non-speculative block; up to 7 speculative
blocks.
Heavy use of predication to build larger blocks. Direct instruction
communication model. Special instructions to handle large fan-out.
Large block of conditional/predicted stuff mapped onto the 4x4 grid of
FPUs; using reads from registers and reads/writes to d-cache.
Larger blocks require more scheduling; harder to place. What is the
utilization of the large instruction window? Both scheduling, plus
e.g. cache-miss & branch mis-predict. Static delay: scheduling
issues; 1-clk to move data from FPU to FPU.
Benchmarks: 53 compiled benchmarks; 13 hand-optimized benchmarks; 2
hand-optimized and hand-scheduled. Gather numbers of the actual chip,
except when needs to use SIM numbers (because chip didn't gather the
right data).
Generally, out of the possible 128 instructions in a block compiler is
able to get about 60 instructions. Hand-optimized code uses smaller
blocks because better scalar optimizations reduces instruction count.
About 1/4 of all instructions are predicated & never executed.
About 1/4 of all cycles are lost due to scheduling.
About 1/2 of all cycles do useful work.
Very large code-bloat: 55% bloat from nops, or direct reg read/write
ops, 18% from predicated useless, 27% from instruction-replication
optimizations (loop unrolling, inlining). 4 SPEC benchmarks have
i-cache miss rates > 10%. Direct memory accesses are much lower (no
spill ops?)
Need variable-sized blocks; would reduce the code bloat from 11x to 6x
(over canonical Alpha code).
L1 is partitioned. Compiler can acheive 2 mem-ops /cycle on vector
add; hand-optimized can hit the max of 4 mem-ops / cycle. Again for
matrix-multiple the compiler hits 1 flops/cycle; hand optimized hits 7
flops/cycle.
Clock speed is 366Mhz; mem speed is 200Mhz; process tech is 130nm.
Compare to a Core2 but boost TRIPS "as if" if gets the 65nm process.
Compiled versions are competitive to Core2. Hand-optimized about 3x
faster than Core2.
What works: dramatic reduction in data-cache & register file refs.
Compiler is able to work (to schedule to this weirdo thing).
But performance isn't very big over a Core2. Challenges: dynamic code
size; dataflow overhead instructions; I-cache efficiency; predication
overhead; generating efficient code for control intensive codes.
Maximum Benefit from
Minimum HTM
Running a (supposedly) simple HTM on a transaction-aware Linux kernel.
Their proposed HTM is somewhat stronger than Azuls', in that they
support parallel independent transactions (the main one, and ones run
inside interrupt handlers), a special version of CAS, and graceful
overflow to software. Also they holding pending transactions in a much
more generous L1 than Azuls'.
Results look pretty anemic and fragile, which is on par with Azuls'
results (4x speedups on 16-way boxes; 60% of exec cycles wasted on
retries). On the plus side, they at least attempt a reasonable
benchmark set.
The Rock's HTM
A look at Sun's Rock's HTM. It's a lot weaker than Azul's - max of 16
(or 32, depending on how many hardware threads you use) pending writes;
aborts on function calls, TLB misses, mispredicted branches & some
fp ops, - plus the same reasons Azul's HTM can fail: eviction from the
L1 (due to transaction being too large, or conflict from another CPU).
Major discovery (just like Azul discovered!): feedback is crucial to
decent heuristics. Your failed HTMs need to report as much as possible
about what went wrong so the heuristic stands a chance of getting it
right. This required a hardware spin (Rock2). One useful heuristic is
the standard backoff mechanism: if there's true contention, then
backing off can spread the memory references far enough apart to allow
the HTM to succeed (presumably succeed with less overhead than a simple
spin-lock solution).
They report hashtable numbers, but wildly concurrent hashtables are
commonly available.
They report using the Rock HTM for doing an N-CAS. This is largely
successful, and might simplify many libraries.
CTrigger -
Atomicity Violation Bugs
Software testing is inadequate. Poor interleaving during standard
stress testing.
CTrigger - for any given program input, force interesting
interleavings. There is an exponential interleaving space for each
input, but stress testing is using random interleavings. Previous
work only controlled the context-switch interleavings. CTrigger
focuses on interleavings that likely hide atomicity bugs and rarely
happen during normal testing.
Focus on interleavings that likely hide bugs: e.g. 2 nearby refs to the
same memory in one thread - and another write to same ref in another
thread. Force these accesses to interleave by injecting "sleeps".
Tried on 7 apps (server, client, scientific) on 8-core machine. 143
total Unserializable Interleavings (program interleavings deemed
interesting from atomicity standpoint). Stress testing only hits about
80 to 100 of the U.I., out of 143, even for many many runs. Basically
some (1/3rd) of the U.I.s are very unlikely to hit with stress testing
- this demonstrates the key weakness of standard stress testing.
CTrigger looks at the access traces and attempts to force the
interleaving. Needs to trim out infeasible paths: locking forces many
orderings to be impossible (i.e., the code is correct and no race
happens). Some refs are inside the same lock; some are Happens-Before
on thread-create or thread-join. Typically 90% of apparent bad
interleavings are infeasible.
Focus on rare interleavings also (not-rare races happen easily in
stress testing). If the "local gap" is small - time between 2 refs in
the same thread is short, then the race is narrow - and hard to hit in
stress testing. Also look at the distance between external ref
distance in time to the gap. Sort these potential races by
likelihood. Instrument the program to add an artificial delay at a few
points to try & force these rare races to happen often.
Then run program as normal. Program speed is similar to unmodified
program, because artificial delays are small and few in number.
Program runs fine on multi-core machines; using the multi-cores as
before. But now the rare races are common, and stress testing finds
them quickly.
Typically find bugs now 100x faster than before!!! Found bugs in all
these apps. Was able to repro the breakage 100% of the time, generally
very quickly.
Anomaly-based Bug
Detection &
Prediction
Defect - the Bug
Infection - Defect is triggered ==> program state is corrupt
Failure - An observable error (Crash, hang, wrong result)
Infections spread over time, or get overwritten (hence bug is squashed
in this case).
Software debugging: hypothesis bug; change code; attempt to repro bug,
etc.
- detecting anomalies; out-of-bounds addresses, unusual control paths,
page faults or redundant computations. So profile program to teach
the detector what the "good" behavior is. Some values are in fixed
ranges during the teaching runs, or some mem locations only accessed
from a few ld/st ops; or abnormal number of loop iterations. Current
crop of anomaly detectors have a fairly high false-positive rate.
- Now isolate relevant anomalies: compute dynamic forward slices from
anomalies to crash point. Ouch - this step looks really expensive.
Rerun the broken program, propagating tokens from anomaly site to crash
site. Use binary search on the anomaly set. Can reduce the count of
false positive anomalies by a factor of 10.
- Now "validate" that have found buggy line of code: NOP that line out
and see that you can execute further.
Still don't know what the fix is, but can point out the step that e.g.
corrupted memory.
Appears to scale up to e.g. gcc (330KLOC) found a bug in distributive
laws causing a loop in RTL tree optimization (C2 used to suffer these
routinely).
Isolator
Isolation in concurrent programs: detect isolation guarantees; i.e.
access in critical regions from non-locking threads (or incorrect
locking). Bad threads can break well-behaved threads, (the usual
motivation talk for debugging data races, I wish people could skip this
or at least limit it to 1 slide & 1 minute). Also STMs only have
weak atomicity, so bugs still happen for well-behaved STM threads. Cliff
Notes: This is the key: some other buggy thread breaks your
well-behaved concurrent thread; studying the break point is useless.
Isolator: Program executes as-if no isolation violation. Uses a combo
of data-replication, virtual memory & code instrumentation. Does
not require the whole program. Requires "guarded by" annotations.
Cliff Notes: just having thread-level TLB protection would let
you do some COW-style (http://en.wikipedia.org/wiki/Copy-on-write)
protections. Just write-protect all pages touched during a locked
region (except allow writes from self). All other threads that attempt
to write without holding the lock will get a TLB fault and can be
stalled (and reported as racing).
Stalling can cause deadlocks, so need dynamic deadlock detector.
(Youch!)
Either allow racing access OR throw a DeadlockError.
This above option is done by Isolator, except requires annotations
rather than lazily discovering touched variables.
Issues: expensive to change protection levels on every lock. Fix:
"lazy unprotect" - only remove protections on-demand, as other threads
need them. Could "shatter" big pages, or split unrelated objects to
private pages (via GC).
Cliff Notes: similar to HotSpot's Biased Locking, but this
wants per-thread page-protections; really want user-mode control over
per-thread page protections.
Issues: page-size granularity - has fragmentation overhead for
fine-grained locks.
Kendo
Only works for data-race free programs. Very efficient deterministic
interleaving. Progress is made with "Deterministic logical clock" time.
Work on SPLASH benchmarks. See 18% to 27% overhead, but deterministic
always. Requires no special hardware.
Each thread maintains a "logical clock" similar to Azul's COUNT
register or TSC on X86. Could be e.g., a "bytecodes executed"
counter. Globally, the thread with the lowest "logical clock" has the
right to do a e.g. lock acquire (non-determinstic action). Assumes all
inter-thread communication happens nicely inside locks.
Enforces lock-acquire ordering, except for nested locks (have to
prevent deadlock, since basically adding another locking layer).
Cliff Notes: Mostly interesting because this could work on
stock hardware (but it is limited to race-free programs which are
precisely those that you'd like this to work with).
Deterministic Replay
"Time Travel" for concurrent debugging
Phase 1 - record non-deterministic events
Phase 2 - replay
Software versions have been around for ~15 years. Now commercially
available. Also available are libraries, compilers & OS's to
help. SW stack flaw: limited to speed of 1 cpu
How do you integrate HW & SW stacks? i.e., If HW has captured a
log, where does it go? Do not want to capture or replay the SW stack
that is doing the SW recording.
Also need the HW to be running across the whole machine, so that all
processes are using it because interacting processes need to all be
deterministic (see butterfly effect above).
So some abstraction: not all elements of the system are being recorded
and can be replayed. May need to simo record multiple user-mode
processes in the same recording "sphere". Only user-mode threads are
being recorded (OS support needed to virtualize HW, but also OS needs
SW record/replay that is process specific). Really he's making a weak
stab at recording everything to avoid the butterfly effect.
Now record 2 different kinds of info:
HW info: memory interleave log
SW info: system stuff log; scheduling; context switch; I/O; signals
Replay also has to handle, e.g. malloc has to be deterministic.
Cliff Notes: but not GC, although HashCode has to be
deterministic.
Key Issues:
1- Deterministic injection of data into controlled "sphere" (I/O)
2- Using fewer processors during replay than were used during recording
3- Emulating vs re-execute system call (e.g. fork requires a real
system call) - but this makes a lot of OS state changes.
Key Issue #1 - Capturing ALL I/O
- Random OS calls write into user-space (i.e. setting return values in
memory). But during replay need e.g. the SAME "fd" as before. But OS
under no obligation to hand out the same "fd" at an open(). Or OS
writes data into a buffer - BUT another user-thread is ALSO writing
into the same buffer.
Key Issue #2 - Not enough CPUs during replay
- I think he's expecting HW to do the memory interleaving on
replay!! But basically, it's time for thread#1 to run - but the OS
wants to schedule thread#2 - just need to block #2 and allow #1 to
run. Very slow (constant OS thread scheduling overhead).
Log size - in bits per kilo-instructions.
Small: 3.15 bits/kilo-instruction.
Recording performance overhead: 20-50%.
Replay performance: using replay HW!?!?! About 2x slowdown during
replay.
How to handle large datasets?
DMP - Deterministic
Shared Multiprocessing
Hard to debug, etc, etc...
So just make the whole thing deterministic. No "Heisenbugs"; can use
time-travel debugging. Can test like it's a serial program. No
attempt at record/replay really - just force same ordering every time.
Cliff Notes: Suffers butterfly effect.
Instructions communicate via shared memory. Only care about
"communicating" instructions. Can serialize - but that blows
performance.
So really want to figure out which instructions really communicate.
Requires HW support.
Conceptually have a "deterministic token" which is handed out to
processors and then is passed around all the CPUs in round-robin
format. This token totally orders all instructions. Each "quantum" is
roughly 1000 instructions, but varies. Now filter out
non-communicating instructions. Allow the non-comm instructions to run
in parallel. Leverage cache coherency to track when cache lines move.
Build a "sharing table" to help order memory accesses.
Now add some TM support to help get more parallelism.
Now come some quantum-building optimizations: break at sync/lock
boundaries.
Communication is often bursty: so break quantum after a series of
stores to shared memory.
Memory Buddies
Sharing memory between VMware VMs(?) Hypervisor finds identical pages
and shares them under the hood. (Probably good for e.g. shared
libraries). See 1/3 reduction in memory usage in VMware ESX server. Get
sharing only if you run similar apps on the same hardware. Placement
is hard in datacenter; lots of clients & software stacks; things
change over time. Manual placement is tough.
Goal: estimate sharing ability across many physical machines and many
virtual machines across the datacenter.
Create 32bit hash per 4K page. 1MB hashs per 1GB of ram. Forward to
other hosts. Comparing lists is slow. So use a Bloom filter. With
hash lists, just sort & compare. With Bloom filters you use a
dot-product of bits; very accurate in estimating the number of elements
shared between Bloom filters.
New VM launch - run on staging host. Figure out it's memory
footprint. Compute it's Bloom filter per page. Compare it's filter
against other Filters on all other hosts in data center. Find closest
fit (on machine with enuf memory & cpu) and move the new VM there.
Live VM
Migration
How to migrate VMs cheaper (force their free memory pools to shrink
before migration; similar to having a JVM do a GC & heap-shrink
before migration, then inflate back to normal size in the new host).
Contrast to the usual approach:
- 1st iteration copy all
- later iterations copy any dirty pages
- after dirty workset is "small enuf", stop VM, copy active dirty pages,
- restart on new host.
Goal: reduce downtime.
Con: Write-intensive apps have long migration times (because of
constant page copying) and high network overhead.
Alternative: copy minimal CPU state & pages, start running on
target. VM demand-pages from source. Also actively push pages from
source to target. Good: copy pages only once; lower total migration
time & network overhead. Works well for both read-intensive &
write-intensive. Con: resumed app suffers while pages demand-paged in.
Can make some predictions on future needed pages from apps page-faults
- likely will want other pages nearby in virtual-memory-addresses
soon. Examples are walking an array or working a stack.
Make a psudeo-swap device in the guest OS for demand-pull pages
(basically swapped over the network). Allows the VM job to keep
running in other threads or processes - only 1 thread is blocked on a
page fault. (Actual implementation has lots of warts due to nature of
existing Xen infrastructure). e.g. we have a much more direcy way to
detect memory-full conditions and politely share memory between JVMS.
Results: feasible, but clearly more downtime over pre-copy in the
common
case. But works well w/write-intensive VMs.
Validate
Patches are dangerous. Delaying change is a safer bet even for
security bugs: more likely for downtime from patches than your system
gets hacked while you wait.
Raytracing vs GPUs
Raytracing is more expensive per-ray, but less expensive per-object.
Can be parallelized. Makes really great shadows & reflections.
Want
to improve predictability in ray-tracing to get "good enough" frame
rates.
Trying to use SIMD. Primary rays are parallel, easy to SIMD.
Seconday rays are incoherent, not good for SIMD.
Cliff Notes: no surprise but modern GPUs are not well suited to
raytracing work
Commutativity
Analysis for Software
Parallelization: Letting Program Transformations See the Big Picture
Sources of Commutativity - e.g. sum, malloc
Can detect it? Check memory state before & after running fcn X.
But misses e.g. set's (via linked-list) - linked-list order is
different when swapping 'put(A)' and 'put(B)' calls, but semantically
the same: both elements are in the set, but the set's implementation
stores them differently.
How to test for commutativity? Symbolically execute 'put' calls in
different order. See if memory contents are same: "put(A);put(B)" vs
"put(B);put(A)". If yes, then Yes - but suppose not.
Now check reader fcns: try is_member() (on what values???)
Also check writers like 'remove'.
Problem: exponential search space for equivalence.
Use Random Interpretation-
- choose random input values
- interpret till conditional branch
- follow which way it goes; tests program
- Then back up to cond branch
- Adjust memory & values such that test fails
- Now follow thru on other branch; test program
Now combine memory states for each branch of program. Using an 'affine
join' to combine memory states??? Somehow correctness bug becomes an
exponentially low; very very unlikely the Random Interpretation ever
computes some wrong answer (declares as commutative some function which
is not really commutative).
Looking at MediaBench & SpecInt on an infinite issue machine. See
20% to 30% of all functions are commutative (with themselves?). Only
see speedup when commutative fcns are called near each other, but then
can see parallel speedup (log tree combining?). Average is 50%
speedup. Apparently assuming an extremely wide machine, as opposed to
threads (ignoring communication costs & thread start/stop).
Cliff Notes: Random Interpretation looks interesting, not sure
about the rest of this paper. In fact, this paper looks like
a small delta (and very difficult to use correctly) over the R.I>
work. Instead see 2003 paper: "Discovering Affine Equalities Using
Random Interpretation"
http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf
Predicting Yield of
GCs
Amount of memory reclaimed by GC *can* be low. But looking at minimal
GC footprints. When GC is not productive, then time to extend the
heap. Estimate GC yield by using Virtual Memory to track pages which
are touched and which are not touched - not-touched pages are probably
full of mostly dead stuff.
Cliff Notes: Yawn, Olde Hat Trick.
Cliff Notes: got bored; symptom of Day 3 at a conference;
worked on slides
Phantom BTB
Cliff Notes: Winner of the most ridiculous proposed hardware,
backed up a big speedup on the most-unrealistic simulated hardware.
Virtual BTB Design - as in Virtual Memory not as in VMWare.
BTBs today are {small,fast,inaccurate}. Want {big,fast,accurate}.
Allocate space for the BTB in the L2(!?!?!).
But latency to the L2 is large, so much prefetch the BTB data.
Getting tidy 10%+ speedups on unrealistic hardware.
BTB in L2 cache is large+slow. Missing BTB entries are evicted to L2.
Then prefetch from L2 based on previous missing branches (assuming that
an upcoming BTB miss can be predicted from a just-prior missing
branch). Attempted lots of other predictors; most failed miserably.
Pay-as-you go model; BTB gets effectively larger as more pressure
happens on the on-chip BTB.
Cliff Notes: HotSpot inline-caches wipe out the need for faster
jump-register support in hardware; standard branch predictors work well
with HotSpot. Total clocks spent doing jump-register instructions are
less than 1% of all clocks on realistic hardware.
import java.lang.*;
class TextScan {
static String test1 = "scan_for_some_token(like_this_open_paren"; static String junk0 = "scan_for_some_token_like_this_open_paren"; static String junk1 = junk0.concat(junk0); static String junk2 = junk1.concat(junk1); static String test2 = junk2.concat(test1);
// A TextScan is a lite object that's really 8 longs representing // the bits of 64 bytes split 90-degrees from verticle. long b0,b1,b2,b3,b4,b5,b6,b7;
public static void main( String args[] ) { TextScan[] bits = bit_split(test2);
System.out.print("bits="+bits.length+"@("); for( int i = 0; i<bits.length; i++ ) System.out.print(bits[i]); System.out.println(")");
// Scan for byte System.out.println("Scanning for byte "+(byte)'(' + (char)0x28); System.out.print("["); for( int i=0; i<bits.length; i++ ) { // bit pattern 0x28 - 0 0 1 0 1 0 0 0 TextScan ts = bits[i]; // The open-paren char, '(', hex 0x28: // 0 0 1 0 1 0 0 0 long res = (~ts.b7) & (~ts.b6) & ts.b5 & (~ts.b4) & ts.b3 & (~ts.b2) & (~ts.b1) & (~ts.b0); // A non-zero res means a '(' is found, and the bit-number is // the index into the 64-byte array where it was found. System.out.print(Long.toHexString(res)+","); } System.out.println("]"); }
// Split a String into an array of 64-bit longs. // Parallel-split bits: bit 0 from 64 bytes go in 'b0', // bit 7 from 64 bytes go in 'b7', etc. public static TextScan[] bit_split( String s ) { int slen = s.length(); TextScan[] bits = new TextScan[(slen+63)>>6]; for( int i=0 ; i<bits.length; i++ ) { TextScan ts = bits[i] = new TextScan(); // Suck 64 bytes into bytearray from String. for( int j=0; j<64; j++ ) { int idx = (i<<6) + j; byte b = idx<slen ? (byte)s.charAt(idx) : 0; if( (b&(1<<0)) != 0 ) ts.b0 |= (1L<<j); if( (b&(1<<1)) != 0 ) ts.b1 |= (1L<<j); if( (b&(1<<2)) != 0 ) ts.b2 |= (1L<<j); if( (b&(1<<3)) != 0 ) ts.b3 |= (1L<<j); if( (b&(1<<4)) != 0 ) ts.b4 |= (1L<<j); if( (b&(1<<5)) != 0 ) ts.b5 |= (1L<<j); if( (b&(1<<6)) != 0 ) ts.b6 |= (1L<<j); if( (b&(1<<7)) != 0 ) ts.b7 |= (1L<<j); } } return bits; }
// --- toString // Reverse bit-swapping to a String. public String toString() { String s = "["; for( int i=0; i<64; i++ ) { byte b = 0; b |= ((b0>>i)&1) << 0; b |= ((b1>>i)&1) << 1; b |= ((b2>>i)&1) << 2; b |= ((b3>>i)&1) << 3; b |= ((b4>>i)&1) << 4; b |= ((b5>>i)&1) << 5; b |= ((b6>>i)&1) << 6; b |= ((b7>>i)&1) << 7; if( b == 0 ) break; s += (char)b; } s += "]";
return s; }
}
Category: Web/Tech | Permalink
| Comments (4)
| TrackBack (0)
|
|
|
 |
 |
 |
| |
| Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
| |
|
|
|
|
1 |
2 |
| 3 |
4 |
5 |
6 |
7 |
8 |
9 |
| 10 |
11 |
12 |
13 |
14 |
15 |
16 |
| 17 |
18 |
19 |
20 |
21 |
22 |
23 |
| 24 |
25 |
26 |
27 |
28 |
29 |
30 |
| 31 |
|
|
|
|
|
|
(Learn more about RSS) |
|
|
 |
 |
 |
|
|