|
IFIP WG2.4 Trip Report May 18, 2009 IFIP WG2.4 remains one of my favorite mini-conferences to attend. The group is eclectic & varied; the discussion vigorous. Everybody is willing to interrupt the speaker & ask questions. The talks are usually lively and pointed, and this time was no exception. The conference was held in Fort Worden, a historic Big-Gun (planes rapidly obsoleted the guns) fort of the Pacific Northwest, defending Admiralty Inlet and our shores from the marauding Russians ... or Japanese... or somebody, surely. Nobody ever attacked that way, so clearly the fort was a big success. Meanwhile, the fort has been turned into a park and it's beautiful. The whitewashed buildings with green trim remind me of some mythical 50's era America that never was, the kind that shows up in the backdrop of various movies or maybe "Leave It To Beaver". We stayed in the officers quarters and held our meeting in the old mess hall. The officers lived quite nicely; the quarters are actually old duplexes in great shape, with high ceilings and crown molding and carved brass fittings everywhere - the house is clearly larger than my house and with a vast green lawn ta-boot. In San Jose such a place would be upwards of $2m... Fort Worden overlooks Admiralty Inlet from some high bluffs (classic fort-like location) and the surroundings are all gorgeous. The weather varied from overcast & chilly to clear & crisp (alas, overcast and chilly was the mode on our all-day "whale" watching expedition. We did lots of watching but saw no whales of any kind. And we bounced over heavy seas and got sprayed with frigid water most of the day.) Since the Hood Canal bridge was out I got treated to a 3 hr drive the long way around the peninsula, but somebody else drove and the scenery was worth looking at. The airplane portion of the trip was easy. As usual, my reports are (very) brief, stream-of-consciousness writing. I often skip around, or brutally summarize (or brutally criticize - but this group is above the usual conference par, so less of that is needed). I skip talks; I sleep through ones I'm not interested in, etc, etc. Caveat Emptor. Synchronization Talks Atomic Sets - another nice concurrency alternative Fork/Join for .Net occam / CSP - Uber lite-weight user-thread scheduling Java Analysis Talks SnuggleBug - Better bug reporting for Java SATIrE - A complete deep analysis toolkit for Java PointsTo - A comparison of various points-to algorithms for Java analyzing JavaScript - Has a type-lattice as complex as C2's internal lattice Performance variations in JVMs - Noticed the wild variations in run-to-run performance of JVMs Misc Cheap Cores discussion Compiler-directed FPGA variations Clone Wars - huge study on cut-n-paste code in large projects Business Process Modeling Security PDF Attack - "Adobe Reader is the new Internet Explorer" The Britney Spears Problem Squeezing your Trusted Base Who you gonna call? Evolving Systems Swarms in Space Finding emergent behavior Locks are hard. Auto-inference of correct sync in O-O programs. Even with no data-races still have atomicity races. Instead of code-centric locking paradigms, do data-centric locking. Tied to classes with fields, leverage O-O structures Group a subset of fields in a class into a *atomic_set*. Then find units_of_work for that atomic_set. Add language features to Java: atomicset account; atomic(account) int checking; ... atomicset logging; atomic(logging) Log log; atomic(logging) int logCount; So add 'atomicset' keyword, 'atomic' keyword and strongly type variables with atomic_sets. Units_of_work expand to reach the syntactic borders anytime an atomic variable is mentioned. Can expand unit_of_work by hand also. Can at runtime cause atomic_sets to alias - to union together. Used on ownership types (e.g. HashMap uses HashMap$Entry, so units_of_work on HashMap or HashMap$Entry become units_of_work for both). Atomic_sets can be unioned; e.g. for LinkList the entire backbone of the list is one atomicset. Lots of discussion - points out the issues with hacking all the Code Out There. Has proven strong serializable properties on atomicsets. Proven various kinds of soundness, including *internal* objects are only accessed by their owner, etc. All concurrency annotations appear in the libraries but not in the client code - shows a nice concurrent example with 2 threads manipulating the same linked list. Has a working version using Eclipse & taking annotations as Java comments. Using j.u.c.Locks; can do lock aliasing by simply sharing j.u.c Locks. Applied this to a bunch of Java Collections; 63 types & 10KLOC of code. Needs about 1 line of annotation per 21 lines of code for the libraries (and none in clients) About 15-30% slower than using plain sync wrappers on a single-threaded testcase; probably because the "synchronization" keyword is very optimized. Performance is similar or slightly faster when running with 10 threads & high contention. - A concurrent parallel library for .Net (think dl's Fork/Join). Infrastructure: locks/threads/STM, async agents (responsiveness), task parallelism (performance) Gives a demo of a parallel matrix-multiply w/C#. Very ugly code. So instead, write a nice combinator... Standard task-stealing work-queue approach. (private set of tasks that don't need CAS, public set of steal-able tasks that require CAS). Points out that want fine-grained parallelism, so can do work-stealing and do auto-load-balancing. -- Effectful state. Statically typed, with typed side-effects. Type signature includes all side-effects (yes!) int fib(int n) { return n<=1 ? 1 : fib(n-1)+fib(n-2); } But this program int fib(int n) { return n<=1 ? print"hi",1 : fib(n-1)+fib(n-2); } Does not type in Haskell - need the IO monad. And now the function returns 2 results: the IO "hi" and the int. So want some syntax to allow "auto-lifting" into a side-effect world - same function syntax, but auto-includes the IO monad as a result. Then can prove that some stateful computation does not leave the function - that the state is entirely encapsulated & unobservable outside. So can do type'ing which allows some state-ful work internally, but it doesn't escape the function "box". int fib(int n) { ref f1 = 1; // side effects, but entirely internal to the function ref f2 = 1; while( n ) { sum = *f1 + *f2; *f1 = *f2; *f2 = sum; n--; } return sum; } Google: kroc install Carl Ritson did all the work... Cliff Notes: Insanely cheap process scheduling, plus "free" load balancing, plus "free" communicating-process affinity. Very similar to work-stealing in GC marking, but with "threads". OS guys take note! Scheduler for OCCAM/PI. Process-oriented computation (millions), very fine grained processors. Message passing over channels, plus barriers. Uses CSP and pi-calculus for reasoning. Dynamic creation & deletion of processes & channels. Goal: automatic exploitation of shared-memory multis. "Nothing" for the programmer to do: the program exposes tons of parallelism. "Nothing" for the compiler to do - it's all runtime scheduling. Goals: scalable performance from 10's of processes to millions; co-locate communicating processes - maximize cache, minimize atomics (lock-free & wait-free where possible). Heuristics to balance spreading processes around on spare CPUs and co-locating the communicating ones. Usual sales job for CSP... which still looks good. A blocked process requires only *8* words of state (no stack for you!). Scheduling is cooperative not preemptive. No complex processor state, no register dumps. Up to 100M process contexts per second. Typical context switch is 10ns to 100ns. Performance heavily depends on getting good cache behavior. So processes are batched. Stacks are spaghetti-stack (no that's why no stack saving...). These 8 words include the process run-queues. Batching: sets of e.g. 48 processes are run on a single core; that core is the only core touching this queue & round-robins them until the entire batch times out. So no cache-misses. A Batch is a Process, plus a run-queue. A Core is a Batch, plus a queue of Batches. Again, no other cores touch this list either. A Core with no Batches does work-stealing. There's some proposed batches available for work-stealing from each Core. Cores need some atomic ops to manipulate the queue of Batches for these steal-able Batches. Careful design to ensure all the interesting structures fit in a single cache line. So until you need to enqueue, dequeue or steal a Batch - it's all done without any contention for process scheduling. These Batch-switching operations require atomics but are otherwise fairly rare. Key bit missing: scaling up beyond 8 or 16 cores with some kind of log structure. But otherwise looks really good for very low cost context switch & work steal. Lots of questions about fairness. Batches are formed & split by the runtime (no compiler or programmer). Aim is to keep processes that communicating on the same core (so it's all cache-hit communication), but spread the work across cores. Initial batches are kept small to encourage spreading. Channel ops require atomics, but typically only 1 (sometimes 2). Choosing amongst many channels might require up to 4 atomic ops. Each time communication happens on a channel, the sleeping process is pulled into the run queue Batch of the awake process - thus naturally grouping communicating processes. But as batches get large, need to split. Periodically can observe that if all the processes in a Batch are all communicating with each other in some complex cycle - then the spare run queue empties from time-to-time (with one runnable process). So as long as the run queue empties now & then, then the Batch should stay together. But if the run queue never empties, then split-the-Batch. Split off a single process into a new Batch - which will drag it's connected component together into the new Batch. But eventually if the original Batch had 2 complete strong graphs, they will now be split - and can be stolen onto other CPUs. So this kernel is about 100x less overhead than the same design done on a JVM. :-) "Snugglebug" Communication between the programmer & the tool tends to stop early: "Dear Programmer: you have a null-ptr bug here"... Going to the next step: "Dear Mr Programmer: try running your method with *these* values and you'll see the bug...". Programmer: now that I think about it, these values will never arise. We just forced the programmer to define the legitmite range of values. More: Bug Report is more believable if given a concrete input is given. API hardening: gives the preconditions under which this library will throw an exception Unit-test generation: inputs needed to make a specific assert fail. Works by weakest-precondition. Given an assert in the program, work backwards computing the weakest-precondition at each point until you reach the top of the method. If this chain works, then hand the WP to a decision procedure; sometime the decision is "I don't know" but many times you get an explicit counter example. Good match for the counter-example problem. - Yes loops are a problem, but don't really need the *weakest* precondition. Most Java features can modeled: - heap r/w, arrays, allocation, subtyping, virtual dispatch - exceptions: many bugs are here so we modeled exception handling try/catch very closely. Have a path-explosion problem. For heap updates, even a single path includes branch effectively depending on aliasing. Has to do strong interprocedural analysis. Lazily grow the call-graph consistent with known symbolic constraints. First look for a conflict on a call-free path, then on call-paths but without needing to look into the call. Finally, if I must peek into the call - but use the symbolic info the reduce the set of possible call targets. Generally, no fixed-point. Cannot expect programs to give loop invariants. Work-list management: avoid various common pathologies. Goal is to find *a* contradiction, not to explore all options. http://www.complang.tuwien.ac.at/satire/ Shape analysis, points-to, flow-sensitive, context-sensitive, annotations, can read them in & out. Applications to e.g., slicing. Read C/C++ - large complex tool chain. Reads in program x-forms in either a functional programming language or in Prolog. Can combine the tools in any order at this point (very well integrated). Output is C/C++ w/annotations at each step (or internal "ROSE" ASTrees, etc). Has interval analysis for integers, shape analysis, loop (bounds) analysis, points-to, etc. Gives example of reaching-def's in the functional programming language. It's a domain-specific language for writing compiler analyzes. Can specify the strength of the analyzes in many cases: e.g. the depth of the allowed call-chain for context-senstive (set to zero for context insensitive). Shape analysis gets may/must alias pairs, etc - and all results put back into the C program as comments/annotations. Basically describes a large complex tool chain for doing pointer analysis on C & C++ programs. The tool chain looks very complete & robust. ...skips explaining what points-to is. Clients: compilers, JIT compilers, refactoring tools, software comprehension. Different clients have different needs: speed, precision, security. Granularity & conversatism also matter. Clients might want, e.g., dead code elimination, or removing polymorphic calls, or rename methods & calls. Other people want object granularity - escape info & side-effects (i.e., compilers and JITs). Also static-vs-dynamic analysis. Dynamic analyzes typically run the program and produce optimistic results. Doing careful experiments of 2 analysis w.r.t. accuracy. Hard part: can't get a perfect "gold standard" for real programs. Hard even for a human to inspect a proposed "gold standard" and declare it "gold" or not. Some special cases are easier: conservative analysis (no false positives) & optimistic analysis (no false negatives). Can under- and over-estimate the gold standard, but this still messes with the results (messes with detecting which of the 2 analysis is better w.r.t. accuracy). E.g., ask the question: is it worth increasing k>1 (i.e., becoming context sensitive vs staying insensitive). Checking size of computed call-graph - smaller graph is more precise. Very small increase in accuracy. Sometimes made things worse - because the original analysis was optimistic sometimes. Open question: what is the use of an Uber Points-To - it's definitely beyond the point of diminishing returns for JIT compilers. Not much tool support for catching errors during programming. No nice IDE support. Many bugs are browser specific, but we focus on language bugs. Most JS programs are quite small, so throw a heavy-weight analysis at it. Object based, dynamic typed, proto-type inheritance, 1st-class functions, coercions, dynamic code construction, exceptions, plus 161 built-in functions. Types include null & undefined, primitive types, some properties "ReadOnly" and "DontDelete". Broken things: tracking down an "undefined" and where it came from, reading an absent variable or absent property, etc. So do a dataflow analysis over the program, including a lattice, transfer functions, etc. Handle interprocedural stuff. Tightly interleaved control flow & data - so need a complex lattice. Lattice is roughly as complex as C2's lattice, with more stuff on the allocation sites and less on the class hierarchy. Also then model all the abstract machine state (C2 does this as well). Full context info (total inlining at analysis time except for recursion). Cliff Notes: Nice idea: for each allocation site, model both the *most recent object* from that allocation site, and also model *all the rest* from that site. You can do strong-update on the singleton most-recent object, but weak update on all the rest. Noticing the badness of the "compile plan" - 1 run in 10 had a >10% performance reduction. JIKES - always compiled, no interpreter. On demand at 1st invoke. Focused recompilation, lots of GC variants. Points out the places where you get variations - stack-profiles for estimating the call-graph, basic-block counters, etc. Causes of variation? So far: GC & inlining Concerned that i-cache placement is a major issue of instability. *No* correlation between total amount of compile-time spent vs score. You must compile the hot stuff, but apparently tons of not-so-hot code is getting compiled and it doesn't help the score any. General discussion round, not really a "talk". Expect >32 core/die in 5 yr, >128 core/die in 10 yrs. What's "New" and what can we expect in 5 years? Theory: contention is that there's nothing new here since the 70's & 80's (concerning concurrency). (general agreement on incremental progress but not breakthrough). Systems: Lots of libraries (TBB, pthreads), lots of platforms (webservers), all OS's support multi-core. JVMs are New in this space. Might see OS-support for user-level thread scheduling. Languages: Support been there, but it's locks & threads - no good. Support for TM is coming, but will it help? Then there's CSP... Applications: Not much new here.... more games, more interactive more user-interface. Old stuff: lots of HPC, speech, image. Matching: easy parallelism; lots of interesting apps here. What is "reconfigurable"? "Usual" approach - HW guys compile to a netlist, do some mapping, place & route, make a FGPA/silicon, while the SW guys write some assembler on the FPGA - then at the end they try to run the SW on the HW. Goal is to run some complex function faster on a FPGA than in normal ops. Our approach: read source code, look at program structure, find hot code & hot loops; compiler knows how to switch the hardware between variants; then it compiles the code for a particular FPGA variant for each loop, and reloads the FPGA with a new function between loops. But need to limit the FPGA to certain variants so don't actually need to reload the whole FPGA. Fixed small set of architecture variants. Keeps cost to reconfigure the FPGA very low. Compiler can choose the variants. Example: FPGA runs either a small CPU or a small SIMD CPU or a MIMD CPU. Or reconfigure the pipeline structure, or the default register bank access patterns. In the different configurations can use more or less power for problems that are more or less parallel. e.g., in SIMD mode have all ALUs active but turn all but 1 decoder. Pick variants & design space to those known to be efficiently solved by a compiler - gets a fast compile time & good results there (use a human elsewhere). Can map any arch register to any physical register - so can e.g. have CPUs share registers, or do unbalanced register allocation - if some CPU needs more registers in it's portion of code than others. Does some register allocation affinity games, to avoid having to reconfigure register layouts between blocks of code. Nicely flexible overall design. Instructions name arch registers & ops, but also can re-map real registers & ops periodically. So use targeted registers & ops for a particular chunk of code, then reconfigure when the next chunk of code is different enough from the last chunk- and the cost to reconfigure is lower than running with a less-well-targeted registers & instructions. Ugh, can't read data charts - has some 4 embedded programs. Seeing 30% reduction in execution time AND power cost, by reconfiguring. Code size is a little larger, because either poor parallization(?) & the reconfiguration overhead. Exact clone'd code, white-space only changes, near-miss clones. Bug in 1 deserves inspection in all the rest. Lots of cloning studies out there -esecpially in-depth of the linux file system. But no wide-range-of-systems studies. We did a bunch of open-source systems, C, Java, C#. OS, Web apps, DBs, etc. Used their own clone-detection, as a combo of AST-detection and text-based detection. Text-based comparisons are sensitive to formating changes. But AST parsing is expensive (requires tree compares) and does not find "near miss" clones. Ended up doing "standard pretty-printing" - which uses the AST to normalize the text. Then do text-based comparisons. Also handle small diffs in the middle of clones. Did 24 systems, 10 C, 7 Java, 7 C#. Linux 2.6 kernel, all of j2sdk-swing, db4. Varied from 4KLOC to 6MLOC. Java methods have more cloning than C methods - 15%-25% (depending all the diff threshold) of Java methods are clones; C varies from 3% to 13%. C# is more like Java, until the methods allow more diffs - and then there is a lot more cloning. Swing is hugely clone-ish, >70%. C systems - Linux has the same cloning behavior as postgres & bison & wget & httpd. Java has no more clones as you allow more diffs as compared to C - I suspect good Java refactoring tools detect & remove clones earlier. C# shows a LOT more clones as more diff's are allowed. Maybe cut-n-paste is a coding style? C#'s clones are larger methods as well. Something like 10-20% of all files in Java have 50% of functions are clones. In C# it's more like 20-35% of all files. Files with 100% cloned code: 15% of C# files are totally cloned. For Java whole file cloning is fairly low. Localization of cloned pairs: are they in the same file? same directory? different directories? Business Process Modeling, patterns, x-forms, refactorings Process Structure Tree - - Process is a large unstructured graph. Visio-style graphical is error prone for large process models. But can take the structure graph and build a tree from it. (The graph is all fork/join parallelism). A small part of the tree requires state-space exploration. After the tree is correct, can apply patterns & Xforms to the tree. Essentially Petri Net edit'ing on a structured graph/tree using Eclipse. 100's of users inside of IBM; generated business models are much higher quality, make them 2x faster. Why?
"Visual Virus Studio" - actually exists. PDF is a common vector for *targeted* attacks. "Adobe Reader is the new Internet Explorer". Looked at all the fixed vulnerabilities in Adobe Reader
How big is Adobe Reader?
Problems are clustered - e.g. one module has had 25% of all security problems. Top 6 modules have 80% of security problems. Break down problems another way:
How found:
Some examples:
A Quick List:
All memory integrity or soundness type stuff. But it goes beyond AIOOB - e.g. Name string like "Robert); DROP TABLE Students;-- " Users assigned roles; roles are assigned privileges. Benefit: users can come & go without changing policies. "Patients" can view medical records, but "doctors" can change them. Can add annotations to Java that specify roles. Problem: lack of expressiveness; solution: logging. Violation happened (people saw Britney Spears' data), but the logging allowed catching the violators. Another solution: manual checks. But no compiler support, so missed checks. Problem: Lack of explicit access policies. Really: need a way to "type" data (and hence methods using the data) with some kind of security/access policy. More problems: often checks are hoisted to entry point for efficiency, and then violate policy later. Relying on manually-inserted runtime checking. Our stuff: parameterize classes by some kind of index value defining Role. Each method includes an explicit policy (statically checked). Annotation processor can either statically type-check the policy, or insert runtime checks. "Interesting program properties are undecidable!" Compiler optimization - not so worried: just lose some optimality. Results still usable; but if we have better results we can produce better code. Worse-Case execution time analysis for hard real time: Bang. Your dead. WCET bounds must be safe (soundness), and wish to be tight (optimal or useful). E.g., a bound of 1hr for a mpeg3 audio-player interrupt is safe, but useless. Need, e.g., all loop bounds. Need user annotations. User annotations are untrustworthy, but must be trusted. Squeeze the trusted annotation base - reduce user chance of screwing up. Try to replace trust by proof - as a side-effect generally improving optimality. Then take advantage of trade-off between power & performance of analysis algorithms. Use model-checking with care. If something can't be bounded automatically, as the user for some bound - then prove it. BUt also believe the user's bound isn't tight. Try to prove a tighter bound with the model-checking using binary search. Indeed - can do away with user assistance often; just guess and binary search to tighten it. Summary: can often completely empty the trusted annotation base, usually shrink it; often improve WCET bounds. Start with the cheapest techniques (constant prop), move up to interval analysis, then symbolic bounds & model checkers... (Joe is a Windows attack expert) Attack Vectors - autoexec.bat, .cshrc & friends, autoplay, device-driver installs (Sony), mac resource attacks Engineered attacks: phishing attacks, downloads, activeX, client-side scripting Lots of "security" designs & codes are done by junior programmers w/out adequate oversight. Pervasive attitude: "software is secure - as designed, as implemented, as maintained". Small+fast is all that matters... not... Rant against all modern OS's, talking about #of security patches in Mac, Linux, unix, Windows... Client-side scripting is vulnerable. Flash, Office, Adobe Reader, etc. What are we teaching people about security? How about non-CS types? Graphics designers, website designers... Build very small satellites, swarms of them to watch the Earth. They will spread out after launch to watch e.g. weather, bush fires, etc. Takes about 2 days after launching to get a communication link. Satellites have good earth coverage; orbit at 500 to 600km, orbit earth every 3 hrs or so. Wide range of sensors amongst different satellites. Have a Real Swarm of cameras; map a Virtual Swarm onto that; map mission goals on the V.S. Optimized synthesis of consensus algorithms. Special OS - distributed OS basically. Used to maintain swarm integrity (keep satellites together). Thesis: in the future, systems will be to complex to design *explicitly* Instead engineer systems to have the desired properties "implicitly". "mob programming", ant mounds, etc.... simple rules of behavior leading to complex behaviors that emerge. Emergent Behavior Mechanisms Design - (game theory),
Swarming behavior (flocks, etc)
Social communication (gossip, epidemic algorithms)
Design for emergent behavior
No idea how to design high-level behavior from low-level rules, but busy experimenting and looking for some cases. Cliff Category: Web/Tech | Permalink » | Comments (2) » | TrackBack (0) » Like this post? |
May 5, 2009 http://www.cs.tufts.edu/research/dacapoDaCapo is a small (but growing) research group focused on GC & Memory Management issues. They meet about every 9 months, and the meeting bounces around depending on who is hosting it. People mostly present work-in-progress, and the audience is up front with their critiques... often interrupting the speakers or getting sidetracked into loud off-topic conversations, i.e., it's a fun boisterous crowd. DaCapo is in the Boston area this year, held at Tufts University's Medford campus (everyone's heard of Harvard and MIT, but how about the Berklee College of Architecture, Fisher College, Boston University, Suffolk University and dozens and dozens more? - Boston has more colleges and universities per square mile than any other place I know). As usual for me traveling to Boston, I skipped the car and just used public transportation - the T works marvelously well (as compared to e.g. VTA; crossing the entire town from Logan Airport stuck way out in the water to Tufts some 8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose on the VTA takes something like 2 hrs). Anyways the weather was supposed to be highs of 65 and rainy with lows in the 40's - so I packed my parka. Turned out the weather report was a little bit off, with highs in the upper 80's and sunny. Man did I roast. But I got out and enjoyed the sun - we (the entire DaCapo group of 40 or 50 people) walked about a mile to lunch each day, plus I walked to and from my hotel & the T stop (also about a mile). Enough ruminating, on to the talks! As usual, I pay attention to some talks and chat w/neighbors through others. I brutally summarize. I'm opinionated & narrow focused... so Caveat Emptor. Change Tracking What changed? Who did it? When? Points out the flaws of using 'diff'. Logical-Structure diff. Auto-magically spotted systematic change patterns. It's a better code-review diff. Can report diff's as reports like "for all XXX, replace XXX with YYY except for ZZZ_XXX_ZZZ", etc. Vastly better error reporting!!! Vastly better diff reporting!!! Cliff Notes: We must get this gizmo and check it out!!!! Paper report: http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf Sample output: http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm Inferred Call Path Profiling Very low cost path profiling. Hardware counters have very little cost, but provide little context. Software is reversed: high context & high cost. Cliff Notes: Compare this to call/ret RPC hashing. Cliff Notes: Bond & McKinely does something similar. Instead, look at number of bytes from main() to leaf call on stack. Basically PC & SP (offset from stack base) provides all the context we need. Dynamically detect unique PC & SP (they did it in a training run) pairs, and reverse them to get a call-stack. Lots of duplicated stack heights, how to tell them apart? On SpecInt unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call stack). But huge spread; really sucks bad on some benchmarks. If have a list of entire call-graph, then can remove ambiguity by growing some frames. But requires some global analysis to know which frames to grow in order to be unique. "Solved" problem with a simple greedy search. Compute all call-stacks over whole program run, pick a conflict, grow some frame, re-compute & keep if more PC/SP pairs are distinct. Repeat until it's rare enough that can live with dups. Cliff Notes: The whole-program ahead-of-time thing isn't feasible for us. How to do this on-the-fly during tick-profiling? Hash PC/SP, find a tick record, 1-time-in-1024 check for dups (crawl stack the hard way and confirm call-stack), otherwise assume unique & increment tick record. If MISS in hash table, then crawl stack the hard way. If record is labeled as having dups, then crawl the call-stack & find true record?) Actually, probably expect the call stacks to be common, but the PC's to be fairly unique. So really asking for unique RPC+SP (not PC+SP). His experiments: collect PC/SP pairs about 100 times/sec. 0.2% slowdown (must have had careful measurements to detect that!) Breadcrumbs - Recovering Efficiently Computed Dynamic Calling Contexts Also builds on probabilistic calling contexts. Given a (nearly) unique ID (say, computed at each fcn entry via a simple hash on the return PC). IN the past, reversing these unique ids: use a dictionary (very large & high cost), use a calling-context-tree (smaller, but requires you to track where you are in the CC-tree at runtime). Forward search, but undecidable and limited options for pruning. Attempt to reverse the hash fcn (but only when we need to take the semi-unique id) and convert it to a call-stack. Straight-up reverse is no good, but is easy to guide search. Callers of leaf call themselves have structure - e.g. expect to see a call-op from the call-site, etc. Still need a list of all call-sites (generally available anyways because of inline-cache structures). His function is "p' = (3p+c)", where "c" is RPC? Requires a handful of instructions on function entry. Cliff Notes: If this reverse search is successful most of the time, it says that during a profile tick we can record just the PC/SP (or RPC/SP/PC?) and then reverse them to the entire call-stack later - on demand when RTPM requests. So a 'tick' is no more than record a few bits (and handle buffer overflow), but no stack crawl. Cliff Notes: combined nicely with 'null-assignment-stack' info, ie. collect the entire stack and 'color' all NULLs with a unique stack id. Then if get a NPC can report both the NPC at the crash site, but also the call-stack that assigned the NULL in the first place. Transparent Clustering - Phil McGachey, Eliot Moss, Tony Hosking Attempt to auto-distribute multi-threaded programs. Pure Java, no JVM hacks. No programmer involvement. Dynamic class re-writing. JVMs can vary. Hardware can vary. All machines on a network run RuggedJ nodes. RuggedJ includes a rewriting class loader, some runtime libs, and some application-specific partioning strategy. Rewriter: generate lots of classes & interfaces; especially hook methods & fields. Hook into the runtime. Abstract over object location; remote & local objects. Ex: Standard class "X". Rewrite into a X_Local and X_Stub classes. X objects on Node A are of type X_Local. On Node B, X objects are really X_Stub objects where all the calls are remote calls. Both X_Local and X_Stub implement the "X" interface. Some other Y_Local can refer to X_Local on Node A, but refer to X_Stub on Node B. If want to migrate from Node A to Node C, want to indirect the X_Locals; so for mobile objects there's an indirection layer called "X_Proxy", that then be switched from pointing to an X_Local vs an X_Stub (if the object migrates). Non-mobile objects skip the X_Proxy indirection layer. Inheritance & interfaces are maintained in the rewritten classes. Static classes conform to the RuggedJ model (I assume replication is allowed?). Arrays are also wrapped to support indirection. Constraints: Native code breaks everything. Try to not rewrite stuff referred to by native code (maybe can intercept all native calls and rewrite args?). System code is also an issue: loaded by the system loader, is not rewritten. (Can they use -Xbootclasspath?). But most code run is actually system code! (eg. String libs are very common!) So must deal with system code without changing the JVM.
Apparently able to cluster some complex programs. No idea about performance as compared to explicitly clustered apps. Jikes RVM Native Threads Was Green threads, now Native threads (everybody else is dropping green threads too....) Motherhood & Apple Pie. Computer Science as an Experimental Science Reproducibility? Experimental bias? Show how these problems are solved in very-high-energy astrophysics. (1) No publish without multiple colleague confirmation. (2) Publish (3) MUST also now publish reproductions or non-reproductions's of other peoples' experiments (or lose credibility as a research entity). But bar to publication is lower, and expectations are lower. i.e., people expect to publish in-progress work Biggest comp-sci problem now: lack of infrastructures. Examples of infrastructure bias: green-vs-native threads, some barrier styles are hard to use (.Net embedded objects), stack-scan (OVM makes fast stack scanning at the cost of everything else being slower). HotSpot (high-value compiler, good GC, good most everything) vs JIKES. Cliff Notes: This was actually a really fun talk (and lots of discussion) despite my paucity of notes. Web 2.0 Profiling Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite. Relies on browser to run Flash, JavaScript, REST, AJAX. Benchmarks: IBM social networking software; Java PetStore 2.0, AJAX-based RIA using JEE5, uses Lucene search. Apply workload: viewBlog (from 9 blogs), createBookmark (total 14), selectTag (PetStore 9), etc. Workload mix can be varied, but based on commonly observed usage pattersn from internal deployment of Lotus Connections). Wanted to remove DB from the workload, put DB on ramdisk. Same for LDAP. Used Grinder to generate load. Main server running J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5 1.66GHz. Fairly reasonable setup for a small web/app load. See lots more chattiness; very short fast requests, high interrupt rates, smaller requests but many more of them. Looked at low-level profiling. See fewer I-cache misses as compared to a large legacy apps. Better I-cache behavior. But data-cache miss rate is still very significant. Also different transactions have different scaling limitations. For PetStore the Java persistence layer started falling off after 4 cpus. Stream Processing "Spade" - Cluster stream processing at IBM. Continuous high-level streams. Stream arrives for weeks & months; want to process and then emit stream data continously. Want to express the problem with a language. Priorities - fast streaming on a cluster; then Generality; finally Usability. Beyond StreamIt. Like StreamIt, works with streams that can be split & joined. Language is typed; stream contents are strongly typed. Demystifying Magic - High-Level Low-Level Programming Seen this talk before, but this time given by Steve Blackburn. Nice historical perspective on writing system software in C vs asm - and the parallels to writing sys software in Java vs asm. Java-In-Java
Framework:
Introduce raw pointer type: "Address" looks like a Java type. E.g.: Address a = ...; ... = a.loadByte(offset); Semantic Regimes: Additive: "its ok here to have unchecked casts" Subtractive: "no allocation via new allowed", similar to HotSpot VerifyNoGC. Allow "unboxed" types - no v-table, so it's a bare primitive type. But this isn't the default (but can be asked for). Abstraction results are good: can implement a JVM on bare hardware and with a simple change essentially virtualize the same JVM inside a GC debugging harness running inside another JVM (where the virtualized JVM's heap is just a big array in the outer implementing JVM), etc. Grace, Safe Multithreaded Programming for C/C++ - Emery Berger Seen the paper before, but this is the talk by Emery. Fake sequential semantics on stock multi-threaded hardware using fork/join and process protection. Grace is a pthreads replacement: "g++ -lpthreads" becomes "g++ -lgrace". Speedups: in the absence of common benign races, Grace run programs run about 10% slower than raw pthreads - but have full sequential semantics. It's "as-if" all the parallel programs are run sequentially immediately at the thread spawn point. CAN'T run applications where threads run forever, i.e. reactive or server apps. Works well with fork/join, Cilk, TBB, etc. So thread spawn becomes a unix fork with COW, use mmap for allow memory sharing. At join points, smash in join'd threads memory updates via mmap. Also need scalable thread-local malloc heap, plus aligned globals (to avoid false-sharing at the page granularity), plus some improved I/O. Some simple measures remove nearly all false sharing. Big one: everybody mallocs into their own space. 2nd big one: spread the globals out 1 per page. Thread-spawn is as cheap as a fork on linux (has experiments to show it). Due to thread-cpu affinity, if you spawn a bunch of threads they share a single CPU. At the low end of thread-grain-length, fork is faster than spawn because the scheduler immediately spreads the processes across CPUs, whereas threads share for awhile. So fork is actually faster than thread-spawn for awhile. Real performance limiter for Grace is conflicts & rollbacks, and not thread-spawn overheads. Performance is much better than e.g. STM, because after the 1st touch on a new page (and the SEGV & accounting), all accesses on that page run at full speed. GC Assertions: Using the GC to check heap properties http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&rep=rep1&type=pdf Global knowledge is required to understand seemingly local properties. JBB example: Order order; for(...) order = ....; delete orderTable; ? are all 'orders' dead here? But actually 'order' is held by the last-customer-transaction (as an optimization, or convienvence for customer?). Leads to a leak of 'orders'. Really: programmer does not understand program, too complex. Add assertions to the code, and use the GC to check the property. Cliff Notes: Azul is already gathering global heap properties during GC - but not asserts. Gathering liveness & points-to info. If points-to included a handful of sample real instances (and all the sames linked when possible), would be a very powerful way to instantly check some heap properties. Sample asserts: 'assert_dead(p)' - expect to reclaim 'p' at the next GC cycle Or region-collection: start_region(); ... assert_all_dead(); Or shape property: assert_unshared(p); Or ownership properties (members of a collection are 'owned' by the collection) If an assert fails, then do a slow crawl and provide the full path through the heap showing the failure. Asserts only checked at GC points. Proving Proving Correctness of Abstract Concurrency Control and Recovery - Trek Palmer & Eliot Moss Transactions - closed nesting has issues Open nesting & Boosting need from programmer: conflict info & roll-back info. These are hard to provide correctly. This work: a description language of abstract data types or structures. Can describe the conflict predicates & inverses. Can prove correctness of the conflict & inverse expressions. Working with the abstract description of the data structure, and NOT e.g. with the real Java implemention. E.g., no loops, no mutations, no recursion. So the language isn't a 'real' language, but can describe many kinds of ADTs in code that looks sorta like Java. Output isn't a functional program, instead the output is the result of using a SAT solver to prove correctness. XTN-semantics allows conflict detection to be proven correct pairwise, instead of having to do full interleaving. Formally, a conflict predicate is correct iff it is true when the operations do no commute. The conflict predicate tells when 2 XTNs conflict, and the inverse allows an optimistic XTN to rollback. Obvious use-case: use this tool to write a transactional-version of NBHM or other JDK concurrency utilities. Hard Real-Time GC Scheduling - Periodic Scheduling - GC runs at highest priority, but periodically yields to mutator - Metronome - Slack-based Scheduling - GC runs at lowest priority - Can be preempted at any time - Makinac (Sun RTS) - Work based Scheduling - GC runs at allocation time - Problem with allocation-rate jitter HRT - Deadlines must never be missed AND must be verifiable analytically that no deadlines are missed. Systems tend to therefore be very simple, so that they can be proven. OVM - like Metronome. Dynamic defrag; arraylets; Brooks style barrier; replicating barrier/; incremental, concurrent, supporting slack-based style scheduling. Understanding Performance of Interactive Applications Typical profilers give the wrong numbers: they report total time spent, not time spent between mouse-click and page updated. AWT/Swing & SWT - similar: single-threaded, gui thread in "event loop", relies heavily on Listener model. So profile using gui call-back Listeners between user events and the eventual display change. Really simple idea: profiling & event visualizer tool between click & view times. Program Metamorphsis Trouble w/refactorings: must preserve program behavior with each step. But if we want multiple refactorings then at each step along the way fixup code (needed to preserve program behavior) pollutes the code. Want to do multiple refactoring steps at once - while leaving the program possibly broken in the in-between steps. So compute a program-semantics (names & def-use chains), and let the user make partial refactoring steps and then declare "I think I'm done" - and the system compares the new program-semantics with the old, and reports if they are not equal. So actually comes up with primitive not-quite refactoring steps, which he will compose to build larger "real" refactorings, or compose lots to build a multi-stage refactoring. Instruction Selectors HotSpot C2 uses a BURS framework. Very hard to optimize & debug. So these guys have a semantics which can span both the ideal IR and real machine instructions. Describe both using "CISL" and the tool does the mapping. Once the tool has a mapping, the user provides an adapter that converts the mapping into the code needed by the compiler back-end. This would replace e.g. the ADLC. Gives examples of machine encodings for PPC and Java bytecodes. CISL is given an IR tree and produces a target tree with equivalent semantics. Not sure it's entirely better than BURS (maybe no worse), but I'm still sold on rewriting in e.g. C++/Java hand-written greedy match rules. These code-gen-gen tools are pain for long-term maintenance. Do Design & Performance Relate? Is fast code required to be ugly? Is beautiful required to be slow? Pick 200 code metrics. Also pick some performance metrics (cycles, single threaded, objects created, etc). Could not follow talk... or at least he wasn't very clear with the objectives & results (if any). Interesting idea though. Category: Web/Tech | Permalink » | Comments (1) » | TrackBack (0) » Like this post? |
April 15, 2009 Various pending goodies... Load-Linked/Store-Conditional vs Compare-And-Swap
LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms. With LL/SC, you first Load-Linked a word. The hardware marks the cache line as "linked". You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter). Finally you issue a Store-Conditional to the same address. If the cache line is still "linked", the store happens. If not, then not. The line remains linked as long as it does not leave this CPU's cache; e.g. no other CPU requests the line. Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented). "Weak" LL/SC implementations can easily lead to live-lock - if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their "link" - and their SC's will fail. I suspect most modern implementations of LL do not request the line in exclusive mode - avoiding the obvious live-lock failure. The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC. With CAS, you typical first load a word with a normal load, then manipulate it. Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory. CAS can succeed if the original cache line leaves the CPU and returns holding the same value - this allows the classic ABA bug. In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an "A" pointer unless all copies of A die first - including the one loaded before the CAS. Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS. Azul has a load-exclusive instruction to avoid with this - a plain load but the line is loaded exclusively. With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there's a small finite number of operations that can happen between the LL and the SC without losing the "link". E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an "infinite distance" between the original load and the CAS). These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations. Nothing in the academic literature requires any sort of global ordering on these atomic instructions. Instead the usual academic assumption is that all memory operations are strongly ordered - which is obviously not true on all modern hardware. Practitioners are required to insert memory fences as needed to achieve the desired ordering. Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do. Azul's CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations. (As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking. Report back the %-tage of dropped counts and the throughput rate. Then try it with an Atomic counter. My simple tests show with a handful of CPUs it's easy to achieve a 99%+ drop-rate - which basically makes the counter utterly useless). I am less familiar with the fence properties of common LL/SC implementations. Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc? build.java
A "HotSpot -server" aka C2 compiler IR visualizerNo I didn't do it! This deserving grad student did it. Here is the reference to the visualizer for the ideal graph of the
HotSpot server compiler:
More C2 Goodies
John Cavazos wrote:
... One of the things I am currently looking at is determining the right phase-ordering of optimizations applied to a particular program being optimized. I have some nice (un-published) results for JikesRVM, but it would be nice to replicate the research for HotSpot... I have strong arguments for nearly all orderings of our passes, so I'm
curious to know about your phase-ordering results.
Travel PlansI got too much travel coming up! Apr 30, May 1st - DaCapo in Boston. Favorite small group; GC & Java focused. Website: http://www.cs.tufts.edu/research/dacapo/. I had to turn down the invite to an SSA seminar in France because the dates conflicted. Very sad. I hope one of the attendees will post a trip report. May 11-May 15. IFIP WG2.4 (International Federation of Information Processors, Working Group 2.4 - a *really* old European group with a random mix of industry & academia.) It's a nice group to preview JavaOne talks, and always the meetings are a week-long rambling discussion in some quaint resort. June 2-June 5. JavaOne. I owe slides for 3 talks by end of this week. 'enuf said. July 6-10th - ECOOP, as a paid-for invited speaker. A free vacation to Italy! :-) Cliff Category: Web/Tech | Permalink » | Comments (10) » | TrackBack (0) » Like this post? |
April 5, 2009 'Tis the Season to be Conferencing... I went to EclipseCon, to CGO and back to EclipseCon. Here's how it can down: I got invited out of the blue to talk at EclipseCon by Darin Wright, who had seen me speak earlier at the JVM Language Summit. Like a dummy I accepted it instantly instead of checking my calander. Naturally it conflicts with CGO - and I was on the CGO PC this year, so I felt obligated to attend CGO as well. Turns out my talk at EclipseCon got scheduled for Wednesday, and I decided I could skip CGO on Monday (I think there was one talk I was interested in) - so I scheduled a 1-day only trip to Seattle. Monday I wandered EclispeCon at the Santa Clara Convention Center, Tuesday I took the 1st plane up to Seattle and the last plane back, and Wednesday it was back to Santa Clara to give my talk and surf the EclipseCon booths. The flights on Alaska Air were smooth and easy. No WiFi at CGO - the Hotel wanted some totally outrageous price for conference WiFi (I think the local chair was saying $50/head prepaid?!?). (Remember my grief over Hotel WiFi in the last blog?) So over lunch break (which was Chinese food and I don't like Asian food of any kind) I headed over to Starbucks. Turns out Starbucks will happily give you 2hrs of free WiFi a day from ATT - provided you register $5 on a Starbucks card with them. Yes!!! Suddenly I have a $5/month nation-wide WiFi plan! I snacked on some salad'y thing, drank my triple-latte, and happily Surfed the Internet until show-time at CGO came 'round again. 1st up: EclispeCon, mostly cause I got nothing to say. These guys are in a totally different ball park from where I live; they mostly deal with complexity control (or at least 1/2 the talks essentially amounted to adding, changing, or removing a control layer). Makes me wonder if the cure is really better than the disease. I gave my talk, but it was too much talk in too little a time window. Both me and the attendees would have been better off if I could have spent more time on the details. C'est la vie. Other thing of note: the web site for EclipseCon is vastly better than the usual academic conference I attend. Also the staff was exceptional friendly and easy to work with. Next up: CGO. Despite my 1 day fly-in-fly-out I got a lot more out of CGO. ---------------------- Keynote: Vikram Adve Missed! Arrived barely in time for the last 10 minutes. That's the penalty for attempting a 1-day fly-by. I hear he said something good about Azul systems (he did ping me the day before for some details). Basically it's a Call To Arms for improving C compilers and C runtimes. He does say that fully auto-parallelizing code is probably a bad idea, and that we need to change the language instead (something I very much agree with!). ---------------------- Revisiting Out-Of-SSA Translation Cliff Notes: HotSpot "comes out of" SSA form during register allocation, where it really preserves SSA form throughout the entire compile process, but adds register annotations as part of reg alloc (and inserts live-range splits as part of spilling). So mostly this talk is about stuff HotSpot has been doing for years... but there's a key new piece. ---------------------- Wave Propagation and Deep Propagation for Pointer Analysis Wave Prop is very fast and uses huge memory (working on parallel version) Deep prop is faster for small & precise memory sets - works better on desktop (and JIT?) There's a zoo of ptr-analysis names: - context-sensitive or not - flow-sensitive or not - field-sensitive or not - inclusion-based (Andersen-based) or unification-based (Steesgaard-based) Cliff Notes: HotSpot -server uses context-INsenstive, flow-INsensitive, field-Sensitive, unification-based Build a graph of constraints, and do contraint-propagation. Code: "a = &b;" Constraint: "a superset {b}" - means b element-of P(a) Code "a = b" constaint: "a superset b"- means P(b) contained in P(a) Code "a = *b" or "*a = b" - complex constraint, must add edges to the graph as constraint propagation adds new sets to the nodes. Now build a graph: nodes are P(a) - sets of things pted-at by 'a', and edges represent subsets. Cycles are common; collapsing them is crucial for performance. All things in the cycle must be equal. Uses Nuutila to find strongly connected components in a single pass (so does HotSpot - Vick's algorithm). After collapsing cycles, you get a DAG - and Nuutila also produces a topo-sort as output. (Still looks like a complex diff/delta propagation algorithm - but makes me wonder if standard bit-vector stuff is faster). Not a bad-looking ptr-analysis. Might work in a JIT. Manesh suggests using BDD's. Cliff Notes: Real question of course, is if all the extra alias precision actually buys you any performance. PLDI paper of a few years ago suggestions otherwise: that HS's style of alias-analysis gets you 95% of all the possible gain. ---------------------- Fast Track: A Software System for Speculative Program Optimization Chen Ding: another process-level fork/join speculative program optimization (contrast this to Emery Burger's "Grace" system). Gives sequential program semantics. Add annotations to "suggest" fork/join points. Actually fork a process, use COW & mmap to guarantee sequential semantics in the face of races (end speculation, or force commit-ordering on speculative threads, etc). New Idea: User can write an alternative implementation of some module; run both versions in parallel; keep the faster version & kill the slower version: "return do_whichever_is_faster( methodX(), fast_methodX() )" New Idea: "fast_methodX" might be speculative and *wrong*. So run slow-version to completion and do error checking after the fact. But let the fast version run ahead (and if it's faster), it can spawn off new speculative fast versions. When slow version completes, use page-level protections to know what's been changed, and verify the correct semantics of the fast version. Run all the correctness checks in parallel with all the normal fast-case stuff. Needs some careful throttling. Example using gcc's "mudflap". Array under/overflow, leaks, un-init'd memory, etc. Run 2 versions of all code: the existing code is "fast track" and the mudflap version is the "slow correct track". Goal: same speed as the "fast track" code with all the safety checks of the "slow track using mudflap". In some cases can get real nice speedups. Cliff Notes: can we turn on mudflag for Azul? ---------------------- Scenario Based Optimization Profile the hot methods. Tweak various parameters (e.g. loop unrolling, or cache-tiling parameters). Profile both old and new versions. Pick the best winner, and stick with it for awhile. Keep profiling continously, and switching code as needed. Some JVM JITs do this. HotSpot only sorta does. The programmer has to pick what the new & old versions are, and the runtime picks the fastest one (no support for correctness - what if one of the versions is busted? Or tickles a timing bug which breaks things?) Seeing some nice gains, some of the time. Avoids losing when turning on some aggressive optimizations. But generally unrealistic. ---------------------- Software Pipelined Execution of Stream Programs on GPUs Basically StreamIt on a GPU StreamIt is much easier to use than e.g. CUDA or OpenCL. StreamIt model is a collection of connected serial parts; connected via pipes. Due to splits & joins, might need to exec some parts more often than than others (e.g. part Y "eats" 2 ouputs from part "X" before outputing a result itself). Get complex patterns of execution; difficult to map down to GPGPUs. Have bounded-buffer problems: need to map to a steady-state schedule using a fixed count of buffers. Then can arrange to do no allocation during execution. - Profile each filter standalone, to figure out it's work load. - Configure selection - Add constraints - Use CPLEX to solve - Output CUDA code. Looks like a classic software-pipeline, just done on a GPGPU. Compute minimum initialization interval, compile resource constraints, etc. Have issues with data-layout, so can do parallel data access. Impressive compile times (30sec to 178sec). Can get pretty good utilization on some kernels, so seeing 100x speedups. Cliff Notes: My usual caveat for domain-specific languages applies here: if your domain fits StreamIt's model - then StreamIt is the fastest language for you! If what you got doesn't look like StreamIt, then Too Bad. ----------------------Stream Compilation for Real-Time Embedded Systems StreamIt Again, but for Cell. New constraint: out of memory per cell (distributed memory). Different from GPU - GPU has shared memory, but very slow if doing bank-conflicting access - but can pile on more memory (more buffers) for some GPUs and less for others. Not so for Cell - must have a uniform split of memory. Thanks, Cliff PS - Thanks Doug for inspiring this blog
Category: Web/Tech | Permalink » | Comments (7) » | TrackBack (0) » Like this post? |
March 19, 2009 Once Again I Put Down my Hacker's Keyboard to Bring You This Trip Report... sigh, all good hacks must come to an end, or at least be put aside so Real Life can intervene. I went to DC for the ASPLOS and VEE conferences and all you get is this lousy blog. ASPLOS & VEE are co-located in D.C. this year. Mostly the trip itself was uneventful. I was very busy either attending the conference sessions or preparing for my own talk and I didn't get around much. The flights themselves were mostly uneventful (ComAir's HQ in the midwest lost power from a tornado, so I was stuck on the tarmac at JFK for an hour waiting to get permission to fly into DC - first time I've been delayed by bad weather when the weather between the source and destination airports was perfect). The most notable point was that the hotel used WayPort as their internet access. WayPort sucks horribly; I've used them in the past and swore I'd never use them again - but I forgot to ask when checking in to the hotel! WayPort charges $13/night, and service is filtered through some horrible layout routing through TimBuk Too - average ping times to hit a server in the DC area was about 600msec (ping times from AT&T wireless in the conference room were ~50ms). To add insult to injury, some other conference attendees stayed at a nearby hotel - which was cheaper and had free WiFi AND the free stuff was faster! Definitely having WayPort as your WiFi provider is sufficient for me to change hotels. Cliff Notes: I had a wonderful 1.5-long 1-on-1 talk with Monica Lam (http://en.wikipedia.org/wiki/Monica_S._Lam); she didn't know what Azul does. Cliff Notes: More interesting talks with Jon Shapiro (http://www.eros-os.com, who's busy implementing a high-level low-level language for high-availability systems). Reality Check: yes again a Mac failed to be able to output to the projector, and we all waited while the presenter hotswapped to a PC Caveat Emptor: As usual for these talks, I only attend some sessions, I write stream-of-consciousness reports which I only lightly edit and the reports tend to be brutal and short. They are roughly sorted here by my interest level, and not by e.g. time order. General Cool Stuff: Keynote: Monica Lam's Vision of the Future of Human / Computers Interaction Getting Wrong Answers Without Doing Anything Wrong - all your base are belong to us Text Processing in Parallel in a Register - wicked fast Nanoscale Computing - wicked scary Demystifying High-level Low-level Programming - how to write Java-in-Java Asymmetric Chip Multiprocessor - 1st good use I've seen for these Leak Pruning - Reported on before, but still good TRIPS - Funky dataflow/VLIW hybrid (but not the weirdest I've recently seen!), too bad it didn't pan out HTM Talks Which I read but due to a snafu failed to attend the talks. :-( Maximum Benefit from Minimum HTM - not so minimum (bigger than Azul's HTM!) and lousy results Rock's HTM - a peek at Sun's experience with Rock's HTM; mostly Rock's HTM is really really weak Multithreaded Bug Finders CTrigger - The Winner in this section: cleverly induced sleeps trigger race bugs reliably Anomaly-based Bug Detection & Prediction - It's an, ahhhh, interesting hack Isolator - silently prevent race bugs Making Programs Deterministic Cliff Notes: Major flaw in the thinking here: Real Programs have Real Random Inputs. Unless you capture & replay ALL external events, minor things like network-packet-arrival-order and the chaotic butterfly effect (http://en.wikipedia.org/wiki/Butterfly_effect) ruin your determinism. So I think all these papers are tilting at windmills. Kendo - Stock hardware, but requires a race-free program Deterministic Replay - Goes only 1/2-way to preventing the Butterfly Effect DMP - Deterministic Shared Multiprocessing - No replay attempt, just full-speed determinism (assuming no Butterflies) Other Stuff Talks I attended but aren't otherwise interested in... Memory Buddies - Yet Another Use for Bloom Filters Live VM Migration - pre-copy push still beats post-copy pull Validate - Microsoft patches Are Bad (tm) Raytracing vs GPUs - Doom! Commutativity Analysis for Software Parallelization - the original work is WAY more interesting than this follow-on Predicting Yield of GCs - Yea Olde Hat Trick Phantom BTB - winner of the Most Unrealistic Hardware award Why do computers break? When we buy something, we expect it to work. If it breaks, you buy another one. But with computers they break on their own (e.g. u$oft update) or you can break it doing "normal" stuff - and if it breaks it doesn't help to buy another one - the new one is broken already. Virtual Desktop - idea: keep the virtual machine in the datacenter, and do remote login. Does not work well - requires WiFi everywhere. With each new comp generation, 10x cheaper & 10x more users. Mainframe-> mini -> worksta -> PC -> laptop -> cell Also see 3 industries: internet, media/TV, telephone Also see 3 networks: cable/satillite, broadband/phone, wifi/wimax Also see 3 computing models: cell (always on, pocket), PC (rich legacy, offline exec, locality of reference), Cloud (device independent, central management). Things like GoogleGears or Virtual Desktop are trying to split the Cloud/PC model. Guess where we are going? 1- (Man) a phone-y thing: key, cache, window into digital cloud, ID, personality, asserts, internet access. Small, always on. Can be lost easily & replaced easily. No important state, but many Gigs of flash cache. No real compute power. Small visual display & keyboard. 2- (Earth) Can (1) plug into a PC. Get big screen & keyboard, get fast processor. Get hot DRAM, teribyte disk drive. 3- (Heaven) Backs onto the internet/cloud. This model has some different security issues. DataCenter is probably secure (or can be?) You hold key in your hands. Can we make path from hand-held-key to datacenter secure? Trends - green, Bring-Your-Own-PC, Mac-vs-Windows. End of Corporate PC model. 80% of people use personal PC for work; 45% of people use work PC for personal use. 25% of workers change security settings, etc. SO... want: {User, Ubiquity, IT} Separation of Concerns. x86 virtualization seperates: hardware from VM; hardware from OS; different hardware configs; different user roles MokiaFive company - Desktop-As-A-Service. LivePC images "in the cloud". Managed on central server. Can go up to any mac/pc and run "your desktop" from a USB stick. USB stick/phone is a local cache only - backed up on the cloud. Think of USB memory as a "network accelerator". Nice live demo. From USB stick to a selection of PC desktops is not very long (+2 mins for a new PC to install software from stick). No hotel wifi? Claims to be sync'ing with cloud??? Nice integrity model: basically re-install the OS clean each time on a fresh VM-ware layer with anti-viral. Intent is to clean out root-kits with each fresh login to machine. Graphics is hard: many games are wired to DirectX, so losing much 3D performance. Trying to bypass all the virtualization layers for DirectX. But... guest & host have different address spaces; but graphics has lots of pointers; fast pointer munging is hard. Other services: LDAP, revocation, offline access, logging, dlp. Encyption, Storage Virtualization: reju, incremental & atomic restore; secure access. Uniquification, single-sign-on. Seperate-of-Concerns Principles - 90% of code handles exceptions (only 10% does the normal stuff). "Re-Install the OS" as a root-kit prevention. e.g. system is structured as a fresh-OS-install because of the rare root-kit issue. Bootstrap from a small trusted base: baremetal VM; LivePC player; IT-managed base image (Windows+AntiVirus); finally user-apps & data Like the Azul philosophy about GC - make the hard case (FullGC, infection from rootkit) common and then get it right - then all else follows. Need to keep users ownership of own data - avoid the Big Brother mentality. Give users a choice on where data is kept. Variations in unix env variable size, can see 15% variation in performance! Running gcc -O2 vs gcc-O3, 400.perlbench. Bunch of runs at each setting very repeatable per-run. But with different env sizes see huge variation on 400.perlbench & 'ibm' benchmark. Conclusion: setting unix env variables can change your conclusions. Also, sorting object files on the link-lines differently changes layout in memory, and in I-cache layout. Conclusion: linking with 'ld *.o' is different from 'ld a.o b.o'. Bias holds true even if looked at a large benchmark suite, or across different CPU #'s from Intel. See a prior paper (which I now cannot find!!!) which studies this carefully. Note that there are a bunch of JVM/Java benchmark issues here that also have been written up recently. Cliff Notes: I've dealt with both of these in the past; so has Mark Moir (Sun) and Bean Anderson (SGI). I discovered the need for I-cache layout while working for Motorola on PPC chips (somebody else actually did the implementation). The results gave us an average 10% performance boost on a bunch of benchmarks, plus removed a huge source of variation. Later at Azul, with our shared low-associativity caches we discovered that stack placement was crucial to avoid conflicts, fixed with a little random offset to each stack base. Cliff Notes: What are other large systemic biases? Are these two it? And obvious question for a JIT - should I start doing I-cache layout games for performance? How about making copies of an existing hot method and running both old & new and doing sampling, and pick the winner? Cliff Notes: this technique is insanely fast Like SIMD, but char/bytes within a standard GPR. So 4 or 8 chars in a GPR? Doubling: field width from n-bit to 2n-bit Inductive: repeated doubling. Parallel bit streams; 1-to-1 correspondence to byte stream. Vastly faster XML parsing; better cache locality; better branch prediction. So: take 1st bit from each byte, and jam 64 bytes (bit #0 only) into 1 GPR. Take 2nd bit into 2nd GPR, etc. Basically, load 64 bytes into 8 GPRs, but bit-striped. Need architecture support for bit-striping pack/unpack. Now break up desired operations (e.g., compare for lexical token boundaries) into bit-field operations. Can do compares of 64 bytes for tokens by simple ops on 8 GPRs. Cliff Notes: really lousy presentation. Hope the paper is better. Guesswork: take this string "java/lang/Object(Crunkify)" and find the '(' tokens. GPR #0 - xxxxxxxx - bit 0 from all bytes GPR #1 - xxxxxxxx - bit 1 from all bytes GPR #2 - xxxxxxxx - bit 2 from all bytes GPR #3 - xxxxxxxx - bit 3 from all bytes My turn for a lousy presentation: Look at bits for '('. Want some combo of 0's & 1's from the GPRs. So e.g. XOR w/-1 all GPRs needing 0's - now need them as 1's. Then do 7 AND's (parallel log tree AND). Any bits still set represent the '('s. Nicely wide-issues per 64-bytes. Expect to compare 64-bytes for a single char in maybe 6-8 clocks!?!?!?! Plus evil pack/unpack issues/costs. Wrote some code to demo finding '(' in a long string using maybe 1/10th clock per character, plus whatever the packing costs are (which probably dominate!). In this code, 'b0' through 'b7' are 64-bit values holding bits 0 (through 7) for 64 chars in turn. Here's the test-64-chars-for-paren check: // The open-paren char, '(', hex 0x28:See attached source code for TextScan.java 128 bytes of ram; 5micrometers in area Smaller version: 8 bytes of ram; 1.5micrometers in area; similar in size to largest virus. Chemical sensors report results by directly changing bits in the instruction stream (e.g., flipping a bit changes a JMP to a NOP, in effect making a conditional branch). Can detect adequately running at 100Hz (not 100MHz). Can write small simple test/loop codes in this form factor. Can e.g. count molecule rates or virus rates in a solution (injected in bloodstream). Cliff Notes: Very Very Scary. See Vernor Vinge's dystopia "A Deepness in the Sky". http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky http://en.wikipedia.org/wiki/Smartdust Demystifying High-level Low-level Programming Goal: Abstraction w/out guilt. Separation of concerns. Nice historical perspective. Issues go back as far as 1975 - using C for writing an OS was a big issue back then. So it's time for the modern revolution of the same thing. JVMs for OS's. Various C support tools: coding styles; Purify; idioms; tools (Valgrind). Also Systems-languages - BLISS, Modula-3, Cyclone, etc Or 2 languages: JNI+Java JikesRVM extends Java, also OVM, Moxie, DRLVM, C# (Bartok). Usually in context of runtime research. This talk: Jikes, Java-in-Java. Containment: - minimize exposure to low-level coding - extensible; requires change fast, but language changes slowly Encapsulate: contained low-level semantics. Fine-grained; separation of concerns. "Address" is Magic, like unsafe. "Address foo; int x = foo.loadInt(offset)" More types for "void*" in Java, or generic OOP pointers. Intrinsics: contract between you & the compiler; at the Java level it has an empty implementation. But the compiler will always insert the correct code for the e.g. prefetch intrinsic. Also annotations to do e.g., turn off bounds-checks or null-checks (performance hints, but break safety). Scoped semantic regions - scoped rule changes, e.g. "NoNew" - no allocation in this scope (HotSpot uses this technique extensively), or "NoBoundsChecks" in this scope, etc. Unboxed is another annotation - no class header, no v-table, basically a C struct. Also can do direct field layout. @RawStorage - basic static C memory layout. Backed by native width raw storage, only accessed by native-width C intrinsics. Can both compile "fast" - down to raw bits, and "virtual" - to some virtual implementations allowing full GC on full simulation. Cross grid of threads vs open-tables in MySQL. (looks like ref-counting of MySQL tables). Blah - implementation looks alot of NmethodBuckets in HotSpot. 2-d grid. Updates are complex & require complex locking. Look at some locking traces. If we could accelerate the critical sections then we get a Amdhal's Law approach - removes the serial portion. Idea: switch to the one fast CPU on a contended lock. Seeing speedups of 50% or so for 12-16 cpu system, where program was seeing e.g. 5x speedup and now gets e.g. 8x speedup. One of the benefits: locks sections always executed by same CPU, so caches remain 'hot' for hot-lock code. Cliff Notes: This is the first real use of Asymmetric Multiprocessors that I believe in. See it & reported it before. Survive OutOfMemory & leaks by reclaiming predicted dead objects. Used a simple X86 read-barrier w/6% slowdown. Hoping to see new slowdown numbers for the read-barrier: see 3% slowdown on Core2 & 5% on Pentium4. Using 2 bits per reference; bit 0 is for stale-refs; bit 1 is for pruned refs. So instead of throwing an OOM, look thru heap and see if find a dominate object. Assume dominate object is dead & leaked; "Cut" link to object and replace with a poisoned pointer. Touching this pointer throws the original OOM. But if never touch object, then never throw OOM. Cut links allows dead object (and his transitive closure) to be reclaimed. Try to predict roots of dead datastructures. So profile max time between last use (using something like GC age bits, but exponential counter bits). Also need to track which refs are keeping most bytes alive. HashEntry -> PreparedStatement -> ParserInfo -> byte[][] -> byte[] -> ResultSet -> Field[] Compute during Phase 1: Sometimes the HashTable grows, and PreparedStatement gets touched. So max-stale-then-used for HashTable-> PreparedStatement is 16-32 GC cycles. But max-stale-then-used for PreparedStatement-> ParserInfo is 0-1 GC cycles. Compute during Phase 2: Count bytes kept alive for each type of thing that has a low MSTU number. See PPS->ParserInfo has 420M kept alive. So kills all edges from old PPS->ParserInfo (but not young ones, because MSTU is kept per object? I now think he keeps MSTU per-class, so no bits in object but can suffer from common types like Object[] being used both in leaks and in hot code). Don't worry about DAG sharing - ends up blaming the leak on 1 ref when it's really kept alive by several refs; just means on 1st go-around you prune one edge - but leak remains alive. On 2nd go-around all the leak is blamed on object type#2 and pruned then. Results are decent - keeping some complex real apps alive essentially forever. Performance is decent, but near OOM conditions things slow down as first come frequent GCs, then LeakPruning kicks in and cleans the heap back out. Cliff Notes: Wondering if could do read-barrier check on low bits by e.g. loading from the loaded pointer with an X86 op which faults on odd addresses? Means faulting on stale & poisoned refs; faulting on poisoned is OK (means throwing OOM anyways), but faulting on stale might be too frequent still. TRIPs is an alternative processor architecture, with a square 4x4 grid of functional units (GPUs); registers are along one "wall", D-cache is along another 2 sides, etc. Inter-GPU traffic is explicitly scheduled, but with a data-flow-like flavor (static scheduling, dynamic timing). I've reported on this project before. This talk is essentially an end-of-project summary. And the summary is: not much better than a Core2 (if the chip had been built with Core2 technology). Each chip has 2 processors; 1M L2. 16GPUs, 128 registers, 32K L1 cache. Processor is distributed. 4x4 grid of GPUs; registers along one side; D-cache along other sides. Window size of 1024 instructions. Block atomic execution model; dataflow between blocks. Blocks are single-entry, multi-exit. 1 non-speculative block; up to 7 speculative blocks. Heavy use of predication to build larger blocks. Direct instruction communication model. Special instructions to handle large fan-out. Large block of conditional/predicted stuff mapped onto the 4x4 grid of FPUs; using reads from registers and reads/writes to d-cache. Larger blocks require more scheduling; harder to place. What is the utilization of the large instruction window? Both scheduling, plus e.g. cache-miss & branch mis-predict. Static delay: scheduling issues; 1-clk to move data from FPU to FPU. Benchmarks: 53 compiled benchmarks; 13 hand-optimized benchmarks; 2 hand-optimized and hand-scheduled. Gather numbers of the actual chip, except when needs to use SIM numbers (because chip didn't gather the right data). Generally, out of the possible 128 instructions in a block compiler is able to get about 60 instructions. Hand-optimized code uses smaller blocks because better scalar optimizations reduces instruction count. About 1/4 of all instructions are predicated & never executed. About 1/4 of all cycles are lost due to scheduling. About 1/2 of all cycles do useful work. Very large code-bloat: 55% bloat from nops, or direct reg read/write ops, 18% from predicated useless, 27% from instruction-replication optimizations (loop unrolling, inlining). 4 SPEC benchmarks have i-cache miss rates > 10%. Direct memory accesses are much lower (no spill ops?) Need variable-sized blocks; would reduce the code bloat from 11x to 6x (over canonical Alpha code). L1 is partitioned. Compiler can acheive 2 mem-ops /cycle on vector add; hand-optimized can hit the max of 4 mem-ops / cycle. Again for matrix-multiple the compiler hits 1 flops/cycle; hand optimized hits 7 flops/cycle. Clock speed is 366Mhz; mem speed is 200Mhz; process tech is 130nm. Compare to a Core2 but boost TRIPS "as if" if gets the 65nm process. Compiled versions are competitive to Core2. Hand-optimized about 3x faster than Core2. What works: dramatic reduction in data-cache & register file refs. Compiler is able to work (to schedule to this weirdo thing). But performance isn't very big over a Core2. Challenges: dynamic code size; dataflow overhead instructions; I-cache efficiency; predication overhead; generating efficient code for control intensive codes. Running a (supposedly) simple HTM on a transaction-aware Linux kernel. Their proposed HTM is somewhat stronger than Azuls', in that they support parallel independent transactions (the main one, and ones run inside interrupt handlers), a special version of CAS, and graceful overflow to software. Also they holding pending transactions in a much more generous L1 than Azuls'. Results look pretty anemic and fragile, which is on par with Azuls' results (4x speedups on 16-way boxes; 60% of exec cycles wasted on retries). On the plus side, they at least attempt a reasonable benchmark set. A look at Sun's Rock's HTM. It's a lot weaker than Azul's - max of 16 (or 32, depending on how many hardware threads you use) pending writes; aborts on function calls, TLB misses, mispredicted branches & some fp ops, - plus the same reasons Azul's HTM can fail: eviction from the L1 (due to transaction being too large, or conflict from another CPU). Major discovery (just like Azul discovered!): feedback is crucial to decent heuristics. Your failed HTMs need to report as much as possible about what went wrong so the heuristic stands a chance of getting it right. This required a hardware spin (Rock2). One useful heuristic is the standard backoff mechanism: if there's true contention, then backing off can spread the memory references far enough apart to allow the HTM to succeed (presumably succeed with less overhead than a simple spin-lock solution). They report hashtable numbers, but wildly concurrent hashtables are commonly available. They report using the Rock HTM for doing an N-CAS. This is largely successful, and might simplify many libraries. Software testing is inadequate. Poor interleaving during standard stress testing. CTrigger - for any given program input, force interesting interleavings. There is an exponential interleaving space for each input, but stress testing is using random interleavings. Previous work only controlled the context-switch interleavings. CTrigger focuses on interleavings that likely hide atomicity bugs and rarely happen during normal testing. Focus on interleavings that likely hide bugs: e.g. 2 nearby refs to the same memory in one thread - and another write to same ref in another thread. Force these accesses to interleave by injecting "sleeps". Tried on 7 apps (server, client, scientific) on 8-core machine. 143 total Unserializable Interleavings (program interleavings deemed interesting from atomicity standpoint). Stress testing only hits about 80 to 100 of the U.I., out of 143, even for many many runs. Basically some (1/3rd) of the U.I.s are very unlikely to hit with stress testing - this demonstrates the key weakness of standard stress testing. CTrigger looks at the access traces and attempts to force the interleaving. Needs to trim out infeasible paths: locking forces many orderings to be impossible (i.e., the code is correct and no race happens). Some refs are inside the same lock; some are Happens-Before on thread-create or thread-join. Typically 90% of apparent bad interleavings are infeasible. Focus on rare interleavings also (not-rare races happen easily in stress testing). If the "local gap" is small - time between 2 refs in the same thread is short, then the race is narrow - and hard to hit in stress testing. Also look at the distance between external ref distance in time to the gap. Sort these potential races by likelihood. Instrument the program to add an artificial delay at a few points to try & force these rare races to happen often. Then run program as normal. Program speed is similar to unmodified program, because artificial delays are small and few in number. Program runs fine on multi-core machines; using the multi-cores as before. But now the rare races are common, and stress testing finds them quickly. Typically find bugs now 100x faster than before!!! Found bugs in all these apps. Was able to repro the breakage 100% of the time, generally very quickly. Defect - the Bug Infection - Defect is triggered ==> program state is corrupt Failure - An observable error (Crash, hang, wrong result) Infections spread over time, or get overwritten (hence bug is squashed in this case). Software debugging: hypothesis bug; change code; attempt to repro bug, etc. - detecting anomalies; out-of-bounds addresses, unusual control paths, page faults or redundant computations. So profile program to teach the detector what the "good" behavior is. Some values are in fixed ranges during the teaching runs, or some mem locations only accessed from a few ld/st ops; or abnormal number of loop iterations. Current crop of anomaly detectors have a fairly high false-positive rate. - Now isolate relevant anomalies: compute dynamic forward slices from anomalies to crash point. Ouch - this step looks really expensive. Rerun the broken program, propagating tokens from anomaly site to crash site. Use binary search on the anomaly set. Can reduce the count of false positive anomalies by a factor of 10. - Now "validate" that have found buggy line of code: NOP that line out and see that you can execute further. Still don't know what the fix is, but can point out the step that e.g. corrupted memory. Appears to scale up to e.g. gcc (330KLOC) found a bug in distributive laws causing a loop in RTL tree optimization (C2 used to suffer these routinely). Isolation in concurrent programs: detect isolation guarantees; i.e. access in critical regions from non-locking threads (or incorrect locking). Bad threads can break well-behaved threads, (the usual motivation talk for debugging data races, I wish people could skip this or at least limit it to 1 slide & 1 minute). Also STMs only have weak atomicity, so bugs still happen for well-behaved STM threads. Cliff Notes: This is the key: some other buggy thread breaks your well-behaved concurrent thread; studying the break point is useless. Isolator: Program executes as-if no isolation violation. Uses a combo of data-replication, virtual memory & code instrumentation. Does not require the whole program. Requires "guarded by" annotations. Cliff Notes: just having thread-level TLB protection would let you do some COW-style (http://en.wikipedia.org/wiki/Copy-on-write) protections. Just write-protect all pages touched during a locked region (except allow writes from self). All other threads that attempt to write without holding the lock will get a TLB fault and can be stalled (and reported as racing). Stalling can cause deadlocks, so need dynamic deadlock detector. (Youch!) Either allow racing access OR throw a DeadlockError. This above option is done by Isolator, except requires annotations rather than lazily discovering touched variables. Issues: expensive to change protection levels on every lock. Fix: "lazy unprotect" - only remove protections on-demand, as other threads need them. Could "shatter" big pages, or split unrelated objects to private pages (via GC). Cliff Notes: similar to HotSpot's Biased Locking, but this wants per-thread page-protections; really want user-mode control over per-thread page protections. Issues: page-size granularity - has fragmentation overhead for fine-grained locks. Only works for data-race free programs. Very efficient deterministic interleaving. Progress is made with "Deterministic logical clock" time. Work on SPLASH benchmarks. See 18% to 27% overhead, but deterministic always. Requires no special hardware. Each thread maintains a "logical clock" similar to Azul's COUNT register or TSC on X86. Could be e.g., a "bytecodes executed" counter. Globally, the thread with the lowest "logical clock" has the right to do a e.g. lock acquire (non-determinstic action). Assumes all inter-thread communication happens nicely inside locks. Enforces lock-acquire ordering, except for nested locks (have to prevent deadlock, since basically adding another locking layer). Cliff Notes: Mostly interesting because this could work on stock hardware (but it is limited to race-free programs which are precisely those that you'd like this to work with). "Time Travel" for concurrent debugging Phase 1 - record non-deterministic events Phase 2 - replay Software versions have been around for ~15 years. Now commercially available. Also available are libraries, compilers & OS's to help. SW stack flaw: limited to speed of 1 cpu How do you integrate HW & SW stacks? i.e., If HW has captured a log, where does it go? Do not want to capture or replay the SW stack that is doing the SW recording. Also need the HW to be running across the whole machine, so that all processes are using it because interacting processes need to all be deterministic (see butterfly effect above). So some abstraction: not all elements of the system are being recorded and can be replayed. May need to simo record multiple user-mode processes in the same recording "sphere". Only user-mode threads are being recorded (OS support needed to virtualize HW, but also OS needs SW record/replay that is process specific). Really he's making a weak stab at recording everything to avoid the butterfly effect. Now record 2 different kinds of info: HW info: memory interleave log SW info: system stuff log; scheduling; context switch; I/O; signals Replay also has to handle, e.g. malloc has to be deterministic. Cliff Notes: but not GC, although HashCode has to be deterministic. Key Issues: 1- Deterministic injection of data into controlled "sphere" (I/O) 2- Using fewer processors during replay than were used during recording 3- Emulating vs re-execute system call (e.g. fork requires a real system call) - but this makes a lot of OS state changes. Key Issue #1 - Capturing ALL I/O - Random OS calls write into user-space (i.e. setting return values in memory). But during replay need e.g. the SAME "fd" as before. But OS under no obligation to hand out the same "fd" at an open(). Or OS writes data into a buffer - BUT another user-thread is ALSO writing into the same buffer. Key Issue #2 - Not enough CPUs during replay - I think he's expecting HW to do the memory interleaving on replay!! But basically, it's time for thread#1 to run - but the OS wants to schedule thread#2 - just need to block #2 and allow #1 to run. Very slow (constant OS thread scheduling overhead). Log size - in bits per kilo-instructions. Small: 3.15 bits/kilo-instruction. Recording performance overhead: 20-50%. Replay performance: using replay HW!?!?! About 2x slowdown during replay. How to handle large datasets? Hard to debug, etc, etc... So just make the whole thing deterministic. No "Heisenbugs"; can use time-travel debugging. Can test like it's a serial program. No attempt at record/replay really - just force same ordering every time. Cliff Notes: Suffers butterfly effect. Instructions communicate via shared memory. Only care about "communicating" instructions. Can serialize - but that blows performance. So really want to figure out which instructions really communicate. Requires HW support. Conceptually have a "deterministic token" which is handed out to processors and then is passed around all the CPUs in round-robin format. This token totally orders all instructions. Each "quantum" is roughly 1000 instructions, but varies. Now filter out non-communicating instructions. Allow the non-comm instructions to run in parallel. Leverage cache coherency to track when cache lines move. Build a "sharing table" to help order memory accesses. Now add some TM support to help get more parallelism. Now come some quantum-building optimizations: break at sync/lock boundaries. Communication is often bursty: so break quantum after a series of stores to shared memory. Sharing memory between VMware VMs(?) Hypervisor finds identical pages and shares them under the hood. (Probably good for e.g. shared libraries). See 1/3 reduction in memory usage in VMware ESX server. Get sharing only if you run similar apps on the same hardware. Placement is hard in datacenter; lots of clients & software stacks; things change over time. Manual placement is tough. Goal: estimate sharing ability across many physical machines and many virtual machines across the datacenter. Create 32bit hash per 4K page. 1MB hashs per 1GB of ram. Forward to other hosts. Comparing lists is slow. So use a Bloom filter. With hash lists, just sort & compare. With Bloom filters you use a dot-product of bits; very accurate in estimating the number of elements shared between Bloom filters. New VM launch - run on staging host. Figure out it's memory footprint. Compute it's Bloom filter per page. Compare it's filter against other Filters on all other hosts in data center. Find closest fit (on machine with enuf memory & cpu) and move the new VM there. How to migrate VMs cheaper (force their free memory pools to shrink before migration; similar to having a JVM do a GC & heap-shrink before migration, then inflate back to normal size in the new host). Contrast to the usual approach: - 1st iteration copy all - later iterations copy any dirty pages - after dirty workset is "small enuf", stop VM, copy active dirty pages, - restart on new host. Goal: reduce downtime. Con: Write-intensive apps have long migration times (because of constant page copying) and high network overhead. Alternative: copy minimal CPU state & pages, start running on target. VM demand-pages from source. Also actively push pages from source to target. Good: copy pages only once; lower total migration time & network overhead. Works well for both read-intensive & write-intensive. Con: resumed app suffers while pages demand-paged in. Can make some predictions on future needed pages from apps page-faults - likely will want other pages nearby in virtual-memory-addresses soon. Examples are walking an array or working a stack. Make a psudeo-swap device in the guest OS for demand-pull pages (basically swapped over the network). Allows the VM job to keep running in other threads or processes - only 1 thread is blocked on a page fault. (Actual implementation has lots of warts due to nature of existing Xen infrastructure). e.g. we have a much more direcy way to detect memory-full conditions and politely share memory between JVMS. Results: feasible, but clearly more downtime over pre-copy in the common case. But works well w/write-intensive VMs. Patches are dangerous. Delaying change is a safer bet even for security bugs: more likely for downtime from patches than your system gets hacked while you wait. Raytracing is more expensive per-ray, but less expensive per-object. Can be parallelized. Makes really great shadows & reflections. Want to improve predictability in ray-tracing to get "good enough" frame rates. Trying to use SIMD. Primary rays are parallel, easy to SIMD. Seconday rays are incoherent, not good for SIMD. Cliff Notes: no surprise but modern GPUs are not well suited to raytracing work Sources of Commutativity - e.g. sum, malloc Can detect it? Check memory state before & after running fcn X. But misses e.g. set's (via linked-list) - linked-list order is different when swapping 'put(A)' and 'put(B)' calls, but semantically the same: both elements are in the set, but the set's implementation stores them differently. How to test for commutativity? Symbolically execute 'put' calls in different order. See if memory contents are same: "put(A);put(B)" vs "put(B);put(A)". If yes, then Yes - but suppose not. Now check reader fcns: try is_member() (on what values???) Also check writers like 'remove'. Problem: exponential search space for equivalence. Use Random Interpretation- - choose random input values - interpret till conditional branch - follow which way it goes; tests program - Then back up to cond branch - Adjust memory & values such that test fails - Now follow thru on other branch; test program Now combine memory states for each branch of program. Using an 'affine join' to combine memory states??? Somehow correctness bug becomes an exponentially low; very very unlikely the Random Interpretation ever computes some wrong answer (declares as commutative some function which is not really commutative). Looking at MediaBench & SpecInt on an infinite issue machine. See 20% to 30% of all functions are commutative (with themselves?). Only see speedup when commutative fcns are called near each other, but then can see parallel speedup (log tree combining?). Average is 50% speedup. Apparently assuming an extremely wide machine, as opposed to threads (ignoring communication costs & thread start/stop). Cliff Notes: Random Interpretation looks interesting, not sure about the rest of this paper. In fact, this paper looks like a small delta (and very difficult to use correctly) over the R.I> work. Instead see 2003 paper: "Discovering Affine Equalities Using Random Interpretation" http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf Amount of memory reclaimed by GC *can* be low. But looking at minimal GC footprints. When GC is not productive, then time to extend the heap. Estimate GC yield by using Virtual Memory to track pages which are touched and which are not touched - not-touched pages are probably full of mostly dead stuff. Cliff Notes: Yawn, Olde Hat Trick. Cliff Notes: got bored; symptom of Day 3 at a conference; worked on slides Cliff Notes: Winner of the most ridiculous proposed hardware, backed up a big speedup on the most-unrealistic simulated hardware. Virtual BTB Design - as in Virtual Memory not as in VMWare. BTBs today are {small,fast,inaccurate}. Want {big,fast,accurate}. Allocate space for the BTB in the L2(!?!?!). But latency to the L2 is large, so much prefetch the BTB data. Getting tidy 10%+ speedups on unrealistic hardware. BTB in L2 cache is large+slow. Missing BTB entries are evicted to L2. Then prefetch from L2 based on previous missing branches (assuming that an upcoming BTB miss can be predicted from a just-prior missing branch). Attempted lots of other predictors; most failed miserably. Pay-as-you go model; BTB gets effectively larger as more pressure happens on the on-chip BTB. Cliff Notes: HotSpot inline-caches wipe out the need for faster jump-register support in hardware; standard branch predictors work well with HotSpot. Total clocks spent doing jump-register instructions are less than 1% of all clocks on realistic hardware.
Category: Web/Tech | Permalink » | Comments (4) » | TrackBack (0) » Like this post? |
February 25, 2009 (sorry for the long gap between postings; work's gotten interesting and I got busy) I recently attended the Bay Area Workshop on Transactional Memory at Stanford generously hosted by HP. Slides are here; my slides are helpfully titled "2009_TMW.pdf". In this workshop I gave a brief overview of Azul Systems' Hardware Transactional Memory support plus our experiences in using it. This was followed shortly by a blog posting from David Dice discussing Sun's HTM support in the Rock processor. For Azul Systems' certainly (and to a lesser extent Rock), the name of the game is throughput: we appear to be generously over-provisioned with bandwidth, we can sustain 30G/sec allocation on 600G heaps with max pause times on the order of 10's of milliseconds (and unlike IBM's Metronome hard-real-time collector, our MMU past 20msec is 0.99; see MMU definition at bottom). Each of our 864 cpus can sustain 2 cache-missing memory ops (plus a bunch of prefetches); a busy box will see 2300+ outstanding memory references at any time. We have a lite microkernel style OS; we can easily handle 100K runnable threads (not just blocked ones). Our JVM & GC scales easily to the whole box. In short: the bottleneck is NOT the platform. We need our users to be able to write scalable concurrent code. The goal of our HTM has always been to accelerate "dusty deck" Java programs via Lock Elision. (We've never been tempted to enter into the language wars and implement some form of full-blown transactional programming via an "atomic" keyword). We see a lot of defensive uses of the "synchronization" keyword; legacy libraries using "sync" on all methods or code maintainers sprinkling "sync" about liberally in order to squash some bug. In practice, 99% of all locks are never contended - and these we run very fast (our CAS & FENCE instructions can hit in cache; an uncontended lock requires only a few clock cycles). But the locks that are contended prevent parallel execution (Amdahl's Law and all that)- and we observed that much of the time the contention is on the lock itself and not on the data it protects. So we added in HTM support in our first processors and this support has been shipping and turned on at all costumer sites for over 4 years now. By now we have a lot of field experience with using HTM. But first, a brief overview of our HTM support: we use a few extra bits on each L1 cache line to track "speculatively read" and "speculatively written" cache lines. Each CPU can be put in a "speculate" mode; loads and stores then set these bits accordingly. If a speculative line is evicted from the L1 for any reason then the transaction "aborts" (there is also an "abort" instruction). Speculatively-written lines are marked invalid (meaning: they no longer cache any data) and speculatively-read lines have their spec-bit cleared. Also the CPU branches to a fixed trap vector for software fixup. If the CPU makes it to a "commit" instruction then the spec bits are cleared - including the speculative-write bits - which atomically makes all those writes visible to other CPUs. In other words, the XTN commits. Nothing aborts a transaction except the "abort" op or losing a speculative line out of L1. We do not abort on function calls/returns, nor TLB misses, nor exceptions nor rain nor snow nor sleet nor dark of night... Our XTNs are limited therefore by the size and associativity of the 16K 4-way L1 cache (and the shared inclusive L2). We routinely see successful HTMs of many thousands of instructions with many hundreds of cache-hitting memory ops. (contrast this to what I can figure out about Rock: Rock appears to abort on function call/return, on running out of store-buffer depth (which limits an XTN to 16 stores?), on TLB misses, on L1 associativity, on common nested locking patterns (because of abort on function calls)). Software heuristics determine when to try the HTM (uncontended locks are much cheaper using a cache-hitting CAS). The software measures the success/fail ratio of the HTM on contented locks and determines when to flip to a standard blocking implementation and when to use the HTM. While some apps get a nice 2x speedup (e.g. Trade6), most apps are speedup only slightly (we had lots of teething problems with the software heuristic that used to cause 10-20% slowdowns due to endless fail/retry loops, although these all appear to be fixed now). We nearly always see a handful of locks using the HTM support, but these appear to only rarely be the "right" locks to get more CPUs really busy. HTM failure appears to nearly always be due to conflict and not capacity. Issues In short, users' don't write "TM-friendly" code. Neither do library writers. For example the "modcount" field in Hashtable is bumped for every update in the Hashtable. For a large Hashtable we can expect most mods to be in unrelated portions of the Hashtable. i.e., without the modcount field update we would expect the HTM to allow parallel 'put' operations. With the modcount field update, 'put' operations always have a true data conflict - and this aborts the HTM. After some number of failed 'put' operations we have to start really locking the Hashtable (to allow the puts to make forward progress) and this now single-threads the 'get' operations as well. This general pattern of updates to peripherial shared values (usually performance counters of one sort or another) is very common and it kills the HTM - because it presents a true data conflict. Many times a small rewrite to remove the conflict makes the HTM useful. But this blows the "dusty deck" code - people just want their old code to run faster. The hard part here is getting customers to accept that a code rewrite is needed. Once they are over that mental hump, once a code rewrite is "on the table" - then the customers go whole-hog. Why make the code XTN-friendly when they can make it lock-friendly as well, and have it run fine on all gear (and not just HTM-enabled gear)? Also locks have well understood performance characteristics, unlike TM's which generally rely on a complex and not-well-understood runtime portion (and indeed all the STMs out there have wildly varying "sweet spots" such that code which performs well on one STM might be really unusably slow on another STM). Really what the customers want to know is: "which locks do I need to 'crack' to get performance?". Once they have that answer they are ready and willing to write fine-grained locking code. And nearly always the fine-grained locking is a very simple step up in complexity over what they had before. It's not the case that they need to write some uber-hard-to-maintain code to get performance. Instead it's the case that they have no clue which locks need to be "cracked" to get a speedup, and once that's pointed out the fixes are generally straightforward. (e.g., replacing sync/HashMap with ConcurrentHashMap, striping a lock, reducing hold times (generally via caching), switching to AtomicXXX::increment, etc) Summary
Thanks, [MMU is a common measure of GC performance and stands for "minimum mutator utilization". It shows the worst-case amount of time a worker thread gets for a timeslice of a given length vs time spent in GC. Thus an MMU graph plots utilization (a number from 0 to 1 where 1 is better) versus timeslice size. A program using Metronome might see a max GC pause of 1microsecond; it's MMU at 1 microsecond is then zero because the mutator can do no work for that microsecond. However the MMU at 1 second might be 70% showing that in the long run Metronome's GC exacts a 30% performance "tax" on mutator operations. Contrast this to Azul's Pauseless GC: at 10msec our MMU is zero but at 1sec our MMU is 0.99; in the long run there is no GC "tax" but in the short run our max pauses are much larger than Metronome's. Metronome is suitable for e.g. hard-real-time applications like audio-playback, whereas Azul Systems' is suitable for e.g. soft-real-time transactional apps with high throughput.] Category: Web/Tech | Permalink » | Comments (11) » | TrackBack (0) » Like this post? |
November 18, 2008 I had an email conversation between myself, David Moon & Daniel Weinreb. For the younger readers: David Moon is one of the original architects of the Symbolics machines, which 20 years ago more-or-less attempted the same sorts of things Azul is doing now; i.e. language-specific hardware support. I lightly edited the emails for clarity and ordering (otherwise the nested interleavings get horrendous to follow).
[Ed: 11/20/2008 - Some small follow-on conversation has been appended to the end] Cliff Click wrote: Mostly I felt inspired by your Symbolics work. Azul Systems makes gear for running Java (but actually it runs a C/C++ program - we just happen to run a large C++ that implements Java). One of the biggest impact changes we made was a hardware read-barrier for GC - a simple instruction that tests invariants of freshly loaded pointers and takes a fast-trap if the test fails. Fixup is entirely in software. It's a RISC-y approach to the GC hardware that Symbolics had 20 years ago. David A Moon wrote: I remember this now; this was one of the things that impressed me when I read about your machine a couple years ago. As you may or may not know, the Symbolics 3600 had fairly complex GC hardware and the Symbolics Ivory went 90% of the way towards riscifying it, eliminating all the large hardware tables. I think there was still a 16x1 bit (NOT 16Kx1 !) table to indicate which portions of address space are oldspace; with 64 address bits instead of 32 you were able to reduce that to a single bit, which is nice. Cliff Click wrote: We use very simple 3-address 32-register 64-bit RISC cores, with 54 cores per die. All chips are cache-coherent and uniform (medicore) memory access time for all; the upper bound is 16 chips or 864 cpus. The "big" box is about the size of a dorm room fridge, 14U form-factor. The big box also has top-20 super-computer bandwidth (barely). Modest 16K I- & 16K D-caches for all cpus, plus groups of 9 share a 2-meg L2. Chips are fully interconnected. David A Moon wrote: 3-address 32-register 64-bit address, 32-bit instruction RISC does seem to be the "sweet spot" for instruction set architectures. [Cliff: Yes!] I hope you avoid needless complexity like condition codes, link registers, maybe even separate integer and floating point registers. [Cliff: Yes, yes & yes] I don't think Java needs floating point rounding and trapping modes either. [Cliff: Yes, it does not] For your application you probably don't even need separate user and supervisor modes since all executable code is generated by your just-in-time compiler from a safe language. Cliff Click wrote: That's the theory but the practice is a little different: JVM's crash and 1 crashing JVM should not bring down the whole box. So we in fact have a user/kernel split and it's saved our butts any number of times. David A Moon wrote: Maybe you don't need any kind of mode bits register at all? Being completely modeless would be cool. Cliff Click wrote: There's also a speculative-memory mode: writes are buffered in the L1 but don't become visible until a 'COMMIT' instruction. David A Moon wrote: That's a pissload o' cores. Are those 54 real cores per die? Or is it perhaps 6 real cores, each executing 9 interleaved instruction streams, in the style of Denelcor HEP? Cliff Click wrote: Nope, real cores all the way. David A Moon wrote: Does the precisely defined Java memory model give you any help in making a more simple implementation of cache coherency than is usually done e.g. for Intel x86 architecture? Or did that memory model come out after your hardware was designed? Cliff Click wrote: The Java Memory Model came first, and we designed the hardware around it. Actually the hardware implements something somewhat weaker than the JMM by default, and the JIT inserts memory FENCE ops as needed. More like Sparc RMO or Power's memory model. David A Moon wrote: I assume uniform memory access doesn't mean cached accesses aren't any faster than cache misses, but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network. [Cliff: Yup] It must be a real challenge to get enough memory bandwidth to keep 864 instruction streams going at full speed, even more so with such tiny caches. Daniel Weinreb wrote: Moon says: "... but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network." Actually, I heard a talk about a year ago, at the first New England Database Day, suggesting that the way information moves between the L1 caches of the cores is by a sequence of moves, each being up, down, right, or left, in the direction of the destination core. Cliff Click wrote: No, we're a full direct interconnect on the L2's. 9 L1's share an L2. One step to the L2, then one step to the correct remote L2 - across the box in any direction (including another L2 on the same chip; it was easier to run outside the chip and back in than arrange for a special faster-path for within-chip communication. 15/16ths of your L2 misses are remote and the gain for speeding up the remaining 1/16th of the misses isn't worth it. Daniel Weinreb wrote: If so, the times would not actually be uniform. Taking advantage of such knowledge in the software would be challenging except for special-purpose domains, I would think. I don't know how the 54 L1 caches (or maybe fewer if you're doing the piplined business that Moon described) are interconnected, but there is a limit to how many can participate in a full crossbar NxN connection. [Cliff: Key word: challenging] Cliff Click wrote: We have a lot of memory bandwidth - one of our bigger boxes is on the top-20 supercomputer bandwidth list. The chips are fully interconnected. David A Moon wrote: Meaning every chip has 15 ports directly connected to another chip? [Cliff: Yup] That's a lot of pins. You might want to look at http://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnect for an interesting slimmer approach. Cliff Click wrote: I'm sure we looked at sicortex at one point. Daniel Weinreb wrote: I agree about the challenge of the memory bandwidth for all those cores. Do you have very wide paths between the chips and the main memory? Cliff Click wrote: There are 4 memory controllers per chip (and up to 16 chips so 64 controllers). All memory is striped across all controllers at a fine grained level so each core's memory requests are spread across the box - mostly eliminating "hotspots". David A Moon wrote: Also they are a message passing architecture and you have to be a shared memory architecture, but I think that's a separate issue from the interconnect topology. I am no longer sure which I believe in. On the one hand, I am a shared memory guy from way, way back. On the other hand, it is becoming increasingly expensive to pretend that memory is uniform and random-access, and the illusion is becoming increasingly unconvincing. The von Neumann architecture may be reaching the end of its life. I like the way the Cell treats main memory as essentially an I/O device, but I have never attempted to program it. Cliff Click wrote: By all accounts, programming the Cell is a nightmare. Since we can do shared-memory with so many cores, other people can too. The more difficult issue is parallel programming that uses all those cores. Shared memory is one way to make programming easier. Daniel Weinreb wrote: What about cache coherency over multiple chips? These days, as Moon has pointed out to me, the importance of caches means that you gain a lot by not looking at your address space as a sequence of co-equal words, but rather as a sort of block-oriented device (a little like the way one looks at a disk), in which the block size is the size of the cache line in some relevant cache (probably L2 but it depends a lot on the hardware if the hardware is novel). (See the word "illusion" in his mail.) Does your Java implementation specifically take this into account, e.g. allocating objects aligned on cache line boundaries, implementing collections as B-trees with nodes being an integral number of cache lines, and that sort of thing? If not, you might consider benchmarking it; I'd love to know the results of such a test. Cliff Click wrote: Our hardware has short cache lines (32b) so there's less to be gained by aligning objects to cache lines. For certain benchmarks (not to mention SpecJBB or anything) such alignment is probably worth alot - but not for most use cases. Perhaps it makes sense to expose cache-line alignment to the Java/JDK programmer? Cliff Click wrote: Things we do specifically for Java:
David A Moon wrote: That is indeed impressive! Even if I divide 40 GB/sec by 864 it's still fast. Of course you need it, much Java code really likes to generate garbage all over the place. My pet peeve is no way to return multiple values from a method without generating garbage. Daniel Weinreb wrote: Dave, you say: "My pet peeve is no way to return multiple values from a method without generating garbage." In Rich Hickey's talk about Clojure(Cliff: a new Lisp dialect implemented on the JVM, mainly having been run on the Sun JVM but he ran it on an Azul at least once) addressed this issue. [Cliff: I'm familar with Clojure] He left out multiple value returns exactly on the grounds that the consing cost for a small returned value is so cheap these days, ephemeral GC's being so damned good. I do not know to what extent he has done actual metering, though. Cliff Click wrote: Yes, annoying. And it's embedded into the JVM bytecodes that way, so you can't even rescue the problem by changing languages while keeping the JVM. The only hope here is for a combo of JIT'ing & smart memory allocation: either inline & optimize away the junk object or do some sort of stack-based allocation for the very-short-lived object. David A Moon wrote: Can you say anything about what page-mapping tricks are good for? Actually I am a little surprised you bother with virtual memory at all. If memory gets full surely it is faster to do a garbage collection than to swap pages out to disk (I think you don't even have a disk if I remember correctly). Maybe page mapping is just to avoid core shuffling if there are multiple heaps? Is the page size large to cut down on mapping overhead? Cliff Click wrote: Ok, a bunch of things here:
David A Moon wrote: I would expect the write barrier to be more than a slight improvement, unless your memory scanning is incredibly fast. The main thing the write barrier accomplishes is to greatly decrease the number of pages (or cards or other chunks) that have to be scanned for pointers to the youngest generation. Cliff Click wrote: Ahh.... here's the difference: a 'card-mark write barrier' can be fairly cheaply done with plain-o-instructions, so the write-barrier instruction is a delta above that. The biggest change is our write-barrier op 'filters' card-marks that don't need to happen which in turn cuts down on useless card-mark writes. Writes are generally really cheap - unless there's a ton of contention caused by rapidly updating of an old-gen object holding a pointer into young-gen. In that case the filtering is worth a lot. David A Moon wrote: Since the GC is concurrent the time required to scan for pointers of interest to the GC shouldn't slow down the application unless the application uses all the cores, except surely you run out of memory bandwidth eventually and then the GC's memory scanning interferes with the application's accesses to memory. So even with all that concurrency and massive processing power and memory bandwidth I would think you would still get a lot of benefit from doing less scanning because of the write barrier. But I haven't measured it and you probably have. Cliff Click wrote: We're definitely doing a generational collector, so we'd be using a software write-barrier if we didn't have the hardware one handy. Or maybe: "We have a write-barrier. The fast-path is done in hardware. There is a slow-path done in software." David A Moon wrote: I don't know if this was published, but the write barrier on the Symbolics Ivory was really simple. Each page table word contains an encoding (in 4 bits?) of the age of the youngest generation pointed to by this page; the write barrier just causes a trap when those bits need to be updated. I don't remember if we kept another index or just had the GC scan the page table. Since it's an IBM-style inverted page table the page table is dense even when a lot of pages are swapped out so scanning it is not too slow. It would have worked better if the Ivory page size hadn't been way too small. Did you ever look at Eliot Moss's "train" garbage collection ideas to do ephemeral-GC-type optimization for long-lived data? If I remember correctly there is a bug: memory consumption can become arbitrarily large in the worst case, but the idea might still be good. I think Microsoft is using a form of it in .NET. Cliff Click wrote: HotSpot tried out the train collector years ago, and never got it to perform well in practice. There are a bunch of odd-ball issues, (remembered sets with popular objects?) that made it not very performant. Cliff Click wrote: Things we do specifically for Java (continued):
David A Moon wrote: dcbzinstruction on POWER. Clearly a good idea. Cliff Click wrote: dcbz doesn't make the grade:
David A Moon wrote: Dedicated register for the allocation pointer? Cliff Click wrote: No. The main issue with allocation is avoiding the mandatory cache-miss as you stream through memory. You can hide tons of dumb int-ops and cache-hitting-loads underneath that cache-miss. So we pay a little bit of integer-cycles to setup our J-I-T-zeroing, and getting the allocation pointer is one of those cycles. In short: it's too small picken's to be worth holding a register for. Cliff Click wrote: Things we do specifically for Java (continued):
David A Moon wrote: So you're saying that since 64 bits is way too many, you can use a bunch of those bits to put a rather wide tag in each pointer? Maybe 20 bits wide? Seems like a great idea. No problem with having too many classes defined and running out of bits? Cliff Click wrote: Yes, wide tag per pointer. No problem (yet) with running out of classes. Big Java Apps these days seem to have about 2^15 classes. Daniel Weinreb wrote: I am curious about how you do threads. On the Lisp machine, the stack had to be contiguous in virtual memory. So whenever we had to allocate one for a new thread, we had to trade off between wasting VM versus worrying about running out of stack. I don't know if we ever considered making the stack such that it could be a linked list of (big) blocks, but I suppose anything that slows down the function call path is very dicey. Dave, did you ever think about this? It would have been nice to have cheap full coroutines ("stack groups"). Cliff Click wrote: Nope - we're using the fairly classic stack layout for HotSpot. And yes this means you have to decide up front the stack-vs-heap memory split. Daniel Weinreb wrote: Cliff, have you heard anything about the possibility that the JVM will be extended to be able to do tail-calling? Rich Hickey says that there was recently a "JVM summit" (something like that)of people writing non-Java languages for the JVM, and there was a lot of interest in tail-calling. It would be great for the Lisp world, since Scheme depends on it so much; it would be nice to re-unite the Scheme and Lisp communities as much as possible. Cliff Click wrote: I'm well aware of the interest in tail-calling optimizations - but it has to compete for scarce engineering resources like everything else. Cliff Click wrote: Things we do specifically for Java (continued):
David A Moon wrote: Would that be true even with 4 times larger caches? Cliff Click wrote: I think so. We did a bunch of studies on cache-sizes, looking at fewer-cores-more-cache vs more-cores-less-cache. Cliff Click wrote: Things we do specifically for Java (continued):
David A Moon wrote: Do you have any kind of specialized cache to avoid memory references to find the address of a virtual method to call? I haven't measured it but I get the sense that all those not very predictable indirections through vtbls really hurt C++ speed. Cliff Click wrote: Yes... specialized hardware, not specialized cache. We do the standard HotSpot "inline cache"trick that's done on all CPUs. The cache is a 1-entry cache, inlined in the code. The Key is the expected class of the object, the Value is the target of the function call encoded as a plain "CALL" instruction. So you do a Key-compare inline then a plain static call. If the Key compare hits the hardware will Do The Right Thing with the static call. In practice, 95% of all call-sites (that can't be optimized by the JIT) never fail the 1-entry cache. The other 5% tend to fail for having lots and lots of targets. So there IS a "cache", but it's not specialized. It's the normal HotSpot cache. Instead, we have an instruction to do the Key-compare in 1 cycle (and the call is another cycle). So we do cache-hitting v-calls in 2 clks & 2 ops. On a wide-issue O-O-O X86 that same sequence is maybe 4 dependent ops but issues also in 2 or 3 clks - except loading the object header is typically a cache-miss; but the branch predicts right and the O-O-O hardware lets the X86 proceed despite the miss. Cliff Click wrote: Things we do specifically for Java (continued):
David A Moon wrote: I'd be interested in reading more about that. Cliff Click wrote: We have a white-paper here. http://www.azulsystems.com/products/whitepaper/wp_otc.pdf Cliff Click wrote: Ok, I'm done boasting. :-) David A Moon wrote: It's plenty worth boasting about. --Dave Moon Cliff Click wrote: Thanks, Cliff =========================================== 11/20/2008 - A little more: David A Moon wrote: Then I guess your thread scheduler is not aware of the partitioning of the hardware and doesn't try to put related threads onto cores in the same chip. Cliff Click wrote: The thread scheduler is well aware of the hardware layout. It's no mean feat to schedule 800+ cpus with 1000's of runnable threads. Instead, nobody has a clue what counts as "related threads". Certainly there's no Java-level hint. You might imagine doing L2-to-L2 miss-rate profiling and try to decide which set of threads are communicating through memory instead of through an L2. It's a non-trivial task for which there are some non-trivial academic papers out there showing modest gains for lots of hard work. David A Moon wrote: But you said your memory access time is mediocre. That's the first symptom of the uniform memory illusion breaking down. Then you find that no matter how big you make your caches they aren't big enough. Then you find that each new generation has slower memory, as a ratio of memory access time to processor speed. Then you find that programs are too slow unless they are written with detailed knowledge of the hardware memory structure. You probably know way better than me how far we are down that slope today. Cliff Click wrote: Split out the notions of 'uniform' from 'fast'. Everybody pays attention to caches (or should): the mindset is "memory is slow". But only some people pay attention to memory layout: the mindset is "memory is uniform". For some parallel scientific programs, with well understood access patterns people can build both the machine and the program such that most accesses are "near" the CPU. In such a case it makes sense to build a NUMA machine: "near" accesses are faster than "far" accesses. Azul builds are gear as 'UMA' because our programs do not have well understood access patterns. Instead, the patterns are mostly random (after cache filtering) and so it makes sense to have uniform mediocre speeds instead of 1/16th of memory fast and 15/16ths as slow. In both the NUMA and UMA case everybody has given up on the illusion that memory is fast. Instead we all acknowledge the presence of caches; we prefetch (or expect the hardware to do so); we know that "1st access is slow, 2nd one is free"; we do cache-smart object layout and so on. No illusions about memory there. Category: Web/Tech | Permalink » | Comments (11) » | TrackBack (0) » Like this post? |
November 13, 2008 A couple of interesting threads running on high-scale-lib's forums I thought I'd post to a wider audience. Josh Dybnis (jdybnis) - 2008-11-10 11:46
- if the table has "almost no synchronization" then it's not non-blocking (but it might be highly concurrent). Non-blocking has a specific meaning to the academic community; if you say "non-blocking" you should also be able to say "no synch". If you are based on NBHM then you need to provide e.g. a non-blocking malloc and a way to reclaim storage. Both are things NBHM uses GC for, and are not-trivial to get non-blocking. NBHM does indeed do very little allocation - so the blocking embedded in "malloc" will probably never hurt you. - No "C" program can ever implement any reasonable concurrent program, because "C" has no memory model (and the proposed model is a major punt and not much better than nothing). Instead you can implement NBHM using "C using gcc_ver_XXX on X86_model_YYY". If you change the "XXX" or the "YYY", then all bets are off, and the program can fail. In short, the "volatile" keyword in C does not have nearly the semantics as it does in Java, and C compilers routinely do things with "volatile" variables that Java does not allow. Changing the C compiler version or the underlying hardware can break things. Despite my complaints a C version of NBHM on X86 is surely a very useful and non-trivial thing.
- In theory, I agree 100% with you about writing concurrent programs in C. And people need to be clear about the limitations before attempting such a thing. In practice I think it is reasonable to do "porting" work for each platform/compiler. My implementation is targeted to gcc 4.x on 64-bit x86 (Intel does specify a memory model in their Software Developer's Manual). I think it could be ported to most other architectures without major modifications. - My implementation includes a non-blocking malloc() and a method to do safe memory reclamation. - My statement about "almost no synchronization" was badly stated. I was referring to my malloc implementation and not the NBHM itself. Also I was also using "synchronization" to refer to x86 operations having the "lock" prefix, not any non-blocking property of the program. - A couple of more caveats about my malloc and the safe memory reclamation. They aren't the most robust implementations. They need a bit of work. But they are non-blocking, and should be very low overhead on systems without too many cores (say less than 32). The malloc does call mmap() from multiple threads, it could block in there. So I guess technically it is not "non-blocking". But that is an artifact of the implementation. In theory one could use an asynchronous method of invoking the kernel to make it truly non-blocking. Josh By: Cliff Click (cliffclickProject Admin) - 2008-11-11 07:41
Also, one could also use GC in a C program too. Or one of the safe memory reclamation techniques documented in the academic literature (e.g. hazard pointers, smr, etc.).
Cliff --------------------------- By: Cliff Click (cliffclickProject Admin) - 2008-11-11 13:52
Here's a sample IA64 table-copy screwup:
Cliff --------------------------- By: Josh Dybnis (jdybnis) - 2008-11-12 16:36
Feel free to post a fix. It should be a 1-liner. ============================= compareAndSwap memory fence (New) Andrew Trick (andrewtrick) - 2008-11-12 11:16 By: Cliff Click (cliffclickProject Admin) - 2008-11-12 21:35 That's all, Cliff Category: Web/Tech | Permalink » | Comments (0) » | TrackBack (0) » Like this post? |
October 28, 2008 Or how I went to New England to view the Fall Colors in Perfect Weather, |



