We Don't Know How To Program...
April 4, 2008

... the current (and future!) crop of machines.  "Yeah, yeah, yeah," I hear you say, "tell me something new".  Instead I'm going to tell you something olde.

10 Years Ago - GC was widely understood and rarely used.  People were skeptical that is was ready for production use.  Now GC is well understood and widely used.  It's less buggy than malloc/free (e.g. no dangling pointer bugs, but still leaks) and in general is faster to do allocation (hard to beat parallel bump-pointer allocation).  The 'Collection' part of GC has gotten substantially faster in the last decade: parallel collectors are available from all major Java Virtual Machines.  Rarely do I see GC consuming more than 10% of total CPU time; usually it's closer to 5%.  Concurrent & Incremental GC's are mostly available and mostly work (e.g. Azul can demo sustained 40+ Gigabyte/sec allocation & collection with no GC pauses); hard-real-time GC's are starting to see real applications.

10 Years Ago - JIT'ing - Just-In-Time Compiling - was considered controversial.  People widely claimed "the code quality will never approach static compilation" and "its too slow".  These days it's fairly easy to find programs where Java beats "gcc -O2" (and vice versa; we're clearly at some kind of performance plateau).  I personally mostly laid these claims to rest: HotSpot remains "The JVM to Beat"; with world-beating records in a variety of areas.  My name still appears in HotSpot source base, buried down in the src/share/vm/opto directory files.

10 Years Ago - JVMs and Managed Runtimes (no .Net 10 years ago; v1.0 appeared in 2002) were unstable on dual-CPU machines (that's being charitable: the bug list from that era was shocking; e.g. at one point we (the HotSpot team) discovered that emitting all memory barriers had been accidentally disabled on multi-cpu X86's).  These days JVMs are rock solid on 4-8 CPU machines; this count of CPUs has been Sun's bread-and-butter for years.  Indeed, HotSpot routinely runs well on 64 - 128 CPU counts from a variety of hardware vendors; Azul's version of HotSpot runs well on 768 cpus.

10 Years Ago - The JDK was in it's infancy; the libraries generally were broken on multi-threaded programs.  Then the Abstract Data Types (widely called the Collection Classes for Java) got synchronized (slow, but correct), then in 1.4 unsynchronized versions appeared (fast, but incorrect for multi-threaded usage), then in 1.5 fast & correct concurrent library routines appeared.  These days you can drop in a hash table that will scale linearly to 768 concurrent threads doing inserts & lookups, or write parallel-for loops with a little bit of stylized boiler-plate.

 

In Summary: The JVM, Collection Classes & GC are Ready for large-core-count machines.  The OS has been ready for a long time (Cray, SGI, IBM et al showed you could beat down all the multi-CPU OS bugs years ago).  In short: every CPU cycle the hardware guys make comes up through the OS, JVM (& JDK & GC) and is usefully exposed to the application writer.

But...

We Still Can't Write the Million Line Concurrent Program.

I don't mean the data-parallel (scientific) programming, where large datasets and large CPU counts have been the norm for years.  These guys generally have way fewer lines of code than the big complex Java apps; there's some correlation there between bytes of dataset vs lines of code.  Both the scientific apps and big Java apps look at gigabytes of data; the problem is that the Java guys use 100x more lines of code - and it's because they are doing something 100x more difficult to express in code.

I don't mean non-main-stream programing languages; I mean programs as written by the millions of programmers today, using tools (both software and mental) that are widely available.  As for the specialty languages, Erlang is the only language I know of where million-line concurrent programs are written, maintained and used in a business-critical way on a daily basis.  Near as I can tell, all other specialty languages are used by a handful of academics or in a small-scale as experiments by desperate businesses.

Concurrent Programming is widely considered very hard; try googling "notoriously difficult concurrency" sometime.  Lots of folks are busy trying to figure it out - and it seems pretty clear to me: We Haven't Figured It Out yet.  Here are some obvious differences between concurrent and serial programming that bear repeating:

  • "Parallelism" for Serial Programming is generally automatic and very fine-grained: e.g. out-of-order hardware or hit-under-miss cache.  Your normal code "just got faster" and the hardware auto-parallelized under the hood for you.  We mined out this kind of parallelism years ago, and just expect it (remember the speed-demon vs brainiac wars?)  "Parallelism" for Concurrent Programs generally means: task-level parallelism.  You gotta hack your code to expose it.
  • Testing provides Confidence for Serial Programming; you can get decent code coverage.  In practice, testing does little for concurrent programming.  Bugs that never appear in testing routinely appear in production - where the load issues are subtly different, or QA doesn't have the same count of CPUs as production.  Indeed, in theory the situation is gruesome: there's basically no chance Testing can cover the exponential explosion of possible code interleavings.  There's a glimmer of hope here: there's some evidence that the 'n' in the exponent can be reduced to 3 or 4 in many cases without missing too many real bugs.
  • Debugging serial programs can be made human-thought slow.  Time is NOT of the essence.  In concurrent programming, everything needs to be full-speed or the bug never appears - leading to the HeisenBug syndome.
  • Program reasoning is local for serial programs.  In concurrent programs, you have to reason about all the code in the system at once - because some other thread can be in every other piece of code concurrently with this thread.
  • Control flow is the main concern in serial programming; you ask questions like "how did I get here?" and "where do I go next?".  Data flow is the main concern in concurrent programming; you ask questions like "who-the-flock changed this variable out from under me?".
  • Timing is nothing in serial programming.  Nearly all languages have no way to express time, and program meaning is defined mathematically independent of time.  The same program can be faster or slower - which preserves it's meaning but might change it's usefulness.  Timing is everything in concurrent programming: correctness flows from well timed interleavings.  Changing that timing will often break programs that have functioned well for years.  Note that most languages (outside of Java) start out with a disclaimer like "in race-free programs, the meaning is this...".  Most large concurrent programs have races, so most large concurrent (non-Java) programs have no theoretical meaning.
  • Nothing changes between statements in serial programs; it's "as if" the Universe Stops and waits for you.  All of memory (could) change between "ticks" of the program's clock in concurrent programming.  Just because you looked at some variables a nano-second ago, doesn't mean those variables remain the same.
  • Serial programming has fundamental name-space control for code & data & their cross product.  Classes, Object-Oriented Programming, Functional Programming, lexical scopes, etc are all techniques to control access to either code, or data or both.  There is no equivalent notion for threads - any thread can touch any piece of data or run any piece of code.
  • The Tools are Mature, pretty much by definition.  It's obvious that support for concurrent programming sucks right now, so therefore the existing (serial) tools must be mature.  Bleah.  Some examples: I don't know of any widely used (and usable) tools for auto-extracting parallelism ("widely usable" excludes tools requiring a PhD, or a Cray).  I know of no tool which can do data-race detection on HotSpot: it's too big, it's too concurrent, it JIT's code which has subtle race-invariants with the static code, it uses GC, it uses hardware exceptions, it uses lock-free idioms, etc.

What can we do, without changing our fundamental programming model, to use these extra cores to speed up our programs?  Some, but not much.

  • GC can become concurrent by default.  Other than the slowdown in the mutators to preserve the GC invariants, you'll never see a GC pause.   Azul's made a lot of hay here; "we solve GC problems".  Stock JVMs on stock hardware are still "getting there" for production use - but clearly are heading that way.  This technique probably uses about 20% more CPUs to do the GC than you have mutators usefully producing garbage.
  • All codes are JIT'd in the best (most expensive) way.  JIT'ing parallelizes the same way as "pmake" parallelizes builds - you use a thread pool of background compiler threads to JIT furiously.  During a Big Application startup, Azul might see 50 compiler threads JIT'ing in the background, with many megabytes/sec of high quality code being produced.  Of course, post-startup this doesn't speed up your programs anymore: there's nothing more to compile.
  • Continuous profiling and introspection need only a fraction of a CPU.
  • More auto-parallelizing hardware tricks like the Run-Ahead Scout haven't (yet) panned out: the "scout" is more like a blind idiot without warmed-up caches & branch-predictors.
  • Larger-scale speculation seems plausible for a bit, with combinations of hardware & runtime support.  While plain Speculative Lock Elision is a bit stiff on the hardware requirements, Azul's version shows you can do this with a fairly small amount of hardware, plus some simple runtime support.  Azul's SLE experience shows there's a real upper bound to the amount of parallelism you can usefully expose here: the code really does communicate with data inside most locked regions.
  • There are also a bunch of tricks to auto-parallelize & speculate loop iterations, although these techniques probably only pay off for larger loop bodies running longer iteration counts than are normal for business Java.  i.e., this probably pays nicely for scientific apps but not XML-parsing or Web-Servers.

Moving outside of the "auto-parallelize" world, we see a bunch of coding styles to enable parallelism.  Transaction-level parallelism is common, with J2EE a large complex example of it.  This is really "chunk-o-data+work" kind of parallelism, where the "chunk-o-data" is much larger and more complex than in the standard scientific app.  Under this same vague notion of "chunk-o-data+work" we also see things like stream programming (GPGPU is the hardware-accelerated version of this).  Large Java uses a lot of pipeline-of-thread-pools style programming.

In general, performance analysis tools for complex concurrent programs remain very weak.  Tools to help programmers write in the pipeline-of-thread-pools style are totally lacking.  Classic tools tell me things like "your time goes in these routines; here's the code being executed (alot)".  What I want is things like "your threads can't progress because they are blocked on this lock" (fairly easy: Azul shows this now), or "this thread pool is starved for work because that pool isn't feeding it" (because that pools' thread-count is capped too low, or it's threads are being starved in turn from another pool, etc, etc). 


But how well do these techniques work as we move from dozens of cores to hundreds (Azul's already there!)?  Right now, the Big 3 Application Servers can rarely use more than 50-100 cores before they become choked up on internal locks- and that includes the 20% for GC and using SLE.  Maybe we go to individual programs communicating via sockets (SOA?).  But isn't this just a very complex version of CSP?  Might we be better off just switching to CSP in the first place (or some hopefully more modern version)?

In short, the easy pickin's have long gone, and now we need complex tools and complex coding styles to get more performance from more cores using the existing languages.  While I think we can go farther with what we have, it's time to get serious about exploring serious language changes.  Somewhere, Out There, There's a Way to write large programs without me having to sweat out Every Bloody Little Detail about how my parallel program communicates internally.  Screw Java: I got a JVM with a super GC, fantastic JIT and decent concurrent libraries; it can do loads more stuff than just run Java.  I got reasonable OS's and an ever-growing mountain of (parallel) CPU cycles.  Isn't there Some Way to beat back the Concurrent-Programming Demon ?

Cliff


Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e200e551a5370c8833

Listed below are links to weblogs that reference We Don't Know How To Program...:

 

Comments

Cliff "In general, performance analysis tools for complex concurrent programs remain very weak."

Since 2002 JXInsight has reported the GC, thread waits, thread monitor contention (blocking) along with the standard CPU and wall clock timings for every method interception or transaction path. We even reported allocation at the method level.

Last year introduced a thread resource metering solution that has unlimited flexibility in the reporting of new meters that can be mapped onto various counters that can go up and down the abstraction layers (from bytes counts to workflow messages). You can meter practically anything as long as it can be measured and its grows. Here is a benchmark analysis with shows this in practice and with lock contention detection.

Benchmark Analysis: Guice vs Spring
http://blog.jinspired.com/?p=156

For complex concurrent programs we also offer extensions that support tracing across threads and processes. From a client side I can determine the lock contention in multiple nodes in a data grid responsible for servicing a request or executing a compute job.

DataGrid Traffic Analysis
http://blog.jinspired.com/?p=58

Parallel & Remote Tracing
http://blog.jinspired.com/?p=144

I do not think it is necessarily a problem with the tools. Complex tools are required to resolve complex problems within a complex domain but no one is interested in such complexity. Instead users want 6 green/red circles on a dashboard informing them that everything is operating normal.

William

Posted by: William Louth | Apr 4, 2008 11:04:41 AM

 

Cliff "In general, performance analysis tools for complex concurrent programs remain very weak."

Since 2002 JXInsight has reported the GC, thread waits, thread monitor contention (blocking) along with the standard CPU and wall clock timings for every method interception or transaction path. We even reported allocation at the method level.

Last year introduced a thread resource metering solution that has unlimited flexibility in the reporting of new meters that can be mapped onto various counters that can go up and down the abstraction layers (from bytes counts to workflow messages). You can meter practically anything as long as it can be measured and its grows. Here is a benchmark analysis with shows this in practice and with lock contention detection.

Benchmark Analysis: Guice vs Spring
http://blog.jinspired.com/?p=156

For complex concurrent programs we also offer extensions that support tracing across threads and processes. From a client side I can determine the lock contention in multiple nodes in a data grid responsible for servicing a request or executing a compute job.

DataGrid Traffic Analysis
http://blog.jinspired.com/?p=58

Parallel & Remote Tracing
http://blog.jinspired.com/?p=144

I do not think it is necessarily a problem with the tools. Complex tools are required to resolve complex problems within a complex domain but no one is interested in such complexity. Instead users want 6 green/red circles on a dashboard informing them that everything is operating normal.

William

Posted by: William Louth | Apr 4, 2008 11:07:07 AM

 

Worth looking at Scala's Actors. They are based around the idea of Erlang's processes. Where possible, the Actors reuse threads from the calling actor where possible. Lift is a good example of a concurrent system using Actors.

Posted by: Alrx Blewitt | Apr 4, 2008 11:37:38 AM

 

Yeah, in general I like Scala and wish I had more time to play with it. Issues like "where possible Actors reuse threads" are hopefully implementation details & entirely transparent to the language. i.e., I'd expect the runtime+JIT to decide what is the right 'coarseness' for CPU+OS_threads and leave the programmer free(er) to express the problem - similar to how the JVMs choose what & how-much to JIT (and the JVMs mostly get that right by now).

I'd still call Scala an "experiment in how to program concurrently" - and one that definitely needs to be done.

Cliff

Posted by: Cliff Click | Apr 4, 2008 12:21:41 PM

 

As for JXInsight - I wish I knew more. My experience along these lines is very very bad:

- "JXInsight has reported ... the standard CPU and wall clock timings for every method interception". Do you do this efficiently? If so, how? Every tool I've seen that claims to do this, does it by slowly the app by gross amounts. i.e., definitely NOT production ready. Azul hooks deep into the JVM into order to provide this with literally no-cost (and so it's always-on in production)... how well does JXInsight do here?

- "JXInsight has reported the GC... timings". Again, my experience here is very very bad: the GC timings as reported by the Big 3 JVMs using "-verbose:gc" blatantly lie badly - or did so in 2005 when I last tested. A simple "run a zillion transactions" version of JBB with short fixed-size transactions would report various kinds of huge delays for some transactions, despite the app being entirely parallel (no blocking on any transaction) AND GC reporting only tiny GC pauses. Somehow, somewhere, application threads would take unreported pauses of upwards of 10x the GC reported pauses.

- Your blog entries are less than inspiring: it appears you need a good understanding of the target app in order to insert instrumentation. No doubt about it: hand-inserted instrumentation is very useful, and can tell you things you can't get any other way. But what do I do when I've downloaded a JDBC driver from vendor X, Calypso from Y, a JVM from Z, and 5 other 3rd party libraries? Hand-insert in them all? Get a deep understanding of each library, so I have a chance to hand-insert probes?

Cliff

Posted by: Cliff Click | Apr 4, 2008 12:33:56 PM

 

William writes:
"I do not think it is necessarily a problem with the tools. Complex tools are required to resolve complex problems within a complex domain but no one is interested in such complexity."

I agree: the tools are the symptom, not the cause. As soon as I say "I can do this if I had a fancy tool that...." I've basically admitted that I can't do "that" in the bare language, but there's some mechanical helper that can. Hence it looks like a weak/inappropriate language issue - I'd like to believe I could do "that" if the language fit the bill better.


Cliff

Posted by: Cliff Click | Apr 4, 2008 12:37:54 PM

 

Your comments about the differences between serial and concurrent worlds make a lot of sense to me, especially coming at it from the perspective of debugging.

I think that for us to be able to debug, we need to understand the state-space and the current state. This tends to be more constrained in a serial program. This tends to be less so in a concurrent one. So debugging is harder.

Which means we need a way to really say what part of the state space is allowed vs. not.

[ hand waving ensues ]

Now the sad thing is that we've never done a good job at that even with serial systems. The number of people who really know the state space and really know what their source code says can/not happen is small. Everybody puts a top level exception handler and just leaves it at that. (I'm overstating it - but not by much, I think ;-)

We've basically only ever just gotten lucky with serial stuff. It would mostly work, and sometimes it would crash horribly and we'd either just restart+pray (see: fail-fast/only software) or we would maybe get out the debugger and figure it out. But it was "debugging into existence".

So I'm going to hypothesize / posit that we have 2 choices: Either we all need to start doing Design By Contract / Correct By Construction / Formal Methods / Programming in Coq, so that we actually define the state space that is OK,

or

We need our systems to do some amount of automatic inference about what not to allow.

?

Posted by: Raoul Duke | Apr 4, 2008 1:21:35 PM

 

Heh, shockingly close to my thinking as well.

I've been playing with FSM's, and gotten some pretty outragous results - but clearly the tools leave a lot to be desired. Somebody that wraps a nice GUI around a formal-methods-tool with the openning "tips" really being a hidden formal-methods tutorial might do really well here.


Cliff

Posted by: Cliff Click | Apr 4, 2008 1:54:53 PM

 

Cliff: "Do you do this efficiently?"

More efficiently than any other vendor other than those with a 35K profiler, ;-)
http://blog.jinspired.com/?p=227
The overhead comes down to the underlying meter used. Now if we could access other meters other than the 15+ meters we already support that would be great.

"Your blog entries are less than inspiring: it appears you need a good understanding of the target app in order to insert instrumentation."

Ouch. "Less than inspiring". I thought our ability to use multiple meters (http://blog.jinspired.com/?p=154) as well as strategies (http://blog.jinspired.com/?p=190) would have at least earned some brownie points considering no other solution has this today (other than a 35K profiler?).

You do know that the links show the Open API used by our instrumentation extensions we use various techniques including raw BCI, AspectJ, Proxy (JNDI Bound Objects, JDBC Drivers), and Filters, Interceptors (CORBA distributed tracing). Most importantly we are not just code reporters Probes and Traces can be contextual reflecting more the workload (URL, SQL, TXID, JNDI Name....).

We come with over 465 AspectJ extension libraries which covers a pretty board range of technologies and our product has 3,000 system properties.

We do NOT use verbose:gc reporting and up to now the reporting of GC times has correlated well with problem applications and transactions. Of course it varies across JVM's but at least we have a tool that does work across each vendor.

I am little disappointed you did not think the Probes API was a good approach at separating the metering for the instrumentation allowing JVM vendors to create special thread counters that could then be metered (billed) again developer specific probes which again could relate to code or other forms of execution. We talked with one other JVM vendor who was pretty impressed with our approach to allowing developers to create custom metering/billing engines.

William

Posted by: William Louth | Apr 4, 2008 2:26:55 PM

 

Re: Possibly formal methods.

I am not an expert, nor on TV, etc.; here is my subjective break-down gamut scale range graph thought spam. Apologies if this just sounds wrong and weird and like something somebody like RUSirius would write :-)

First off, whatever it is you do, it has to guarantee that your data doesn't get broken. Hopefully we all know by now that shared-mutable-state-with-locking is not the way to get that with any sanity left.

A problem is that "data" has many levels of hierarchy, and "broken" is a relative thing. So while you might have a system which statically prevents deadlocks and race conditions and transactions or whatever, that doesn't mean it guarantees your higher-level semantics are right / enforced. You still have to do work to make sure what you ask for makes sense for your goals. So that means you need a way to understand the system in the larger sense than just atomic reads and writes or the "volatile" keyword.

I think the spectrum has to do with the communications between parts of code which is a facet of the state space your system can inhabit.

On one end, you have "serial++" which is stuff like SIMD vector operations.

Then you have stuff like skeletons for composition of parallel activities. The point of those is to keep the concurrency constrained such that you greatly reduce the chance of unknowingly screwing yourself; you still have a lot of serial things, but they are pretty well compartmentalized away from each other.

Maybe after that there's something like the Actor model which attempts to have islands of serialization which then interact to form a concurrent whole.

[ I think that is where you can more seriously start to get into the to abuse a phrase uncanny valley where our human brains aren't going to do a good job of understanding what we have created and let loose on our 4096 cpus. ]

Basically at those points you have one foot in the "this is completely impossible to debug by attaching a debugger and see what is going on because it is just too complicated and non-deterministic from my vantage point".

So then I think you maybe have to throw up your hands and go one of two directions:

a) the proof direction. things like Epigram or Coq or whatever where you are not writing programs, you are writing proofs which are then used by the system to generate the programs you want. some sort of proofs in a concurrent setting which lead to some auto-generated system which is guaranteed to not violate the principles you wrote down.

b) the biological direction. you don't actually understand everything, and you not everything even actually does the right thing, but there is some kind of overarching feedback system / natural selection / genetic algorithm / systems healing rules system which somehow culls the bad results. the good thing is that biology sure is impressive so it can't be a completely stupid idea. the really bad thing is that it is pretty much something we cannot really peer into and fully understand any more.

Of course all of this is predicated on us even understanding how to do specifications. "Thereby reducing it to an earlier [sad] joke."

Posted by: Raoul Duke | Apr 4, 2008 3:03:33 PM

 

It's one of these "ignorance isn't bliss" kind of things. I stared hard at the then-current crop of Java tools about 5-6 years ago and got throughly disgusted. Obviously the tools have moved on.

Ok, I "owe you" a harder look at JXInsight - in my copious spare time. :-(

But can you put some numbers on "efficiently"? What's the overhead to run SpecJVM98 (or any single-threaded CPU-bound app) to tell me "the standard CPU and wall clock timings for every method interception" - which is one of the common and very useful starting points for lots of performance work?

What's the "hot code profile" for SpecJBB (maxed warehouses on a big multi-core X86, with some decent GC tuning)? These are things I know well; I'm curious to see how well JXInsight does on spotting the hot code.

(and yes: it's not the same thing as hand-inserting probes or tuning a big web server; that's a whole different ball game attempting to solve an entirely different problem... and one that's more-or-less orthogonal to the "concurrent programming is hard" thread). I'm not trying to bash JXInsight here - indeed I can see the obvious need for application-level-event profiling (which is what I consider hand-inserted probes to be)... I'm trying to track your claim of low-level profiling capabilities.


Cliff

Posted by: Cliff Click | Apr 4, 2008 3:04:31 PM

 

I knew this blog post would bring out fun comments.
I've been playing in the "Proof" space but I like the
"biological" space better - if I can figure out what "better" means. I actually think most big systems live closer to the "biological space" - they have internal invariants, tests for those invariants, and "cull it out & retry" sorts of solutions. Eg: many thread pools catch generic top-level exceptions, kill the thread in question, launch another thread (to replace the sicko one) and re-run the task (assuming it's failure left no bad side effects lying around).


Cliff

Posted by: Cliff Click | Apr 4, 2008 3:09:44 PM

 

".... big multi-core X86"

Can I get access to this when I am San Mateo in 2 weeks?

[JXInsight 5.5.2.1]
Our Probes API is pure Java based though we do have instrumentation that uses JVMPI/JVMTI agents. But measuring the overhead of our pure Java Probes calls which supports reporting of resource metering in groups (http://blog.jinspired.com/?p=150) at both thread and process levels (http://blog.jinspired.com/?p=152) including inherent totals (http://blog.jinspired.com/?p=149) a method interception on my PowerBook G4 running Java 5 takes 1.73 microseconds (average of 100 million calls). Taking off the 2 clock time access calls which take 0.3 microseconds each we are talking about 1 microsecond which is the time for Probe.begin() and Probe.end().

Maybe I can tune even more off whilst still keeping the feature set and extensibility (not just code profiling) but comparing our solution to other vendors (not just profilers) we are 4 times faster than our nearest competitor.

I am sure you can create a JVM specific solution that has far less overhead but would it be accessible to Java developers to allow them to tweak the instrumentation points and extend it such that code becomes less important and that higher levels of abstraction that are based on context and not visible to the JVM executing byte code. I am not sure.

I would love to see a JVM/hardware vendor open up the thread specific cumulative counters as meters to our Probes.

Developers can still have the same instrumentation injected/inserted:

probe = Probes.begin();
....
....
p.end();

they would accumulate various low level meters/counters during the threads execution which would then be accessible via the same API for tool vendors to publish and enhance.

Note: Developers are completely un-ware of the actual meters included in the metering. They just mark the region with labels which can be hierarchical to allow billing of resource usage at various levels just like in cost accounting in companies today.

Posted by: William Louth | Apr 4, 2008 4:29:50 PM

 

re: bio vs. proof

* it is hard to sell bio to somebody else e.g. do i really want my 777 to be running something loosey-goosey-bioy, or do i want to be running code done in Praxis SPARK Ada? (the problem is that we don't get to have P.S.A. for everything; we end up with some possibly questionable stuff in between.)

* i worry that by going for the fail-fast/only bio approach, what we are really saying is that we aren't smart enough or principled enough to keep focusing on how to really prove things and hone that approach to the point where we are Masters of the Universe (read: statespace).

* i think the proof approach is more likely to get people thinking and growing and being smarter and stuff. i like the idea of those fringe benefits. gosh i need to go clone myself and then go spend all my free time learning Coq. ha ha. :-(

* biology is stunningly cool and disturbingly robust. completely mind-blowing when you really look at it and think about it. so maybe it really is the only way to go in the long long run.

* on the other hand, machines are not biology. they don't +have+ to be forced into the same development path as biology. maybe machines of logic would in fact be way better than biology.

* maybe if we take the biological approach then SkyNet will feel some sympathy and empathy for us. whereas if we take the purely logical machines approach then we'll end up with Berzerkers that think we're just ugly bags of mostly water.

* i just want to write big-ass video game systems that scale like nobody's business and don't corrupt data or go tits up.

Posted by: Raoul Duke | Apr 4, 2008 4:32:28 PM

 

By the way the solution is to have a low -level complex event processing (CEP) engine designed with patterns specific to the language, runtime, frameworks. Runtimes, frameworks, and application developers would create events at various levels of abstraction that could be tracked in terms of causality. Then it comes down to creating rules that extend this further.

If there was more time in the day this is what I would be working on to solve this problem.

William

Posted by: William Louth | Apr 4, 2008 4:34:41 PM

 

Raoul writes - "i just want to write big-ass video game systems that scale like nobody's business and don't corrupt data or go tits up."

I've been wondering if Azul would make a good MMORPG server engine: large flat coherent memory space; hundreds of cores can use the memory; pauseless-gc. You might be able to get away from the usual heavily-instanced-worlds I see, which are obviously done so we don't get 5000 players interacting on the same X86.

Cliff

Posted by: Cliff Click | Apr 4, 2008 4:47:34 PM

 

William writes - " Our Probes API is pure Java based though we do have instrumentation that uses JVMPI/JVMTI agents. But measuring the overhead of our pure Java..."

You're being very defensive; I'm not "on the attack" and I think app-level event profiling is very useful... but I'm looking for something very specific here, and it relates to the "complex tool thread" from far far above:

What's the cost to get low-level "hot code" style info, that is commonly available for C programs the world over, from dozens of different tool vendors? If I use such tools on Java, I discover stuff like: "my JIT'd code is hot" with no mapping back to the original Java.

If I do this with JVMPI/TI/pure Java, performance sucks so bad that the numbers are meaningless - indeed generally outright misleading. I suspect JXInsight falls into this trap when gathering low-level stuff. Indeed I don't see how it cannot. Hence my curiosity: what's the "score" for e.g. SpecJVM98 (all benchmarks in 1 JVM, best-of-5-runs), or SpecJBB-4-warehouses-generous-heap? Azul gives me this info easily, and I've used this kind of info for years as a first-step in any performance analysis.

William writes: "Can I get access to this when I am San Mateo in 2 weeks?" Sure... but I don't really need numbers from a "big multi-core X86". I'd like numbers from a "well described & understood multi-core X86". I can extrapolate from there.

Please note that I fully understand that this kind of low-level information is NOT the same thing as "your portal app can't handle more than 100 users because thread-pool-X is undersized" or some such kind of info; I can well believe JXInsight shines in places Azul's profiler does not.


Cliff

Posted by: Cliff Click | Apr 4, 2008 4:59:46 PM

 

William writes - "I would love to see a JVM/hardware vendor open up the thread specific cumulative counters as meters to our Probes."

Azul's solution exposes all our info as XML via http GET. Your app can probe itself. I sure this could lead to some totally impossible to grok feed-back driven heuristic death.

The "usual" usage pattern is a browser run by a developer, or a monitoring agent running the GET's and logging to a file (and sometimes parsing the XML and making trendlines).

Cliff

Posted by: Cliff Click | Apr 4, 2008 5:04:10 PM

 

Actually, Azul's low-level abilities & JXInsight's high-level abilities might combine very well.

It might make an excellent "Java One Demo" to take some big hairy app & run it on Azul - and flip back-and-forth between low-level and high-level views while getting the app to scream on 750+ cores


Cliff

Posted by: Cliff Click | Apr 4, 2008 5:09:43 PM

 

"MMORPG server engine"

i honestly had no ulterior motive when posting comments on this particular 'thread', but nevertheless the other day i was really hoping that i could somehow convince Azul and my current employer to work together to test out our Java MMORPG game server on more cores, since the current plan really is to just throw hardware at the problem of supporting more people :-) (we just hit alpha.)

Posted by: Raoul Duke | Apr 4, 2008 5:16:02 PM

 

"complex event processing (CEP)"

any favourite pointers to things of such ilk? i'm reading the wikipedia pages...

it sounds at very first quick skim blush sorta like what i've been thinking of as a possibly right way to approach systems which are likely to have constantly changing requirements/specifications: some sort of blackboarding tuplespace db thing which has all the data in it, and then a bunch of semantic rules layered on top of all that data. screw encapsulation! ;-)

Posted by: Raoul Duke | Apr 4, 2008 5:20:05 PM

 

Random thought: use model checkers to explore the state space. then tell your system to restrict the runtime to not go into the unchecked parts of the state space. the more of the state space you "prove" via checking, the more your system can expand. but it would be guaranteed never to go off into untested la la lands.

re: "tools" in the above see e.g. http://cm.bell-labs.com/who/god/verisoft/abstract.html

re: "no go into" in the above see e.g. http://infoscience.epfl.ch/record/115079 for avoidance based on previous knowledge

Posted by: Raoul Duke | Apr 4, 2008 5:43:41 PM

 

oh crap and of course the funny thing is that model checking is something you'd like to be able to throw hardware at! :-)

Posted by: Raoul Duke | Apr 4, 2008 5:59:01 PM

 

and with this i shall endeavour to STOP.

http://ietfec.oxfordjournals.org/cgi/content/abstract/E88-A/4/941

Posted by: Raoul Duke | Apr 4, 2008 5:59:47 PM

 

Raoul writes - "lots and lots". :-)

re: MMPORG - I bet Azul would rock here.

re: model checkers & throwing hardware at it: my 2006 JavaOne talk on "Scaling a Real Application on Azul" was me max'ing out a model checker on Azul.

re: deadlock detection/avoidance. We've been having this whole deadlock discussion at Azul, after we (finally) got a real deadlock in HotSpot in the field. (HotSpot itself has long had deadlock detection built in; we present all deadlocked threads & their stacks at a glance in our monitoring tool). We ended up "fixing" the existing lock-ranking scheme in HotSpot (i.e., we removed all the exceptions to the rank asserts people had leaked in over the years, and re-sorted the locks until we quit getting rank-asserts).

The discussion wound on and on going something like this: we can detect deadlock dynamically easily; we can collect & record them; we can sort lock ranks based on all viewed locking paths, including all deadlocks ever seen... BUT we cannot prove ahead of time that we got-all-paths covered without real compiler or language support.

And then we basically said: "it's easy to fix the deadlock bugs, once we have stack traces for the deadlocked threads - and this is generally true in Java as it is for HotSpot innards (a big C++ program). So lets not spent more time on an Uber Solution here."

Cliff

Posted by: Cliff Click | Apr 4, 2008 7:15:21 PM

 

Post a comment