Cliff Click Jr.’s Blog's Blog

« Java vs C performance... again... | Main | Biased Locking »

Touching Base...
December 22, 2009

It's been awhile since I blogged, so I thought I'd touch base with people to let them know what's been going on.  Azul Systems has been hard at work improving our JVM.  This is a bigger statement than it sounds - there are not many groups that have a large enough 'quorum' of JVM engineers to do large-scale changes to the HotSpot JVM.  Azul has nearly a dozen engineers doing core HotSpot work (not counting JDK work or QA folks - counting only core JVM engineers)!  We've been doing large-scale changes to HotSpot for nearly 8 years now.  Our HotSpot has been improved over Sun's standard HotSpot or the OpenJDK in a large number of ways, some more visible and some less so.   Some of the more obvious stuff we've got working:

  • A new complete replacement GC: Generational Pauseless GC (and the older PauselessGC paper is here).  This is one of our core strengths.  GPGC handles heaps from 60Megabytes to 600Gigabytes and allocation rates from 4Megabytes/sec to 40Gigabytes/sec, with MAX pause-times consistently down in the 10-20msec range. 
  • GPGC requires read barriers, and this means instrumenting every read from the garbage-collected heap.  Instrumenting the JIT'd reads is easy: we altered the JITs long ago to emit the needed instructions.  Instrumenting the VM itself is a bigger job; every time we integrate a new source drop from Sun we have to find all the new heap-reads Sun has inserted into their new C++ code (HotSpot itself is a large complex C++ program) and add read-barriers to them.
  • Real Time Performance Monitoring - RTPM.  This is our high-resolution always-on no-overhead integrated profiling tool and is our 2nd major selling point.  Because it's no-overhead (literally less than 1%; it's very hard to measure the overhead) we leave it always on.  This means you can look at a JVM that's been up in production for a week or a month and introspect it.  It's *common* for a 1hr session with RTPM to answer performance questions that have plagued production systems for years, or to have people walk away with 10-line fixes worth 30% speedups.  It's as-if you've been blind to what your JVM has been doing and suddenly your eyes are opened.  Live stack traces, heap contents, leaks, hot-locks with contending stack traces, profiled JIT'd assembly, I/O bottlenecks, GC issues, etc, etc.  See the link for a demo.
  • Virtualized JVM - We can take pretty much any old server, install a new JDK, change JAVA_HOME to the new JDK and re-launch the application... and it now runs on Azul's JVM backed by an Azul appliance.  No hardware change and no OS change.  This is a great solution for in-place speedups of older gear.
  • More recently of course, we've been hard at working porting our JVM to our new hardware platform.  This work is going well; look for more discussion here as we have things to announce!

Here's some of the LESS obvious stuff we have working:

  • Tiered Compilation.  Despite the fact that Sun has shipped "-client" and "-server" configurations for years, they never integrated these two JITs into a single system.  Most other JVMs have had a tiered compilation configuration for years and Azul Systems did this to HotSpot a few years ago.  We consistently see a roughly 15% speed improvement over a plain "-server" configuration.  We use the "-client" JIT (also known internally as C1) to do fast high-resolution profiling; this high-quality profile information allows the "-server" JIT (C2) to do a much better job of inlining and compiling.
  • A complete replacement for the existing HotSpot CodeCache: the holder of all JIT'd code in the system.  While *adding* code has always been easy, *removing* code has always been tricky (well, tricky to do it without blowing all code away at once and without requiring all calls to indirect through a 'handle').  Most large server apps slowly churn new code, so if you leak code you eventually run out of memory.  The new CodeCache uses GC to control code lifetimes and this results in a vastly simpler and less buggy structure all around.  We also use GC to manage all the auxiliary data structures surrounding code, e.g. the list of "class dependencies" for a piece of JIT'd code is a standard heap object now.  (A "class dependency" lists the set of classes & methods that a piece of JIT'd code assumes are NOT overridden; if a new class and/or method overrides one of these then some inlining decision made by the JIT is now illegal and the JIT'd code needs to be deoptimized, removed & recompiled).  Besides being a common management point for all code, the CodeCache is pinned in the low 4-Gig.  This means all hardware Program Counters can be limited to 32bits (in our otherwise 64-bit system) and this is a tidy cost savings (shorter instruction sequences for calls; less I-cache space consumed, etc).
  • Tons of internal JVM scaling work.  We run on systems with 100's of CPUs and so we've found (and fixed!) any number of internal JVM scaling limitations.  GPGC can run with hundreds of worker CPUs if needed.  The JITs compile in parallel with dozens of CPUs (50 is common during a large application startup).  Many internal VM structures have been made lock-free or have had their lock hold-times reduced by 10x or more.  Self-tuning auto-sizing JIT/compiler thread pool.  Concurrent stub/native-wrapper generation.  Concurrent code-dependency insertion (during compilation) and checking (during class loading).  Self-tuning finalizer work queues.  etc, etc, etc....
  • Cooperative Safepointing allows thousands of *running* threads (not just alive-but-blocked-on-IO) to come to a Safepoint in under a millisecond.  Merely safepointing 100's of threads is down in the microseconds.  Note that a full-on Safepoint does not happen until the last thread checks-in but the stall time starts when the first thread stops for a Safepoint.  The time-to-safepoint pause is measured from when the first running thread stops till when the last thread checks-in.
  • The ability to asynchronously stop & signal individual threads, to have them do various self-service tasks cheaper than a remote thread can do it.  This includes, e.g. stack crawls for GC or profiling (a thread's stack is hot in his own L1 cache and can be crawled vastly faster than by a remote thread), or to acknowledge GC phase shifts or to allow code to be deoptimized (jargon word for what happens to code that is no longer valid due to class loading). We can also efficiently do "ragged safepoints" - this is like a full Safepoint except we don't need to simultaneously stop all threads.  Instead we merely need to know when all threads have acknowledged e.g. a GC phase shift.  The threads "check in" as they individually acknowledge the Safepoint and keep on running.  When the last thread has checked in, the "ragged safepoint" (and GC phase shift) is complete.
  • No more "perm-gen" space to run out or require a separate tuning flag.  No more old-gen or young-gen either.  No GC-thread-count knobs, or space/ratio tuning knobs or GC age or SurvivorXXX flags.  GPGC takes no flags (except max total resources allowed), and runs well.  There Is Only One Heap Space, and GPGC Rules It All.
  • A new thread & stack layout that lets us use the stack-pointer also as a ThreadLocal storage pointer, the HotSpot "JavaThread*", AND as a small dense integer thread-id (requires 1 or 2 integer ops to flip between these forms).  This frees up a CPU register for general use, while still allowing 1-cycle access to performance critical thread-local structures.
  • A complete replacement for the existing HotSpot locking mechanisms.  Our new locks are 'biased' (here's the original paper idea) similar in theory to Sun's +BiasedLocking but based on entirely new code.  No more "displaced header" madness (this comment is probably only relevant to hard-core HotSpot engineers).  Biased locks do not require ANY atomic operation or memory barrier during locking & unlocking, unless the lock needs to "change hands".  Since we can stop individual threads asynchronously, we have a fairly cheap way to hand biased locks off between threads.  Once individual locks demonstrate that they need to "change hands", we inflate that one lock (not the whole class of locks) and it  becomes a "thin lock" as long as the contention is low enough switching over to a "thick lock" only when there are threads waiting to acquire the lock.  The issues here are fairly complex and subtle and deserve an entire 'nother blog!

That's enough for this Blog.  More later...

Cliff

Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e20120a7724fc7970b

Listed below are links to weblogs that reference Touching Base...:

 

Comments

Thanks for documenting all this cool stuff. Unfortunately all my clients use "old boring" platforms, so my self-serving question is, are all these improvements private to Azul products or do you give some stuff back to Sun (and others)? I realize that some features may be mostly specific to your platform or very important as competitive selling points, e.g. GPGC (not to mention that Sun and other JVM makers have their own next-gen GC projects etc. in parallel; btw this includes tiered JIT in JDK 7). But I'd be interested in more fundamental things, like general fixes for scalability issues in HotSpot.

I would also know more details about each feature, e.g. how exactly your RTPM works and how it compares to Sun's DTrace which is the "gold standard" for that stuff (it's one of the reasons why I just chose Solaris 10 for a brand new testing cluster, even though my client's production servers are a completely different platform... the other reason being that said platform sucks and would cost our eyeballs ;-)).

Posted by: Osvaldo Pinali Doederlein | Dec 22, 2009 10:49:48 AM

 

Damn, sounds awesome. When is the Intel port? :)

Posted by: Sam Pullara | Dec 22, 2009 10:54:15 AM

 

- On giving stuff back: if we go open-source then obviously yes, some stuff comes back to the community. Going open-source isn't that far-fetched of an idea for us.


- How does RTPM compare vs D-Trace: PROS: RTPM is much more JVM-specific & aware than DTrace; it's much much easier to use (you 'surf' your JVM in a web browser). It's always on always & available. The overhead is always tiny (although many common things are cheap in DTrace, you can ask for stuff that is very expensive). CONS: It does not cover everything; you can ask for e.g. OS thread-scheduling events from DTrace but not RTPM. DTrace can filter high-frequency events online (RTPM cannot, but it can certainly log-to-disk and filter offline). This is obviously a very superficial comparison between the two products and they clearly do very different things.


- Ahh, when is the Intel port? Ummm, when we can get around to it? :-)


Cliff

Posted by: Cliff Click | Dec 22, 2009 1:56:10 PM

 

Awesome stuff. Is there any possibility of you guys supporting Mono in a similar fashion?

Posted by: Brien | Dec 23, 2009 6:27:05 AM

 

The problem with Mono is 3-fold:

1- Microsoft 'owns' the spec and can change it at will. This gives them a headlock on your profits before you begin.

2- There's no high-margin high-end market (well, at least it's a very small market). Not very attractive for a small company.

3- The spec is 'loose' already. They have to support all that legacy code and there are holes aplenty in the spec... and people drive through them on a regular basis (because they've been doing it that way since forever and Microsoft is famous for backwards compatibility...). So you have to be bug-for-bug compatible and that's really hard to do.


That said, we did put in support for the CLR in our hardware but the market never materialized.

Cliff

Posted by: Cliff Click | Dec 23, 2009 8:59:39 AM

 

How do I get this? I can't find a download link or a pricing page anywhere on your website.....

thanks
Dan

Posted by: Daniel Lucraft | Dec 28, 2009 9:25:54 AM

 

You asked "How do I get this?". I assume 'this' is a Mono port of Azul's stuff - we never made a port of CLR, there was never enough market to pay for it.

If you want to be contacted by an Azul sales person, email me privately or I believe there is a registration link somewhere on the site which feeds into the sales database.

Cliff

Posted by: Cliff Click | Dec 28, 2009 9:36:53 AM

 

Hi Cliff, thanks for the prompt response.

I actually meant the Azul HotSpot referred to in the blog, since I couldn't find any other information on the site I didn't know it was open source or paid for or if there was a trial version.

I guess it's bundled with some of your other products, and not available separately. Sounds very interesting though! If you do open source it I'd love to try it out.

thanks
Dan

Posted by: Daniel Lucraft | Dec 28, 2009 9:51:56 AM

 

Yes, our JVM is bundled with our hardware. We're debating Open Sourcing our stuff, but nothing is settled as of right now.

Cliff

Posted by: Cliff Click | Dec 28, 2009 9:54:33 AM

 

By "give back" I didn't mean open sourcing - although that would certainly be great, I hope this happens eventually. But I wonder if your licensing agreement with Sun implies that you must give back to Sun (and by consequence to many other JVMs) any improvements that you make in code originally from Sun. It seems that other vendors, like IBM and Apple, routinely do this; but then, I don't know if this happens by contractual obligation. There are other reasons to share improvements, like reducing your effort to merge them again with every new Sun JDK build, and interop.

The hardware support for GPGC remembers me from earlier attempts to create CPUs with ISA extensions to help Java, like Sun's picoJava and MAJC architectures (both RIP, afaik). I'm interested in CPU technology that enables modern software advancements; transactional memory is another important item that comes to mind, there's a bunch of research in hardware+software TM (too bad Sun's ROCK failed). It's intriguing that Intel doesn't seem to be paying any attention to this stuff.

Posted by: Osvaldo Pinali Doederlein | Dec 29, 2009 8:54:50 AM

 

Humm, lots here...

1- We are required to report back to Sun any bugs found; we do this routinely and usually also hand them our bug fixes.

2- We are NOT required to hand back any new feature work (or maintenance cleanup work).

3- We tried, years ago (pre-OpenJDK), to hand back a major chunk of work related to the thread self-service & safepointing, but Sun was not interested.

4- The actual hardware needed for GPGC is really quite trivial. It would be less than the dot on the hair on the flea on the wart of a dog to put in an X86.

5- We have hardware transactional memory support, and have it turned on and running for quite a few years now. We use it for software-lock-elision of dusty-deck Java. It "doesn't work". Meaning: the hardware works as expected and we routinely allow parallel execution of dozens of otherwise serialized lock regions (transactions that are 1000's of instructions long), but we never can speed up any programs. See http://blogs.azulsystems.com/cliff/2009/02/and-now-some-hardware-transactional-memory-comments.html

Cliff

Posted by: Cliff Click | Dec 29, 2009 9:11:48 AM

 

hi there,

"every time we integrate a new source drop from Sun we have to find all the new heap-reads Sun has inserted into their new C++ code (HotSpot itself is a large complex C++ program) and add read-barriers to them."

How do you get new source drops from sun to exisiting JDKs(not including OpenJDK) ?

Thank you,

BR,
~A

Posted by: anjan bacchu | Feb 28, 2010 9:40:41 PM

 

Azul is a HotSpot licensee.
Cliff

Posted by: Cliff Click | Mar 1, 2010 7:37:48 AM

 

AFAIK the point of transactional memory is not to speed up programs but to make them easier to write, at least that's how it's being "sold". Most people would be happy, I believe, if it achieved that in return for a small speed penalty, never mind a speed increase. So why are you so disappointed?

Posted by: Olivier | Mar 5, 2010 11:12:25 AM

 

Post a comment