|
Touching Base...
December 22, 2009
It's been awhile since I blogged, so I thought I'd touch base with
people to let them know what's been going on. Azul Systems has been
hard at work improving our JVM. This is a bigger statement than it
sounds - there are not many groups that have a large enough 'quorum' of
JVM engineers to do large-scale changes to the HotSpot JVM. Azul has
nearly a dozen engineers doing core HotSpot work (not counting JDK work
or QA folks - counting only core JVM engineers)! We've been doing
large-scale changes to HotSpot for nearly 8 years now. Our HotSpot has
been improved over Sun's standard HotSpot or the OpenJDK in a large
number of ways, some more visible and some less so. Some of the more
obvious stuff we've got working:
- A new complete replacement GC: Generational
Pauseless GC (and the older PauselessGC
paper is here). This is one of our core strengths. GPGC handles
heaps from 60Megabytes to 600Gigabytes and allocation rates from
4Megabytes/sec to 40Gigabytes/sec, with MAX pause-times consistently
down in the
10-20msec range.
- GPGC requires read barriers,
and this means
instrumenting every read from the garbage-collected heap.
Instrumenting the JIT'd reads is easy: we altered the JITs long ago to
emit the needed instructions. Instrumenting the VM itself is a bigger
job; every time we integrate a new source drop from Sun we have to find
all the new heap-reads Sun has inserted into their new C++ code
(HotSpot itself is a large complex C++ program) and add read-barriers
to them.
- Real Time Performance
Monitoring - RTPM. This
is our high-resolution always-on no-overhead integrated profiling tool
and is our 2nd major selling point. Because it's no-overhead
(literally less than 1%; it's very hard to measure the overhead) we
leave it always on. This means you can look at a JVM that's been up in
production for a week or a month and introspect it. It's *common*
for a 1hr session with RTPM to answer performance questions that have
plagued production systems for years, or to have people walk away with
10-line fixes worth 30% speedups. It's as-if you've been blind to what
your JVM has been doing and suddenly your eyes are opened. Live stack
traces, heap contents, leaks, hot-locks with contending stack traces,
profiled JIT'd assembly, I/O bottlenecks, GC issues, etc, etc. See the
link for a demo.
- Virtualized JVM - We can take pretty much any old server, install
a new JDK, change JAVA_HOME to the new JDK and re-launch the
application... and it now runs on Azul's JVM backed by an Azul
appliance. No hardware change and no OS change. This is a great
solution for in-place speedups of older gear.
- More recently of course, we've been hard at working porting our
JVM to our new hardware platform. This work is going well; look for
more discussion here as we have things to announce!
Here's some of the LESS obvious stuff we have working:
- Tiered Compilation. Despite the fact that Sun has shipped
"-client" and "-server" configurations for years, they never integrated
these two JITs into a single system. Most other JVMs have had a tiered
compilation configuration for years and Azul Systems did this to
HotSpot a few years
ago. We consistently see a roughly 15% speed improvement over a
plain "-server" configuration. We use the "-client" JIT (also known
internally as C1) to do fast high-resolution profiling; this
high-quality profile information allows the "-server" JIT (C2) to do a
much better job of inlining and compiling.
- A complete replacement for the existing HotSpot CodeCache: the
holder of all JIT'd code in the system. While *adding* code has
always been easy, *removing* code has always been tricky (well,
tricky to do it without blowing all code away at once and without
requiring all calls to indirect through a 'handle'). Most
large server apps slowly churn new code, so if you leak code you
eventually run out of memory. The
new CodeCache uses GC to control code lifetimes and this results in a
vastly simpler and less buggy structure all around. We also use GC to
manage all the
auxiliary data structures surrounding code, e.g. the list of "class
dependencies" for a piece of JIT'd code is a standard heap object now.
(A "class dependency" lists the set of classes & methods that a
piece of JIT'd code assumes are NOT overridden; if a new class and/or
method overrides one of these then some inlining decision made by the
JIT is now illegal and the JIT'd code needs to be deoptimized, removed
&
recompiled). Besides being a common management point for all code, the
CodeCache is pinned in the low 4-Gig. This means all hardware Program
Counters can be limited to 32bits (in our otherwise 64-bit system) and
this is a
tidy cost savings (shorter instruction sequences for calls; less
I-cache space consumed, etc).
- Tons of internal JVM scaling work. We run on systems with 100's
of CPUs and so we've found (and fixed!) any number of internal JVM
scaling limitations. GPGC can run with hundreds of worker CPUs if
needed. The JITs compile in parallel with dozens of CPUs (50 is common
during a large application startup). Many internal VM structures have
been made lock-free or have had their lock hold-times reduced by 10x or more.
Self-tuning auto-sizing JIT/compiler thread pool. Concurrent stub/native-wrapper generation. Concurrent code-dependency insertion (during compilation) and checking (during class loading). Self-tuning
finalizer work queues. etc, etc, etc....
- Cooperative Safepointing allows thousands of *running*
threads
(not just alive-but-blocked-on-IO) to come to a Safepoint in
under a millisecond. Merely safepointing 100's of threads is down in
the microseconds. Note that a full-on Safepoint does not happen until
the last thread checks-in but the stall time starts when the first
thread stops for a Safepoint. The time-to-safepoint pause is measured
from when the first running thread stops till when the last thread
checks-in.
- The
ability to asynchronously stop & signal individual
threads, to have them do various self-service tasks cheaper than a
remote thread can do it. This includes, e.g. stack crawls for GC or
profiling (a thread's stack is hot in his own L1 cache and can be
crawled
vastly faster than by a remote thread), or to acknowledge GC phase
shifts or to allow code to be deoptimized (jargon word for what happens
to code that is no longer valid due to class loading). We can also
efficiently do "ragged safepoints" - this is like a full Safepoint
except we don't need to simultaneously stop all threads. Instead we
merely need to know when all threads have acknowledged e.g. a GC phase
shift. The threads "check in" as they individually acknowledge the
Safepoint and keep on running. When the last thread has checked in,
the "ragged safepoint" (and GC phase shift) is complete.
- No more "perm-gen" space to run out or require a separate tuning
flag. No more old-gen or young-gen either. No GC-thread-count knobs,
or space/ratio tuning knobs or GC age or SurvivorXXX flags. GPGC takes
no flags (except max total resources allowed), and runs well. There Is
Only One Heap
Space, and GPGC Rules It All.
- A new thread & stack layout that lets us use the
stack-pointer also as a ThreadLocal storage pointer, the HotSpot
"JavaThread*", AND as a small dense integer thread-id (requires 1 or 2
integer ops to flip between these forms). This frees up a CPU register
for general use, while still allowing 1-cycle access to performance
critical thread-local structures.
- A complete replacement for the existing HotSpot locking
mechanisms. Our new locks are 'biased' (here's the
original paper idea) similar in
theory to Sun's +BiasedLocking but based on entirely new code. No more
"displaced
header" madness
(this comment
is probably only relevant to
hard-core HotSpot engineers). Biased locks do not require ANY atomic
operation or memory barrier during locking & unlocking, unless the
lock needs to "change hands". Since we can stop individual threads
asynchronously, we have a fairly cheap way to hand biased locks off
between threads. Once individual locks demonstrate that they need to
"change hands", we inflate that one lock (not the whole class of locks)
and it becomes a "thin lock" as long as the contention is low enough
switching over to a "thick lock" only when there are threads waiting to
acquire the lock. The issues here are fairly complex and subtle and
deserve an entire 'nother blog!
That's enough for this Blog. More later...
Cliff
Category: Web/Tech | |
TrackBack (0)
TrackBack
TrackBack URL for this entry: http://www.typepad.com/services/trackback/6a00d83451bd7669e20120a7724fc7970b
Listed below are links to weblogs that reference Touching Base...:
Comments
Thanks for documenting all this cool stuff. Unfortunately all my clients use "old boring" platforms, so my self-serving question is, are all these improvements private to Azul products or do you give some stuff back to Sun (and others)? I realize that some features may be mostly specific to your platform or very important as competitive selling points, e.g. GPGC (not to mention that Sun and other JVM makers have their own next-gen GC projects etc. in parallel; btw this includes tiered JIT in JDK 7). But I'd be interested in more fundamental things, like general fixes for scalability issues in HotSpot.
I would also know more details about each feature, e.g. how exactly your RTPM works and how it compares to Sun's DTrace which is the "gold standard" for that stuff (it's one of the reasons why I just chose Solaris 10 for a brand new testing cluster, even though my client's production servers are a completely different platform... the other reason being that said platform sucks and would cost our eyeballs ;-)).
Posted by: Osvaldo Pinali Doederlein | Dec 22, 2009 10:49:48 AM
Damn, sounds awesome. When is the Intel port? :)
Posted by: Sam Pullara | Dec 22, 2009 10:54:15 AM
- On giving stuff back: if we go open-source then obviously yes, some stuff comes back to the community. Going open-source isn't that far-fetched of an idea for us.
- How does RTPM compare vs D-Trace: PROS: RTPM is much more JVM-specific & aware than DTrace; it's much much easier to use (you 'surf' your JVM in a web browser). It's always on always & available. The overhead is always tiny (although many common things are cheap in DTrace, you can ask for stuff that is very expensive). CONS: It does not cover everything; you can ask for e.g. OS thread-scheduling events from DTrace but not RTPM. DTrace can filter high-frequency events online (RTPM cannot, but it can certainly log-to-disk and filter offline). This is obviously a very superficial comparison between the two products and they clearly do very different things.
- Ahh, when is the Intel port? Ummm, when we can get around to it? :-)
Cliff
Posted by: Cliff Click | Dec 22, 2009 1:56:10 PM
Awesome stuff. Is there any possibility of you guys supporting Mono in a similar fashion?
Posted by: Brien | Dec 23, 2009 6:27:05 AM
The problem with Mono is 3-fold:
1- Microsoft 'owns' the spec and can change it at will. This gives them a headlock on your profits before you begin.
2- There's no high-margin high-end market (well, at least it's a very small market). Not very attractive for a small company.
3- The spec is 'loose' already. They have to support all that legacy code and there are holes aplenty in the spec... and people drive through them on a regular basis (because they've been doing it that way since forever and Microsoft is famous for backwards compatibility...). So you have to be bug-for-bug compatible and that's really hard to do.
That said, we did put in support for the CLR in our hardware but the market never materialized.
Cliff
Posted by: Cliff Click | Dec 23, 2009 8:59:39 AM
How do I get this? I can't find a download link or a pricing page anywhere on your website.....
thanks
Dan
Posted by: Daniel Lucraft | Dec 28, 2009 9:25:54 AM
You asked "How do I get this?". I assume 'this' is a Mono port of Azul's stuff - we never made a port of CLR, there was never enough market to pay for it.
If you want to be contacted by an Azul sales person, email me privately or I believe there is a registration link somewhere on the site which feeds into the sales database.
Cliff
Posted by: Cliff Click | Dec 28, 2009 9:36:53 AM
Hi Cliff, thanks for the prompt response.
I actually meant the Azul HotSpot referred to in the blog, since I couldn't find any other information on the site I didn't know it was open source or paid for or if there was a trial version.
I guess it's bundled with some of your other products, and not available separately. Sounds very interesting though! If you do open source it I'd love to try it out.
thanks
Dan
Posted by: Daniel Lucraft | Dec 28, 2009 9:51:56 AM
Yes, our JVM is bundled with our hardware. We're debating Open Sourcing our stuff, but nothing is settled as of right now.
Cliff
Posted by: Cliff Click | Dec 28, 2009 9:54:33 AM
By "give back" I didn't mean open sourcing - although that would certainly be great, I hope this happens eventually. But I wonder if your licensing agreement with Sun implies that you must give back to Sun (and by consequence to many other JVMs) any improvements that you make in code originally from Sun. It seems that other vendors, like IBM and Apple, routinely do this; but then, I don't know if this happens by contractual obligation. There are other reasons to share improvements, like reducing your effort to merge them again with every new Sun JDK build, and interop.
The hardware support for GPGC remembers me from earlier attempts to create CPUs with ISA extensions to help Java, like Sun's picoJava and MAJC architectures (both RIP, afaik). I'm interested in CPU technology that enables modern software advancements; transactional memory is another important item that comes to mind, there's a bunch of research in hardware+software TM (too bad Sun's ROCK failed). It's intriguing that Intel doesn't seem to be paying any attention to this stuff.
Posted by: Osvaldo Pinali Doederlein | Dec 29, 2009 8:54:50 AM
Humm, lots here...
1- We are required to report back to Sun any bugs found; we do this routinely and usually also hand them our bug fixes.
2- We are NOT required to hand back any new feature work (or maintenance cleanup work).
3- We tried, years ago (pre-OpenJDK), to hand back a major chunk of work related to the thread self-service & safepointing, but Sun was not interested.
4- The actual hardware needed for GPGC is really quite trivial. It would be less than the dot on the hair on the flea on the wart of a dog to put in an X86.
5- We have hardware transactional memory support, and have it turned on and running for quite a few years now. We use it for software-lock-elision of dusty-deck Java. It "doesn't work". Meaning: the hardware works as expected and we routinely allow parallel execution of dozens of otherwise serialized lock regions (transactions that are 1000's of instructions long), but we never can speed up any programs. See http://blogs.azulsystems.com/cliff/2009/02/and-now-some-hardware-transactional-memory-comments.html
Cliff
Posted by: Cliff Click | Dec 29, 2009 9:11:48 AM
hi there,
"every time we integrate a new source drop from Sun we have to find all the new heap-reads Sun has inserted into their new C++ code (HotSpot itself is a large complex C++ program) and add read-barriers to them."
How do you get new source drops from sun to exisiting JDKs(not including OpenJDK) ?
Thank you,
BR,
~A
Posted by: anjan bacchu | Feb 28, 2010 9:40:41 PM
Azul is a HotSpot licensee.
Cliff
Posted by: Cliff Click | Mar 1, 2010 7:37:48 AM
AFAIK the point of transactional memory is not to speed up programs but to make them easier to write, at least that's how it's being "sold". Most people would be happy, I believe, if it achieved that in return for a small speed penalty, never mind a speed increase. So why are you so disappointed?
Posted by: Olivier | Mar 5, 2010 11:12:25 AM
Post a comment
|