Talk is Cheap...
February 12, 2008

Been busy this month getting talks together.

First up: MSPC08, Workshop on Memory Systems Performance and Correctness.  I got my IWannaBit! paper accepted.  In this paper I propose a single hardware Bit; one that tracks L1 misses & evicts.  Using this one bit I can prove atomicity of a series of operations, or take corrective action if the operations are not atomic.  I basically can convert a hardware stall/fence for memory ordering into a software spin loop (plus I get to do other things while I'm waiting on the hardware).  It's a pretty cool notion, half-way to Transactional Memory and half-way to DCAS but with almost no hardware (and replaces either LL/SC or CAS ta-boot!).

Next I submitted 4 talks to JavaOne, and amazingly these 3 got accepted:

  • JVM Challenges and Directions in the Multicore Era - This is where I rant against all our current concurrent programming models.  Concurrent programming is very hard, and most of the currently debated solutions (e.g. transactional memory, atomic, locks, threads) does NOT address the hardest part of writing concurrent code.
  • Debugging Data Races - given as a BOF to last years JavaOne.  I'll freshen up the slides but mostly cover the same ground.
  • Toward a Coding Style for Scalable Nonblocking Data Structures - This the notion behind my NonBlockingHashTable taken to extremes.  Again, these slides  are old but start in the right direction: the JavaOne talk will "up level" the discussion and apply the same techniques to several different datastructures.

I'll probably be speaking at TSS Las Vegas again this year, but I haven't a clue about what!     :-)

More cheap talk: I think I've figured out how to do all the key bits for a full-blown in-memory DB: atomicity (nobody sees partial updates), non-blocking (no deadlocks; always some transaction can make progress), probably even fair (everybody's transaction gets to complete eventually, no matter the size or conflicts).  And fast: I bet I can sustain > 100million (simple) DB ops/sec.  Larger transactions will take longer of course.  Of course, it's all in-memory so the usual notion of durability is limited to the up-time of the server (it's actually not that bad: I can still checkpoint slowly to disk so e.g. the disk is no more than 60sec out-of-date and is still coherent (represents some snapshot of the real DB)).

Enough Cheap Talk!  I need to get busy making slides...

Cliff

Category: Web/Tech | | TrackBack (0)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451bd7669e200e550441a8e8834

Listed below are links to weblogs that reference Talk is Cheap...:

 

Comments

Hi Cliff,

Will this in-memory DB be more than a hash map? If so then will it be based on relational database technology and including a mature SQL? query engine that is aware of the underlying (extreme transaction) architecture and plan the execution accordingly under dynamic workloads and data distributions. I assume you mean to take an existing database product and make it scream or are you planning on doing the slaying of the beast all by yourself.

Kind regards,

William

Posted by: William Louth | Feb 12, 2008 2:28:29 PM

 

I don't want to slay the beast all by my lonesome. I can provide a core piece: a high throughput "transactional memory"; an atomic read/modify/write for large disconnected chunks of memory. Surely Somebody (for some definition of Somebody) can turn that into an awesome SQL.

Cliff

Posted by: Cliff Click | Feb 12, 2008 2:38:27 PM

 

Why not just an awesome transactional data grid as this more than likely better reflects the actual user usage of such a system?

I would shy away from the DB term as most people will have preconceptions that it will be more than a transactional data store. Those (the few) that would need such a solution would have already discounted the DB technology so you do not gain any advantage. That is my honest opinion but I maybe (more than likely) completely wrong when it comes to such sales and marketing issues.

"In-memory DB" also undersells what is being talked about as most will just see it as removing disk reads when it is much more than that.

William

Posted by: William Louth | Feb 12, 2008 3:02:02 PM

 

I have a solution; I'm out looking for Problem with Money. :-)

I'm all too happy to drop the DB layer and just provide the awesome transactional data grid. You got a Problem? ;-)

Cliff

Posted by: Cliff Click | Feb 12, 2008 3:05:08 PM

 

If you want a better link for e.g. Bloom filter reference, the thesis is now publically available at http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-697.html;

Posted by: Chris Purcell | Feb 12, 2008 4:05:32 PM

 

Oh, and (at least in the design I proposed in the thesis), the Bloom filter allows snapshots to extend beyond the cache by snooping all transactions on the memory bus. This obviously does not scale directly to larger systems (no snooping), and any system I can think of for those involves some modification to the directory system, so the L1 restriction of IWannaBit makes a lot of sense for keeping the system cheap.

Posted by: Chris Purcell | Feb 12, 2008 4:45:55 PM

 

The paper is in to MSPC already, so it's probably to late to change the printed reference. I change it in my slides, however.

Cliff

Posted by: Cliff Click | Feb 12, 2008 10:32:57 PM

 

Sorry, late at night when I posted that, I missed the most important bit: congratulations! That's a lot of papers/talks! Best of luck at the conferences.

Posted by: Chris Purcell | Feb 13, 2008 12:35:00 PM

 

Thanks,
Cliff

Posted by: Cliff Click | Feb 13, 2008 3:40:06 PM

 

Are you just saying that if you have a L1 cache that's exclusively tied to a thread for a time, that if all the loads and stores hit in the cache and have the required permission, then you don't need any locks?

if so, isn't that a "duh...well yeah"

Don't you have to do some performance analysis of how often this is true, and how often you have to fall back to the "some other method" to complete the work?

I mean, minimally, you should do some performance analysis on what size caches are required. Even mentioning that you need a L1 cache that supports clean and dirty lines, right? (a lot of L1 caches don't, and some share between cores)

I would also think that prefetch instructions would be useful, for getting the cache in the desired state first, before you start doing any real work, would be good. (I say prefetch because you can probably do those at higher thruput than loads and stores).

And your evict case..is that supposed to cover all MP memory order issues? i.e. are you assuming "evict" covers all L1 cache state changes for any reason (due to this thread, or threads elsewhere in the system)
If so, then there's more performance analysis needed there. If the L1 cache is real big (2MB) I could see how it would suck.

I think you're assuming you have relatively small L1 caches, backed by larger L2/L3 caches.

Imagine if there is just L1 cache. Big. bad. Right? (think false sharing effects..lifetime of data in caches)

I have a counter proposal.
A single bit in the system. It is cleared when anyone tries to do a load or store in the system.

Obviously dumb, right?

Your proposal seems just a few steps down from it though. I think you need perf. analysis to describe the scenarios under which it might work well.

Posted by: Bart | Feb 21, 2008 3:20:03 PM

 

Letsee... lots of big comments, so maybe a big reply:

>Are you just saying that if you have a L1 cache that's exclusively tied to a thread for a time, that if all the loads and stores hit in the cache and have the required permission, then you don't need any locks?
>if so, isn't that a "duh...well yeah"

In cache-coherent systems, ala all desktops & all servers these days, the L1 partakes in the coherence protocol. So I'm looking to figure out when "protocol didn't happen".


> Don't you have to do some performance analysis of how often this is true, and how often you have to fall back to the "some other method" to complete the work?

Well understood & documented in the industry: most Java locks are very short and not contended. There are now a number of JIT optimizations in main-stream JVMs targeted directly at these sorts of locks.


> I mean, minimally, you should do some performance analysis on what size caches are required.

For my intended use case, implementing interesting non-blocking algorithms, a 4-word cache with 4-way associativity is sufficient. All the caches Out There are orders of magnitude larger.


> Even mentioning that you need a L1 cache that supports clean and dirty lines, right? (a lot of L1 caches don't, and some share between cores)

Actually, I'd say the reverse: *most* caches support clean & dirty AND are private to the core. Certainly true of the bazillion X86's Out There today. Sparc has it's "interesting" write-back cache vs read-only L1. PPC, IA64, MIPS, Azul, etc have private caches as far as I know.

But I don't need clean & dirty bits: I need atomicity. This Bit is useful on a write-through cache.


> I would also think that prefetch instructions would be useful, for getting the cache in the desired state first, before you start doing any real work, would be good. (I say prefetch because you can probably do those at higher thruput than loads and stores).

If you read the IWannaBit! paper, you'll see that my expected use-case is to cache-miss on the first go-'round, then loop & rerun with things hot in cache. I get to turn a hardware coherency stall (from a mfence instruction) into a software spin-loop - hopefully doing more useful work while I wait.

More in another post.

Cliff

Posted by: Cliff Click | Feb 22, 2008 10:00:57 AM

 

> And your evict case..is that supposed to cover all MP memory order issues?

Yup.


> i.e. are you assuming "evict" covers all L1 cache state changes for any reason (due to this thread, or threads elsewhere in the system)

Yup. I'm looking for a minimalist approach that seems to realistically do stuff. Fancier hardware makes you more immune to other threads' actions - but nobody (other than Azul) has shipped fancier hardware. I'm trying to lower the barrier to entry.


> If so, then there's more performance analysis needed there. If the L1 cache is real big (2MB) I could see how it would suck.

Why? Do you suspect constant eviction from other threads? I suspect the evictions would die out very quickly.


> I think you're assuming you have relatively small L1 caches, backed by larger L2/L3 caches.

I am... it's the current common approach.


> Imagine if there is just L1 cache. Big. bad. Right? (think false sharing effects..lifetime of data in caches)

Doesn't seem so bad to me. As I say above, I suspect the false-sharing evictions will quickly stabilize long enough to get the atomic operation done.


> I have a counter proposal.
> A single bit in the system. It is cleared when anyone tries to do a load or store in the system.
> Obviously dumb, right?

Yup. I got an 800 core system *right now*. Not proposed: shipping to paying customers. There's never a clock cycle where a load or store isn't happening somewhere.


> Your proposal seems just a few steps down from it though. I think you need perf. analysis to describe the scenarios under which it might work well.

It's those few crucial steps down that matter!

And I *do* have some performance analysis; traces that tell me that standard modest sized L1's (commonly available) running Java apps see time-between-evictions on the order of 1000's of instructions.

Cliff

Posted by: Cliff Click | Feb 22, 2008 10:08:40 AM

 

Quick answer showing that it's dangerous when SW guys say they want a particular HW "thingee" as opposed to describing behaviors.
You should describe the behavior you want from a load/store/memory model point of view. Not assume
a hw implementation.

Here's an example of breakage:

In SUN TSO, A thread is allowed to see the results of it's own stores early..In UltraSparc I & II, this allowed the write-thru L1 to be updated BEFORE ownership of the line was resolved in terms of system wide cache coherency. This is legal because TSO allows it. (check your SPARC V9 book)

You're suggesting a solution that works assuming other things: implementation, memory model etc. Like L1 being a commit point for system-wide coherency resolution (or even that the system wide memory model has some write-atomicity requirement?)

So I offer the existence proof that on UltraSPARC I & II, hitting in the L1 cache for a store doesn't get you what you want.
(a commit point, although it's not just that, because the notion of "commit point" requires something from the system memory model. Talking about commit points is broken too...need to talk about relationships.)

Now sure you can say those are old brain dead processors. But at the very least, it shows you have to describe all the things that have to be true for your scheme to work.

That's my point.

and btw, write-thru L1 caches are still common...like in xbox 360:
http://www-128.ibm.com/developerworks/power/library/pa-fpfxbox/

Posted by: Bart | Feb 24, 2008 11:55:31 AM

 

I certainly ran this idea by the same hardware guys that designed that shipping 800+ core system I was talking about, and they assured me it could be made to work. Agreed that this is Not the same thing as making it work in a shipping product, but this *is* for a work-in-progress workshop paper after all.

As for the write-thru cache thing: I don't think it matters for my intended usage at all. The whole point is to conditionally store exactly once, after an atomic string of reads. No need to roll-back the store or a set of stores (as might be expected if I insisted on a write-back cache).

As for Sparc TSO: I don't see how that impacts this notion either. My desired behaviors are pretty obvious, and laid out in the paper mentioned in the blog. TSO, including the early update of stores, doesn't "break" the Bit. TSO might imply a particular implementation, that's all.

So here's the TSO example. What I want to do is read 2 words, then conditional write the 3rd - atomically, or not at all. The sequence is:
1 - clear the Bit; begin watching the cache for evicts
2 - issue 1st load; set the Bit if it misses in L1
3 - issue 2nd load; set the Bit if it misses in L1
4 - At this point, either both loads are in cache & have been in cache since I started OR the bit is set (not exclusive-OR, Bit might be set for unrelated eviction)
5 - Issue a store-conditional. Either the Bit is still clear & the store succeeds, OR the store is not visible outside the CPU (e.g. does not alter the L1). This store is atomic with evicts that set the Bit (i.e., hardware closes the race where an unrelated evict is coming "up" from the lower levels of coherency protocol - which will set the Bit and abort the store). I don't care if the aborted store is also visible to other loads issued from the same CPU (successful stores must match the normal MM).
6 - Test the Bit for remaining clear; take corrective active if it's not.

Nowhere in this sequence do I read-after-write, hence I don't care if on Sparc stores can update early. Nor do I care if a successful store writes-thru to memory or just to L1. I contend that a straightforward implementation of the IWannaBit! will work fine on an UltraSparcI/II or any TSO machine.


Cliff

Posted by: Cliff Click | Feb 24, 2008 12:48:21 PM

 

I have to read your paper more, but I think your store completion is with respect to getting "completed" in the system, not just in the L1. I can see that in your system, the two are identical. With my example of US-I/II, it's not. It completes to L1 before system completion. (if the line is already
in the L1).

Now you could track when the store actually completes relative to the system rules, if you need to. But it goes to the point of what the IWannaBit needs to cover.

Like I said, I have to reread to see if you want it to mean that all outstanding stores have completed in terms of some definition of system completeness. I suspect you do.

I think you've mentally internalized a description of how stores work in your system with respect to your L1, so that it's difficult to see what I mean in the general case.

I still think it would be better to talk
about the assumed memory model characteristics, and the setting and clearing of this bit being defined relative to the completion of various aspects of load and store completion in the system.

Whether or not there's an L1 in that picture doesn't really matter, other than a required minimum size for good performance, and probably a maximum size for good performance also (due to the false sharing issue).


It seems like what you're really saying is that L1 cache sizes today, typically seem to match the "size needed" for # of lines needed for your transaction. So rather than having state for every line that's tracked independently, having one bit of state for the whole cache is sufficient.

But: it would be nice to know what size is "right".

Just like there are sometimes reasons to make separate store/write caches, you could imagine that there is a separate transaction cache that is only used for the transaction cases. Fully associative..could be small...you could look at it from that perspective. You're describing a transaction cache, and implying you could layer it onto the normal behavior of a small, typically designed L1 cache. But define typical more clearly. (and implementations that require
special handling like I've suggested)

Posted by: Bart | Feb 24, 2008 9:11:01 PM

 

NOW we're getting somewhere. I need TWO things out of IWannaBit - and if I don't get both, then the Bit is pointless:

1- Multi-word atomic-read set, with 1 word atomic-write set. "multi" is useful at 2 words, more useful at 3 or 4. Not really tons more useful at 1000 words.

2- Cheap in hardware. If the hardware guys are willing to give me a transactional cache, or bits-per-line or any number of other proposed HTM extensions, then this Bit is dwarfed. But Big HTM hasn't happened yet (except on Azul shipping, and maybe Rock: not shipping, heard no news on how it works). So I'm hoping I can get "small HTM" in the form of this Bit.

So implementation and Bit meaning go hand-in-hand here. If the implementation isn't cheap either I won't get it at all or instead I'll get a much more impressive HTM. In either case I don't care about the Bit (which is why I'm resisting talking about *meaning* in a MM).

On Store ordering: I don't care about ordering of other stores from this CPU. I don't care about stores from OTHER CPUs, as long as they don't muck with the atomicity of my reads-plus-one-write.

On what size is "Right": 2-atomic-read words and 1 atomic-write is Good. More is better of course, but 2/1 is enough to write interesting algorithms that canNOT be written with plain CAS. 3/1 is enough better that I'd like to not give up on it: means I can do lock-free updates on structures that have 3 interlocking links like R/B trees.

---

Another underlying assumption that might be confusing us: I DONT CARE what the hardware memory model is: PSO, TSO, IA64, Hamsters-with-PostIt-notes - I REALLY DONT CARE.

I'm a JVM writer. I must implement the JMM. No Hardware MM matches the JMM, so I must bridge the gap to *my* customers (the Java writers). So in all cases *I* need detailed understanding of the Hardware MM - but then I hide that from the Java writers. I'm very good at it: I can "hide" the most amazingly weak Hardware MM's from the Java programmer and make them look like the JMM.

So when I say "I DONT CARE" I mean: you tell me what happens if I have outstanding stores & issue loads & stores against them, and I can work with it.

---

On the Sparc US I/II issue: we should break that out in a separate conversation. I don't understand how it "doesn't work" with my notion of the Bit. Everything you say still seems to play-nice with the Bit.

Cliff

Posted by: Cliff Click | Feb 25, 2008 9:10:23 AM

 

I live in San Jose, and will be in Seattle next week for ASPLOS & MSPC.

You get around either of those places? We can talk through the issues much faster face to face, and it would be nice to meet you.
The blog posting software is a bear to write long posts on!

Thanks,
Cliff

Posted by: Cliff Click | Feb 25, 2008 9:14:33 AM

 

okay. good job thinking it thru.

I'm just throwing jabs at you so you'll do a better job at your presentation. I know you like being quick and glib, but it's easy for the "IWannaBit" title to make it seem like you're just a typical sw guy who hasn't thought thru all the issues. If you can show you've thought thru how it maps onto what hw guys are doing in the future, and why it makes sense in comparison to alternative approaches, then you'll be able to sell to the hw guys better. It's all about showing that the prior solutions were overconstrained because no sw guy said "hey we'll support this half of the problem if you do this"

It's all about showing how you've recognized a new way to match requirements and implementation possibilities.

My experience is that it's not just about "new ideas" but showing how a new idea encompasses all prior knowledge, recognizes what hw latitude wasn't allowed before, but should have been.

Advances in hw depend on sw transitions. Being able to articulate simple sw transitions that need simple hw support is hugely important.

The best computer system is always one that has "just enough" stuff and no more. The hw guys are stuck in thinking sw guys can't adapt.

So you're doing the right thing in taking the lead in exploring how things can be.

Posted by: bart | Feb 25, 2008 10:59:41 PM

 

Post a comment