|
Talk is Cheap... Been busy this month getting talks together. First up: MSPC08, Workshop on Memory Systems Performance and Correctness. I got my IWannaBit! paper accepted. In this paper I propose a single hardware Bit; one that tracks L1 misses & evicts. Using this one bit I can prove atomicity of a series of operations, or take corrective action if the operations are not atomic. I basically can convert a hardware stall/fence for memory ordering into a software spin loop (plus I get to do other things while I'm waiting on the hardware). It's a pretty cool notion, half-way to Transactional Memory and half-way to DCAS but with almost no hardware (and replaces either LL/SC or CAS ta-boot!). Next I submitted 4 talks to JavaOne, and amazingly these 3 got accepted:
I'll probably be speaking at TSS Las Vegas again this year, but I haven't a clue about what! :-) More cheap talk: I think I've figured out how to do all the key bits for a full-blown in-memory DB: atomicity (nobody sees partial updates), non-blocking (no deadlocks; always some transaction can make progress), probably even fair (everybody's transaction gets to complete eventually, no matter the size or conflicts). And fast: I bet I can sustain > 100million (simple) DB ops/sec. Larger transactions will take longer of course. Of course, it's all in-memory so the usual notion of durability is limited to the up-time of the server (it's actually not that bad: I can still checkpoint slowly to disk so e.g. the disk is no more than 60sec out-of-date and is still coherent (represents some snapshot of the real DB)). Enough Cheap Talk! I need to get busy making slides... Cliff Category: Web/Tech | | TrackBack (0) TrackBackTrackBack URL for this entry: Listed below are links to weblogs that reference Talk is Cheap...: CommentsHi Cliff, Will this in-memory DB be more than a hash map? If so then will it be based on relational database technology and including a mature SQL? query engine that is aware of the underlying (extreme transaction) architecture and plan the execution accordingly under dynamic workloads and data distributions. I assume you mean to take an existing database product and make it scream or are you planning on doing the slaying of the beast all by yourself. Kind regards, William Posted by: William Louth | Feb 12, 2008 2:28:29 PM I don't want to slay the beast all by my lonesome. I can provide a core piece: a high throughput "transactional memory"; an atomic read/modify/write for large disconnected chunks of memory. Surely Somebody (for some definition of Somebody) can turn that into an awesome SQL. Cliff Posted by: Cliff Click | Feb 12, 2008 2:38:27 PM Why not just an awesome transactional data grid as this more than likely better reflects the actual user usage of such a system? I would shy away from the DB term as most people will have preconceptions that it will be more than a transactional data store. Those (the few) that would need such a solution would have already discounted the DB technology so you do not gain any advantage. That is my honest opinion but I maybe (more than likely) completely wrong when it comes to such sales and marketing issues. "In-memory DB" also undersells what is being talked about as most will just see it as removing disk reads when it is much more than that. William Posted by: William Louth | Feb 12, 2008 3:02:02 PM I have a solution; I'm out looking for Problem with Money. :-) I'm all too happy to drop the DB layer and just provide the awesome transactional data grid. You got a Problem? ;-) Cliff Posted by: Cliff Click | Feb 12, 2008 3:05:08 PM If you want a better link for e.g. Bloom filter reference, the thesis is now publically available at http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-697.html; Posted by: Chris Purcell | Feb 12, 2008 4:05:32 PM Oh, and (at least in the design I proposed in the thesis), the Bloom filter allows snapshots to extend beyond the cache by snooping all transactions on the memory bus. This obviously does not scale directly to larger systems (no snooping), and any system I can think of for those involves some modification to the directory system, so the L1 restriction of IWannaBit makes a lot of sense for keeping the system cheap. Posted by: Chris Purcell | Feb 12, 2008 4:45:55 PM The paper is in to MSPC already, so it's probably to late to change the printed reference. I change it in my slides, however. Cliff Posted by: Cliff Click | Feb 12, 2008 10:32:57 PM Sorry, late at night when I posted that, I missed the most important bit: congratulations! That's a lot of papers/talks! Best of luck at the conferences. Posted by: Chris Purcell | Feb 13, 2008 12:35:00 PM Thanks, Posted by: Cliff Click | Feb 13, 2008 3:40:06 PM Are you just saying that if you have a L1 cache that's exclusively tied to a thread for a time, that if all the loads and stores hit in the cache and have the required permission, then you don't need any locks? if so, isn't that a "duh...well yeah" Don't you have to do some performance analysis of how often this is true, and how often you have to fall back to the "some other method" to complete the work? I mean, minimally, you should do some performance analysis on what size caches are required. Even mentioning that you need a L1 cache that supports clean and dirty lines, right? (a lot of L1 caches don't, and some share between cores) I would also think that prefetch instructions would be useful, for getting the cache in the desired state first, before you start doing any real work, would be good. (I say prefetch because you can probably do those at higher thruput than loads and stores). And your evict case..is that supposed to cover all MP memory order issues? i.e. are you assuming "evict" covers all L1 cache state changes for any reason (due to this thread, or threads elsewhere in the system) I think you're assuming you have relatively small L1 caches, backed by larger L2/L3 caches. Imagine if there is just L1 cache. Big. bad. Right? (think false sharing effects..lifetime of data in caches) I have a counter proposal. Obviously dumb, right? Your proposal seems just a few steps down from it though. I think you need perf. analysis to describe the scenarios under which it might work well. Posted by: Bart | Feb 21, 2008 3:20:03 PM Letsee... lots of big comments, so maybe a big reply: >Are you just saying that if you have a L1 cache that's exclusively tied to a thread for a time, that if all the loads and stores hit in the cache and have the required permission, then you don't need any locks? In cache-coherent systems, ala all desktops & all servers these days, the L1 partakes in the coherence protocol. So I'm looking to figure out when "protocol didn't happen".
Well understood & documented in the industry: most Java locks are very short and not contended. There are now a number of JIT optimizations in main-stream JVMs targeted directly at these sorts of locks.
For my intended use case, implementing interesting non-blocking algorithms, a 4-word cache with 4-way associativity is sufficient. All the caches Out There are orders of magnitude larger.
Actually, I'd say the reverse: *most* caches support clean & dirty AND are private to the core. Certainly true of the bazillion X86's Out There today. Sparc has it's "interesting" write-back cache vs read-only L1. PPC, IA64, MIPS, Azul, etc have private caches as far as I know. But I don't need clean & dirty bits: I need atomicity. This Bit is useful on a write-through cache.
If you read the IWannaBit! paper, you'll see that my expected use-case is to cache-miss on the first go-'round, then loop & rerun with things hot in cache. I get to turn a hardware coherency stall (from a mfence instruction) into a software spin-loop - hopefully doing more useful work while I wait. More in another post. Cliff Posted by: Cliff Click | Feb 22, 2008 10:00:57 AM > And your evict case..is that supposed to cover all MP memory order issues? Yup.
Yup. I'm looking for a minimalist approach that seems to realistically do stuff. Fancier hardware makes you more immune to other threads' actions - but nobody (other than Azul) has shipped fancier hardware. I'm trying to lower the barrier to entry.
Why? Do you suspect constant eviction from other threads? I suspect the evictions would die out very quickly.
I am... it's the current common approach.
Doesn't seem so bad to me. As I say above, I suspect the false-sharing evictions will quickly stabilize long enough to get the atomic operation done.
Yup. I got an 800 core system *right now*. Not proposed: shipping to paying customers. There's never a clock cycle where a load or store isn't happening somewhere.
It's those few crucial steps down that matter! And I *do* have some performance analysis; traces that tell me that standard modest sized L1's (commonly available) running Java apps see time-between-evictions on the order of 1000's of instructions. Cliff Posted by: Cliff Click | Feb 22, 2008 10:08:40 AM Quick answer showing that it's dangerous when SW guys say they want a particular HW "thingee" as opposed to describing behaviors. Here's an example of breakage: In SUN TSO, A thread is allowed to see the results of it's own stores early..In UltraSparc I & II, this allowed the write-thru L1 to be updated BEFORE ownership of the line was resolved in terms of system wide cache coherency. This is legal because TSO allows it. (check your SPARC V9 book) You're suggesting a solution that works assuming other things: implementation, memory model etc. Like L1 being a commit point for system-wide coherency resolution (or even that the system wide memory model has some write-atomicity requirement?) So I offer the existence proof that on UltraSPARC I & II, hitting in the L1 cache for a store doesn't get you what you want. Now sure you can say those are old brain dead processors. But at the very least, it shows you have to describe all the things that have to be true for your scheme to work. That's my point. and btw, write-thru L1 caches are still common...like in xbox 360: Posted by: Bart | Feb 24, 2008 11:55:31 AM I certainly ran this idea by the same hardware guys that designed that shipping 800+ core system I was talking about, and they assured me it could be made to work. Agreed that this is Not the same thing as making it work in a shipping product, but this *is* for a work-in-progress workshop paper after all. As for the write-thru cache thing: I don't think it matters for my intended usage at all. The whole point is to conditionally store exactly once, after an atomic string of reads. No need to roll-back the store or a set of stores (as might be expected if I insisted on a write-back cache). As for Sparc TSO: I don't see how that impacts this notion either. My desired behaviors are pretty obvious, and laid out in the paper mentioned in the blog. TSO, including the early update of stores, doesn't "break" the Bit. TSO might imply a particular implementation, that's all. So here's the TSO example. What I want to do is read 2 words, then conditional write the 3rd - atomically, or not at all. The sequence is: Nowhere in this sequence do I read-after-write, hence I don't care if on Sparc stores can update early. Nor do I care if a successful store writes-thru to memory or just to L1. I contend that a straightforward implementation of the IWannaBit! will work fine on an UltraSparcI/II or any TSO machine.
Posted by: Cliff Click | Feb 24, 2008 12:48:21 PM I have to read your paper more, but I think your store completion is with respect to getting "completed" in the system, not just in the L1. I can see that in your system, the two are identical. With my example of US-I/II, it's not. It completes to L1 before system completion. (if the line is already Now you could track when the store actually completes relative to the system rules, if you need to. But it goes to the point of what the IWannaBit needs to cover. Like I said, I have to reread to see if you want it to mean that all outstanding stores have completed in terms of some definition of system completeness. I suspect you do. I think you've mentally internalized a description of how stores work in your system with respect to your L1, so that it's difficult to see what I mean in the general case. I still think it would be better to talk Whether or not there's an L1 in that picture doesn't really matter, other than a required minimum size for good performance, and probably a maximum size for good performance also (due to the false sharing issue).
But: it would be nice to know what size is "right". Just like there are sometimes reasons to make separate store/write caches, you could imagine that there is a separate transaction cache that is only used for the transaction cases. Fully associative..could be small...you could look at it from that perspective. You're describing a transaction cache, and implying you could layer it onto the normal behavior of a small, typically designed L1 cache. But define typical more clearly. (and implementations that require Posted by: Bart | Feb 24, 2008 9:11:01 PM NOW we're getting somewhere. I need TWO things out of IWannaBit - and if I don't get both, then the Bit is pointless: 1- Multi-word atomic-read set, with 1 word atomic-write set. "multi" is useful at 2 words, more useful at 3 or 4. Not really tons more useful at 1000 words. 2- Cheap in hardware. If the hardware guys are willing to give me a transactional cache, or bits-per-line or any number of other proposed HTM extensions, then this Bit is dwarfed. But Big HTM hasn't happened yet (except on Azul shipping, and maybe Rock: not shipping, heard no news on how it works). So I'm hoping I can get "small HTM" in the form of this Bit. So implementation and Bit meaning go hand-in-hand here. If the implementation isn't cheap either I won't get it at all or instead I'll get a much more impressive HTM. In either case I don't care about the Bit (which is why I'm resisting talking about *meaning* in a MM). On Store ordering: I don't care about ordering of other stores from this CPU. I don't care about stores from OTHER CPUs, as long as they don't muck with the atomicity of my reads-plus-one-write. On what size is "Right": 2-atomic-read words and 1 atomic-write is Good. More is better of course, but 2/1 is enough to write interesting algorithms that canNOT be written with plain CAS. 3/1 is enough better that I'd like to not give up on it: means I can do lock-free updates on structures that have 3 interlocking links like R/B trees. --- Another underlying assumption that might be confusing us: I DONT CARE what the hardware memory model is: PSO, TSO, IA64, Hamsters-with-PostIt-notes - I REALLY DONT CARE. I'm a JVM writer. I must implement the JMM. No Hardware MM matches the JMM, so I must bridge the gap to *my* customers (the Java writers). So in all cases *I* need detailed understanding of the Hardware MM - but then I hide that from the Java writers. I'm very good at it: I can "hide" the most amazingly weak Hardware MM's from the Java programmer and make them look like the JMM. So when I say "I DONT CARE" I mean: you tell me what happens if I have outstanding stores & issue loads & stores against them, and I can work with it. --- On the Sparc US I/II issue: we should break that out in a separate conversation. I don't understand how it "doesn't work" with my notion of the Bit. Everything you say still seems to play-nice with the Bit. Cliff Posted by: Cliff Click | Feb 25, 2008 9:10:23 AM I live in San Jose, and will be in Seattle next week for ASPLOS & MSPC. You get around either of those places? We can talk through the issues much faster face to face, and it would be nice to meet you. Thanks, Posted by: Cliff Click | Feb 25, 2008 9:14:33 AM okay. good job thinking it thru. I'm just throwing jabs at you so you'll do a better job at your presentation. I know you like being quick and glib, but it's easy for the "IWannaBit" title to make it seem like you're just a typical sw guy who hasn't thought thru all the issues. If you can show you've thought thru how it maps onto what hw guys are doing in the future, and why it makes sense in comparison to alternative approaches, then you'll be able to sell to the hw guys better. It's all about showing that the prior solutions were overconstrained because no sw guy said "hey we'll support this half of the problem if you do this" It's all about showing how you've recognized a new way to match requirements and implementation possibilities. My experience is that it's not just about "new ideas" but showing how a new idea encompasses all prior knowledge, recognizes what hw latitude wasn't allowed before, but should have been. Advances in hw depend on sw transitions. Being able to articulate simple sw transitions that need simple hw support is hugely important. The best computer system is always one that has "just enough" stuff and no more. The hw guys are stuck in thinking sw guys can't adapt. So you're doing the right thing in taking the lead in exploring how things can be. Posted by: bart | Feb 25, 2008 10:59:41 PM Post a comment |


