|
A Non-Blocking HashTable I've been wrestling with concurrent algorithms again. This time, it's a Non-Blocking (concurrent, lock-free) HashTable. I've had it basically figured out for a few months now and I'm slowing widening the circle of people I expose it too. This HashTable seems too good to be true, which is why I'm trying hard to vet it's correctness before claiming victory. It's single-threaded performance appears to equal java.util.HashTable, and at high CPU count (750 cpus on a high-end Azul machine) it's about twice as fast as Doug Lea's marvelous java.util.concurrent.ConcurrentHashMap - with 4096-way striping. This is for very high read-to-write ratios (99% reads, 1% table mods). For higher write ratios the Non-Blocking table scales much better than ConcurrentHashMap. Of course, if you lower the 4096-way striping to the default 16-way, then the Non-Blocking table becomes easily 100x faster. Performance summary:
Ok, how do I do it? By writing a State-Machine based algorithm, instead of the usual concurrent programming style. From this idea I get a simple state machine, a short Java implementation of it, and a very fast very scalable algorithm. First the easy stuff: I assume you have written hash tables in the past (nearly every non-Java programmer does at one point or another; the Java guys get it easy). A hash table is a collection of {key,value} pairs with fast random access (first access is via the hash function). I've chosen a power-of-2 closed table - collisions cause re-probing into the table (contrast this to an open table where collisions turn into linked-list traversals). I chose power-of-2 instead of a prime number because 'AND' is faster than 'MOD' by an amount larger than an L1-miss-L2-hit. I re-probe by adding 1 and AND'ing to the table size (better cache behavior for repeated re-probes). Other design decisions make sense here, depending on your exact usage pattern. Now the fun stuff: How do I handle the {key,value} pairs? This is where the state-machine kicks in. I define states for each interesting key & value. For keys the states are {null,K} - where K is any key. A key-slot makes a 1-time transition from null to some K. For values the states are {null,V/T} - where V is any value and T is a Tombstone, a special token that is not any valid value and represents a deleted key. A value-slot makes a 1-time transition from null to some V, and can waffle about being different V's or T (deleted) according to whatever was the last table modification. This gives me 2 key-states and 3 value-states, so the {key,value} pair has 6 states.
Now here's the cool thing: it doesn't matter what order I read or write these bits within each pair, I always get some sane state. I can act on that state. This means I do not need to reason about ordering when either inserting or looking up data in the table! This also means I do not need any memory fencing or any locking! I do need CAS when changing the table (but not double-CAS, since I can read out-of-order I can write out-of-order as well - hence I do not need to atomically update both words). Here, then, is the 'get(Object key)' code: Object get( Object K ) {
I've left out some details (e.g. using the hash to avoid doing some key-compares that are doomed to fail), but this is the gist of it. 'get' is extremely lightweight - just the basic hash, lookup and return value for a hit. More next time on how I do 'put()' as that is where the state machine gets a real workout. But this post is already too long. Cliff PS - Slides will be online soon, and the source code as well (need to get the license figured out) Category: Web/Tech | | TrackBack (1) TrackBackTrackBack URL for this entry: Listed below are links to weblogs that reference A Non-Blocking HashTable:
» I wish I had a C-based lock-free hash table... from Kyle Tracked on Sep 20, 2007 2:21:09 PM CommentsIf I understand it correctly, under this arrangement there are no guarantees for visibility of changes: from thread A one can add K1,V1 then K2,V2, and see only K2,V2 from thread B. Is that right? Posted by: Gabor Melis | Apr 2, 2007 3:36:22 AM Have you read Chris Purcell's work? (link follows, report is on 'An Open Addressed Lock-free Hashtable', it made up most of his thesis) http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-639.html Posted by: Henry | Apr 2, 2007 6:30:19 AM Yes, I've read Chris Purcell's work. My work also touches only 1 cache line if there are no collisions (and no saved hash or written in C; 2 if hashes are memoized - no arrays-of-structs; a weakness of Java). For multiple collisions, my work turns into a linear scan (very cache efficient). Although my work requires a resize, that resize is fully concurrent, incremental & parallelizable. His TR includes substantially more experimentation than mine (suitable for peer-reviewed publication). This just represents the early stage of development my work is at. Posted by: Cliff Click | Apr 2, 2007 8:06:31 AM > If I understand it correctly, under this arrangement there are no guarantees for visibility of changes: from thread A one can add K1,V1 then K2,V2, and see only K2,V2 from thread B. Is that right? Yup; threads A & B need to synchronize externally somehow to guarantee visibility. They can do this by, e.g., thread A writes to a volatile in V2; when thread B sees V2 it can do a volatile read from V2 guaranteeing it now sees all of thread A's writes including K1,V1. If you are worried about the speed of visibility (e.g., one thread is rapidly inserting new info; the other thread is looking for new info to avoid work), in practice visibility of updates typically happens very quickly (nanosecond scale). Posted by: Cliff Click | Apr 2, 2007 8:17:56 AM Bob Lee reads his email if the content is interesting enough, and this certainly qualifies. Did you pass this by him? By the way, can you do a search/replace on the entire article and replace HashTable, a class that doesn't even exist, with HashMap? :-P just a minor nit. (Hashtable, lower case t, is a class, but it's the crappy version of HashMap). Posted by: Reinier Zwitserloot | Apr 2, 2007 9:20:03 AM Crazy Bob? You got an email address for him? I can't find one on crazybob.org. Two comments about HashTable: Thanks, Posted by: Cliff Click | Apr 2, 2007 9:30:03 AM > it made up most of his thesis Well, a little under half. > My work allows for re-inserting of a prior key, Chris's does not. Indeed -- not in published work. Posted by: Chris Purcell | Apr 2, 2007 10:59:48 AM >> My work allows for re-inserting of a prior key, Chris's does not. Can't say my work here is published either. ;-) Posted by: Cliff Click | Apr 2, 2007 12:03:15 PM I found this an interesting paper: "Lock-free dynamic hash tables with open addressing", which can be found here: http://www.cs.rug.nl/~wim/mechver/ -- Posted by: bakotaco | Apr 2, 2007 12:40:24 PM I read the paper briefly. There are a lot of parallels. My solution is also "Almost Wait-Free". From the paper: "Strictly speaking, our algorithm is lock-free. However, we call our algorithm almost wait-free since wait-freedom is only violated when a hashtable is resized, which is a relatively rare operation.". There are clearly some implementation differences, and I'll claim that my use of a state machine makes understanding vastly easier. Or at least I couldn't follow all that was going on in a 1/2hr read or so. I believe some of the complexity comes from avoiding GC - i.e. exact usage tracking to allow deallocate of old tables. Certainly there are scaling issues as well: e.g. they propose atomic inc/dec on a shared counter (to track going in and out of the hashtable code); this does not scale past 100 cpus or so. It's hard to tell how the performance varies; my version shows 1000x better performance but on presumably newer hardware. Posted by: Cliff Click | Apr 2, 2007 2:52:13 PM Hashtable IS a crappy implementation, for the following reasons: 1. Collections.synchronizedMap(new HashMap()); does the same thing if you really need that extra functionality. I have Bob Lee's mail address. I'll mail it to you. Posted by: Reinier Zwitserloot | Apr 3, 2007 10:00:55 AM Click, This is awesome, I watched the google techtalk and was blown away. I would just love to implement this in C or C++ - unless you have such an implementation going open source soon, of course :-) Posted by: Tim | May 12, 2007 4:21:06 AM No C++ implementations anytime soon. Posted by: Cliff Click | May 12, 2007 12:12:03 PM Tim, I am working on a lock-free hash table with built-in ref counting and guaranteed-order iteration for C. It's open source (MIT style license) and you are welcome to join the effort. I use an entirely different model than Cliff does though. There are two implementations, one which isn't resizable (preferable for the type of application I needed it for) and another one which is (but isn't part of the snapshot yet because I need to refactor it to match API changes made to the first one). The resizable one only resizes those buckets which show heavy collisions. This is possible because I use hash tables for the buckets too in this case. A detailed description is at the URL and I believe it has some interesting concepts too. Note that the snapshot is now out of sync with the description as I am just going through a major change of the code. To Cliff I would like to say: Thanks for the inspiration, you made me realise that I could aim for more than I intended to with my library (dynamic resizing with minimal impact in this case). Keep up the good work and BTW, when I am done, I would love to see somebody test drive my code on one of those 768 CPU systems of yours ;-) regards Posted by: benjk | Jul 20, 2007 11:50:52 AM How are you handling a lack-of-memory-model for C? On different C compilers & different platforms, you don't have any guarantees about inter-process shared-memory communication. And yes, I can test-drive your code for you on one of our boxes. Posted by: Cliff Click | Jul 20, 2007 12:04:27 PM The library doesn't use any global variables. All data is provided by and returned to the caller. Concurrent threads can share that data, but I don't think concurrent processes could. Memory allocators could become a bottleneck, unless of course a lock-free allocator is used. I provide hooks to plug in other allocators and I might just try out some of the lock-free allocators that have popped up lately. As for compilers and platforms, well, I have to start somewhere. At this point I am working on ppc and x86 and I only use gcc. For now that will have to be my focus. Nevertheless, two friends are interested to make it work on Sparc. Beyond that, I will have to wait and see if there are any joiners willing to test on and if need be port to other architectures. It's open source after all ;) But perhaps I should explain the motivation for doing this. At least in the short term my motivation is a little different than yours. I am doing open source telephony, in particluar softswitches. Eventually, scalability on oodles of CPUs will be of interest there too, but right now the code we have (forked from Asterisk) resembles a house where you spend most of your time locking and unlocking doors and closets, regularly lose the keys and frequently fall into trap doors. It all comes down to not having any proper data repository. The library I am working on is meant to provide such a repository. That's why its got built-in ref counting and why I wanted to get rid of as many locks as possible. If we can manage to get 10 times the number of concurrent calls on a single or dual processor box, that will already be a big success. If we can evolve it to scale up further on n>2 processors, that will be icing on the cake. In any event, Rome wasn't built in one day, so I will be taking this one step at a time and see where it leads. Thanks for the offer to test drive, I will contact you about that again when I am ready. regards Posted by: benjk | Jul 21, 2007 3:47:43 PM > The library doesn't use any global variables. Mine doesn't either!
These sentences conflict: "all data is provided by the caller" and "concurrent threads share data". I think you mean the 2nd senctence - sharing is between threads in the same address space, not between processes.
Advantage of using a multi-threaded GC'd language: the allocator already scales to allow concurrent allocation across all the CPUs in the box! :-)
This is what I mean: for each rev of the compiler, chip, motherboard, cpu, etc, the rules for sharing will change. This makes it really hard to make *portable* multi-threaded code with fine-grained sharing.
I think ref-counting is a mistake. It requires an atomic read/modify/write cycle, which in turn limits the number of cpus that can simultaneously touch the same data. No Big Hardware supports concurrent r/m/w; they support any number of concurrent readers, and concurrent writers to INDEPENDENT data. Needing a r/m/w to read a piece of data will limit your concurrent readers. Also, be sure the count is split from the data; a classic ref-counting mistake is to put the count in the same memory chunk it is guarding. --- Have you pondered targeting Azul for your telephony app? Not sure it makes sense in an open-source environment, but our hardware does have a lot of features that otherwise make sense: very low pause times, very high throughput, etc. Cliff Posted by: Cliff Click | Jul 22, 2007 9:40:25 AM I never assumed your implementation would use global vars, but since I use C and "popular" C coding practises often involve some - shall we say - bad habits, I thought I better explain ;-) And yes, I intended data to be shared between threads in the same address space, not processes. Make no mistake, I am no fan of C. However, the code was originally forked from a project which uses C only and that is what most of the contributors are most familiar with. If it had been written in a language which already provided managed memory and hashed storage, I don't think I would have bothered starting this library. As for gc, it is generally acknowledged that it is not always suitable for real-time applications. Some of the telephony protocols, have extremely tight specs which are already a challenge as it stands. For example, SS7 requires guaranteed response times of 20ms. If a node goes down, you have to transfer all links to an alternative node and still serve all outstanding requests in the correct order within those 20ms. If a gc cycle kicks in during such a transition you may not be able to satisfy the spec and your system will then fail acceptance/approval. I understand the issues around ref count updating. Its a necessary evil I have to deal with. As I said, in the short term the aim is to get good mileage on 1, 2 and 4 core/cpu systems. Beyond that, I will have to see what the impact is and then deal with it at that point in time. But anyway, thanks for the input, I really appreciate it. As for targeting Azul with our telephony server, that would certainly be exciting. Let's talk about that offline. regds Posted by: benjk | Jul 22, 2007 5:44:11 PM > I never assumed your implementation would use global vars, but since I use C and "popular" C coding practises often involve some - shall we say - bad habits, I thought I better explain Ahh... makes sense in retrospect.
Just an FYI - IBM's Metronome project is about hard real-time GC. Last I heard they are getting guaranteed max pause times about 10 MICROseconds (but with a 20-40% GC load factor). They require interesting proofs about the app's max allocation rate & max live set, which is generally possible for a fixed app set like a telephony app. Cliff Posted by: Cliff Click | Jul 23, 2007 9:18:22 AM I was about to e-mail you about the race condition existing between the key and val assignments, but your Part 2 post addressed it: "Note that Key slots, once set, can never be UNset - this is required for correctness, lest Thread A tests for K1 in a slot, then Thread B deletes it & inserts K2/V2, and finally Thread A gets around to reading the value slot and reads V2 - and not the now replaced V1." Nice to see someone else with such a solid grasp on the subject! Posted by: Cas | Aug 1, 2007 12:07:17 AM Cliff, How does the above scenario compare with To make my point more clear imagine an application which keeps an awful lot of data in memory in one of those concurrent maps. Posted by: lorenzo | Sep 28, 2007 8:00:17 AM Performance is as good or better than CHM under nearly all loads (better with more CPUs or more writers). The odd updater won't slow down the readers at all (nor will they slow down for CHM, unless you have dozens of CPUs writing). As with CHM, iterators are only guaranteed to see Keys which exist in the table for the whole duration of iteration. If you add or remove a Key during iteration it might be skipped. Both CHM and NBHM are ideal for the scenario you're painting. NBHM just works better for higher CPU counts and with more writers (but both will do fine for the dozen writers). NBHM continues to give great throughput with hundreds of concurrent writing threads (and many hundreds of concurrent readers). Cliff Posted by: Cliff Click | Sep 28, 2007 8:09:20 AM If you don't mind me hijaking your blog towards a slightly different conversation I will ask you more. The application I am describing wants to use 1000s of threads in parallel to constantly analyze a large set of data: I am thinking of running 1000 of threads all rescanning the data in parallel map.keySet().iterator() I am of course aware that the data might change as I am iterating through but for that let's say I can cater or at worse My point is that a CHM or possibly your hashmap can help me out dealing with a large volume of data which I can not distribute a priori amongst threads/processes. Too often I hear about the ideal scenario where parallel tasks work with their own data.... Thanks
Posted by: lorenzo puccetti | Sep 28, 2007 9:37:50 AM Sounds like an ideal job for an Azul box! ;-) Cliff Posted by: Cliff Click | Sep 28, 2007 10:13:48 AM Any chance of getting hold of that source? We have a number of use cases coming up, some somewhat pressing - the asymmetry of Doug Lea's CHM on a lookup miss is quite a killer. Some kind of BSD/Apache-style licence would be really super :) Cheers Posted by: Antranig Basman | Oct 3, 2007 7:41:10 PM
Next
»
Post a comment |


