Changeset 991d2d8 in mainline


Ignore:
Timestamp:
2012-11-15T23:52:18Z (11 years ago)
Author:
Adam Hraska <adam.hraska+hos@…>
Branches:
lfn, master, serial, ticket/834-toolchain-update, topic/msim-upgrade, topic/simplify-dev-export
Children:
4f5e1c7c
Parents:
dcb0751
Message:

cht: Added a higher-level description of the resizing algorithm.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • kernel/generic/src/adt/cht.c

    rdcb0751 r991d2d8  
    3535 * @file
    3636 * @brief Scalable resizable concurrent lock-free hash table.
     37 *
     38 * CHT is a concurrent hash table that is scalable resizable and lock-free.
     39 * resizable = the number of buckets of the table increases or decreases
     40 *     depending on the average number of elements per bucket (ie load)
     41 * scalable = accessing the table from more cpus increases performance
     42 *     almost linearly
     43 * lock-free = common operations never block; even if any of the operations
     44 *     is preempted or interrupted at any time, other operations will still
     45 *     make forward progress
    3746 *
    38  */
     47 * CHT is designed for read mostly scenarios. Performance degrades as the
     48 * fraction of updates (insert/remove) increases. Other data structures
     49 * significantly outperform CHT if the fraction of updates exceeds ~40%.
     50 *
     51 * CHT tolerates hardware exceptions and may be accessed from exception
     52 * handlers as long as the underlying RCU implementation is exception safe.
     53 *
     54 * @par Caveats
     55 *
     56 * 0) Never assume an item is still in the table.
     57 * The table may be accessed concurrently; therefore, other threads may
     58 * insert or remove an item at any time. Do not assume an item is still
     59 * in the table if cht_find() just returned it to you. Similarly, an
     60 * item may have already been inserted by the time cht_find() returns NULL.
     61 *
     62 * 1) Always use RCU read locks when searching the table.
     63 * Holding an RCU lock guarantees that an item found in the table remains
     64 * valid (eg is not freed) even if the item was removed from the table
     65 * in the meantime by another thread.
     66 *
     67 * 2) Never update values in place.
     68 * Do not update items in the table in place, ie directly. The changes
     69 * will not propagate to other readers (on other cpus) immediately or even
     70 * correctly. Some readers may then encounter items that have only some
     71 * of their fields changed or are completely inconsistent.
     72 *
     73 * Instead consider inserting an updated/changed copy of the item and
     74 * removing the original item. Or contact the maintainer to provide
     75 * you with a function that atomically replaces an item with a copy.
     76 *
     77 * 3) Use cht_insert_unique() instead of checking for duplicates with cht_find()
     78 * The following code is prone to race conditions:
     79 * @code
     80 * if (NULL == cht_find(&h, key)) {
     81 *     // If another thread inserts and item here, we'll insert a duplicate.
     82 *     cht_insert(&h, item);
     83 * }
     84 * @endcode
     85 * See cht_insert_unique() on how to correctly fix this.
     86 *
     87 *
     88 * @par Semantics
     89 *
     90 * Lazy readers = cht_find_lazy(), cht_find_next_lazy()
     91 * Readers = lazy readers, cht_find(), cht_find_next()
     92 * Updates = cht_insert(), cht_insert_unique(), cht_remove_key(),
     93 *     cht_remove_item()
     94 *
     95 * Readers (but not lazy readers) are guaranteed to see the effects
     96 * of @e completed updates. In other words, if cht_find() is invoked
     97 * after a cht_insert() @e returned eg on another cpu, cht_find() is
     98 * guaranteed to see the inserted item.
     99 *
     100 * Similarly, updates see the effects of @e completed updates. For example,
     101 * issuing cht_remove() after a cht_insert() for that key returned (even
     102 * on another cpu) is guaranteed to remove the inserted item.
     103 *
     104 * Reading or updating the table concurrently with other updates
     105 * always returns consistent data and never corrupts the table.
     106 * However the effects of concurrent updates may or may not be
     107 * visible to all other concurrent readers or updaters. Eg, not
     108 * all readers may see that an item has already been inserted
     109 * if cht_insert() has not yet returned.
     110 *
     111 * Lazy readers are guaranteed to eventually see updates but it
     112 * may take some time (possibly milliseconds) after the update
     113 * completes for the change to propagate to lazy readers on all
     114 * cpus.
     115 *
     116 * @par Implementation
     117 *
     118 * Collisions in CHT are resolved with chaining. The number of buckets
     119 * is always a power of 2. Each bucket is represented with a single linked
     120 * lock-free list [1]. Items in buckets are sorted by their mixed hashes
     121 * in ascending order. All buckets are terminated with a single global
     122 * sentinel node whose mixed hash value is the greatest possible.
     123 *
     124 * CHT with 2^k buckets uses the k most significant bits of a hash value
     125 * to determine the bucket number where an item is to be stored. To
     126 * avoid storing all items in a single bucket if the user supplied
     127 * hash function does not produce uniform hashes, hash values are
     128 * mixed first so that the top bits of a mixed hash change even if hash
     129 * values differ only in the least significant bits. The mixed hash
     130 * values are cached in cht_link.hash (which is overwritten once the
     131 * item is scheduled for removal via rcu_call).
     132 *
     133 * A new item is inserted before all other existing items in the bucket
     134 * with the same hash value as the newly inserted item (a la the original
     135 * lock-free list [2]). Placing new items at the start of a same-hash
     136 * sequence of items (eg duplicates) allows us to easily check for duplicates
     137 * in cht_insert_unique(). The function can first check that there are
     138 * no duplicates of the newly inserted item amongst the items with the
     139 * same hash as the new item. If there were no duplicates the new item
     140 * is linked before the same-hash items. Inserting a duplicate while
     141 * the function is checking for duplicates is detected as a change of
     142 * the link to the first checked same-hash item (and the search for
     143 * duplicates can be restarted).
     144 *
     145 * @par Table resize algorithm
     146 *
     147 * Table resize is based on [3] and [5]. First, a new bucket head array
     148 * is allocated and initialized. Second, old bucket heads are moved
     149 * to the new bucket head array with the protocol mentioned in [5].
     150 * At this point updaters start using the new bucket heads. Third,
     151 * buckets are split (or joined) so that the table can make use of
     152 * the extra bucket head slots in the new array (or stop wasting space
     153 * with the unnecessary extra slots in the old array). Splitting
     154 * or joining buckets employs a custom protocol. Last, the new array
     155 * replaces the original bucket array.
     156 *
     157 * A single background work item (of the system work queue) guides
     158 * resizing of the table. If an updater detects that the bucket it
     159 * is about to access is undergoing a resize (ie its head is moving
     160 * or it needs to be split/joined), it helps out and completes the
     161 * head move or the bucket split/join.
     162 *
     163 * The table always grows or shrinks by a factor of 2. Because items
     164 * are assigned a bucket based on the top k bits of their mixed hash
     165 * values, when growing the table each bucket is split into two buckets
     166 * and all items of the two new buckets come from the single bucket in the
     167 * original table. Ie items from separate buckets in the original table
     168 * never intermix in the new buckets. Moreover
     169 * since the buckets are sorted by their mixed hash values the items
     170 * at the beginning of the old bucket will end up in the first new
     171 * bucket while all the remaining items of the old bucket will end up
     172 * in the second new bucket. Therefore, there is a single point where
     173 * to split the linked list of the old bucket into two correctly sorted
     174 * linked lists of the new buckets:
     175 *                            .- bucket split
     176 *                            |
     177 *             <-- first -->  v  <-- second -->
     178 *   [old] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
     179 *              ^                 ^   
     180 *   [new0] -- -+                 | 
     181 *   [new1] -- -- -- -- -- -- -- -+
     182 *
     183 * Resize in greater detail:
     184 *
     185 * a) First, a resizer (a single background system work queue item
     186 * in charge of resizing the table) allocates and initializes a new
     187 * bucket head array. New bucket heads are pointed to the sentinel
     188 * and marked Invalid (in the lower order bits of the pointer to the
     189 * next item, ie the sentinel in this case):
     190 *
     191 *   [old, N] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
     192 *                                                    ^ ^
     193 *   [new0, Inv] -------------------------------------+ |
     194 *   [new1, Inv] ---------------------------------------+
     195 *
     196 *
     197 * b) Second, the resizer starts moving old bucket heads with the following
     198 * lock-free protocol (from [5]) where cas(variable, expected_val, new_val)
     199 * is short for compare-and-swap:
     200 *
     201 *   old head     new0 head      transition to next state
     202 *   --------     ---------      ------------------------
     203 *   addr, N      sentinel, Inv  cas(old, (addr, N), (addr, Const))
     204 *                               .. mark the old head as immutable, so that
     205 *                                  updaters do not relink it to other nodes
     206 *                                  until the head move is done.
     207 *   addr, Const  sentinel, Inv  cas(new0, (sentinel, Inv), (addr, N))
     208 *                               .. move the address to the new head and mark
     209 *                                  the new head normal so updaters can start
     210 *                                  using it.
     211 *   addr, Const  addr, N        cas(old, (addr, Const), (addr, Inv))
     212 *                               .. mark the old head Invalid to signify
     213 *                                  the head move is done.
     214 *   addr, Inv    addr, N
     215 *
     216 * Notice that concurrent updaters may step in at any point and correctly
     217 * complete the head move without disrupting the resizer. At worst, the
     218 * resizer or other concurrent updaters will attempt a number of CAS() that
     219 * will correctly fail.
     220 *
     221 *   [old, Inv] -> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
     222 *                 ^                                   ^
     223 *   [new0, N] ----+                                   |
     224 *   [new1, Inv] --------------------------------------+
     225 *
     226 * 
     227 * c) Third, buckets are split if the table is growing; or joined if
     228 * shrinking (by the resizer or updaters depending on whoever accesses
     229 * the bucket first). See split_bucket() and join_buckets() for details.
     230 *
     231 *  1) Mark the last item of new0 with JOIN_FOLLOWS:
     232 *   [old, Inv] -> [00b] -> [01b, JF] -> [10b] -> [11b] -> sentinel
     233 *                 ^                                       ^
     234 *   [new0, N] ----+                                       |
     235 *   [new1, Inv] ------------------------------------------+
     236 *
     237 *  2) Mark the first item of new1 with JOIN_NODE:
     238 *   [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel
     239 *                 ^                                           ^
     240 *   [new0, N] ----+                                           |
     241 *   [new1, Inv] ----------------------------------------------+
     242 *
     243 *  3) Point new1 to the join-node and mark new1 NORMAL.
     244 *   [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel
     245 *                 ^                     ^
     246 *   [new0, N] ----+                     |
     247 *   [new1, N] --------------------------+
     248 *
     249 *
     250 * d) Fourth, the resizer cleans up extra marks added during bucket
     251 * splits/joins but only when it is sure all updaters are accessing
     252 * the table via the new bucket heads only (ie it is certain there
     253 * are no delayed updaters unaware of the resize and accessing the
     254 * table via the old bucket head).
     255 *
     256 *   [old, Inv] ---+
     257 *                 v
     258 *   [new0, N] --> [00b] -> [01b, N] ---+
     259 *                                      v
     260 *   [new1, N] --> [10b, N] -> [11b] -> sentinel
     261 *
     262 *
     263 * e) Last, the resizer publishes the new bucket head array for everyone
     264 * to see and use. This signals the end of the resize and the old bucket
     265 * array is freed.
     266 *
     267 *
     268 * To understand details of how the table is resized, read [1, 3, 5]
     269 * and comments in join_buckets(), split_bucket().
     270 * 
     271 *
     272 * [1] High performance dynamic lock-free hash tables and list-based sets,
     273 *     Michael, 2002
     274 *     http://www.research.ibm.com/people/m/michael/spaa-2002.pdf
     275 * [2] Lock-free linked lists using compare-and-swap,
     276 *     Valois, 1995
     277 *     http://people.csail.mit.edu/bushl2/rpi/portfolio/lockfree-grape/documents/lock-free-linked-lists.pdf
     278 * [3] Resizable, scalable, concurrent hash tables via relativistic programming,
     279 *     Triplett, 2011
     280 *     http://www.usenix.org/event/atc11/tech/final_files/Triplett.pdf
     281 * [4] Split-ordered Lists: Lock-free Extensible Hash Tables,
     282 *     Shavit, 2006
     283 *     http://www.cs.ucf.edu/~dcm/Teaching/COT4810-Spring2011/Literature/SplitOrderedLists.pdf
     284 * [5] Towards a Scalable Non-blocking Coding Style,
     285 *     Click, 2008
     286 *     http://www.azulsystems.com/events/javaone_2008/2008_CodingNonBlock.pdf
     287 */
     288
    39289
    40290#include <adt/cht.h>
     
    645895 * table.
    646896 *
    647  * The following is NOT thread-safe, so do not use:
    648  * \code
     897 * The following is @e NOT thread-safe, so do not use:
     898 * @code
    649899 * if (!cht_find(h, key)) {
    650900 *     // A concurrent insert here may go unnoticed by cht_find() above.
     
    653903 *     // Now we may have two items with equal search keys.
    654904 * }
    655  * \endcode
     905 * @endcode
    656906 *
    657907 * Replace such code with:
    658  * \code
     908 * @code
    659909 * item = malloc(..);
    660910 * if (!cht_insert_unique(h, item)) {
     
    665915 *     // there are no other equal items.
    666916 * }
    667  * \endcode
     917 * @endcode
    668918 *
    669919 */
     
    15331783         *  [dest_head | Inv]
    15341784         *
    1535          * 2)                             ,-- split_hash
     1785         * 3)                             ,-- split_hash
    15361786         *                                v 
    15371787         *  [src_head | N] -> .. -> [JF] -> [JN]
Note: See TracChangeset for help on using the changeset viewer.