source: mainline/kernel/generic/src/adt/cht.c@ 314f4b59

lfn serial ticket/834-toolchain-update topic/msim-upgrade topic/simplify-dev-export
Last change on this file since 314f4b59 was a35b458, checked in by Jiří Zárevúcky <zarevucky.jiri@…>, 7 years ago

style: Remove trailing whitespace on _all_ lines, including empty ones, for particular file types.

Command used: tools/srepl '\s\+$' '' -- *.c *.h *.py *.sh *.s *.S *.ag

Currently, whitespace on empty lines is very inconsistent.
There are two basic choices: Either remove the whitespace, or keep empty lines
indented to the level of surrounding code. The former is AFAICT more common,
and also much easier to do automatically.

Alternatively, we could write script for automatic indentation, and use that
instead. However, if such a script exists, it's possible to use the indented
style locally, by having the editor apply relevant conversions on load/save,
without affecting remote repository. IMO, it makes more sense to adopt
the simpler rule.

  • Property mode set to 100644
File size: 89.2 KB
Line 
1/*
2 * Copyright (c) 2012 Adam Hraska
3 * All rights reserved.
4 *
5 * Redistribution and use in source and binary forms, with or without
6 * modification, are permitted provided that the following conditions
7 * are met:
8 *
9 * - Redistributions of source code must retain the above copyright
10 * notice, this list of conditions and the following disclaimer.
11 * - Redistributions in binary form must reproduce the above copyright
12 * notice, this list of conditions and the following disclaimer in the
13 * documentation and/or other materials provided with the distribution.
14 * - The name of the author may not be used to endorse or promote products
15 * derived from this software without specific prior written permission.
16 *
17 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
18 * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
19 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
20 * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
21 * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
22 * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
23 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
24 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
26 * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27 */
28
29
30/** @addtogroup genericadt
31 * @{
32 */
33
34/**
35 * @file
36 * @brief Scalable resizable concurrent lock-free hash table.
37 *
38 * CHT is a concurrent hash table that is scalable resizable and lock-free.
39 * resizable = the number of buckets of the table increases or decreases
40 * depending on the average number of elements per bucket (ie load)
41 * scalable = accessing the table from more cpus increases performance
42 * almost linearly
43 * lock-free = common operations never block; even if any of the operations
44 * is preempted or interrupted at any time, other operations will still
45 * make forward progress
46 *
47 * CHT is designed for read mostly scenarios. Performance degrades as the
48 * fraction of updates (insert/remove) increases. Other data structures
49 * significantly outperform CHT if the fraction of updates exceeds ~40%.
50 *
51 * CHT tolerates hardware exceptions and may be accessed from exception
52 * handlers as long as the underlying RCU implementation is exception safe.
53 *
54 * @par Caveats
55 *
56 * 0) Never assume an item is still in the table.
57 * The table may be accessed concurrently; therefore, other threads may
58 * insert or remove an item at any time. Do not assume an item is still
59 * in the table if cht_find() just returned it to you. Similarly, an
60 * item may have already been inserted by the time cht_find() returns NULL.
61 *
62 * 1) Always use RCU read locks when searching the table.
63 * Holding an RCU lock guarantees that an item found in the table remains
64 * valid (eg is not freed) even if the item was removed from the table
65 * in the meantime by another thread.
66 *
67 * 2) Never update values in place.
68 * Do not update items in the table in place, ie directly. The changes
69 * will not propagate to other readers (on other cpus) immediately or even
70 * correctly. Some readers may then encounter items that have only some
71 * of their fields changed or are completely inconsistent.
72 *
73 * Instead consider inserting an updated/changed copy of the item and
74 * removing the original item. Or contact the maintainer to provide
75 * you with a function that atomically replaces an item with a copy.
76 *
77 * 3) Use cht_insert_unique() instead of checking for duplicates with cht_find()
78 * The following code is prone to race conditions:
79 * @code
80 * if (NULL == cht_find(&h, key)) {
81 * // If another thread inserts and item here, we'll insert a duplicate.
82 * cht_insert(&h, item);
83 * }
84 * @endcode
85 * See cht_insert_unique() on how to correctly fix this.
86 *
87 *
88 * @par Semantics
89 *
90 * Lazy readers = cht_find_lazy(), cht_find_next_lazy()
91 * Readers = lazy readers, cht_find(), cht_find_next()
92 * Updates = cht_insert(), cht_insert_unique(), cht_remove_key(),
93 * cht_remove_item()
94 *
95 * Readers (but not lazy readers) are guaranteed to see the effects
96 * of @e completed updates. In other words, if cht_find() is invoked
97 * after a cht_insert() @e returned eg on another cpu, cht_find() is
98 * guaranteed to see the inserted item.
99 *
100 * Similarly, updates see the effects of @e completed updates. For example,
101 * issuing cht_remove() after a cht_insert() for that key returned (even
102 * on another cpu) is guaranteed to remove the inserted item.
103 *
104 * Reading or updating the table concurrently with other updates
105 * always returns consistent data and never corrupts the table.
106 * However the effects of concurrent updates may or may not be
107 * visible to all other concurrent readers or updaters. Eg, not
108 * all readers may see that an item has already been inserted
109 * if cht_insert() has not yet returned.
110 *
111 * Lazy readers are guaranteed to eventually see updates but it
112 * may take some time (possibly milliseconds) after the update
113 * completes for the change to propagate to lazy readers on all
114 * cpus.
115 *
116 * @par Implementation
117 *
118 * Collisions in CHT are resolved with chaining. The number of buckets
119 * is always a power of 2. Each bucket is represented with a single linked
120 * lock-free list [1]. Items in buckets are sorted by their mixed hashes
121 * in ascending order. All buckets are terminated with a single global
122 * sentinel node whose mixed hash value is the greatest possible.
123 *
124 * CHT with 2^k buckets uses the k most significant bits of a hash value
125 * to determine the bucket number where an item is to be stored. To
126 * avoid storing all items in a single bucket if the user supplied
127 * hash function does not produce uniform hashes, hash values are
128 * mixed first so that the top bits of a mixed hash change even if hash
129 * values differ only in the least significant bits. The mixed hash
130 * values are cached in cht_link.hash (which is overwritten once the
131 * item is scheduled for removal via rcu_call).
132 *
133 * A new item is inserted before all other existing items in the bucket
134 * with the same hash value as the newly inserted item (a la the original
135 * lock-free list [2]). Placing new items at the start of a same-hash
136 * sequence of items (eg duplicates) allows us to easily check for duplicates
137 * in cht_insert_unique(). The function can first check that there are
138 * no duplicates of the newly inserted item amongst the items with the
139 * same hash as the new item. If there were no duplicates the new item
140 * is linked before the same-hash items. Inserting a duplicate while
141 * the function is checking for duplicates is detected as a change of
142 * the link to the first checked same-hash item (and the search for
143 * duplicates can be restarted).
144 *
145 * @par Table resize algorithm
146 *
147 * Table resize is based on [3] and [5]. First, a new bucket head array
148 * is allocated and initialized. Second, old bucket heads are moved
149 * to the new bucket head array with the protocol mentioned in [5].
150 * At this point updaters start using the new bucket heads. Third,
151 * buckets are split (or joined) so that the table can make use of
152 * the extra bucket head slots in the new array (or stop wasting space
153 * with the unnecessary extra slots in the old array). Splitting
154 * or joining buckets employs a custom protocol. Last, the new array
155 * replaces the original bucket array.
156 *
157 * A single background work item (of the system work queue) guides
158 * resizing of the table. If an updater detects that the bucket it
159 * is about to access is undergoing a resize (ie its head is moving
160 * or it needs to be split/joined), it helps out and completes the
161 * head move or the bucket split/join.
162 *
163 * The table always grows or shrinks by a factor of 2. Because items
164 * are assigned a bucket based on the top k bits of their mixed hash
165 * values, when growing the table each bucket is split into two buckets
166 * and all items of the two new buckets come from the single bucket in the
167 * original table. Ie items from separate buckets in the original table
168 * never intermix in the new buckets. Moreover
169 * since the buckets are sorted by their mixed hash values the items
170 * at the beginning of the old bucket will end up in the first new
171 * bucket while all the remaining items of the old bucket will end up
172 * in the second new bucket. Therefore, there is a single point where
173 * to split the linked list of the old bucket into two correctly sorted
174 * linked lists of the new buckets:
175 * .- bucket split
176 * |
177 * <-- first --> v <-- second -->
178 * [old] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
179 * ^ ^
180 * [new0] -- -+ |
181 * [new1] -- -- -- -- -- -- -- -+
182 *
183 * Resize in greater detail:
184 *
185 * a) First, a resizer (a single background system work queue item
186 * in charge of resizing the table) allocates and initializes a new
187 * bucket head array. New bucket heads are pointed to the sentinel
188 * and marked Invalid (in the lower order bits of the pointer to the
189 * next item, ie the sentinel in this case):
190 *
191 * [old, N] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
192 * ^ ^
193 * [new0, Inv] -------------------------------------+ |
194 * [new1, Inv] ---------------------------------------+
195 *
196 *
197 * b) Second, the resizer starts moving old bucket heads with the following
198 * lock-free protocol (from [5]) where cas(variable, expected_val, new_val)
199 * is short for compare-and-swap:
200 *
201 * old head new0 head transition to next state
202 * -------- --------- ------------------------
203 * addr, N sentinel, Inv cas(old, (addr, N), (addr, Const))
204 * .. mark the old head as immutable, so that
205 * updaters do not relink it to other nodes
206 * until the head move is done.
207 * addr, Const sentinel, Inv cas(new0, (sentinel, Inv), (addr, N))
208 * .. move the address to the new head and mark
209 * the new head normal so updaters can start
210 * using it.
211 * addr, Const addr, N cas(old, (addr, Const), (addr, Inv))
212 * .. mark the old head Invalid to signify
213 * the head move is done.
214 * addr, Inv addr, N
215 *
216 * Notice that concurrent updaters may step in at any point and correctly
217 * complete the head move without disrupting the resizer. At worst, the
218 * resizer or other concurrent updaters will attempt a number of CAS() that
219 * will correctly fail.
220 *
221 * [old, Inv] -> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
222 * ^ ^
223 * [new0, N] ----+ |
224 * [new1, Inv] --------------------------------------+
225 *
226 *
227 * c) Third, buckets are split if the table is growing; or joined if
228 * shrinking (by the resizer or updaters depending on whoever accesses
229 * the bucket first). See split_bucket() and join_buckets() for details.
230 *
231 * 1) Mark the last item of new0 with JOIN_FOLLOWS:
232 * [old, Inv] -> [00b] -> [01b, JF] -> [10b] -> [11b] -> sentinel
233 * ^ ^
234 * [new0, N] ----+ |
235 * [new1, Inv] ------------------------------------------+
236 *
237 * 2) Mark the first item of new1 with JOIN_NODE:
238 * [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel
239 * ^ ^
240 * [new0, N] ----+ |
241 * [new1, Inv] ----------------------------------------------+
242 *
243 * 3) Point new1 to the join-node and mark new1 NORMAL.
244 * [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel
245 * ^ ^
246 * [new0, N] ----+ |
247 * [new1, N] --------------------------+
248 *
249 *
250 * d) Fourth, the resizer cleans up extra marks added during bucket
251 * splits/joins but only when it is sure all updaters are accessing
252 * the table via the new bucket heads only (ie it is certain there
253 * are no delayed updaters unaware of the resize and accessing the
254 * table via the old bucket head).
255 *
256 * [old, Inv] ---+
257 * v
258 * [new0, N] --> [00b] -> [01b, N] ---+
259 * v
260 * [new1, N] --> [10b, N] -> [11b] -> sentinel
261 *
262 *
263 * e) Last, the resizer publishes the new bucket head array for everyone
264 * to see and use. This signals the end of the resize and the old bucket
265 * array is freed.
266 *
267 *
268 * To understand details of how the table is resized, read [1, 3, 5]
269 * and comments in join_buckets(), split_bucket().
270 *
271 *
272 * [1] High performance dynamic lock-free hash tables and list-based sets,
273 * Michael, 2002
274 * http://www.research.ibm.com/people/m/michael/spaa-2002.pdf
275 * [2] Lock-free linked lists using compare-and-swap,
276 * Valois, 1995
277 * http://people.csail.mit.edu/bushl2/rpi/portfolio/lockfree-grape/documents/lock-free-linked-lists.pdf
278 * [3] Resizable, scalable, concurrent hash tables via relativistic programming,
279 * Triplett, 2011
280 * http://www.usenix.org/event/atc11/tech/final_files/Triplett.pdf
281 * [4] Split-ordered Lists: Lock-free Extensible Hash Tables,
282 * Shavit, 2006
283 * http://www.cs.ucf.edu/~dcm/Teaching/COT4810-Spring2011/Literature/SplitOrderedLists.pdf
284 * [5] Towards a Scalable Non-blocking Coding Style,
285 * Click, 2008
286 * http://www.azulsystems.com/events/javaone_2008/2008_CodingNonBlock.pdf
287 */
288
289
290#include <adt/cht.h>
291#include <adt/hash.h>
292#include <assert.h>
293#include <mm/slab.h>
294#include <arch/barrier.h>
295#include <compiler/barrier.h>
296#include <atomic.h>
297#include <synch/rcu.h>
298
299#ifdef CONFIG_DEBUG
300/* Do not enclose in parentheses. */
301#define DBG(x) x
302#else
303#define DBG(x)
304#endif
305
306/* Logarithm of the min bucket count. Must be at least 3. 2^6 == 64 buckets. */
307#define CHT_MIN_ORDER 6
308/* Logarithm of the max bucket count. */
309#define CHT_MAX_ORDER (8 * sizeof(size_t))
310/* Minimum number of hash table buckets. */
311#define CHT_MIN_BUCKET_CNT (1 << CHT_MIN_ORDER)
312/* Does not have to be a power of 2. */
313#define CHT_MAX_LOAD 2
314
315typedef cht_ptr_t marked_ptr_t;
316typedef bool (*equal_pred_t)(void *arg, const cht_link_t *item);
317
318/** The following mark items and bucket heads.
319 *
320 * They are stored in the two low order bits of the next item pointers.
321 * Some marks may be combined. Some marks share the same binary value and
322 * are distinguished only by context (eg bucket head vs an ordinary item),
323 * in particular by walk_mode_t.
324 */
325typedef enum mark {
326 /** Normal non-deleted item or a valid bucket head. */
327 N_NORMAL = 0,
328 /** Logically deleted item that might have already been unlinked.
329 *
330 * May be combined with N_JOIN and N_JOIN_FOLLOWS. Applicable only
331 * to items; never to bucket heads.
332 *
333 * Once marked deleted an item remains marked deleted.
334 */
335 N_DELETED = 1,
336 /** Immutable bucket head.
337 *
338 * The bucket is being moved or joined with another and its (old) head
339 * must not be modified.
340 *
341 * May be combined with N_INVALID. Applicable only to old bucket heads,
342 * ie cht_t.b and not cht_t.new_b.
343 */
344 N_CONST = 1,
345 /** Invalid bucket head. The bucket head must not be modified.
346 *
347 * Old bucket heads (ie cht_t.b) are marked invalid if they have
348 * already been moved to cht_t.new_b or if the bucket had already
349 * been merged with another when shrinking the table. New bucket
350 * heads (ie cht_t.new_b) are marked invalid if the old bucket had
351 * not yet been moved or if an old bucket had not yet been split
352 * when growing the table.
353 */
354 N_INVALID = 3,
355 /** The item is a join node, ie joining two buckets
356 *
357 * A join node is either the first node of the second part of
358 * a bucket to be split; or it is the first node of the bucket
359 * to be merged into/appended to/joined with another bucket.
360 *
361 * May be combined with N_DELETED. Applicable only to items, never
362 * to bucket heads.
363 *
364 * Join nodes are referred to from two different buckets and may,
365 * therefore, not be safely/atomically unlinked from both buckets.
366 * As a result join nodes are not unlinked but rather just marked
367 * deleted. Once resize completes join nodes marked deleted are
368 * garbage collected.
369 */
370 N_JOIN = 2,
371 /** The next node is a join node and will soon be marked so.
372 *
373 * A join-follows node is the last node of the first part of bucket
374 * that is to be split, ie it is the last node that will remain
375 * in the same bucket after splitting it.
376 *
377 * May be combined with N_DELETED. Applicable to items as well
378 * as to bucket heads of the bucket to be split (but only in cht_t.new_b).
379 */
380 N_JOIN_FOLLOWS = 2,
381 /** Bit mask to filter out the address to the next item from the next ptr. */
382 N_MARK_MASK = 3
383} mark_t;
384
385/** Determines */
386typedef enum walk_mode {
387 /** The table is not resizing. */
388 WM_NORMAL = 4,
389 /** The table is undergoing a resize. Join nodes may be encountered. */
390 WM_LEAVE_JOIN,
391 /** The table is growing. A join-follows node may be encountered. */
392 WM_MOVE_JOIN_FOLLOWS
393} walk_mode_t;
394
395/** Bucket position window. */
396typedef struct wnd {
397 /** Pointer to cur's predecessor. */
398 marked_ptr_t *ppred;
399 /** Current item. */
400 cht_link_t *cur;
401 /** Last encountered item. Deleted or not. */
402 cht_link_t *last;
403} wnd_t;
404
405
406/* Sentinel node used by all buckets. Stores the greatest possible hash value.*/
407static const cht_link_t sentinel = {
408 /* NULL and N_NORMAL */
409 .link = 0 | N_NORMAL,
410 .hash = -1
411};
412
413
414static size_t size_to_order(size_t bucket_cnt, size_t min_order);
415static cht_buckets_t *alloc_buckets(size_t order, bool set_invalid,
416 bool can_block);
417static inline cht_link_t *find_lazy(cht_t *h, void *key);
418static cht_link_t *search_bucket(cht_t *h, marked_ptr_t head, void *key,
419 size_t search_hash);
420static cht_link_t *find_resizing(cht_t *h, void *key, size_t hash,
421 marked_ptr_t old_head, size_t old_idx);
422static bool insert_impl(cht_t *h, cht_link_t *item, cht_link_t **dup_item);
423static bool insert_at(cht_link_t *item, const wnd_t *wnd, walk_mode_t walk_mode,
424 bool *resizing);
425static bool has_duplicate(cht_t *h, const cht_link_t *item, size_t hash,
426 cht_link_t *cur, cht_link_t **dup_item);
427static cht_link_t *find_duplicate(cht_t *h, const cht_link_t *item, size_t hash,
428 cht_link_t *start);
429static bool remove_pred(cht_t *h, size_t hash, equal_pred_t pred, void *pred_arg);
430static bool delete_at(cht_t *h, wnd_t *wnd, walk_mode_t walk_mode,
431 bool *deleted_but_gc, bool *resizing);
432static bool mark_deleted(cht_link_t *cur, walk_mode_t walk_mode, bool *resizing);
433static bool unlink_from_pred(wnd_t *wnd, walk_mode_t walk_mode, bool *resizing);
434static bool find_wnd_and_gc_pred(cht_t *h, size_t hash, walk_mode_t walk_mode,
435 equal_pred_t pred, void *pred_arg, wnd_t *wnd, bool *resizing);
436static bool find_wnd_and_gc(cht_t *h, size_t hash, walk_mode_t walk_mode,
437 wnd_t *wnd, bool *resizing);
438static bool gc_deleted_node(cht_t *h, walk_mode_t walk_mode, wnd_t *wnd,
439 bool *resizing);
440static bool join_completed(cht_t *h, const wnd_t *wnd);
441static void upd_resizing_head(cht_t *h, size_t hash, marked_ptr_t **phead,
442 bool *join_finishing, walk_mode_t *walk_mode);
443static void item_removed(cht_t *h);
444static void item_inserted(cht_t *h);
445static void free_later(cht_t *h, cht_link_t *item);
446static void help_head_move(marked_ptr_t *psrc_head, marked_ptr_t *pdest_head);
447static void start_head_move(marked_ptr_t *psrc_head);
448static void mark_const(marked_ptr_t *psrc_head);
449static void complete_head_move(marked_ptr_t *psrc_head, marked_ptr_t *pdest_head);
450static void split_bucket(cht_t *h, marked_ptr_t *psrc_head,
451 marked_ptr_t *pdest_head, size_t split_hash);
452static void mark_join_follows(cht_t *h, marked_ptr_t *psrc_head,
453 size_t split_hash, wnd_t *wnd);
454static void mark_join_node(cht_link_t *join_node);
455static void join_buckets(cht_t *h, marked_ptr_t *psrc_head,
456 marked_ptr_t *pdest_head, size_t split_hash);
457static void link_to_join_node(cht_t *h, marked_ptr_t *pdest_head,
458 cht_link_t *join_node, size_t split_hash);
459static void resize_table(work_t *arg);
460static void grow_table(cht_t *h);
461static void shrink_table(cht_t *h);
462static void cleanup_join_node(cht_t *h, marked_ptr_t *new_head);
463static void clear_join_and_gc(cht_t *h, cht_link_t *join_node,
464 marked_ptr_t *new_head);
465static void cleanup_join_follows(cht_t *h, marked_ptr_t *new_head);
466static marked_ptr_t make_link(const cht_link_t *next, mark_t mark);
467static cht_link_t * get_next(marked_ptr_t link);
468static mark_t get_mark(marked_ptr_t link);
469static void next_wnd(wnd_t *wnd);
470static bool same_node_pred(void *node, const cht_link_t *item2);
471static size_t calc_key_hash(cht_t *h, void *key);
472static size_t node_hash(cht_t *h, const cht_link_t *item);
473static size_t calc_node_hash(cht_t *h, const cht_link_t *item);
474static void memoize_node_hash(cht_t *h, cht_link_t *item);
475static size_t calc_split_hash(size_t split_idx, size_t order);
476static size_t calc_bucket_idx(size_t hash, size_t order);
477static size_t grow_to_split_idx(size_t old_idx);
478static size_t grow_idx(size_t idx);
479static size_t shrink_idx(size_t idx);
480static marked_ptr_t cas_link(marked_ptr_t *link, const cht_link_t *cur_next,
481 mark_t cur_mark, const cht_link_t *new_next, mark_t new_mark);
482static marked_ptr_t _cas_link(marked_ptr_t *link, marked_ptr_t cur,
483 marked_ptr_t new);
484static void cas_order_barrier(void);
485
486static void dummy_remove_callback(cht_link_t *item)
487{
488 /* empty */
489}
490
491/** Creates a concurrent hash table.
492 *
493 * @param h Valid pointer to a cht_t instance.
494 * @param op Item specific operations. All operations are compulsory.
495 * @return True if successfully created the table. False otherwise.
496 */
497bool cht_create_simple(cht_t *h, cht_ops_t *op)
498{
499 return cht_create(h, 0, 0, 0, false, op);
500}
501
502/** Creates a concurrent hash table.
503 *
504 * @param h Valid pointer to a cht_t instance.
505 * @param init_size The initial number of buckets the table should contain.
506 * The table may be shrunk below this value if deemed necessary.
507 * Uses the default value if 0.
508 * @param min_size Minimum number of buckets that the table should contain.
509 * The number of buckets never drops below this value,
510 * although it may be rounded up internally as appropriate.
511 * Uses the default value if 0.
512 * @param max_load Maximum average number of items per bucket that allowed
513 * before the table grows.
514 * @param can_block If true creating the table blocks until enough memory
515 * is available (possibly indefinitely). Otherwise,
516 * table creation does not block and returns immediately
517 * even if not enough memory is available.
518 * @param op Item specific operations. All operations are compulsory.
519 * @return True if successfully created the table. False otherwise.
520 */
521bool cht_create(cht_t *h, size_t init_size, size_t min_size, size_t max_load,
522 bool can_block, cht_ops_t *op)
523{
524 assert(h);
525 assert(op && op->hash && op->key_hash && op->equal && op->key_equal);
526 /* Memoized hashes are stored in the rcu_link.func function pointer. */
527 static_assert(sizeof(size_t) == sizeof(rcu_func_t), "");
528 assert(sentinel.hash == (uintptr_t)sentinel.rcu_link.func);
529
530 /* All operations are compulsory. */
531 if (!op || !op->hash || !op->key_hash || !op->equal || !op->key_equal)
532 return false;
533
534 size_t min_order = size_to_order(min_size, CHT_MIN_ORDER);
535 size_t order = size_to_order(init_size, min_order);
536
537 h->b = alloc_buckets(order, false, can_block);
538
539 if (!h->b)
540 return false;
541
542 h->max_load = (max_load == 0) ? CHT_MAX_LOAD : max_load;
543 h->min_order = min_order;
544 h->new_b = NULL;
545 h->op = op;
546 atomic_set(&h->item_cnt, 0);
547 atomic_set(&h->resize_reqs, 0);
548
549 if (NULL == op->remove_callback) {
550 h->op->remove_callback = dummy_remove_callback;
551 }
552
553 /*
554 * Cached item hashes are stored in item->rcu_link.func. Once the item
555 * is deleted rcu_link.func will contain the value of invalid_hash.
556 */
557 h->invalid_hash = (uintptr_t)h->op->remove_callback;
558
559 /* Ensure the initialization takes place before we start using the table. */
560 write_barrier();
561
562 return true;
563}
564
565/** Allocates and initializes 2^order buckets.
566 *
567 * All bucket heads are initialized to point to the sentinel node.
568 *
569 * @param order The number of buckets to allocate is 2^order.
570 * @param set_invalid Bucket heads are marked invalid if true; otherwise
571 * they are marked N_NORMAL.
572 * @param can_block If true memory allocation blocks until enough memory
573 * is available (possibly indefinitely). Otherwise,
574 * memory allocation does not block.
575 * @return Newly allocated and initialized buckets or NULL if not enough memory.
576 */
577static cht_buckets_t *alloc_buckets(size_t order, bool set_invalid, bool can_block)
578{
579 size_t bucket_cnt = (1 << order);
580 size_t bytes =
581 sizeof(cht_buckets_t) + (bucket_cnt - 1) * sizeof(marked_ptr_t);
582 cht_buckets_t *b = malloc(bytes, can_block ? 0 : FRAME_ATOMIC);
583
584 if (!b)
585 return NULL;
586
587 b->order = order;
588
589 marked_ptr_t head_link = set_invalid
590 ? make_link(&sentinel, N_INVALID)
591 : make_link(&sentinel, N_NORMAL);
592
593 for (size_t i = 0; i < bucket_cnt; ++i) {
594 b->head[i] = head_link;
595 }
596
597 return b;
598}
599
600/** Returns the smallest k such that bucket_cnt <= 2^k and min_order <= k.*/
601static size_t size_to_order(size_t bucket_cnt, size_t min_order)
602{
603 size_t order = min_order;
604
605 /* Find a power of two such that bucket_cnt <= 2^order */
606 do {
607 if (bucket_cnt <= ((size_t)1 << order))
608 return order;
609
610 ++order;
611 } while (order < CHT_MAX_ORDER);
612
613 return order;
614}
615
616/** Destroys a CHT successfully created via cht_create().
617 *
618 * Waits for all outstanding concurrent operations to complete and
619 * frees internal allocated resources. The table is however not cleared
620 * and items already present in the table (if any) are leaked.
621 */
622void cht_destroy(cht_t *h)
623{
624 cht_destroy_unsafe(h);
625
626 /* You must clear the table of items. Otherwise cht_destroy will leak. */
627 assert(atomic_get(&h->item_cnt) == 0);
628}
629
630/** Destroys a successfully created CHT but does no error checking. */
631void cht_destroy_unsafe(cht_t *h)
632{
633 /* Wait for resize to complete. */
634 while (0 < atomic_get(&h->resize_reqs)) {
635 rcu_barrier();
636 }
637
638 /* Wait for all remove_callback()s to complete. */
639 rcu_barrier();
640
641 free(h->b);
642 h->b = NULL;
643}
644
645/** Returns the first item equal to the search key or NULL if not found.
646 *
647 * The call must be enclosed in a rcu_read_lock() unlock() pair. The
648 * returned item is guaranteed to be allocated until rcu_read_unlock()
649 * although the item may be concurrently removed from the table by another
650 * cpu.
651 *
652 * Further items matching the key may be retrieved via cht_find_next().
653 *
654 * cht_find() sees the effects of any completed cht_remove(), cht_insert().
655 * If a concurrent remove or insert had not yet completed cht_find() may
656 * or may not see the effects of it (eg it may find an item being removed).
657 *
658 * @param h CHT to operate on.
659 * @param key Search key as defined by cht_ops_t.key_equal() and .key_hash().
660 * @return First item equal to the key or NULL if such an item does not exist.
661 */
662cht_link_t *cht_find(cht_t *h, void *key)
663{
664 /* Make the most recent changes to the table visible. */
665 read_barrier();
666 return cht_find_lazy(h, key);
667}
668
669/** Returns the first item equal to the search key or NULL if not found.
670 *
671 * Unlike cht_find(), cht_find_lazy() may not see the effects of
672 * cht_remove() or cht_insert() even though they have already completed.
673 * It may take a couple of milliseconds for those changes to propagate
674 * and become visible to cht_find_lazy(). On the other hand, cht_find_lazy()
675 * operates a bit faster than cht_find().
676 *
677 * See cht_find() for more details.
678 */
679cht_link_t *cht_find_lazy(cht_t *h, void *key)
680{
681 return find_lazy(h, key);
682}
683
684/** Finds the first item equal to the search key. */
685static inline cht_link_t *find_lazy(cht_t *h, void *key)
686{
687 assert(h);
688 /* See docs to cht_find() and cht_find_lazy(). */
689 assert(rcu_read_locked());
690
691 size_t hash = calc_key_hash(h, key);
692
693 cht_buckets_t *b = rcu_access(h->b);
694 size_t idx = calc_bucket_idx(hash, b->order);
695 /*
696 * No need for access_once. b->head[idx] will point to an allocated node
697 * even if marked invalid until we exit rcu read section.
698 */
699 marked_ptr_t head = b->head[idx];
700
701 /* Undergoing a resize - take the slow path. */
702 if (N_INVALID == get_mark(head))
703 return find_resizing(h, key, hash, head, idx);
704
705 return search_bucket(h, head, key, hash);
706}
707
708/** Returns the next item matching \a item.
709 *
710 * Must be enclosed in a rcu_read_lock()/unlock() pair. Effects of
711 * completed cht_remove(), cht_insert() are guaranteed to be visible
712 * to cht_find_next().
713 *
714 * See cht_find() for more details.
715 */
716cht_link_t *cht_find_next(cht_t *h, const cht_link_t *item)
717{
718 /* Make the most recent changes to the table visible. */
719 read_barrier();
720 return cht_find_next_lazy(h, item);
721}
722
723/** Returns the next item matching \a item.
724 *
725 * Must be enclosed in a rcu_read_lock()/unlock() pair. Effects of
726 * completed cht_remove(), cht_insert() may or may not be visible
727 * to cht_find_next_lazy().
728 *
729 * See cht_find_lazy() for more details.
730 */
731cht_link_t *cht_find_next_lazy(cht_t *h, const cht_link_t *item)
732{
733 assert(h);
734 assert(rcu_read_locked());
735 assert(item);
736
737 return find_duplicate(h, item, calc_node_hash(h, item), get_next(item->link));
738}
739
740/** Searches the bucket at head for key using search_hash. */
741static inline cht_link_t *search_bucket(cht_t *h, marked_ptr_t head, void *key,
742 size_t search_hash)
743{
744 /*
745 * It is safe to access nodes even outside of this bucket (eg when
746 * splitting the bucket). The resizer makes sure that any node we
747 * may find by following the next pointers is allocated.
748 */
749
750 cht_link_t *cur = NULL;
751 marked_ptr_t prev = head;
752
753try_again:
754 /* Filter out items with different hashes. */
755 do {
756 cur = get_next(prev);
757 assert(cur);
758 prev = cur->link;
759 } while (node_hash(h, cur) < search_hash);
760
761 /*
762 * Only search for an item with an equal key if cur is not the sentinel
763 * node or a node with a different hash.
764 */
765 while (node_hash(h, cur) == search_hash) {
766 if (h->op->key_equal(key, cur)) {
767 if (!(N_DELETED & get_mark(cur->link)))
768 return cur;
769 }
770
771 cur = get_next(cur->link);
772 assert(cur);
773 }
774
775 /*
776 * In the unlikely case that we have encountered a node whose cached
777 * hash has been overwritten due to a pending rcu_call for it, skip
778 * the node and try again.
779 */
780 if (node_hash(h, cur) == h->invalid_hash) {
781 prev = cur->link;
782 goto try_again;
783 }
784
785 return NULL;
786}
787
788/** Searches for the key while the table is undergoing a resize. */
789static cht_link_t *find_resizing(cht_t *h, void *key, size_t hash,
790 marked_ptr_t old_head, size_t old_idx)
791{
792 assert(N_INVALID == get_mark(old_head));
793 assert(h->new_b);
794
795 size_t new_idx = calc_bucket_idx(hash, h->new_b->order);
796 marked_ptr_t new_head = h->new_b->head[new_idx];
797 marked_ptr_t search_head = new_head;
798
799 /* Growing. */
800 if (h->b->order < h->new_b->order) {
801 /*
802 * Old bucket head is invalid, so it must have been already
803 * moved. Make the new head visible if still not visible, ie
804 * invalid.
805 */
806 if (N_INVALID == get_mark(new_head)) {
807 /*
808 * We should be searching a newly added bucket but the old
809 * moved bucket has not yet been split (its marked invalid)
810 * or we have not yet seen the split.
811 */
812 if (grow_idx(old_idx) != new_idx) {
813 /*
814 * Search the moved bucket. It is guaranteed to contain
815 * items of the newly added bucket that were present
816 * before the moved bucket was split.
817 */
818 new_head = h->new_b->head[grow_idx(old_idx)];
819 }
820
821 /* new_head is now the moved bucket, either valid or invalid. */
822
823 /*
824 * The old bucket was definitely moved to new_head but the
825 * change of new_head had not yet propagated to this cpu.
826 */
827 if (N_INVALID == get_mark(new_head)) {
828 /*
829 * We could issue a read_barrier() and make the now valid
830 * moved bucket head new_head visible, but instead fall back
831 * on using the old bucket. Although the old bucket head is
832 * invalid, it points to a node that is allocated and in the
833 * right bucket. Before the node can be freed, it must be
834 * unlinked from the head (or another item after that item
835 * modified the new_head) and a grace period must elapse.
836 * As a result had the node been already freed the grace
837 * period preceeding the free() would make the unlink and
838 * any changes to new_head visible. Therefore, it is safe
839 * to use the node pointed to from the old bucket head.
840 */
841
842 search_head = old_head;
843 } else {
844 search_head = new_head;
845 }
846 }
847
848 return search_bucket(h, search_head, key, hash);
849 } else if (h->b->order > h->new_b->order) {
850 /* Shrinking. */
851
852 /* Index of the bucket in the old table that was moved. */
853 size_t move_src_idx = grow_idx(new_idx);
854 marked_ptr_t moved_old_head = h->b->head[move_src_idx];
855
856 /*
857 * h->b->head[move_src_idx] had already been moved to new_head
858 * but the change to new_head had not yet propagated to us.
859 */
860 if (N_INVALID == get_mark(new_head)) {
861 /*
862 * new_head is definitely valid and we could make it visible
863 * to this cpu with a read_barrier(). Instead, use the bucket
864 * in the old table that was moved even though it is now marked
865 * as invalid. The node it points to must be allocated because
866 * a grace period would have to elapse before it could be freed;
867 * and the grace period would make the now valid new_head
868 * visible to all cpus.
869 *
870 * Note that move_src_idx may not be the same as old_idx.
871 * If move_src_idx != old_idx then old_idx is the bucket
872 * in the old table that is not moved but instead it is
873 * appended to the moved bucket, ie it is added at the tail
874 * of new_head. In that case an invalid old_head notes that
875 * it had already been merged into (the moved) new_head.
876 * We will try to search that bucket first because it
877 * may contain some newly added nodes after the bucket
878 * join. Moreover, the bucket joining link may already be
879 * visible even if new_head is not. Therefore, if we're
880 * lucky we'll find the item via moved_old_head. In any
881 * case, we'll retry in proper old_head if not found.
882 */
883 search_head = moved_old_head;
884 }
885
886 cht_link_t *ret = search_bucket(h, search_head, key, hash);
887
888 if (ret)
889 return ret;
890 /*
891 * Bucket old_head was already joined with moved_old_head
892 * in the new table but we have not yet seen change of the
893 * joining link (or the item is not in the table).
894 */
895 if (move_src_idx != old_idx && get_next(old_head) != &sentinel) {
896 /*
897 * Note that old_head (the bucket to be merged into new_head)
898 * points to an allocated join node (if non-null) even if marked
899 * invalid. Before the resizer lets join nodes to be unlinked
900 * (and freed) it sets old_head to NULL and waits for a grace period.
901 * So either the invalid old_head points to join node; or old_head
902 * is null and we would have seen a completed bucket join while
903 * traversing search_head.
904 */
905 assert(N_JOIN & get_mark(get_next(old_head)->link));
906 return search_bucket(h, old_head, key, hash);
907 }
908
909 return NULL;
910 } else {
911 /*
912 * Resize is almost done. The resizer is waiting to make
913 * sure all cpus see that the new table replaced the old one.
914 */
915 assert(h->b->order == h->new_b->order);
916 /*
917 * The resizer must ensure all new bucket heads are visible before
918 * replacing the old table.
919 */
920 assert(N_NORMAL == get_mark(new_head));
921 return search_bucket(h, new_head, key, hash);
922 }
923}
924
925/** Inserts an item. Succeeds even if an equal item is already present. */
926void cht_insert(cht_t *h, cht_link_t *item)
927{
928 insert_impl(h, item, NULL);
929}
930
931/** Inserts a unique item. Returns false if an equal item was already present.
932 *
933 * Use this function to atomically check if an equal/duplicate item had
934 * not yet been inserted into the table and to insert this item into the
935 * table.
936 *
937 * The following is @e NOT thread-safe, so do not use:
938 * @code
939 * if (!cht_find(h, key)) {
940 * // A concurrent insert here may go unnoticed by cht_find() above.
941 * item = malloc(..);
942 * cht_insert(h, item);
943 * // Now we may have two items with equal search keys.
944 * }
945 * @endcode
946 *
947 * Replace such code with:
948 * @code
949 * item = malloc(..);
950 * if (!cht_insert_unique(h, item, &dup_item)) {
951 * // Whoops, someone beat us to it - an equal item 'dup_item'
952 * // had already been inserted.
953 * free(item);
954 * } else {
955 * // Successfully inserted the item and we are guaranteed that
956 * // there are no other equal items.
957 * }
958 * @endcode
959 *
960 */
961bool cht_insert_unique(cht_t *h, cht_link_t *item, cht_link_t **dup_item)
962{
963 assert(rcu_read_locked());
964 assert(dup_item);
965 return insert_impl(h, item, dup_item);
966}
967
968/** Inserts the item into the table and checks for duplicates if dup_item. */
969static bool insert_impl(cht_t *h, cht_link_t *item, cht_link_t **dup_item)
970{
971 rcu_read_lock();
972
973 cht_buckets_t *b = rcu_access(h->b);
974 memoize_node_hash(h, item);
975 size_t hash = node_hash(h, item);
976 size_t idx = calc_bucket_idx(hash, b->order);
977 marked_ptr_t *phead = &b->head[idx];
978
979 bool resizing = false;
980 bool inserted = false;
981
982 do {
983 walk_mode_t walk_mode = WM_NORMAL;
984 bool join_finishing;
985
986 resizing = resizing || (N_NORMAL != get_mark(*phead));
987
988 /* The table is resizing. Get the correct bucket head. */
989 if (resizing) {
990 upd_resizing_head(h, hash, &phead, &join_finishing, &walk_mode);
991 }
992
993 wnd_t wnd = {
994 .ppred = phead,
995 .cur = get_next(*phead),
996 .last = NULL
997 };
998
999 if (!find_wnd_and_gc(h, hash, walk_mode, &wnd, &resizing)) {
1000 /* Could not GC a node; or detected an unexpected resize. */
1001 continue;
1002 }
1003
1004 if (dup_item && has_duplicate(h, item, hash, wnd.cur, dup_item)) {
1005 rcu_read_unlock();
1006 return false;
1007 }
1008
1009 inserted = insert_at(item, &wnd, walk_mode, &resizing);
1010 } while (!inserted);
1011
1012 rcu_read_unlock();
1013
1014 item_inserted(h);
1015 return true;
1016}
1017
1018/** Inserts item between wnd.ppred and wnd.cur.
1019 *
1020 * @param item Item to link to wnd.ppred and wnd.cur.
1021 * @param wnd The item will be inserted before wnd.cur. Wnd.ppred
1022 * must be N_NORMAL.
1023 * @param walk_mode
1024 * @param resizing Set to true only if the table is undergoing resize
1025 * and it was not expected (ie walk_mode == WM_NORMAL).
1026 * @return True if the item was successfully linked to wnd.ppred. False
1027 * if whole insert operation must be retried because the predecessor
1028 * of wnd.cur has changed.
1029 */
1030inline static bool insert_at(cht_link_t *item, const wnd_t *wnd,
1031 walk_mode_t walk_mode, bool *resizing)
1032{
1033 marked_ptr_t ret;
1034
1035 if (walk_mode == WM_NORMAL) {
1036 item->link = make_link(wnd->cur, N_NORMAL);
1037 /* Initialize the item before adding it to a bucket. */
1038 memory_barrier();
1039
1040 /* Link a clean/normal predecessor to the item. */
1041 ret = cas_link(wnd->ppred, wnd->cur, N_NORMAL, item, N_NORMAL);
1042
1043 if (ret == make_link(wnd->cur, N_NORMAL)) {
1044 return true;
1045 } else {
1046 /* This includes an invalid head but not a const head. */
1047 *resizing = ((N_JOIN_FOLLOWS | N_JOIN) & get_mark(ret));
1048 return false;
1049 }
1050 } else if (walk_mode == WM_MOVE_JOIN_FOLLOWS) {
1051 /* Move JOIN_FOLLOWS mark but filter out the DELETED mark. */
1052 mark_t jf_mark = get_mark(*wnd->ppred) & N_JOIN_FOLLOWS;
1053 item->link = make_link(wnd->cur, jf_mark);
1054 /* Initialize the item before adding it to a bucket. */
1055 memory_barrier();
1056
1057 /* Link the not-deleted predecessor to the item. Move its JF mark. */
1058 ret = cas_link(wnd->ppred, wnd->cur, jf_mark, item, N_NORMAL);
1059
1060 return ret == make_link(wnd->cur, jf_mark);
1061 } else {
1062 assert(walk_mode == WM_LEAVE_JOIN);
1063
1064 item->link = make_link(wnd->cur, N_NORMAL);
1065 /* Initialize the item before adding it to a bucket. */
1066 memory_barrier();
1067
1068 mark_t pred_mark = get_mark(*wnd->ppred);
1069 /* If the predecessor is a join node it may be marked deleted.*/
1070 mark_t exp_pred_mark = (N_JOIN & pred_mark) ? pred_mark : N_NORMAL;
1071
1072 ret = cas_link(wnd->ppred, wnd->cur, exp_pred_mark, item, exp_pred_mark);
1073 return ret == make_link(wnd->cur, exp_pred_mark);
1074 }
1075}
1076
1077/** Returns true if the chain starting at cur has an item equal to \a item.
1078 *
1079 * @param h CHT to operate on.
1080 * @param item Item whose duplicates the function looks for.
1081 * @param hash Hash of \a item.
1082 * @param[in] cur The first node with a hash greater to or equal to item's hash.
1083 * @param[out] dup_item The first duplicate item encountered.
1084 * @return True if a non-deleted item equal to \a item exists in the table.
1085 */
1086static inline bool has_duplicate(cht_t *h, const cht_link_t *item, size_t hash,
1087 cht_link_t *cur, cht_link_t **dup_item)
1088{
1089 assert(cur);
1090 assert(cur == &sentinel || hash <= node_hash(h, cur)
1091 || node_hash(h, cur) == h->invalid_hash);
1092
1093 /* hash < node_hash(h, cur) */
1094 if (hash != node_hash(h, cur) && h->invalid_hash != node_hash(h, cur))
1095 return false;
1096
1097 /*
1098 * Load the most recent node marks. Otherwise we might pronounce a
1099 * logically deleted node for a duplicate of the item just because
1100 * the deleted node's DEL mark had not yet propagated to this cpu.
1101 */
1102 read_barrier();
1103
1104 *dup_item = find_duplicate(h, item, hash, cur);
1105 return NULL != *dup_item;
1106}
1107
1108/** Returns an item that is equal to \a item starting in a chain at \a start. */
1109static cht_link_t *find_duplicate(cht_t *h, const cht_link_t *item, size_t hash,
1110 cht_link_t *start)
1111{
1112 assert(hash <= node_hash(h, start) || h->invalid_hash == node_hash(h, start));
1113
1114 cht_link_t *cur = start;
1115
1116try_again:
1117 assert(cur);
1118
1119 while (node_hash(h, cur) == hash) {
1120 assert(cur != &sentinel);
1121
1122 bool deleted = (N_DELETED & get_mark(cur->link));
1123
1124 /* Skip logically deleted nodes. */
1125 if (!deleted && h->op->equal(item, cur))
1126 return cur;
1127
1128 cur = get_next(cur->link);
1129 assert(cur);
1130 }
1131
1132 /* Skip logically deleted nodes with rcu_call() in progress. */
1133 if (h->invalid_hash == node_hash(h, cur)) {
1134 cur = get_next(cur->link);
1135 goto try_again;
1136 }
1137
1138 return NULL;
1139}
1140
1141/** Removes all items matching the search key. Returns the number of items removed.*/
1142size_t cht_remove_key(cht_t *h, void *key)
1143{
1144 assert(h);
1145
1146 size_t hash = calc_key_hash(h, key);
1147 size_t removed = 0;
1148
1149 while (remove_pred(h, hash, h->op->key_equal, key))
1150 ++removed;
1151
1152 return removed;
1153}
1154
1155/** Removes a specific item from the table.
1156 *
1157 * The called must hold rcu read lock.
1158 *
1159 * @param item Item presumably present in the table and to be removed.
1160 * @return True if the item was removed successfully; or false if it had
1161 * already been deleted.
1162 */
1163bool cht_remove_item(cht_t *h, cht_link_t *item)
1164{
1165 assert(h);
1166 assert(item);
1167 /* Otherwise a concurrent cht_remove_key might free the item. */
1168 assert(rcu_read_locked());
1169
1170 /*
1171 * Even though we know the node we want to delete we must unlink it
1172 * from the correct bucket and from a clean/normal predecessor. Therefore,
1173 * we search for it again from the beginning of the correct bucket.
1174 */
1175 size_t hash = calc_node_hash(h, item);
1176 return remove_pred(h, hash, same_node_pred, item);
1177}
1178
1179/** Removes an item equal to pred_arg according to the predicate pred. */
1180static bool remove_pred(cht_t *h, size_t hash, equal_pred_t pred, void *pred_arg)
1181{
1182 rcu_read_lock();
1183
1184 bool resizing = false;
1185 bool deleted = false;
1186 bool deleted_but_gc = false;
1187
1188 cht_buckets_t *b = rcu_access(h->b);
1189 size_t idx = calc_bucket_idx(hash, b->order);
1190 marked_ptr_t *phead = &b->head[idx];
1191
1192 do {
1193 walk_mode_t walk_mode = WM_NORMAL;
1194 bool join_finishing = false;
1195
1196 resizing = resizing || (N_NORMAL != get_mark(*phead));
1197
1198 /* The table is resizing. Get the correct bucket head. */
1199 if (resizing) {
1200 upd_resizing_head(h, hash, &phead, &join_finishing, &walk_mode);
1201 }
1202
1203 wnd_t wnd = {
1204 .ppred = phead,
1205 .cur = get_next(*phead),
1206 .last = NULL
1207 };
1208
1209 if (!find_wnd_and_gc_pred(
1210 h, hash, walk_mode, pred, pred_arg, &wnd, &resizing)) {
1211 /* Could not GC a node; or detected an unexpected resize. */
1212 continue;
1213 }
1214
1215 /*
1216 * The item lookup is affected by a bucket join but effects of
1217 * the bucket join have not been seen while searching for the item.
1218 */
1219 if (join_finishing && !join_completed(h, &wnd)) {
1220 /*
1221 * Bucket was appended at the end of another but the next
1222 * ptr linking them together was not visible on this cpu.
1223 * join_completed() makes this appended bucket visible.
1224 */
1225 continue;
1226 }
1227
1228 /* Already deleted, but delete_at() requested one GC pass. */
1229 if (deleted_but_gc)
1230 break;
1231
1232 bool found = (wnd.cur != &sentinel && pred(pred_arg, wnd.cur));
1233
1234 if (!found) {
1235 rcu_read_unlock();
1236 return false;
1237 }
1238
1239 deleted = delete_at(h, &wnd, walk_mode, &deleted_but_gc, &resizing);
1240 } while (!deleted || deleted_but_gc);
1241
1242 rcu_read_unlock();
1243 return true;
1244}
1245
1246/** Unlinks wnd.cur from wnd.ppred and schedules a deferred free for the item.
1247 *
1248 * Ignores nodes marked N_JOIN if walk mode is WM_LEAVE_JOIN.
1249 *
1250 * @param h CHT to operate on.
1251 * @param wnd Points to the item to delete and its N_NORMAL predecessor.
1252 * @param walk_mode Bucket chaing walk mode.
1253 * @param deleted_but_gc Set to true if the item had been logically deleted,
1254 * but a garbage collecting walk of the bucket is in order for
1255 * it to be fully unlinked.
1256 * @param resizing Set to true if the table is undergoing an unexpected
1257 * resize (ie walk_mode == WM_NORMAL).
1258 * @return False if the wnd.ppred changed in the meantime and the whole
1259 * delete operation must be retried.
1260 */
1261static inline bool delete_at(cht_t *h, wnd_t *wnd, walk_mode_t walk_mode,
1262 bool *deleted_but_gc, bool *resizing)
1263{
1264 assert(wnd->cur && wnd->cur != &sentinel);
1265
1266 *deleted_but_gc = false;
1267
1268 if (!mark_deleted(wnd->cur, walk_mode, resizing)) {
1269 /* Already deleted, or unexpectedly marked as JOIN/JOIN_FOLLOWS. */
1270 return false;
1271 }
1272
1273 /* Marked deleted. Unlink from the bucket. */
1274
1275 /* Never unlink join nodes. */
1276 if (walk_mode == WM_LEAVE_JOIN && (N_JOIN & get_mark(wnd->cur->link)))
1277 return true;
1278
1279 cas_order_barrier();
1280
1281 if (unlink_from_pred(wnd, walk_mode, resizing)) {
1282 free_later(h, wnd->cur);
1283 } else {
1284 *deleted_but_gc = true;
1285 }
1286
1287 return true;
1288}
1289
1290/** Marks cur logically deleted. Returns false to request a retry. */
1291static inline bool mark_deleted(cht_link_t *cur, walk_mode_t walk_mode,
1292 bool *resizing)
1293{
1294 assert(cur && cur != &sentinel);
1295
1296 /*
1297 * Btw, we could loop here if the cas fails but let's not complicate
1298 * things and let's retry from the head of the bucket.
1299 */
1300
1301 cht_link_t *next = get_next(cur->link);
1302
1303 if (walk_mode == WM_NORMAL) {
1304 /* Only mark clean/normal nodes - JF/JN is used only during resize. */
1305 marked_ptr_t ret = cas_link(&cur->link, next, N_NORMAL, next, N_DELETED);
1306
1307 if (ret != make_link(next, N_NORMAL)) {
1308 *resizing = (N_JOIN | N_JOIN_FOLLOWS) & get_mark(ret);
1309 return false;
1310 }
1311 } else {
1312 static_assert(N_JOIN == N_JOIN_FOLLOWS, "");
1313
1314 /* Keep the N_JOIN/N_JOIN_FOLLOWS mark but strip N_DELETED. */
1315 mark_t cur_mark = get_mark(cur->link) & N_JOIN_FOLLOWS;
1316
1317 marked_ptr_t ret =
1318 cas_link(&cur->link, next, cur_mark, next, cur_mark | N_DELETED);
1319
1320 if (ret != make_link(next, cur_mark))
1321 return false;
1322 }
1323
1324 return true;
1325}
1326
1327/** Unlinks wnd.cur from wnd.ppred. Returns false if it should be retried. */
1328static inline bool unlink_from_pred(wnd_t *wnd, walk_mode_t walk_mode,
1329 bool *resizing)
1330{
1331 assert(wnd->cur != &sentinel);
1332 assert(wnd->cur && (N_DELETED & get_mark(wnd->cur->link)));
1333
1334 cht_link_t *next = get_next(wnd->cur->link);
1335
1336 if (walk_mode == WM_LEAVE_JOIN) {
1337 /* Never try to unlink join nodes. */
1338 assert(!(N_JOIN & get_mark(wnd->cur->link)));
1339
1340 mark_t pred_mark = get_mark(*wnd->ppred);
1341 /* Succeed only if the predecessor is clean/normal or a join node. */
1342 mark_t exp_pred_mark = (N_JOIN & pred_mark) ? pred_mark : N_NORMAL;
1343
1344 marked_ptr_t pred_link = make_link(wnd->cur, exp_pred_mark);
1345 marked_ptr_t next_link = make_link(next, exp_pred_mark);
1346
1347 if (pred_link != _cas_link(wnd->ppred, pred_link, next_link))
1348 return false;
1349 } else {
1350 assert(walk_mode == WM_MOVE_JOIN_FOLLOWS || walk_mode == WM_NORMAL);
1351 /* Move the JF mark if set. Clear DEL mark. */
1352 mark_t cur_mark = N_JOIN_FOLLOWS & get_mark(wnd->cur->link);
1353
1354 /* The predecessor must be clean/normal. */
1355 marked_ptr_t pred_link = make_link(wnd->cur, N_NORMAL);
1356 /* Link to cur's successor keeping/copying cur's JF mark. */
1357 marked_ptr_t next_link = make_link(next, cur_mark);
1358
1359 marked_ptr_t ret = _cas_link(wnd->ppred, pred_link, next_link);
1360
1361 if (pred_link != ret) {
1362 /* If we're not resizing the table there are no JF/JN nodes. */
1363 *resizing = (walk_mode == WM_NORMAL)
1364 && (N_JOIN_FOLLOWS & get_mark(ret));
1365 return false;
1366 }
1367 }
1368
1369 return true;
1370}
1371
1372/** Finds the first non-deleted item equal to \a pred_arg according to \a pred.
1373 *
1374 * The function returns the candidate item in \a wnd. Logically deleted
1375 * nodes are garbage collected so the predecessor will most likely not
1376 * be marked as deleted.
1377 *
1378 * Unlike find_wnd_and_gc(), this function never returns a node that
1379 * is known to have already been marked N_DELETED.
1380 *
1381 * Any logically deleted nodes (ie those marked N_DELETED) are garbage
1382 * collected, ie free in the background via rcu_call (except for join-nodes
1383 * if walk_mode == WM_LEAVE_JOIN).
1384 *
1385 * @param h CHT to operate on.
1386 * @param hash Hash the search for.
1387 * @param walk_mode Bucket chain walk mode.
1388 * @param pred Predicate used to find an item equal to pred_arg.
1389 * @param pred_arg Argument to pass to the equality predicate \a pred.
1390 * @param[in,out] wnd The search starts with wnd.cur. If the desired
1391 * item is found wnd.cur will point to it.
1392 * @param resizing Set to true if the table is resizing but it was not
1393 * expected (ie walk_mode == WM_NORMAL).
1394 * @return False if the operation has to be retried. True otherwise
1395 * (even if an equal item had not been found).
1396 */
1397static bool find_wnd_and_gc_pred(cht_t *h, size_t hash, walk_mode_t walk_mode,
1398 equal_pred_t pred, void *pred_arg, wnd_t *wnd, bool *resizing)
1399{
1400 assert(wnd->cur);
1401
1402 if (wnd->cur == &sentinel)
1403 return true;
1404
1405 /*
1406 * A read barrier is not needed here to bring up the most recent
1407 * node marks (esp the N_DELETED). At worst we'll try to delete
1408 * an already deleted node; fail in delete_at(); and retry.
1409 */
1410
1411 size_t cur_hash;
1412
1413try_again:
1414 cur_hash = node_hash(h, wnd->cur);
1415
1416 while (cur_hash <= hash) {
1417 assert(wnd->cur && wnd->cur != &sentinel);
1418
1419 /* GC any deleted nodes on the way. */
1420 if (N_DELETED & get_mark(wnd->cur->link)) {
1421 if (!gc_deleted_node(h, walk_mode, wnd, resizing)) {
1422 /* Retry from the head of a bucket. */
1423 return false;
1424 }
1425 } else {
1426 /* Is this the node we were looking for? */
1427 if (cur_hash == hash && pred(pred_arg, wnd->cur))
1428 return true;
1429
1430 next_wnd(wnd);
1431 }
1432
1433 cur_hash = node_hash(h, wnd->cur);
1434 }
1435
1436 if (cur_hash == h->invalid_hash) {
1437 next_wnd(wnd);
1438 assert(wnd->cur);
1439 goto try_again;
1440 }
1441
1442 /* The searched for node is not in the current bucket. */
1443 return true;
1444}
1445
1446/** Find the first item (deleted or not) with a hash greater or equal to \a hash.
1447 *
1448 * The function returns the first item with a hash that is greater or
1449 * equal to \a hash in \a wnd. Moreover it garbage collects logically
1450 * deleted node that have not yet been unlinked and freed. Therefore,
1451 * the returned node's predecessor will most likely be N_NORMAL.
1452 *
1453 * Unlike find_wnd_and_gc_pred(), this function may return a node
1454 * that is known to had been marked N_DELETED.
1455 *
1456 * @param h CHT to operate on.
1457 * @param hash Hash of the item to find.
1458 * @param walk_mode Bucket chain walk mode.
1459 * @param[in,out] wnd wnd.cur denotes the first node of the chain. If the
1460 * the operation is successful, \a wnd points to the desired
1461 * item.
1462 * @param resizing Set to true if a table resize was detected but walk_mode
1463 * suggested the table was not undergoing a resize.
1464 * @return False indicates the operation must be retried. True otherwise
1465 * (even if an item with exactly the same has was not found).
1466 */
1467static bool find_wnd_and_gc(cht_t *h, size_t hash, walk_mode_t walk_mode,
1468 wnd_t *wnd, bool *resizing)
1469{
1470try_again:
1471 assert(wnd->cur);
1472
1473 while (node_hash(h, wnd->cur) < hash) {
1474 /* GC any deleted nodes along the way to our desired node. */
1475 if (N_DELETED & get_mark(wnd->cur->link)) {
1476 if (!gc_deleted_node(h, walk_mode, wnd, resizing)) {
1477 /* Failed to remove the garbage node. Retry. */
1478 return false;
1479 }
1480 } else {
1481 next_wnd(wnd);
1482 }
1483
1484 assert(wnd->cur);
1485 }
1486
1487 if (node_hash(h, wnd->cur) == h->invalid_hash) {
1488 next_wnd(wnd);
1489 goto try_again;
1490 }
1491
1492 /* wnd->cur may be NULL or even marked N_DELETED. */
1493 return true;
1494}
1495
1496/** Garbage collects the N_DELETED node at \a wnd skipping join nodes. */
1497static bool gc_deleted_node(cht_t *h, walk_mode_t walk_mode, wnd_t *wnd,
1498 bool *resizing)
1499{
1500 assert(N_DELETED & get_mark(wnd->cur->link));
1501
1502 /* Skip deleted JOIN nodes. */
1503 if (walk_mode == WM_LEAVE_JOIN && (N_JOIN & get_mark(wnd->cur->link))) {
1504 next_wnd(wnd);
1505 } else {
1506 /* Ordinary deleted node or a deleted JOIN_FOLLOWS. */
1507 assert(walk_mode != WM_LEAVE_JOIN
1508 || !((N_JOIN | N_JOIN_FOLLOWS) & get_mark(wnd->cur->link)));
1509
1510 /* Unlink an ordinary deleted node, move JOIN_FOLLOWS mark. */
1511 if (!unlink_from_pred(wnd, walk_mode, resizing)) {
1512 /* Retry. The predecessor was deleted, invalid, const, join_follows. */
1513 return false;
1514 }
1515
1516 free_later(h, wnd->cur);
1517
1518 /* Leave ppred as is. */
1519 wnd->last = wnd->cur;
1520 wnd->cur = get_next(wnd->cur->link);
1521 }
1522
1523 return true;
1524}
1525
1526/** Returns true if a bucket join had already completed.
1527 *
1528 * May only be called if upd_resizing_head() indicates a bucket join
1529 * may be in progress.
1530 *
1531 * If it returns false, the search must be retried in order to guarantee
1532 * all item that should have been encountered have been seen.
1533 */
1534static bool join_completed(cht_t *h, const wnd_t *wnd)
1535{
1536 /*
1537 * The table is shrinking and the searched for item is in a bucket
1538 * appended to another. Check that the link joining these two buckets
1539 * is visible and if not, make it visible to this cpu.
1540 */
1541
1542 /*
1543 * Resizer ensures h->b->order stays the same for the duration of this
1544 * func. We got here because there was an alternative head to search.
1545 * The resizer waits for all preexisting readers to finish after
1546 * it
1547 */
1548 assert(h->b->order > h->new_b->order);
1549 assert(wnd->cur);
1550
1551 /* Either we did not need the joining link or we have already followed it.*/
1552 if (wnd->cur != &sentinel)
1553 return true;
1554
1555 /* We have reached the end of a bucket. */
1556
1557 if (wnd->last != &sentinel) {
1558 size_t last_seen_hash = node_hash(h, wnd->last);
1559
1560 if (last_seen_hash == h->invalid_hash) {
1561 last_seen_hash = calc_node_hash(h, wnd->last);
1562 }
1563
1564 size_t last_old_idx = calc_bucket_idx(last_seen_hash, h->b->order);
1565 size_t move_src_idx = grow_idx(shrink_idx(last_old_idx));
1566
1567 /*
1568 * Last node seen was in the joining bucket - if the searched
1569 * for node is there we will find it.
1570 */
1571 if (move_src_idx != last_old_idx)
1572 return true;
1573 }
1574
1575 /*
1576 * Reached the end of the bucket but no nodes from the joining bucket
1577 * were seen. There should have at least been a JOIN node so we have
1578 * definitely not seen (and followed) the joining link. Make the link
1579 * visible and retry.
1580 */
1581 read_barrier();
1582 return false;
1583}
1584
1585/** When resizing returns the bucket head to start the search with in \a phead.
1586 *
1587 * If a resize had been detected (eg cht_t.b.head[idx] is marked immutable).
1588 * upd_resizing_head() moves the bucket for \a hash from the old head
1589 * to the new head. Moreover, it splits or joins buckets as necessary.
1590 *
1591 * @param h CHT to operate on.
1592 * @param hash Hash of an item whose chain we would like to traverse.
1593 * @param[out] phead Head of the bucket to search for \a hash.
1594 * @param[out] join_finishing Set to true if a bucket join might be
1595 * in progress and the bucket may have to traversed again
1596 * as indicated by join_completed().
1597 * @param[out] walk_mode Specifies how to interpret node marks.
1598 */
1599static void upd_resizing_head(cht_t *h, size_t hash, marked_ptr_t **phead,
1600 bool *join_finishing, walk_mode_t *walk_mode)
1601{
1602 cht_buckets_t *b = rcu_access(h->b);
1603 size_t old_idx = calc_bucket_idx(hash, b->order);
1604 size_t new_idx = calc_bucket_idx(hash, h->new_b->order);
1605
1606 marked_ptr_t *pold_head = &b->head[old_idx];
1607 marked_ptr_t *pnew_head = &h->new_b->head[new_idx];
1608
1609 /* In any case, use the bucket in the new table. */
1610 *phead = pnew_head;
1611
1612 /* Growing the table. */
1613 if (b->order < h->new_b->order) {
1614 size_t move_dest_idx = grow_idx(old_idx);
1615 marked_ptr_t *pmoved_head = &h->new_b->head[move_dest_idx];
1616
1617 /* Complete moving the bucket from the old to the new table. */
1618 help_head_move(pold_head, pmoved_head);
1619
1620 /* The hash belongs to the moved bucket. */
1621 if (move_dest_idx == new_idx) {
1622 assert(pmoved_head == pnew_head);
1623 /*
1624 * move_head() makes the new head of the moved bucket visible.
1625 * The new head may be marked with a JOIN_FOLLOWS
1626 */
1627 assert(!(N_CONST & get_mark(*pmoved_head)));
1628 *walk_mode = WM_MOVE_JOIN_FOLLOWS;
1629 } else {
1630 assert(pmoved_head != pnew_head);
1631 /*
1632 * The hash belongs to the bucket that is the result of splitting
1633 * the old/moved bucket, ie the bucket that contains the second
1634 * half of the split/old/moved bucket.
1635 */
1636
1637 /* The moved bucket has not yet been split. */
1638 if (N_NORMAL != get_mark(*pnew_head)) {
1639 size_t split_hash = calc_split_hash(new_idx, h->new_b->order);
1640 split_bucket(h, pmoved_head, pnew_head, split_hash);
1641 /*
1642 * split_bucket() makes the new head visible. No
1643 * JOIN_FOLLOWS in this part of split bucket.
1644 */
1645 assert(N_NORMAL == get_mark(*pnew_head));
1646 }
1647
1648 *walk_mode = WM_LEAVE_JOIN;
1649 }
1650 } else if (h->new_b->order < b->order ) {
1651 /* Shrinking the table. */
1652
1653 size_t move_src_idx = grow_idx(new_idx);
1654
1655 /*
1656 * Complete moving the bucket from the old to the new table.
1657 * Makes a valid pnew_head visible if already moved.
1658 */
1659 help_head_move(&b->head[move_src_idx], pnew_head);
1660
1661 /* Hash belongs to the bucket to be joined with the moved bucket. */
1662 if (move_src_idx != old_idx) {
1663 /* Bucket join not yet completed. */
1664 if (N_INVALID != get_mark(*pold_head)) {
1665 size_t split_hash = calc_split_hash(old_idx, b->order);
1666 join_buckets(h, pold_head, pnew_head, split_hash);
1667 }
1668
1669 /*
1670 * The resizer sets pold_head to &sentinel when all cpus are
1671 * guaranteed to see the bucket join.
1672 */
1673 *join_finishing = (&sentinel != get_next(*pold_head));
1674 }
1675
1676 /* move_head() or join_buckets() makes it so or makes the mark visible.*/
1677 assert(N_INVALID == get_mark(*pold_head));
1678 /* move_head() makes it visible. No JOIN_FOLLOWS used when shrinking. */
1679 assert(N_NORMAL == get_mark(*pnew_head));
1680
1681 *walk_mode = WM_LEAVE_JOIN;
1682 } else {
1683 /*
1684 * Final stage of resize. The resizer is waiting for all
1685 * readers to notice that the old table had been replaced.
1686 */
1687 assert(b == h->new_b);
1688 *walk_mode = WM_NORMAL;
1689 }
1690}
1691
1692
1693#if 0
1694static void move_head(marked_ptr_t *psrc_head, marked_ptr_t *pdest_head)
1695{
1696 start_head_move(psrc_head);
1697 cas_order_barrier();
1698 complete_head_move(psrc_head, pdest_head);
1699}
1700#endif
1701
1702/** Moves an immutable head \a psrc_head of cht_t.b to \a pdest_head of cht_t.new_b.
1703 *
1704 * The function guarantees the move will be visible on this cpu once
1705 * it completes. In particular, *pdest_head will not be N_INVALID.
1706 *
1707 * Unlike complete_head_move(), help_head_move() checks if the head had already
1708 * been moved and tries to avoid moving the bucket heads if possible.
1709 */
1710static inline void help_head_move(marked_ptr_t *psrc_head,
1711 marked_ptr_t *pdest_head)
1712{
1713 /* Head move has to in progress already when calling this func. */
1714 assert(N_CONST & get_mark(*psrc_head));
1715
1716 /* Head already moved. */
1717 if (N_INVALID == get_mark(*psrc_head)) {
1718 /* Effects of the head move have not yet propagated to this cpu. */
1719 if (N_INVALID == get_mark(*pdest_head)) {
1720 /* Make the move visible on this cpu. */
1721 read_barrier();
1722 }
1723 } else {
1724 complete_head_move(psrc_head, pdest_head);
1725 }
1726
1727 assert(!(N_CONST & get_mark(*pdest_head)));
1728}
1729
1730/** Initiates the move of the old head \a psrc_head.
1731 *
1732 * The move may be completed with help_head_move().
1733 */
1734static void start_head_move(marked_ptr_t *psrc_head)
1735{
1736 /* Mark src head immutable. */
1737 mark_const(psrc_head);
1738}
1739
1740/** Marks the head immutable. */
1741static void mark_const(marked_ptr_t *psrc_head)
1742{
1743 marked_ptr_t ret, src_link;
1744
1745 /* Mark src head immutable. */
1746 do {
1747 cht_link_t *next = get_next(*psrc_head);
1748 src_link = make_link(next, N_NORMAL);
1749
1750 /* Mark the normal/clean src link immutable/const. */
1751 ret = cas_link(psrc_head, next, N_NORMAL, next, N_CONST);
1752 } while(ret != src_link && !(N_CONST & get_mark(ret)));
1753}
1754
1755/** Completes moving head psrc_head to pdest_head (started by start_head_move()).*/
1756static void complete_head_move(marked_ptr_t *psrc_head, marked_ptr_t *pdest_head)
1757{
1758 assert(N_JOIN_FOLLOWS != get_mark(*psrc_head));
1759 assert(N_CONST & get_mark(*psrc_head));
1760
1761 cht_link_t *next = get_next(*psrc_head);
1762
1763 DBG(marked_ptr_t ret = )
1764 cas_link(pdest_head, &sentinel, N_INVALID, next, N_NORMAL);
1765 assert(ret == make_link(&sentinel, N_INVALID) || (N_NORMAL == get_mark(ret)));
1766 cas_order_barrier();
1767
1768 DBG(ret = )
1769 cas_link(psrc_head, next, N_CONST, next, N_INVALID);
1770 assert(ret == make_link(next, N_CONST) || (N_INVALID == get_mark(ret)));
1771 cas_order_barrier();
1772}
1773
1774/** Splits the bucket at psrc_head and links to the remainder from pdest_head.
1775 *
1776 * Items with hashes greater or equal to \a split_hash are moved to bucket
1777 * with head at \a pdest_head.
1778 *
1779 * @param h CHT to operate on.
1780 * @param psrc_head Head of the bucket to split (in cht_t.new_b).
1781 * @param pdest_head Head of the bucket that points to the second part
1782 * of the split bucket in psrc_head. (in cht_t.new_b)
1783 * @param split_hash Hash of the first possible item in the remainder of
1784 * psrc_head, ie the smallest hash pdest_head is allowed
1785 * to point to..
1786 */
1787static void split_bucket(cht_t *h, marked_ptr_t *psrc_head,
1788 marked_ptr_t *pdest_head, size_t split_hash)
1789{
1790 /* Already split. */
1791 if (N_NORMAL == get_mark(*pdest_head))
1792 return;
1793
1794 /*
1795 * L == Last node of the first part of the split bucket. That part
1796 * remains in the original/src bucket.
1797 * F == First node of the second part of the split bucket. That part
1798 * will be referenced from the dest bucket head.
1799 *
1800 * We want to first mark a clean L as JF so that updaters unaware of
1801 * the split (or table resize):
1802 * - do not insert a new node between L and F
1803 * - do not unlink L (that is why it has to be clean/normal)
1804 * - do not unlink F
1805 *
1806 * Then we can safely mark F as JN even if it has been marked deleted.
1807 * Once F is marked as JN updaters aware of table resize will not
1808 * attempt to unlink it (JN will have two predecessors - we cannot
1809 * safely unlink from both at the same time). Updaters unaware of
1810 * ongoing resize can reach F only via L and that node is already
1811 * marked JF, so they won't unlink F.
1812 *
1813 * Last, link the new/dest head to F.
1814 *
1815 *
1816 * 0) ,-- split_hash, first hash of the dest bucket
1817 * v
1818 * [src_head | N] -> .. -> [L] -> [F]
1819 * [dest_head | Inv]
1820 *
1821 * 1) ,-- split_hash
1822 * v
1823 * [src_head | N] -> .. -> [JF] -> [F]
1824 * [dest_head | Inv]
1825 *
1826 * 2) ,-- split_hash
1827 * v
1828 * [src_head | N] -> .. -> [JF] -> [JN]
1829 * [dest_head | Inv]
1830 *
1831 * 3) ,-- split_hash
1832 * v
1833 * [src_head | N] -> .. -> [JF] -> [JN]
1834 * ^
1835 * [dest_head | N] -----------------'
1836 */
1837 wnd_t wnd;
1838
1839 rcu_read_lock();
1840
1841 /* Mark the last node of the first part of the split bucket as JF. */
1842 mark_join_follows(h, psrc_head, split_hash, &wnd);
1843 cas_order_barrier();
1844
1845 /* There are nodes in the dest bucket, ie the second part of the split. */
1846 if (wnd.cur != &sentinel) {
1847 /*
1848 * Mark the first node of the dest bucket as a join node so
1849 * updaters do not attempt to unlink it if it is deleted.
1850 */
1851 mark_join_node(wnd.cur);
1852 cas_order_barrier();
1853 } else {
1854 /*
1855 * Second part of the split bucket is empty. There are no nodes
1856 * to mark as JOIN nodes and there never will be.
1857 */
1858 }
1859
1860 /* Link the dest head to the second part of the split. */
1861 DBG(marked_ptr_t ret = )
1862 cas_link(pdest_head, &sentinel, N_INVALID, wnd.cur, N_NORMAL);
1863 assert(ret == make_link(&sentinel, N_INVALID) || (N_NORMAL == get_mark(ret)));
1864 cas_order_barrier();
1865
1866 rcu_read_unlock();
1867}
1868
1869/** Finds and marks the last node of psrc_head w/ hash less than split_hash.
1870 *
1871 * Finds a node in psrc_head with the greatest hash that is strictly less
1872 * than split_hash and marks it with N_JOIN_FOLLOWS.
1873 *
1874 * Returns a window pointing to that node.
1875 *
1876 * Any logically deleted nodes along the way are
1877 * garbage collected; therefore, the predecessor node (if any) will most
1878 * likely not be marked N_DELETED.
1879 *
1880 * @param h CHT to operate on.
1881 * @param psrc_head Bucket head.
1882 * @param split_hash The smallest hash a join node (ie the node following
1883 * the desired join-follows node) may have.
1884 * @param[out] wnd Points to the node marked with N_JOIN_FOLLOWS.
1885 */
1886static void mark_join_follows(cht_t *h, marked_ptr_t *psrc_head,
1887 size_t split_hash, wnd_t *wnd)
1888{
1889 /* See comment in split_bucket(). */
1890
1891 bool done = false;
1892
1893 do {
1894 bool resizing = false;
1895 wnd->ppred = psrc_head;
1896 wnd->cur = get_next(*psrc_head);
1897
1898 /*
1899 * Find the split window, ie the last node of the first part of
1900 * the split bucket and the its successor - the first node of
1901 * the second part of the split bucket. Retry if GC failed.
1902 */
1903 if (!find_wnd_and_gc(h, split_hash, WM_MOVE_JOIN_FOLLOWS, wnd, &resizing))
1904 continue;
1905
1906 /* Must not report that the table is resizing if WM_MOVE_JOIN_FOLLOWS.*/
1907 assert(!resizing);
1908 /*
1909 * Mark the last node of the first half of the split bucket
1910 * that a join node follows. It must be clean/normal.
1911 */
1912 marked_ptr_t ret
1913 = cas_link(wnd->ppred, wnd->cur, N_NORMAL, wnd->cur, N_JOIN_FOLLOWS);
1914
1915 /*
1916 * Successfully marked as a JF node or already marked that way (even
1917 * if also marked deleted - unlinking the node will move the JF mark).
1918 */
1919 done = (ret == make_link(wnd->cur, N_NORMAL))
1920 || (N_JOIN_FOLLOWS & get_mark(ret));
1921 } while (!done);
1922}
1923
1924/** Marks join_node with N_JOIN. */
1925static void mark_join_node(cht_link_t *join_node)
1926{
1927 /* See comment in split_bucket(). */
1928
1929 bool done;
1930 do {
1931 cht_link_t *next = get_next(join_node->link);
1932 mark_t mark = get_mark(join_node->link);
1933
1934 /*
1935 * May already be marked as deleted, but it won't be unlinked
1936 * because its predecessor is marked with JOIN_FOLLOWS or CONST.
1937 */
1938 marked_ptr_t ret
1939 = cas_link(&join_node->link, next, mark, next, mark | N_JOIN);
1940
1941 /* Successfully marked or already marked as a join node. */
1942 done = (ret == make_link(next, mark))
1943 || (N_JOIN & get_mark(ret));
1944 } while(!done);
1945}
1946
1947/** Appends the bucket at psrc_head to the bucket at pdest_head.
1948 *
1949 * @param h CHT to operate on.
1950 * @param psrc_head Bucket to merge with pdest_head.
1951 * @param pdest_head Bucket to be joined by psrc_head.
1952 * @param split_hash The smallest hash psrc_head may contain.
1953 */
1954static void join_buckets(cht_t *h, marked_ptr_t *psrc_head,
1955 marked_ptr_t *pdest_head, size_t split_hash)
1956{
1957 /* Buckets already joined. */
1958 if (N_INVALID == get_mark(*psrc_head))
1959 return;
1960 /*
1961 * F == First node of psrc_head, ie the bucket we want to append
1962 * to (ie join with) the bucket starting at pdest_head.
1963 * L == Last node of pdest_head, ie the bucket that psrc_head will
1964 * be appended to.
1965 *
1966 * (1) We first mark psrc_head immutable to signal that a join is
1967 * in progress and so that updaters unaware of the join (or table
1968 * resize):
1969 * - do not insert new nodes between the head psrc_head and F
1970 * - do not unlink F (it may already be marked deleted)
1971 *
1972 * (2) Next, F is marked as a join node. Updaters aware of table resize
1973 * will not attempt to unlink it. We cannot safely/atomically unlink
1974 * the join node because it will be pointed to from two different
1975 * buckets. Updaters unaware of resize will fail to unlink the join
1976 * node due to the head being marked immutable.
1977 *
1978 * (3) Then the tail of the bucket at pdest_head is linked to the join
1979 * node. From now on, nodes in both buckets can be found via pdest_head.
1980 *
1981 * (4) Last, mark immutable psrc_head as invalid. It signals updaters
1982 * that the join is complete and they can insert new nodes (originally
1983 * destined for psrc_head) into pdest_head.
1984 *
1985 * Note that pdest_head keeps pointing at the join node. This allows
1986 * lookups and updaters to determine if they should see a link between
1987 * the tail L and F when searching for nodes originally in psrc_head
1988 * via pdest_head. If they reach the tail of pdest_head without
1989 * encountering any nodes of psrc_head, either there were no nodes
1990 * in psrc_head to begin with or the link between L and F did not
1991 * yet propagate to their cpus. If psrc_head was empty, it remains
1992 * NULL. Otherwise psrc_head points to a join node (it will not be
1993 * unlinked until table resize completes) and updaters/lookups
1994 * should issue a read_barrier() to make the link [L]->[JN] visible.
1995 *
1996 * 0) ,-- split_hash, first hash of the src bucket
1997 * v
1998 * [dest_head | N]-> .. -> [L]
1999 * [src_head | N]--> [F] -> ..
2000 * ^
2001 * ` split_hash, first hash of the src bucket
2002 *
2003 * 1) ,-- split_hash
2004 * v
2005 * [dest_head | N]-> .. -> [L]
2006 * [src_head | C]--> [F] -> ..
2007 *
2008 * 2) ,-- split_hash
2009 * v
2010 * [dest_head | N]-> .. -> [L]
2011 * [src_head | C]--> [JN] -> ..
2012 *
2013 * 3) ,-- split_hash
2014 * v
2015 * [dest_head | N]-> .. -> [L] --+
2016 * v
2017 * [src_head | C]-------------> [JN] -> ..
2018 *
2019 * 4) ,-- split_hash
2020 * v
2021 * [dest_head | N]-> .. -> [L] --+
2022 * v
2023 * [src_head | Inv]-----------> [JN] -> ..
2024 */
2025
2026 rcu_read_lock();
2027
2028 /* Mark src_head immutable - signals updaters that bucket join started. */
2029 mark_const(psrc_head);
2030 cas_order_barrier();
2031
2032 cht_link_t *join_node = get_next(*psrc_head);
2033
2034 if (join_node != &sentinel) {
2035 mark_join_node(join_node);
2036 cas_order_barrier();
2037
2038 link_to_join_node(h, pdest_head, join_node, split_hash);
2039 cas_order_barrier();
2040 }
2041
2042 DBG(marked_ptr_t ret = )
2043 cas_link(psrc_head, join_node, N_CONST, join_node, N_INVALID);
2044 assert(ret == make_link(join_node, N_CONST) || (N_INVALID == get_mark(ret)));
2045 cas_order_barrier();
2046
2047 rcu_read_unlock();
2048}
2049
2050/** Links the tail of pdest_head to join_node.
2051 *
2052 * @param h CHT to operate on.
2053 * @param pdest_head Head of the bucket whose tail is to be linked to join_node.
2054 * @param join_node A node marked N_JOIN with a hash greater or equal to
2055 * split_hash.
2056 * @param split_hash The least hash that is greater than the hash of any items
2057 * (originally) in pdest_head.
2058 */
2059static void link_to_join_node(cht_t *h, marked_ptr_t *pdest_head,
2060 cht_link_t *join_node, size_t split_hash)
2061{
2062 bool done = false;
2063
2064 do {
2065 wnd_t wnd = {
2066 .ppred = pdest_head,
2067 .cur = get_next(*pdest_head)
2068 };
2069
2070 bool resizing = false;
2071
2072 if (!find_wnd_and_gc(h, split_hash, WM_LEAVE_JOIN, &wnd, &resizing))
2073 continue;
2074
2075 assert(!resizing);
2076
2077 if (wnd.cur != &sentinel) {
2078 /* Must be from the new appended bucket. */
2079 assert(split_hash <= node_hash(h, wnd.cur)
2080 || h->invalid_hash == node_hash(h, wnd.cur));
2081 return;
2082 }
2083
2084 /* Reached the tail of pdest_head - link it to the join node. */
2085 marked_ptr_t ret =
2086 cas_link(wnd.ppred, &sentinel, N_NORMAL, join_node, N_NORMAL);
2087
2088 done = (ret == make_link(&sentinel, N_NORMAL));
2089 } while (!done);
2090}
2091
2092/** Instructs RCU to free the item once all preexisting references are dropped.
2093 *
2094 * The item is freed via op->remove_callback().
2095 */
2096static void free_later(cht_t *h, cht_link_t *item)
2097{
2098 assert(item != &sentinel);
2099
2100 /*
2101 * remove_callback only works as rcu_func_t because rcu_link is the first
2102 * field in cht_link_t.
2103 */
2104 rcu_call(&item->rcu_link, (rcu_func_t)h->op->remove_callback);
2105
2106 item_removed(h);
2107}
2108
2109/** Notes that an item had been unlinked from the table and shrinks it if needed.
2110 *
2111 * If the number of items in the table drops below 1/4 of the maximum
2112 * allowed load the table is shrunk in the background.
2113 */
2114static inline void item_removed(cht_t *h)
2115{
2116 size_t items = (size_t) atomic_predec(&h->item_cnt);
2117 size_t bucket_cnt = (1 << h->b->order);
2118
2119 bool need_shrink = (items == h->max_load * bucket_cnt / 4);
2120 bool missed_shrink = (items == h->max_load * bucket_cnt / 8);
2121
2122 if ((need_shrink || missed_shrink) && h->b->order > h->min_order) {
2123 atomic_count_t resize_reqs = atomic_preinc(&h->resize_reqs);
2124 /* The first resize request. Start the resizer. */
2125 if (1 == resize_reqs) {
2126 workq_global_enqueue_noblock(&h->resize_work, resize_table);
2127 }
2128 }
2129}
2130
2131/** Notes an item had been inserted and grows the table if needed.
2132 *
2133 * The table is resized in the background.
2134 */
2135static inline void item_inserted(cht_t *h)
2136{
2137 size_t items = (size_t) atomic_preinc(&h->item_cnt);
2138 size_t bucket_cnt = (1 << h->b->order);
2139
2140 bool need_grow = (items == h->max_load * bucket_cnt);
2141 bool missed_grow = (items == 2 * h->max_load * bucket_cnt);
2142
2143 if ((need_grow || missed_grow) && h->b->order < CHT_MAX_ORDER) {
2144 atomic_count_t resize_reqs = atomic_preinc(&h->resize_reqs);
2145 /* The first resize request. Start the resizer. */
2146 if (1 == resize_reqs) {
2147 workq_global_enqueue_noblock(&h->resize_work, resize_table);
2148 }
2149 }
2150}
2151
2152/** Resize request handler. Invoked on the system work queue. */
2153static void resize_table(work_t *arg)
2154{
2155 cht_t *h = member_to_inst(arg, cht_t, resize_work);
2156
2157#ifdef CONFIG_DEBUG
2158 assert(h->b);
2159 /* Make resize_reqs visible. */
2160 read_barrier();
2161 assert(0 < atomic_get(&h->resize_reqs));
2162#endif
2163
2164 bool done = false;
2165
2166 do {
2167 /* Load the most recent h->item_cnt. */
2168 read_barrier();
2169 size_t cur_items = (size_t) atomic_get(&h->item_cnt);
2170 size_t bucket_cnt = (1 << h->b->order);
2171 size_t max_items = h->max_load * bucket_cnt;
2172
2173 if (cur_items >= max_items && h->b->order < CHT_MAX_ORDER) {
2174 grow_table(h);
2175 } else if (cur_items <= max_items / 4 && h->b->order > h->min_order) {
2176 shrink_table(h);
2177 } else {
2178 /* Table is just the right size. */
2179 atomic_count_t reqs = atomic_predec(&h->resize_reqs);
2180 done = (reqs == 0);
2181 }
2182 } while (!done);
2183}
2184
2185/** Increases the number of buckets two-fold. Blocks until done. */
2186static void grow_table(cht_t *h)
2187{
2188 if (h->b->order >= CHT_MAX_ORDER)
2189 return;
2190
2191 h->new_b = alloc_buckets(h->b->order + 1, true, false);
2192
2193 /* Failed to alloc a new table - try next time the resizer is run. */
2194 if (!h->new_b)
2195 return;
2196
2197 /* Wait for all readers and updaters to see the initialized new table. */
2198 rcu_synchronize();
2199 size_t old_bucket_cnt = (1 << h->b->order);
2200
2201 /*
2202 * Give updaters a chance to help out with the resize. Do the minimum
2203 * work needed to announce a resize is in progress, ie start moving heads.
2204 */
2205 for (size_t idx = 0; idx < old_bucket_cnt; ++idx) {
2206 start_head_move(&h->b->head[idx]);
2207 }
2208
2209 /* Order start_head_move() wrt complete_head_move(). */
2210 cas_order_barrier();
2211
2212 /* Complete moving heads and split any buckets not yet split by updaters. */
2213 for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2214 marked_ptr_t *move_dest_head = &h->new_b->head[grow_idx(old_idx)];
2215 marked_ptr_t *move_src_head = &h->b->head[old_idx];
2216
2217 /* Head move not yet completed. */
2218 if (N_INVALID != get_mark(*move_src_head)) {
2219 complete_head_move(move_src_head, move_dest_head);
2220 }
2221
2222 size_t split_idx = grow_to_split_idx(old_idx);
2223 size_t split_hash = calc_split_hash(split_idx, h->new_b->order);
2224 marked_ptr_t *split_dest_head = &h->new_b->head[split_idx];
2225
2226 split_bucket(h, move_dest_head, split_dest_head, split_hash);
2227 }
2228
2229 /*
2230 * Wait for all updaters to notice the new heads. Once everyone sees
2231 * the invalid old bucket heads they will know a resize is in progress
2232 * and updaters will modify the correct new buckets.
2233 */
2234 rcu_synchronize();
2235
2236 /* Clear the JOIN_FOLLOWS mark and remove the link between the split buckets.*/
2237 for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2238 size_t new_idx = grow_idx(old_idx);
2239
2240 cleanup_join_follows(h, &h->new_b->head[new_idx]);
2241 }
2242
2243 /*
2244 * Wait for everyone to notice that buckets were split, ie link connecting
2245 * the join follows and join node has been cut.
2246 */
2247 rcu_synchronize();
2248
2249 /* Clear the JOIN mark and GC any deleted join nodes. */
2250 for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2251 size_t new_idx = grow_to_split_idx(old_idx);
2252
2253 cleanup_join_node(h, &h->new_b->head[new_idx]);
2254 }
2255
2256 /* Wait for everyone to see that the table is clear of any resize marks. */
2257 rcu_synchronize();
2258
2259 cht_buckets_t *old_b = h->b;
2260 rcu_assign(h->b, h->new_b);
2261
2262 /* Wait for everyone to start using the new table. */
2263 rcu_synchronize();
2264
2265 free(old_b);
2266
2267 /* Not needed; just for increased readability. */
2268 h->new_b = NULL;
2269}
2270
2271/** Halfs the number of buckets. Blocks until done. */
2272static void shrink_table(cht_t *h)
2273{
2274 if (h->b->order <= h->min_order)
2275 return;
2276
2277 h->new_b = alloc_buckets(h->b->order - 1, true, false);
2278
2279 /* Failed to alloc a new table - try next time the resizer is run. */
2280 if (!h->new_b)
2281 return;
2282
2283 /* Wait for all readers and updaters to see the initialized new table. */
2284 rcu_synchronize();
2285
2286 size_t old_bucket_cnt = (1 << h->b->order);
2287
2288 /*
2289 * Give updaters a chance to help out with the resize. Do the minimum
2290 * work needed to announce a resize is in progress, ie start moving heads.
2291 */
2292 for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2293 size_t new_idx = shrink_idx(old_idx);
2294
2295 /* This bucket should be moved. */
2296 if (grow_idx(new_idx) == old_idx) {
2297 start_head_move(&h->b->head[old_idx]);
2298 } else {
2299 /* This bucket should join the moved bucket once the move is done.*/
2300 }
2301 }
2302
2303 /* Order start_head_move() wrt to complete_head_move(). */
2304 cas_order_barrier();
2305
2306 /* Complete moving heads and join buckets with the moved buckets. */
2307 for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2308 size_t new_idx = shrink_idx(old_idx);
2309 size_t move_src_idx = grow_idx(new_idx);
2310
2311 /* This bucket should be moved. */
2312 if (move_src_idx == old_idx) {
2313 /* Head move not yet completed. */
2314 if (N_INVALID != get_mark(h->b->head[old_idx])) {
2315 complete_head_move(&h->b->head[old_idx], &h->new_b->head[new_idx]);
2316 }
2317 } else {
2318 /* This bucket should join the moved bucket. */
2319 size_t split_hash = calc_split_hash(old_idx, h->b->order);
2320 join_buckets(h, &h->b->head[old_idx], &h->new_b->head[new_idx],
2321 split_hash);
2322 }
2323 }
2324
2325 /*
2326 * Wait for all updaters to notice the new heads. Once everyone sees
2327 * the invalid old bucket heads they will know a resize is in progress
2328 * and updaters will modify the correct new buckets.
2329 */
2330 rcu_synchronize();
2331
2332 /* Let everyone know joins are complete and fully visible. */
2333 for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2334 size_t move_src_idx = grow_idx(shrink_idx(old_idx));
2335
2336 /* Set the invalid joinee head to NULL. */
2337 if (old_idx != move_src_idx) {
2338 assert(N_INVALID == get_mark(h->b->head[old_idx]));
2339
2340 if (&sentinel != get_next(h->b->head[old_idx]))
2341 h->b->head[old_idx] = make_link(&sentinel, N_INVALID);
2342 }
2343 }
2344
2345 /* todo comment join node vs reset joinee head*/
2346 rcu_synchronize();
2347
2348 size_t new_bucket_cnt = (1 << h->new_b->order);
2349
2350 /* Clear the JOIN mark and GC any deleted join nodes. */
2351 for (size_t new_idx = 0; new_idx < new_bucket_cnt; ++new_idx) {
2352 cleanup_join_node(h, &h->new_b->head[new_idx]);
2353 }
2354
2355 /* Wait for everyone to see that the table is clear of any resize marks. */
2356 rcu_synchronize();
2357
2358 cht_buckets_t *old_b = h->b;
2359 rcu_assign(h->b, h->new_b);
2360
2361 /* Wait for everyone to start using the new table. */
2362 rcu_synchronize();
2363
2364 free(old_b);
2365
2366 /* Not needed; just for increased readability. */
2367 h->new_b = NULL;
2368}
2369
2370/** Finds and clears the N_JOIN mark from a node in new_head (if present). */
2371static void cleanup_join_node(cht_t *h, marked_ptr_t *new_head)
2372{
2373 rcu_read_lock();
2374
2375 cht_link_t *cur = get_next(*new_head);
2376
2377 while (cur != &sentinel) {
2378 /* Clear the join node's JN mark - even if it is marked as deleted. */
2379 if (N_JOIN & get_mark(cur->link)) {
2380 clear_join_and_gc(h, cur, new_head);
2381 break;
2382 }
2383
2384 cur = get_next(cur->link);
2385 }
2386
2387 rcu_read_unlock();
2388}
2389
2390/** Clears the join_node's N_JOIN mark frees it if marked N_DELETED as well. */
2391static void clear_join_and_gc(cht_t *h, cht_link_t *join_node,
2392 marked_ptr_t *new_head)
2393{
2394 assert(join_node != &sentinel);
2395 assert(join_node && (N_JOIN & get_mark(join_node->link)));
2396
2397 bool done;
2398
2399 /* Clear the JN mark. */
2400 do {
2401 marked_ptr_t jn_link = join_node->link;
2402 cht_link_t *next = get_next(jn_link);
2403 /* Clear the JOIN mark but keep the DEL mark if present. */
2404 mark_t cleared_mark = get_mark(jn_link) & N_DELETED;
2405
2406 marked_ptr_t ret =
2407 _cas_link(&join_node->link, jn_link, make_link(next, cleared_mark));
2408
2409 /* Done if the mark was cleared. Retry if a new node was inserted. */
2410 done = (ret == jn_link);
2411 assert(ret == jn_link || (get_mark(ret) & N_JOIN));
2412 } while (!done);
2413
2414 if (!(N_DELETED & get_mark(join_node->link)))
2415 return;
2416
2417 /* The join node had been marked as deleted - GC it. */
2418
2419 /* Clear the JOIN mark before trying to unlink the deleted join node.*/
2420 cas_order_barrier();
2421
2422 size_t jn_hash = node_hash(h, join_node);
2423 do {
2424 bool resizing = false;
2425
2426 wnd_t wnd = {
2427 .ppred = new_head,
2428 .cur = get_next(*new_head)
2429 };
2430
2431 done = find_wnd_and_gc_pred(h, jn_hash, WM_NORMAL, same_node_pred,
2432 join_node, &wnd, &resizing);
2433
2434 assert(!resizing);
2435 } while (!done);
2436}
2437
2438/** Finds a non-deleted node with N_JOIN_FOLLOWS and clears the mark. */
2439static void cleanup_join_follows(cht_t *h, marked_ptr_t *new_head)
2440{
2441 assert(new_head);
2442
2443 rcu_read_lock();
2444
2445 wnd_t wnd = {
2446 .ppred = NULL,
2447 .cur = NULL
2448 };
2449 marked_ptr_t *cur_link = new_head;
2450
2451 /*
2452 * Find the non-deleted node with a JF mark and clear the JF mark.
2453 * The JF node may be deleted and/or the mark moved to its neighbors
2454 * at any time. Therefore, we GC deleted nodes until we find the JF
2455 * node in order to remove stale/deleted JF nodes left behind eg by
2456 * delayed threads that did not yet get a chance to unlink the deleted
2457 * JF node and move its mark.
2458 *
2459 * Note that the head may be marked JF (but never DELETED).
2460 */
2461 while (true) {
2462 bool is_jf_node = N_JOIN_FOLLOWS & get_mark(*cur_link);
2463
2464 /* GC any deleted nodes on the way - even deleted JOIN_FOLLOWS. */
2465 if (N_DELETED & get_mark(*cur_link)) {
2466 assert(cur_link != new_head);
2467 assert(wnd.ppred && wnd.cur && wnd.cur != &sentinel);
2468 assert(cur_link == &wnd.cur->link);
2469
2470 bool dummy;
2471 bool deleted = gc_deleted_node(h, WM_MOVE_JOIN_FOLLOWS, &wnd, &dummy);
2472
2473 /* Failed to GC or collected a deleted JOIN_FOLLOWS. */
2474 if (!deleted || is_jf_node) {
2475 /* Retry from the head of the bucket. */
2476 cur_link = new_head;
2477 continue;
2478 }
2479 } else {
2480 /* Found a non-deleted JF. Clear its JF mark. */
2481 if (is_jf_node) {
2482 cht_link_t *next = get_next(*cur_link);
2483 marked_ptr_t ret =
2484 cas_link(cur_link, next, N_JOIN_FOLLOWS, &sentinel, N_NORMAL);
2485
2486 assert(next == &sentinel
2487 || ((N_JOIN | N_JOIN_FOLLOWS) & get_mark(ret)));
2488
2489 /* Successfully cleared the JF mark of a non-deleted node. */
2490 if (ret == make_link(next, N_JOIN_FOLLOWS)) {
2491 break;
2492 } else {
2493 /*
2494 * The JF node had been deleted or a new node inserted
2495 * right after it. Retry from the head.
2496 */
2497 cur_link = new_head;
2498 continue;
2499 }
2500 } else {
2501 wnd.ppred = cur_link;
2502 wnd.cur = get_next(*cur_link);
2503 }
2504 }
2505
2506 /* We must encounter a JF node before we reach the end of the bucket. */
2507 assert(wnd.cur && wnd.cur != &sentinel);
2508 cur_link = &wnd.cur->link;
2509 }
2510
2511 rcu_read_unlock();
2512}
2513
2514/** Returns the first possible hash following a bucket split point.
2515 *
2516 * In other words the returned hash is the smallest possible hash
2517 * the remainder of the split bucket may contain.
2518 */
2519static inline size_t calc_split_hash(size_t split_idx, size_t order)
2520{
2521 assert(1 <= order && order <= 8 * sizeof(size_t));
2522 return split_idx << (8 * sizeof(size_t) - order);
2523}
2524
2525/** Returns the bucket head index given the table size order and item hash. */
2526static inline size_t calc_bucket_idx(size_t hash, size_t order)
2527{
2528 assert(1 <= order && order <= 8 * sizeof(size_t));
2529 return hash >> (8 * sizeof(size_t) - order);
2530}
2531
2532/** Returns the bucket index of destination*/
2533static inline size_t grow_to_split_idx(size_t old_idx)
2534{
2535 return grow_idx(old_idx) | 1;
2536}
2537
2538/** Returns the destination index of a bucket head when the table is growing. */
2539static inline size_t grow_idx(size_t idx)
2540{
2541 return idx << 1;
2542}
2543
2544/** Returns the destination index of a bucket head when the table is shrinking.*/
2545static inline size_t shrink_idx(size_t idx)
2546{
2547 return idx >> 1;
2548}
2549
2550/** Returns a mixed hash of the search key.*/
2551static inline size_t calc_key_hash(cht_t *h, void *key)
2552{
2553 /* Mimic calc_node_hash. */
2554 return hash_mix(h->op->key_hash(key)) & ~(size_t)1;
2555}
2556
2557/** Returns a memoized mixed hash of the item. */
2558static inline size_t node_hash(cht_t *h, const cht_link_t *item)
2559{
2560 assert(item->hash == h->invalid_hash
2561 || item->hash == sentinel.hash
2562 || item->hash == calc_node_hash(h, item));
2563
2564 return item->hash;
2565}
2566
2567/** Calculates and mixed the hash of the item. */
2568static inline size_t calc_node_hash(cht_t *h, const cht_link_t *item)
2569{
2570 assert(item != &sentinel);
2571 /*
2572 * Clear the lowest order bit in order for sentinel's node hash
2573 * to be the greatest possible.
2574 */
2575 return hash_mix(h->op->hash(item)) & ~(size_t)1;
2576}
2577
2578/** Computes and memoizes the hash of the item. */
2579static inline void memoize_node_hash(cht_t *h, cht_link_t *item)
2580{
2581 item->hash = calc_node_hash(h, item);
2582}
2583
2584/** Packs the next pointer address and the mark into a single pointer. */
2585static inline marked_ptr_t make_link(const cht_link_t *next, mark_t mark)
2586{
2587 marked_ptr_t ptr = (marked_ptr_t) next;
2588
2589 assert(!(ptr & N_MARK_MASK));
2590 assert(!((unsigned)mark & ~N_MARK_MASK));
2591
2592 return ptr | mark;
2593}
2594
2595/** Strips any marks from the next item link and returns the next item's address.*/
2596static inline cht_link_t * get_next(marked_ptr_t link)
2597{
2598 return (cht_link_t*)(link & ~N_MARK_MASK);
2599}
2600
2601/** Returns the current node's mark stored in the next item link. */
2602static inline mark_t get_mark(marked_ptr_t link)
2603{
2604 return (mark_t)(link & N_MARK_MASK);
2605}
2606
2607/** Moves the window by one item so that is points to the next item. */
2608static inline void next_wnd(wnd_t *wnd)
2609{
2610 assert(wnd);
2611 assert(wnd->cur);
2612
2613 wnd->last = wnd->cur;
2614 wnd->ppred = &wnd->cur->link;
2615 wnd->cur = get_next(wnd->cur->link);
2616}
2617
2618/** Predicate that matches only exactly the same node. */
2619static bool same_node_pred(void *node, const cht_link_t *item2)
2620{
2621 const cht_link_t *item1 = (const cht_link_t*) node;
2622 return item1 == item2;
2623}
2624
2625/** Compare-and-swaps a next item link. */
2626static inline marked_ptr_t cas_link(marked_ptr_t *link, const cht_link_t *cur_next,
2627 mark_t cur_mark, const cht_link_t *new_next, mark_t new_mark)
2628{
2629 return _cas_link(link, make_link(cur_next, cur_mark),
2630 make_link(new_next, new_mark));
2631}
2632
2633/** Compare-and-swaps a next item link. */
2634static inline marked_ptr_t _cas_link(marked_ptr_t *link, marked_ptr_t cur,
2635 marked_ptr_t new)
2636{
2637 assert(link != &sentinel.link);
2638 /*
2639 * cas(x) on the same location x on one cpu must be ordered, but do not
2640 * have to be ordered wrt to other cas(y) to a different location y
2641 * on the same cpu.
2642 *
2643 * cas(x) must act as a write barrier on x, ie if cas(x) succeeds
2644 * and is observed by another cpu, then all cpus must be able to
2645 * make the effects of cas(x) visible just by issuing a load barrier.
2646 * For example:
2647 * cpu1 cpu2 cpu3
2648 * cas(x, 0 -> 1), succeeds
2649 * cas(x, 0 -> 1), fails
2650 * MB, to order load of x in cas and store to y
2651 * y = 7
2652 * sees y == 7
2653 * loadMB must be enough to make cas(x) on cpu3 visible to cpu1, ie x == 1.
2654 *
2655 * If cas() did not work this way:
2656 * a) our head move protocol would not be correct.
2657 * b) freeing an item linked to a moved head after another item was
2658 * inserted in front of it, would require more than one grace period.
2659 *
2660 * Ad (a): In the following example, cpu1 starts moving old_head
2661 * to new_head, cpu2 completes the move and cpu3 notices cpu2
2662 * completed the move before cpu1 gets a chance to notice cpu2
2663 * had already completed the move. Our requirements for cas()
2664 * assume cpu3 will see a valid and mutable value in new_head
2665 * after issuing a load memory barrier once it has determined
2666 * the old_head's value had been successfully moved to new_head
2667 * (because it sees old_head marked invalid).
2668 *
2669 * cpu1 cpu2 cpu3
2670 * cas(old_head, <addr, N>, <addr, Const>), succeeds
2671 * cas-order-barrier
2672 * // Move from old_head to new_head started, now the interesting stuff:
2673 * cas(new_head, <0, Inv>, <addr, N>), succeeds
2674 *
2675 * cas(new_head, <0, Inv>, <addr, N>), but fails
2676 * cas-order-barrier
2677 * cas(old_head, <addr, Const>, <addr, Inv>), succeeds
2678 *
2679 * Sees old_head marked Inv (by cpu2)
2680 * load-MB
2681 * assert(new_head == <addr, N>)
2682 *
2683 * cas-order-barrier
2684 *
2685 * Even though cpu1 did not yet issue a cas-order-barrier, cpu1's store
2686 * to new_head (successful cas()) must be made visible to cpu3 with
2687 * a load memory barrier if cpu1's store to new_head is visible
2688 * on another cpu (cpu2) and that cpu's (cpu2's) store to old_head
2689 * is already visible to cpu3. *
2690 */
2691 void *expected = (void*)cur;
2692
2693 /*
2694 * Use the acquire-release model, although we could probably
2695 * get away even with the relaxed memory model due to our use
2696 * of explicit memory barriers.
2697 */
2698 __atomic_compare_exchange_n((void**)link, &expected, (void *)new, false,
2699 __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE);
2700
2701 return (marked_ptr_t) expected;
2702}
2703
2704/** Orders compare-and-swaps to different memory locations. */
2705static inline void cas_order_barrier(void)
2706{
2707 /* Make sure CAS to different memory locations are ordered. */
2708 write_barrier();
2709}
2710
2711
2712/** @}
2713 */
Note: See TracBrowser for help on using the repository browser.