Context Navigation

cht.c@ 314f4b59

Visit:

lfn serial ticket/834-toolchain-update topic/msim-upgrade topic/simplify-dev-export

Last change on this file since 314f4b59 was a35b458, checked in by Jiří Zárevúcky <zarevucky.jiri@…>, 7 years ago

style: Remove trailing whitespace on _all_ lines, including empty ones, for particular file types.

Command used: tools/srepl '\s\+$' '' -- *.c *.h *.py *.sh *.s *.S *.ag

Currently, whitespace on empty lines is very inconsistent.
There are two basic choices: Either remove the whitespace, or keep empty lines
indented to the level of surrounding code. The former is AFAICT more common,
and also much easier to do automatically.

Alternatively, we could write script for automatic indentation, and use that
instead. However, if such a script exists, it's possible to use the indented
style locally, by having the editor apply relevant conversions on load/save,
without affecting remote repository. IMO, it makes more sense to adopt
the simpler rule.

Property mode set to 100644

File size: 89.2 KB

Line
1	/*
2	* Copyright (c) 2012 Adam Hraska
3	* All rights reserved.
4	*
5	* Redistribution and use in source and binary forms, with or without
6	* modification, are permitted provided that the following conditions
7	* are met:
8	*
9	* - Redistributions of source code must retain the above copyright
10	* notice, this list of conditions and the following disclaimer.
11	* - Redistributions in binary form must reproduce the above copyright
12	* notice, this list of conditions and the following disclaimer in the
13	* documentation and/or other materials provided with the distribution.
14	* - The name of the author may not be used to endorse or promote products
15	* derived from this software without specific prior written permission.
16	*
17	* THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
18	* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
19	* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
20	* IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
21	* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
22	* NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
23	* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
24	* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25	* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
26	* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27	*/
28
29
30	/** @addtogroup genericadt
31	* @{
32	*/
33
34	/**
35	* @file
36	* @brief Scalable resizable concurrent lock-free hash table.
37	*
38	* CHT is a concurrent hash table that is scalable resizable and lock-free.
39	* resizable = the number of buckets of the table increases or decreases
40	* depending on the average number of elements per bucket (ie load)
41	* scalable = accessing the table from more cpus increases performance
42	* almost linearly
43	* lock-free = common operations never block; even if any of the operations
44	* is preempted or interrupted at any time, other operations will still
45	* make forward progress
46	*
47	* CHT is designed for read mostly scenarios. Performance degrades as the
48	* fraction of updates (insert/remove) increases. Other data structures
49	* significantly outperform CHT if the fraction of updates exceeds ~40%.
50	*
51	* CHT tolerates hardware exceptions and may be accessed from exception
52	* handlers as long as the underlying RCU implementation is exception safe.
53	*
54	* @par Caveats
55	*
56	* 0) Never assume an item is still in the table.
57	* The table may be accessed concurrently; therefore, other threads may
58	* insert or remove an item at any time. Do not assume an item is still
59	* in the table if cht_find() just returned it to you. Similarly, an
60	* item may have already been inserted by the time cht_find() returns NULL.
61	*
62	* 1) Always use RCU read locks when searching the table.
63	* Holding an RCU lock guarantees that an item found in the table remains
64	* valid (eg is not freed) even if the item was removed from the table
65	* in the meantime by another thread.
66	*
67	* 2) Never update values in place.
68	* Do not update items in the table in place, ie directly. The changes
69	* will not propagate to other readers (on other cpus) immediately or even
70	* correctly. Some readers may then encounter items that have only some
71	* of their fields changed or are completely inconsistent.
72	*
73	* Instead consider inserting an updated/changed copy of the item and
74	* removing the original item. Or contact the maintainer to provide
75	* you with a function that atomically replaces an item with a copy.
76	*
77	* 3) Use cht_insert_unique() instead of checking for duplicates with cht_find()
78	* The following code is prone to race conditions:
79	* @code
80	* if (NULL == cht_find(&h, key)) {
81	* // If another thread inserts and item here, we'll insert a duplicate.
82	* cht_insert(&h, item);
83	* }
84	* @endcode
85	* See cht_insert_unique() on how to correctly fix this.
86	*
87	*
88	* @par Semantics
89	*
90	* Lazy readers = cht_find_lazy(), cht_find_next_lazy()
91	* Readers = lazy readers, cht_find(), cht_find_next()
92	* Updates = cht_insert(), cht_insert_unique(), cht_remove_key(),
93	* cht_remove_item()
94	*
95	* Readers (but not lazy readers) are guaranteed to see the effects
96	* of @e completed updates. In other words, if cht_find() is invoked
97	* after a cht_insert() @e returned eg on another cpu, cht_find() is
98	* guaranteed to see the inserted item.
99	*
100	* Similarly, updates see the effects of @e completed updates. For example,
101	* issuing cht_remove() after a cht_insert() for that key returned (even
102	* on another cpu) is guaranteed to remove the inserted item.
103	*
104	* Reading or updating the table concurrently with other updates
105	* always returns consistent data and never corrupts the table.
106	* However the effects of concurrent updates may or may not be
107	* visible to all other concurrent readers or updaters. Eg, not
108	* all readers may see that an item has already been inserted
109	* if cht_insert() has not yet returned.
110	*
111	* Lazy readers are guaranteed to eventually see updates but it
112	* may take some time (possibly milliseconds) after the update
113	* completes for the change to propagate to lazy readers on all
114	* cpus.
115	*
116	* @par Implementation
117	*
118	* Collisions in CHT are resolved with chaining. The number of buckets
119	* is always a power of 2. Each bucket is represented with a single linked
120	* lock-free list [1]. Items in buckets are sorted by their mixed hashes
121	* in ascending order. All buckets are terminated with a single global
122	* sentinel node whose mixed hash value is the greatest possible.
123	*
124	* CHT with 2^k buckets uses the k most significant bits of a hash value
125	* to determine the bucket number where an item is to be stored. To
126	* avoid storing all items in a single bucket if the user supplied
127	* hash function does not produce uniform hashes, hash values are
128	* mixed first so that the top bits of a mixed hash change even if hash
129	* values differ only in the least significant bits. The mixed hash
130	* values are cached in cht_link.hash (which is overwritten once the
131	* item is scheduled for removal via rcu_call).
132	*
133	* A new item is inserted before all other existing items in the bucket
134	* with the same hash value as the newly inserted item (a la the original
135	* lock-free list [2]). Placing new items at the start of a same-hash
136	* sequence of items (eg duplicates) allows us to easily check for duplicates
137	* in cht_insert_unique(). The function can first check that there are
138	* no duplicates of the newly inserted item amongst the items with the
139	* same hash as the new item. If there were no duplicates the new item
140	* is linked before the same-hash items. Inserting a duplicate while
141	* the function is checking for duplicates is detected as a change of
142	* the link to the first checked same-hash item (and the search for
143	* duplicates can be restarted).
144	*
145	* @par Table resize algorithm
146	*
147	* Table resize is based on [3] and [5]. First, a new bucket head array
148	* is allocated and initialized. Second, old bucket heads are moved
149	* to the new bucket head array with the protocol mentioned in [5].
150	* At this point updaters start using the new bucket heads. Third,
151	* buckets are split (or joined) so that the table can make use of
152	* the extra bucket head slots in the new array (or stop wasting space
153	* with the unnecessary extra slots in the old array). Splitting
154	* or joining buckets employs a custom protocol. Last, the new array
155	* replaces the original bucket array.
156	*
157	* A single background work item (of the system work queue) guides
158	* resizing of the table. If an updater detects that the bucket it
159	* is about to access is undergoing a resize (ie its head is moving
160	* or it needs to be split/joined), it helps out and completes the
161	* head move or the bucket split/join.
162	*
163	* The table always grows or shrinks by a factor of 2. Because items
164	* are assigned a bucket based on the top k bits of their mixed hash
165	* values, when growing the table each bucket is split into two buckets
166	* and all items of the two new buckets come from the single bucket in the
167	* original table. Ie items from separate buckets in the original table
168	* never intermix in the new buckets. Moreover
169	* since the buckets are sorted by their mixed hash values the items
170	* at the beginning of the old bucket will end up in the first new
171	* bucket while all the remaining items of the old bucket will end up
172	* in the second new bucket. Therefore, there is a single point where
173	* to split the linked list of the old bucket into two correctly sorted
174	* linked lists of the new buckets:
175	* .- bucket split
176	* \|
177	* <-- first --> v <-- second -->
178	* [old] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
179	* ^ ^
180	* [new0] -- -+ \|
181	* [new1] -- -- -- -- -- -- -- -+
182	*
183	* Resize in greater detail:
184	*
185	* a) First, a resizer (a single background system work queue item
186	* in charge of resizing the table) allocates and initializes a new
187	* bucket head array. New bucket heads are pointed to the sentinel
188	* and marked Invalid (in the lower order bits of the pointer to the
189	* next item, ie the sentinel in this case):
190	*
191	* [old, N] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
192	* ^ ^
193	* [new0, Inv] -------------------------------------+ \|
194	* [new1, Inv] ---------------------------------------+
195	*
196	*
197	* b) Second, the resizer starts moving old bucket heads with the following
198	* lock-free protocol (from [5]) where cas(variable, expected_val, new_val)
199	* is short for compare-and-swap:
200	*
201	* old head new0 head transition to next state
202	* -------- --------- ------------------------
203	* addr, N sentinel, Inv cas(old, (addr, N), (addr, Const))
204	* .. mark the old head as immutable, so that
205	* updaters do not relink it to other nodes
206	* until the head move is done.
207	* addr, Const sentinel, Inv cas(new0, (sentinel, Inv), (addr, N))
208	* .. move the address to the new head and mark
209	* the new head normal so updaters can start
210	* using it.
211	* addr, Const addr, N cas(old, (addr, Const), (addr, Inv))
212	* .. mark the old head Invalid to signify
213	* the head move is done.
214	* addr, Inv addr, N
215	*
216	* Notice that concurrent updaters may step in at any point and correctly
217	* complete the head move without disrupting the resizer. At worst, the
218	* resizer or other concurrent updaters will attempt a number of CAS() that
219	* will correctly fail.
220	*
221	* [old, Inv] -> [00b] -> [01b] -> [10b] -> [11b] -> sentinel
222	* ^ ^
223	* [new0, N] ----+ \|
224	* [new1, Inv] --------------------------------------+
225	*
226	*
227	* c) Third, buckets are split if the table is growing; or joined if
228	* shrinking (by the resizer or updaters depending on whoever accesses
229	* the bucket first). See split_bucket() and join_buckets() for details.
230	*
231	* 1) Mark the last item of new0 with JOIN_FOLLOWS:
232	* [old, Inv] -> [00b] -> [01b, JF] -> [10b] -> [11b] -> sentinel
233	* ^ ^
234	* [new0, N] ----+ \|
235	* [new1, Inv] ------------------------------------------+
236	*
237	* 2) Mark the first item of new1 with JOIN_NODE:
238	* [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel
239	* ^ ^
240	* [new0, N] ----+ \|
241	* [new1, Inv] ----------------------------------------------+
242	*
243	* 3) Point new1 to the join-node and mark new1 NORMAL.
244	* [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel
245	* ^ ^
246	* [new0, N] ----+ \|
247	* [new1, N] --------------------------+
248	*
249	*
250	* d) Fourth, the resizer cleans up extra marks added during bucket
251	* splits/joins but only when it is sure all updaters are accessing
252	* the table via the new bucket heads only (ie it is certain there
253	* are no delayed updaters unaware of the resize and accessing the
254	* table via the old bucket head).
255	*
256	* [old, Inv] ---+
257	* v
258	* [new0, N] --> [00b] -> [01b, N] ---+
259	* v
260	* [new1, N] --> [10b, N] -> [11b] -> sentinel
261	*
262	*
263	* e) Last, the resizer publishes the new bucket head array for everyone
264	* to see and use. This signals the end of the resize and the old bucket
265	* array is freed.
266	*
267	*
268	* To understand details of how the table is resized, read [1, 3, 5]
269	* and comments in join_buckets(), split_bucket().
270	*
271	*
272	* [1] High performance dynamic lock-free hash tables and list-based sets,
273	* Michael, 2002
274	* http://www.research.ibm.com/people/m/michael/spaa-2002.pdf
275	* [2] Lock-free linked lists using compare-and-swap,
276	* Valois, 1995
277	* http://people.csail.mit.edu/bushl2/rpi/portfolio/lockfree-grape/documents/lock-free-linked-lists.pdf
278	* [3] Resizable, scalable, concurrent hash tables via relativistic programming,
279	* Triplett, 2011
280	* http://www.usenix.org/event/atc11/tech/final_files/Triplett.pdf
281	* [4] Split-ordered Lists: Lock-free Extensible Hash Tables,
282	* Shavit, 2006
283	* http://www.cs.ucf.edu/~dcm/Teaching/COT4810-Spring2011/Literature/SplitOrderedLists.pdf
284	* [5] Towards a Scalable Non-blocking Coding Style,
285	* Click, 2008
286	* http://www.azulsystems.com/events/javaone_2008/2008_CodingNonBlock.pdf
287	*/
288
289
290	#include <adt/cht.h>
291	#include <adt/hash.h>
292	#include <assert.h>
293	#include <mm/slab.h>
294	#include <arch/barrier.h>
295	#include <compiler/barrier.h>
296	#include <atomic.h>
297	#include <synch/rcu.h>
298
299	#ifdef CONFIG_DEBUG
300	/* Do not enclose in parentheses. */
301	#define DBG(x) x
302	#else
303	#define DBG(x)
304	#endif
305
306	/* Logarithm of the min bucket count. Must be at least 3. 2^6 == 64 buckets. */
307	#define CHT_MIN_ORDER 6
308	/* Logarithm of the max bucket count. */
309	#define CHT_MAX_ORDER (8 * sizeof(size_t))
310	/* Minimum number of hash table buckets. */
311	#define CHT_MIN_BUCKET_CNT (1 << CHT_MIN_ORDER)
312	/* Does not have to be a power of 2. */
313	#define CHT_MAX_LOAD 2
314
315	typedef cht_ptr_t marked_ptr_t;
316	typedef bool (equal_pred_t)(void arg, const cht_link_t *item);
317
318	/** The following mark items and bucket heads.
319	*
320	* They are stored in the two low order bits of the next item pointers.
321	* Some marks may be combined. Some marks share the same binary value and
322	* are distinguished only by context (eg bucket head vs an ordinary item),
323	* in particular by walk_mode_t.
324	*/
325	typedef enum mark {
326	/** Normal non-deleted item or a valid bucket head. */
327	N_NORMAL = 0,
328	/** Logically deleted item that might have already been unlinked.
329	*
330	* May be combined with N_JOIN and N_JOIN_FOLLOWS. Applicable only
331	* to items; never to bucket heads.
332	*
333	* Once marked deleted an item remains marked deleted.
334	*/
335	N_DELETED = 1,
336	/** Immutable bucket head.
337	*
338	* The bucket is being moved or joined with another and its (old) head
339	* must not be modified.
340	*
341	* May be combined with N_INVALID. Applicable only to old bucket heads,
342	* ie cht_t.b and not cht_t.new_b.
343	*/
344	N_CONST = 1,
345	/** Invalid bucket head. The bucket head must not be modified.
346	*
347	* Old bucket heads (ie cht_t.b) are marked invalid if they have
348	* already been moved to cht_t.new_b or if the bucket had already
349	* been merged with another when shrinking the table. New bucket
350	* heads (ie cht_t.new_b) are marked invalid if the old bucket had
351	* not yet been moved or if an old bucket had not yet been split
352	* when growing the table.
353	*/
354	N_INVALID = 3,
355	/** The item is a join node, ie joining two buckets
356	*
357	* A join node is either the first node of the second part of
358	* a bucket to be split; or it is the first node of the bucket
359	* to be merged into/appended to/joined with another bucket.
360	*
361	* May be combined with N_DELETED. Applicable only to items, never
362	* to bucket heads.
363	*
364	* Join nodes are referred to from two different buckets and may,
365	* therefore, not be safely/atomically unlinked from both buckets.
366	* As a result join nodes are not unlinked but rather just marked
367	* deleted. Once resize completes join nodes marked deleted are
368	* garbage collected.
369	*/
370	N_JOIN = 2,
371	/** The next node is a join node and will soon be marked so.
372	*
373	* A join-follows node is the last node of the first part of bucket
374	* that is to be split, ie it is the last node that will remain
375	* in the same bucket after splitting it.
376	*
377	* May be combined with N_DELETED. Applicable to items as well
378	* as to bucket heads of the bucket to be split (but only in cht_t.new_b).
379	*/
380	N_JOIN_FOLLOWS = 2,
381	/** Bit mask to filter out the address to the next item from the next ptr. */
382	N_MARK_MASK = 3
383	} mark_t;
384
385	/** Determines */
386	typedef enum walk_mode {
387	/** The table is not resizing. */
388	WM_NORMAL = 4,
389	/** The table is undergoing a resize. Join nodes may be encountered. */
390	WM_LEAVE_JOIN,
391	/** The table is growing. A join-follows node may be encountered. */
392	WM_MOVE_JOIN_FOLLOWS
393	} walk_mode_t;
394
395	/** Bucket position window. */
396	typedef struct wnd {
397	/** Pointer to cur's predecessor. */
398	marked_ptr_t *ppred;
399	/** Current item. */
400	cht_link_t *cur;
401	/** Last encountered item. Deleted or not. */
402	cht_link_t *last;
403	} wnd_t;
404
405
406	/* Sentinel node used by all buckets. Stores the greatest possible hash value.*/
407	static const cht_link_t sentinel = {
408	/* NULL and N_NORMAL */
409	.link = 0 \| N_NORMAL,
410	.hash = -1
411	};
412
413
414	static size_t size_to_order(size_t bucket_cnt, size_t min_order);
415	static cht_buckets_t *alloc_buckets(size_t order, bool set_invalid,
416	bool can_block);
417	static inline cht_link_t find_lazy(cht_t h, void *key);
418	static cht_link_t search_bucket(cht_t h, marked_ptr_t head, void *key,
419	size_t search_hash);
420	static cht_link_t find_resizing(cht_t h, void *key, size_t hash,
421	marked_ptr_t old_head, size_t old_idx);
422	static bool insert_impl(cht_t h, cht_link_t item, cht_link_t **dup_item);
423	static bool insert_at(cht_link_t item, const wnd_t wnd, walk_mode_t walk_mode,
424	bool *resizing);
425	static bool has_duplicate(cht_t h, const cht_link_t item, size_t hash,
426	cht_link_t cur, cht_link_t *dup_item);
427	static cht_link_t find_duplicate(cht_t h, const cht_link_t *item, size_t hash,
428	cht_link_t *start);
429	static bool remove_pred(cht_t h, size_t hash, equal_pred_t pred, void pred_arg);
430	static bool delete_at(cht_t h, wnd_t wnd, walk_mode_t walk_mode,
431	bool deleted_but_gc, bool resizing);
432	static bool mark_deleted(cht_link_t cur, walk_mode_t walk_mode, bool resizing);
433	static bool unlink_from_pred(wnd_t wnd, walk_mode_t walk_mode, bool resizing);
434	static bool find_wnd_and_gc_pred(cht_t *h, size_t hash, walk_mode_t walk_mode,
435	equal_pred_t pred, void pred_arg, wnd_t wnd, bool *resizing);
436	static bool find_wnd_and_gc(cht_t *h, size_t hash, walk_mode_t walk_mode,
437	wnd_t wnd, bool resizing);
438	static bool gc_deleted_node(cht_t h, walk_mode_t walk_mode, wnd_t wnd,
439	bool *resizing);
440	static bool join_completed(cht_t h, const wnd_t wnd);
441	static void upd_resizing_head(cht_t h, size_t hash, marked_ptr_t *phead,
442	bool join_finishing, walk_mode_t walk_mode);
443	static void item_removed(cht_t *h);
444	static void item_inserted(cht_t *h);
445	static void free_later(cht_t h, cht_link_t item);
446	static void help_head_move(marked_ptr_t psrc_head, marked_ptr_t pdest_head);
447	static void start_head_move(marked_ptr_t *psrc_head);
448	static void mark_const(marked_ptr_t *psrc_head);
449	static void complete_head_move(marked_ptr_t psrc_head, marked_ptr_t pdest_head);
450	static void split_bucket(cht_t h, marked_ptr_t psrc_head,
451	marked_ptr_t *pdest_head, size_t split_hash);
452	static void mark_join_follows(cht_t h, marked_ptr_t psrc_head,
453	size_t split_hash, wnd_t *wnd);
454	static void mark_join_node(cht_link_t *join_node);
455	static void join_buckets(cht_t h, marked_ptr_t psrc_head,
456	marked_ptr_t *pdest_head, size_t split_hash);
457	static void link_to_join_node(cht_t h, marked_ptr_t pdest_head,
458	cht_link_t *join_node, size_t split_hash);
459	static void resize_table(work_t *arg);
460	static void grow_table(cht_t *h);
461	static void shrink_table(cht_t *h);
462	static void cleanup_join_node(cht_t h, marked_ptr_t new_head);
463	static void clear_join_and_gc(cht_t h, cht_link_t join_node,
464	marked_ptr_t *new_head);
465	static void cleanup_join_follows(cht_t h, marked_ptr_t new_head);
466	static marked_ptr_t make_link(const cht_link_t *next, mark_t mark);
467	static cht_link_t * get_next(marked_ptr_t link);
468	static mark_t get_mark(marked_ptr_t link);
469	static void next_wnd(wnd_t *wnd);
470	static bool same_node_pred(void node, const cht_link_t item2);
471	static size_t calc_key_hash(cht_t h, void key);
472	static size_t node_hash(cht_t h, const cht_link_t item);
473	static size_t calc_node_hash(cht_t h, const cht_link_t item);
474	static void memoize_node_hash(cht_t h, cht_link_t item);
475	static size_t calc_split_hash(size_t split_idx, size_t order);
476	static size_t calc_bucket_idx(size_t hash, size_t order);
477	static size_t grow_to_split_idx(size_t old_idx);
478	static size_t grow_idx(size_t idx);
479	static size_t shrink_idx(size_t idx);
480	static marked_ptr_t cas_link(marked_ptr_t link, const cht_link_t cur_next,
481	mark_t cur_mark, const cht_link_t *new_next, mark_t new_mark);
482	static marked_ptr_t _cas_link(marked_ptr_t *link, marked_ptr_t cur,
483	marked_ptr_t new);
484	static void cas_order_barrier(void);
485
486	static void dummy_remove_callback(cht_link_t *item)
487	{
488	/* empty */
489	}
490
491	/** Creates a concurrent hash table.
492	*
493	* @param h Valid pointer to a cht_t instance.
494	* @param op Item specific operations. All operations are compulsory.
495	* @return True if successfully created the table. False otherwise.
496	*/
497	bool cht_create_simple(cht_t h, cht_ops_t op)
498	{
499	return cht_create(h, 0, 0, 0, false, op);
500	}
501
502	/** Creates a concurrent hash table.
503	*
504	* @param h Valid pointer to a cht_t instance.
505	* @param init_size The initial number of buckets the table should contain.
506	* The table may be shrunk below this value if deemed necessary.
507	* Uses the default value if 0.
508	* @param min_size Minimum number of buckets that the table should contain.
509	* The number of buckets never drops below this value,
510	* although it may be rounded up internally as appropriate.
511	* Uses the default value if 0.
512	* @param max_load Maximum average number of items per bucket that allowed
513	* before the table grows.
514	* @param can_block If true creating the table blocks until enough memory
515	* is available (possibly indefinitely). Otherwise,
516	* table creation does not block and returns immediately
517	* even if not enough memory is available.
518	* @param op Item specific operations. All operations are compulsory.
519	* @return True if successfully created the table. False otherwise.
520	*/
521	bool cht_create(cht_t *h, size_t init_size, size_t min_size, size_t max_load,
522	bool can_block, cht_ops_t *op)
523	{
524	assert(h);
525	assert(op && op->hash && op->key_hash && op->equal && op->key_equal);
526	/* Memoized hashes are stored in the rcu_link.func function pointer. */
527	static_assert(sizeof(size_t) == sizeof(rcu_func_t), "");
528	assert(sentinel.hash == (uintptr_t)sentinel.rcu_link.func);
529
530	/* All operations are compulsory. */
531	if (!op \|\| !op->hash \|\| !op->key_hash \|\| !op->equal \|\| !op->key_equal)
532	return false;
533
534	size_t min_order = size_to_order(min_size, CHT_MIN_ORDER);
535	size_t order = size_to_order(init_size, min_order);
536
537	h->b = alloc_buckets(order, false, can_block);
538
539	if (!h->b)
540	return false;
541
542	h->max_load = (max_load == 0) ? CHT_MAX_LOAD : max_load;
543	h->min_order = min_order;
544	h->new_b = NULL;
545	h->op = op;
546	atomic_set(&h->item_cnt, 0);
547	atomic_set(&h->resize_reqs, 0);
548
549	if (NULL == op->remove_callback) {
550	h->op->remove_callback = dummy_remove_callback;
551	}
552
553	/*
554	* Cached item hashes are stored in item->rcu_link.func. Once the item
555	* is deleted rcu_link.func will contain the value of invalid_hash.
556	*/
557	h->invalid_hash = (uintptr_t)h->op->remove_callback;
558
559	/* Ensure the initialization takes place before we start using the table. */
560	write_barrier();
561
562	return true;
563	}
564
565	/** Allocates and initializes 2^order buckets.
566	*
567	* All bucket heads are initialized to point to the sentinel node.
568	*
569	* @param order The number of buckets to allocate is 2^order.
570	* @param set_invalid Bucket heads are marked invalid if true; otherwise
571	* they are marked N_NORMAL.
572	* @param can_block If true memory allocation blocks until enough memory
573	* is available (possibly indefinitely). Otherwise,
574	* memory allocation does not block.
575	* @return Newly allocated and initialized buckets or NULL if not enough memory.
576	*/
577	static cht_buckets_t *alloc_buckets(size_t order, bool set_invalid, bool can_block)
578	{
579	size_t bucket_cnt = (1 << order);
580	size_t bytes =
581	sizeof(cht_buckets_t) + (bucket_cnt - 1) * sizeof(marked_ptr_t);
582	cht_buckets_t *b = malloc(bytes, can_block ? 0 : FRAME_ATOMIC);
583
584	if (!b)
585	return NULL;
586
587	b->order = order;
588
589	marked_ptr_t head_link = set_invalid
590	? make_link(&sentinel, N_INVALID)
591	: make_link(&sentinel, N_NORMAL);
592
593	for (size_t i = 0; i < bucket_cnt; ++i) {
594	b->head[i] = head_link;
595	}
596
597	return b;
598	}
599
600	/** Returns the smallest k such that bucket_cnt <= 2^k and min_order <= k.*/
601	static size_t size_to_order(size_t bucket_cnt, size_t min_order)
602	{
603	size_t order = min_order;
604
605	/* Find a power of two such that bucket_cnt <= 2^order */
606	do {
607	if (bucket_cnt <= ((size_t)1 << order))
608	return order;
609
610	++order;
611	} while (order < CHT_MAX_ORDER);
612
613	return order;
614	}
615
616	/** Destroys a CHT successfully created via cht_create().
617	*
618	* Waits for all outstanding concurrent operations to complete and
619	* frees internal allocated resources. The table is however not cleared
620	* and items already present in the table (if any) are leaked.
621	*/
622	void cht_destroy(cht_t *h)
623	{
624	cht_destroy_unsafe(h);
625
626	/* You must clear the table of items. Otherwise cht_destroy will leak. */
627	assert(atomic_get(&h->item_cnt) == 0);
628	}
629
630	/** Destroys a successfully created CHT but does no error checking. */
631	void cht_destroy_unsafe(cht_t *h)
632	{
633	/* Wait for resize to complete. */
634	while (0 < atomic_get(&h->resize_reqs)) {
635	rcu_barrier();
636	}
637
638	/* Wait for all remove_callback()s to complete. */
639	rcu_barrier();
640
641	free(h->b);
642	h->b = NULL;
643	}
644
645	/** Returns the first item equal to the search key or NULL if not found.
646	*
647	* The call must be enclosed in a rcu_read_lock() unlock() pair. The
648	* returned item is guaranteed to be allocated until rcu_read_unlock()
649	* although the item may be concurrently removed from the table by another
650	* cpu.
651	*
652	* Further items matching the key may be retrieved via cht_find_next().
653	*
654	* cht_find() sees the effects of any completed cht_remove(), cht_insert().
655	* If a concurrent remove or insert had not yet completed cht_find() may
656	* or may not see the effects of it (eg it may find an item being removed).
657	*
658	* @param h CHT to operate on.
659	* @param key Search key as defined by cht_ops_t.key_equal() and .key_hash().
660	* @return First item equal to the key or NULL if such an item does not exist.
661	*/
662	cht_link_t cht_find(cht_t h, void *key)
663	{
664	/* Make the most recent changes to the table visible. */
665	read_barrier();
666	return cht_find_lazy(h, key);
667	}
668
669	/** Returns the first item equal to the search key or NULL if not found.
670	*
671	* Unlike cht_find(), cht_find_lazy() may not see the effects of
672	* cht_remove() or cht_insert() even though they have already completed.
673	* It may take a couple of milliseconds for those changes to propagate
674	* and become visible to cht_find_lazy(). On the other hand, cht_find_lazy()
675	* operates a bit faster than cht_find().
676	*
677	* See cht_find() for more details.
678	*/
679	cht_link_t cht_find_lazy(cht_t h, void *key)
680	{
681	return find_lazy(h, key);
682	}
683
684	/** Finds the first item equal to the search key. */
685	static inline cht_link_t find_lazy(cht_t h, void *key)
686	{
687	assert(h);
688	/* See docs to cht_find() and cht_find_lazy(). */
689	assert(rcu_read_locked());
690
691	size_t hash = calc_key_hash(h, key);
692
693	cht_buckets_t *b = rcu_access(h->b);
694	size_t idx = calc_bucket_idx(hash, b->order);
695	/*
696	* No need for access_once. b->head[idx] will point to an allocated node
697	* even if marked invalid until we exit rcu read section.
698	*/
699	marked_ptr_t head = b->head[idx];
700
701	/* Undergoing a resize - take the slow path. */
702	if (N_INVALID == get_mark(head))
703	return find_resizing(h, key, hash, head, idx);
704
705	return search_bucket(h, head, key, hash);
706	}
707
708	/** Returns the next item matching \a item.
709	*
710	* Must be enclosed in a rcu_read_lock()/unlock() pair. Effects of
711	* completed cht_remove(), cht_insert() are guaranteed to be visible
712	* to cht_find_next().
713	*
714	* See cht_find() for more details.
715	*/
716	cht_link_t cht_find_next(cht_t h, const cht_link_t *item)
717	{
718	/* Make the most recent changes to the table visible. */
719	read_barrier();
720	return cht_find_next_lazy(h, item);
721	}
722
723	/** Returns the next item matching \a item.
724	*
725	* Must be enclosed in a rcu_read_lock()/unlock() pair. Effects of
726	* completed cht_remove(), cht_insert() may or may not be visible
727	* to cht_find_next_lazy().
728	*
729	* See cht_find_lazy() for more details.
730	*/
731	cht_link_t cht_find_next_lazy(cht_t h, const cht_link_t *item)
732	{
733	assert(h);
734	assert(rcu_read_locked());
735	assert(item);
736
737	return find_duplicate(h, item, calc_node_hash(h, item), get_next(item->link));
738	}
739
740	/** Searches the bucket at head for key using search_hash. */
741	static inline cht_link_t search_bucket(cht_t h, marked_ptr_t head, void *key,
742	size_t search_hash)
743	{
744	/*
745	* It is safe to access nodes even outside of this bucket (eg when
746	* splitting the bucket). The resizer makes sure that any node we
747	* may find by following the next pointers is allocated.
748	*/
749
750	cht_link_t *cur = NULL;
751	marked_ptr_t prev = head;
752
753	try_again:
754	/* Filter out items with different hashes. */
755	do {
756	cur = get_next(prev);
757	assert(cur);
758	prev = cur->link;
759	} while (node_hash(h, cur) < search_hash);
760
761	/*
762	* Only search for an item with an equal key if cur is not the sentinel
763	* node or a node with a different hash.
764	*/
765	while (node_hash(h, cur) == search_hash) {
766	if (h->op->key_equal(key, cur)) {
767	if (!(N_DELETED & get_mark(cur->link)))
768	return cur;
769	}
770
771	cur = get_next(cur->link);
772	assert(cur);
773	}
774
775	/*
776	* In the unlikely case that we have encountered a node whose cached
777	* hash has been overwritten due to a pending rcu_call for it, skip
778	* the node and try again.
779	*/
780	if (node_hash(h, cur) == h->invalid_hash) {
781	prev = cur->link;
782	goto try_again;
783	}
784
785	return NULL;
786	}
787
788	/** Searches for the key while the table is undergoing a resize. */
789	static cht_link_t find_resizing(cht_t h, void *key, size_t hash,
790	marked_ptr_t old_head, size_t old_idx)
791	{
792	assert(N_INVALID == get_mark(old_head));
793	assert(h->new_b);
794
795	size_t new_idx = calc_bucket_idx(hash, h->new_b->order);
796	marked_ptr_t new_head = h->new_b->head[new_idx];
797	marked_ptr_t search_head = new_head;
798
799	/* Growing. */
800	if (h->b->order < h->new_b->order) {
801	/*
802	* Old bucket head is invalid, so it must have been already
803	* moved. Make the new head visible if still not visible, ie
804	* invalid.
805	*/
806	if (N_INVALID == get_mark(new_head)) {
807	/*
808	* We should be searching a newly added bucket but the old
809	* moved bucket has not yet been split (its marked invalid)
810	* or we have not yet seen the split.
811	*/
812	if (grow_idx(old_idx) != new_idx) {
813	/*
814	* Search the moved bucket. It is guaranteed to contain
815	* items of the newly added bucket that were present
816	* before the moved bucket was split.
817	*/
818	new_head = h->new_b->head[grow_idx(old_idx)];
819	}
820
821	/* new_head is now the moved bucket, either valid or invalid. */
822
823	/*
824	* The old bucket was definitely moved to new_head but the
825	* change of new_head had not yet propagated to this cpu.
826	*/
827	if (N_INVALID == get_mark(new_head)) {
828	/*
829	* We could issue a read_barrier() and make the now valid
830	* moved bucket head new_head visible, but instead fall back
831	* on using the old bucket. Although the old bucket head is
832	* invalid, it points to a node that is allocated and in the
833	* right bucket. Before the node can be freed, it must be
834	* unlinked from the head (or another item after that item
835	* modified the new_head) and a grace period must elapse.
836	* As a result had the node been already freed the grace
837	* period preceeding the free() would make the unlink and
838	* any changes to new_head visible. Therefore, it is safe
839	* to use the node pointed to from the old bucket head.
840	*/
841
842	search_head = old_head;
843	} else {
844	search_head = new_head;
845	}
846	}
847
848	return search_bucket(h, search_head, key, hash);
849	} else if (h->b->order > h->new_b->order) {
850	/* Shrinking. */
851
852	/* Index of the bucket in the old table that was moved. */
853	size_t move_src_idx = grow_idx(new_idx);
854	marked_ptr_t moved_old_head = h->b->head[move_src_idx];
855
856	/*
857	* h->b->head[move_src_idx] had already been moved to new_head
858	* but the change to new_head had not yet propagated to us.
859	*/
860	if (N_INVALID == get_mark(new_head)) {
861	/*
862	* new_head is definitely valid and we could make it visible
863	* to this cpu with a read_barrier(). Instead, use the bucket
864	* in the old table that was moved even though it is now marked
865	* as invalid. The node it points to must be allocated because
866	* a grace period would have to elapse before it could be freed;
867	* and the grace period would make the now valid new_head
868	* visible to all cpus.
869	*
870	* Note that move_src_idx may not be the same as old_idx.
871	* If move_src_idx != old_idx then old_idx is the bucket
872	* in the old table that is not moved but instead it is
873	* appended to the moved bucket, ie it is added at the tail
874	* of new_head. In that case an invalid old_head notes that
875	* it had already been merged into (the moved) new_head.
876	* We will try to search that bucket first because it
877	* may contain some newly added nodes after the bucket
878	* join. Moreover, the bucket joining link may already be
879	* visible even if new_head is not. Therefore, if we're
880	* lucky we'll find the item via moved_old_head. In any
881	* case, we'll retry in proper old_head if not found.
882	*/
883	search_head = moved_old_head;
884	}
885
886	cht_link_t *ret = search_bucket(h, search_head, key, hash);
887
888	if (ret)
889	return ret;
890	/*
891	* Bucket old_head was already joined with moved_old_head
892	* in the new table but we have not yet seen change of the
893	* joining link (or the item is not in the table).
894	*/
895	if (move_src_idx != old_idx && get_next(old_head) != &sentinel) {
896	/*
897	* Note that old_head (the bucket to be merged into new_head)
898	* points to an allocated join node (if non-null) even if marked
899	* invalid. Before the resizer lets join nodes to be unlinked
900	* (and freed) it sets old_head to NULL and waits for a grace period.
901	* So either the invalid old_head points to join node; or old_head
902	* is null and we would have seen a completed bucket join while
903	* traversing search_head.
904	*/
905	assert(N_JOIN & get_mark(get_next(old_head)->link));
906	return search_bucket(h, old_head, key, hash);
907	}
908
909	return NULL;
910	} else {
911	/*
912	* Resize is almost done. The resizer is waiting to make
913	* sure all cpus see that the new table replaced the old one.
914	*/
915	assert(h->b->order == h->new_b->order);
916	/*
917	* The resizer must ensure all new bucket heads are visible before
918	* replacing the old table.
919	*/
920	assert(N_NORMAL == get_mark(new_head));
921	return search_bucket(h, new_head, key, hash);
922	}
923	}
924
925	/** Inserts an item. Succeeds even if an equal item is already present. */
926	void cht_insert(cht_t h, cht_link_t item)
927	{
928	insert_impl(h, item, NULL);
929	}
930
931	/** Inserts a unique item. Returns false if an equal item was already present.
932	*
933	* Use this function to atomically check if an equal/duplicate item had
934	* not yet been inserted into the table and to insert this item into the
935	* table.
936	*
937	* The following is @e NOT thread-safe, so do not use:
938	* @code
939	* if (!cht_find(h, key)) {
940	* // A concurrent insert here may go unnoticed by cht_find() above.
941	* item = malloc(..);
942	* cht_insert(h, item);
943	* // Now we may have two items with equal search keys.
944	* }
945	* @endcode
946	*
947	* Replace such code with:
948	* @code
949	* item = malloc(..);
950	* if (!cht_insert_unique(h, item, &dup_item)) {
951	* // Whoops, someone beat us to it - an equal item 'dup_item'
952	* // had already been inserted.
953	* free(item);
954	* } else {
955	* // Successfully inserted the item and we are guaranteed that
956	* // there are no other equal items.
957	* }
958	* @endcode
959	*
960	*/
961	bool cht_insert_unique(cht_t h, cht_link_t item, cht_link_t **dup_item)
962	{
963	assert(rcu_read_locked());
964	assert(dup_item);
965	return insert_impl(h, item, dup_item);
966	}
967
968	/** Inserts the item into the table and checks for duplicates if dup_item. */
969	static bool insert_impl(cht_t h, cht_link_t item, cht_link_t **dup_item)
970	{
971	rcu_read_lock();
972
973	cht_buckets_t *b = rcu_access(h->b);
974	memoize_node_hash(h, item);
975	size_t hash = node_hash(h, item);
976	size_t idx = calc_bucket_idx(hash, b->order);
977	marked_ptr_t *phead = &b->head[idx];
978
979	bool resizing = false;
980	bool inserted = false;
981
982	do {
983	walk_mode_t walk_mode = WM_NORMAL;
984	bool join_finishing;
985
986	resizing = resizing \|\| (N_NORMAL != get_mark(*phead));
987
988	/* The table is resizing. Get the correct bucket head. */
989	if (resizing) {
990	upd_resizing_head(h, hash, &phead, &join_finishing, &walk_mode);
991	}
992
993	wnd_t wnd = {
994	.ppred = phead,
995	.cur = get_next(*phead),
996	.last = NULL
997	};
998
999	if (!find_wnd_and_gc(h, hash, walk_mode, &wnd, &resizing)) {
1000	/* Could not GC a node; or detected an unexpected resize. */
1001	continue;
1002	}
1003
1004	if (dup_item && has_duplicate(h, item, hash, wnd.cur, dup_item)) {
1005	rcu_read_unlock();
1006	return false;
1007	}
1008
1009	inserted = insert_at(item, &wnd, walk_mode, &resizing);
1010	} while (!inserted);
1011
1012	rcu_read_unlock();
1013
1014	item_inserted(h);
1015	return true;
1016	}
1017
1018	/** Inserts item between wnd.ppred and wnd.cur.
1019	*
1020	* @param item Item to link to wnd.ppred and wnd.cur.
1021	* @param wnd The item will be inserted before wnd.cur. Wnd.ppred
1022	* must be N_NORMAL.
1023	* @param walk_mode
1024	* @param resizing Set to true only if the table is undergoing resize
1025	* and it was not expected (ie walk_mode == WM_NORMAL).
1026	* @return True if the item was successfully linked to wnd.ppred. False
1027	* if whole insert operation must be retried because the predecessor
1028	* of wnd.cur has changed.
1029	*/
1030	inline static bool insert_at(cht_link_t item, const wnd_t wnd,
1031	walk_mode_t walk_mode, bool *resizing)
1032	{
1033	marked_ptr_t ret;
1034
1035	if (walk_mode == WM_NORMAL) {
1036	item->link = make_link(wnd->cur, N_NORMAL);
1037	/* Initialize the item before adding it to a bucket. */
1038	memory_barrier();
1039
1040	/* Link a clean/normal predecessor to the item. */
1041	ret = cas_link(wnd->ppred, wnd->cur, N_NORMAL, item, N_NORMAL);
1042
1043	if (ret == make_link(wnd->cur, N_NORMAL)) {
1044	return true;
1045	} else {
1046	/* This includes an invalid head but not a const head. */
1047	*resizing = ((N_JOIN_FOLLOWS \| N_JOIN) & get_mark(ret));
1048	return false;
1049	}
1050	} else if (walk_mode == WM_MOVE_JOIN_FOLLOWS) {
1051	/* Move JOIN_FOLLOWS mark but filter out the DELETED mark. */
1052	mark_t jf_mark = get_mark(*wnd->ppred) & N_JOIN_FOLLOWS;
1053	item->link = make_link(wnd->cur, jf_mark);
1054	/* Initialize the item before adding it to a bucket. */
1055	memory_barrier();
1056
1057	/* Link the not-deleted predecessor to the item. Move its JF mark. */
1058	ret = cas_link(wnd->ppred, wnd->cur, jf_mark, item, N_NORMAL);
1059
1060	return ret == make_link(wnd->cur, jf_mark);
1061	} else {
1062	assert(walk_mode == WM_LEAVE_JOIN);
1063
1064	item->link = make_link(wnd->cur, N_NORMAL);
1065	/* Initialize the item before adding it to a bucket. */
1066	memory_barrier();
1067
1068	mark_t pred_mark = get_mark(*wnd->ppred);
1069	/* If the predecessor is a join node it may be marked deleted.*/
1070	mark_t exp_pred_mark = (N_JOIN & pred_mark) ? pred_mark : N_NORMAL;
1071
1072	ret = cas_link(wnd->ppred, wnd->cur, exp_pred_mark, item, exp_pred_mark);
1073	return ret == make_link(wnd->cur, exp_pred_mark);
1074	}
1075	}
1076
1077	/** Returns true if the chain starting at cur has an item equal to \a item.
1078	*
1079	* @param h CHT to operate on.
1080	* @param item Item whose duplicates the function looks for.
1081	* @param hash Hash of \a item.
1082	* @param[in] cur The first node with a hash greater to or equal to item's hash.
1083	* @param[out] dup_item The first duplicate item encountered.
1084	* @return True if a non-deleted item equal to \a item exists in the table.
1085	*/
1086	static inline bool has_duplicate(cht_t h, const cht_link_t item, size_t hash,
1087	cht_link_t cur, cht_link_t *dup_item)
1088	{
1089	assert(cur);
1090	assert(cur == &sentinel \|\| hash <= node_hash(h, cur)
1091	\|\| node_hash(h, cur) == h->invalid_hash);
1092
1093	/* hash < node_hash(h, cur) */
1094	if (hash != node_hash(h, cur) && h->invalid_hash != node_hash(h, cur))
1095	return false;
1096
1097	/*
1098	* Load the most recent node marks. Otherwise we might pronounce a
1099	* logically deleted node for a duplicate of the item just because
1100	* the deleted node's DEL mark had not yet propagated to this cpu.
1101	*/
1102	read_barrier();
1103
1104	*dup_item = find_duplicate(h, item, hash, cur);
1105	return NULL != *dup_item;
1106	}
1107
1108	/** Returns an item that is equal to \a item starting in a chain at \a start. */
1109	static cht_link_t find_duplicate(cht_t h, const cht_link_t *item, size_t hash,
1110	cht_link_t *start)
1111	{
1112	assert(hash <= node_hash(h, start) \|\| h->invalid_hash == node_hash(h, start));
1113
1114	cht_link_t *cur = start;
1115
1116	try_again:
1117	assert(cur);
1118
1119	while (node_hash(h, cur) == hash) {
1120	assert(cur != &sentinel);
1121
1122	bool deleted = (N_DELETED & get_mark(cur->link));
1123
1124	/* Skip logically deleted nodes. */
1125	if (!deleted && h->op->equal(item, cur))
1126	return cur;
1127
1128	cur = get_next(cur->link);
1129	assert(cur);
1130	}
1131
1132	/* Skip logically deleted nodes with rcu_call() in progress. */
1133	if (h->invalid_hash == node_hash(h, cur)) {
1134	cur = get_next(cur->link);
1135	goto try_again;
1136	}
1137
1138	return NULL;
1139	}
1140
1141	/** Removes all items matching the search key. Returns the number of items removed.*/
1142	size_t cht_remove_key(cht_t h, void key)
1143	{
1144	assert(h);
1145
1146	size_t hash = calc_key_hash(h, key);
1147	size_t removed = 0;
1148
1149	while (remove_pred(h, hash, h->op->key_equal, key))
1150	++removed;
1151
1152	return removed;
1153	}
1154
1155	/** Removes a specific item from the table.
1156	*
1157	* The called must hold rcu read lock.
1158	*
1159	* @param item Item presumably present in the table and to be removed.
1160	* @return True if the item was removed successfully; or false if it had
1161	* already been deleted.
1162	*/
1163	bool cht_remove_item(cht_t h, cht_link_t item)
1164	{
1165	assert(h);
1166	assert(item);
1167	/* Otherwise a concurrent cht_remove_key might free the item. */
1168	assert(rcu_read_locked());
1169
1170	/*
1171	* Even though we know the node we want to delete we must unlink it
1172	* from the correct bucket and from a clean/normal predecessor. Therefore,
1173	* we search for it again from the beginning of the correct bucket.
1174	*/
1175	size_t hash = calc_node_hash(h, item);
1176	return remove_pred(h, hash, same_node_pred, item);
1177	}
1178
1179	/** Removes an item equal to pred_arg according to the predicate pred. */
1180	static bool remove_pred(cht_t h, size_t hash, equal_pred_t pred, void pred_arg)
1181	{
1182	rcu_read_lock();
1183
1184	bool resizing = false;
1185	bool deleted = false;
1186	bool deleted_but_gc = false;
1187
1188	cht_buckets_t *b = rcu_access(h->b);
1189	size_t idx = calc_bucket_idx(hash, b->order);
1190	marked_ptr_t *phead = &b->head[idx];
1191
1192	do {
1193	walk_mode_t walk_mode = WM_NORMAL;
1194	bool join_finishing = false;
1195
1196	resizing = resizing \|\| (N_NORMAL != get_mark(*phead));
1197
1198	/* The table is resizing. Get the correct bucket head. */
1199	if (resizing) {
1200	upd_resizing_head(h, hash, &phead, &join_finishing, &walk_mode);
1201	}
1202
1203	wnd_t wnd = {
1204	.ppred = phead,
1205	.cur = get_next(*phead),
1206	.last = NULL
1207	};
1208
1209	if (!find_wnd_and_gc_pred(
1210	h, hash, walk_mode, pred, pred_arg, &wnd, &resizing)) {
1211	/* Could not GC a node; or detected an unexpected resize. */
1212	continue;
1213	}
1214
1215	/*
1216	* The item lookup is affected by a bucket join but effects of
1217	* the bucket join have not been seen while searching for the item.
1218	*/
1219	if (join_finishing && !join_completed(h, &wnd)) {
1220	/*
1221	* Bucket was appended at the end of another but the next
1222	* ptr linking them together was not visible on this cpu.
1223	* join_completed() makes this appended bucket visible.
1224	*/
1225	continue;
1226	}
1227
1228	/* Already deleted, but delete_at() requested one GC pass. */
1229	if (deleted_but_gc)
1230	break;
1231
1232	bool found = (wnd.cur != &sentinel && pred(pred_arg, wnd.cur));
1233
1234	if (!found) {
1235	rcu_read_unlock();
1236	return false;
1237	}
1238
1239	deleted = delete_at(h, &wnd, walk_mode, &deleted_but_gc, &resizing);
1240	} while (!deleted \|\| deleted_but_gc);
1241
1242	rcu_read_unlock();
1243	return true;
1244	}
1245
1246	/** Unlinks wnd.cur from wnd.ppred and schedules a deferred free for the item.
1247	*
1248	* Ignores nodes marked N_JOIN if walk mode is WM_LEAVE_JOIN.
1249	*
1250	* @param h CHT to operate on.
1251	* @param wnd Points to the item to delete and its N_NORMAL predecessor.
1252	* @param walk_mode Bucket chaing walk mode.
1253	* @param deleted_but_gc Set to true if the item had been logically deleted,
1254	* but a garbage collecting walk of the bucket is in order for
1255	* it to be fully unlinked.
1256	* @param resizing Set to true if the table is undergoing an unexpected
1257	* resize (ie walk_mode == WM_NORMAL).
1258	* @return False if the wnd.ppred changed in the meantime and the whole
1259	* delete operation must be retried.
1260	*/
1261	static inline bool delete_at(cht_t h, wnd_t wnd, walk_mode_t walk_mode,
1262	bool deleted_but_gc, bool resizing)
1263	{
1264	assert(wnd->cur && wnd->cur != &sentinel);
1265
1266	*deleted_but_gc = false;
1267
1268	if (!mark_deleted(wnd->cur, walk_mode, resizing)) {
1269	/* Already deleted, or unexpectedly marked as JOIN/JOIN_FOLLOWS. */
1270	return false;
1271	}
1272
1273	/* Marked deleted. Unlink from the bucket. */
1274
1275	/* Never unlink join nodes. */
1276	if (walk_mode == WM_LEAVE_JOIN && (N_JOIN & get_mark(wnd->cur->link)))
1277	return true;
1278
1279	cas_order_barrier();
1280
1281	if (unlink_from_pred(wnd, walk_mode, resizing)) {
1282	free_later(h, wnd->cur);
1283	} else {
1284	*deleted_but_gc = true;
1285	}
1286
1287	return true;
1288	}
1289
1290	/** Marks cur logically deleted. Returns false to request a retry. */
1291	static inline bool mark_deleted(cht_link_t *cur, walk_mode_t walk_mode,
1292	bool *resizing)
1293	{
1294	assert(cur && cur != &sentinel);
1295
1296	/*
1297	* Btw, we could loop here if the cas fails but let's not complicate
1298	* things and let's retry from the head of the bucket.
1299	*/
1300
1301	cht_link_t *next = get_next(cur->link);
1302
1303	if (walk_mode == WM_NORMAL) {
1304	/* Only mark clean/normal nodes - JF/JN is used only during resize. */
1305	marked_ptr_t ret = cas_link(&cur->link, next, N_NORMAL, next, N_DELETED);
1306
1307	if (ret != make_link(next, N_NORMAL)) {
1308	*resizing = (N_JOIN \| N_JOIN_FOLLOWS) & get_mark(ret);
1309	return false;
1310	}
1311	} else {
1312	static_assert(N_JOIN == N_JOIN_FOLLOWS, "");
1313
1314	/* Keep the N_JOIN/N_JOIN_FOLLOWS mark but strip N_DELETED. */
1315	mark_t cur_mark = get_mark(cur->link) & N_JOIN_FOLLOWS;
1316
1317	marked_ptr_t ret =
1318	cas_link(&cur->link, next, cur_mark, next, cur_mark \| N_DELETED);
1319
1320	if (ret != make_link(next, cur_mark))
1321	return false;
1322	}
1323
1324	return true;
1325	}
1326
1327	/** Unlinks wnd.cur from wnd.ppred. Returns false if it should be retried. */
1328	static inline bool unlink_from_pred(wnd_t *wnd, walk_mode_t walk_mode,
1329	bool *resizing)
1330	{
1331	assert(wnd->cur != &sentinel);
1332	assert(wnd->cur && (N_DELETED & get_mark(wnd->cur->link)));
1333
1334	cht_link_t *next = get_next(wnd->cur->link);
1335
1336	if (walk_mode == WM_LEAVE_JOIN) {
1337	/* Never try to unlink join nodes. */
1338	assert(!(N_JOIN & get_mark(wnd->cur->link)));
1339
1340	mark_t pred_mark = get_mark(*wnd->ppred);
1341	/* Succeed only if the predecessor is clean/normal or a join node. */
1342	mark_t exp_pred_mark = (N_JOIN & pred_mark) ? pred_mark : N_NORMAL;
1343
1344	marked_ptr_t pred_link = make_link(wnd->cur, exp_pred_mark);
1345	marked_ptr_t next_link = make_link(next, exp_pred_mark);
1346
1347	if (pred_link != _cas_link(wnd->ppred, pred_link, next_link))
1348	return false;
1349	} else {
1350	assert(walk_mode == WM_MOVE_JOIN_FOLLOWS \|\| walk_mode == WM_NORMAL);
1351	/* Move the JF mark if set. Clear DEL mark. */
1352	mark_t cur_mark = N_JOIN_FOLLOWS & get_mark(wnd->cur->link);
1353
1354	/* The predecessor must be clean/normal. */
1355	marked_ptr_t pred_link = make_link(wnd->cur, N_NORMAL);
1356	/* Link to cur's successor keeping/copying cur's JF mark. */
1357	marked_ptr_t next_link = make_link(next, cur_mark);
1358
1359	marked_ptr_t ret = _cas_link(wnd->ppred, pred_link, next_link);
1360
1361	if (pred_link != ret) {
1362	/* If we're not resizing the table there are no JF/JN nodes. */
1363	*resizing = (walk_mode == WM_NORMAL)
1364	&& (N_JOIN_FOLLOWS & get_mark(ret));
1365	return false;
1366	}
1367	}
1368
1369	return true;
1370	}
1371
1372	/** Finds the first non-deleted item equal to \a pred_arg according to \a pred.
1373	*
1374	* The function returns the candidate item in \a wnd. Logically deleted
1375	* nodes are garbage collected so the predecessor will most likely not
1376	* be marked as deleted.
1377	*
1378	* Unlike find_wnd_and_gc(), this function never returns a node that
1379	* is known to have already been marked N_DELETED.
1380	*
1381	* Any logically deleted nodes (ie those marked N_DELETED) are garbage
1382	* collected, ie free in the background via rcu_call (except for join-nodes
1383	* if walk_mode == WM_LEAVE_JOIN).
1384	*
1385	* @param h CHT to operate on.
1386	* @param hash Hash the search for.
1387	* @param walk_mode Bucket chain walk mode.
1388	* @param pred Predicate used to find an item equal to pred_arg.
1389	* @param pred_arg Argument to pass to the equality predicate \a pred.
1390	* @param[in,out] wnd The search starts with wnd.cur. If the desired
1391	* item is found wnd.cur will point to it.
1392	* @param resizing Set to true if the table is resizing but it was not
1393	* expected (ie walk_mode == WM_NORMAL).
1394	* @return False if the operation has to be retried. True otherwise
1395	* (even if an equal item had not been found).
1396	*/
1397	static bool find_wnd_and_gc_pred(cht_t *h, size_t hash, walk_mode_t walk_mode,
1398	equal_pred_t pred, void pred_arg, wnd_t wnd, bool *resizing)
1399	{
1400	assert(wnd->cur);
1401
1402	if (wnd->cur == &sentinel)
1403	return true;
1404
1405	/*
1406	* A read barrier is not needed here to bring up the most recent
1407	* node marks (esp the N_DELETED). At worst we'll try to delete
1408	* an already deleted node; fail in delete_at(); and retry.
1409	*/
1410
1411	size_t cur_hash;
1412
1413	try_again:
1414	cur_hash = node_hash(h, wnd->cur);
1415
1416	while (cur_hash <= hash) {
1417	assert(wnd->cur && wnd->cur != &sentinel);
1418
1419	/* GC any deleted nodes on the way. */
1420	if (N_DELETED & get_mark(wnd->cur->link)) {
1421	if (!gc_deleted_node(h, walk_mode, wnd, resizing)) {
1422	/* Retry from the head of a bucket. */
1423	return false;
1424	}
1425	} else {
1426	/* Is this the node we were looking for? */
1427	if (cur_hash == hash && pred(pred_arg, wnd->cur))
1428	return true;
1429
1430	next_wnd(wnd);
1431	}
1432
1433	cur_hash = node_hash(h, wnd->cur);
1434	}
1435
1436	if (cur_hash == h->invalid_hash) {
1437	next_wnd(wnd);
1438	assert(wnd->cur);
1439	goto try_again;
1440	}
1441
1442	/* The searched for node is not in the current bucket. */
1443	return true;
1444	}
1445
1446	/** Find the first item (deleted or not) with a hash greater or equal to \a hash.
1447	*
1448	* The function returns the first item with a hash that is greater or
1449	* equal to \a hash in \a wnd. Moreover it garbage collects logically
1450	* deleted node that have not yet been unlinked and freed. Therefore,
1451	* the returned node's predecessor will most likely be N_NORMAL.
1452	*
1453	* Unlike find_wnd_and_gc_pred(), this function may return a node
1454	* that is known to had been marked N_DELETED.
1455	*
1456	* @param h CHT to operate on.
1457	* @param hash Hash of the item to find.
1458	* @param walk_mode Bucket chain walk mode.
1459	* @param[in,out] wnd wnd.cur denotes the first node of the chain. If the
1460	* the operation is successful, \a wnd points to the desired
1461	* item.
1462	* @param resizing Set to true if a table resize was detected but walk_mode
1463	* suggested the table was not undergoing a resize.
1464	* @return False indicates the operation must be retried. True otherwise
1465	* (even if an item with exactly the same has was not found).
1466	*/
1467	static bool find_wnd_and_gc(cht_t *h, size_t hash, walk_mode_t walk_mode,
1468	wnd_t wnd, bool resizing)
1469	{
1470	try_again:
1471	assert(wnd->cur);
1472
1473	while (node_hash(h, wnd->cur) < hash) {
1474	/* GC any deleted nodes along the way to our desired node. */
1475	if (N_DELETED & get_mark(wnd->cur->link)) {
1476	if (!gc_deleted_node(h, walk_mode, wnd, resizing)) {
1477	/* Failed to remove the garbage node. Retry. */
1478	return false;
1479	}
1480	} else {
1481	next_wnd(wnd);
1482	}
1483
1484	assert(wnd->cur);
1485	}
1486
1487	if (node_hash(h, wnd->cur) == h->invalid_hash) {
1488	next_wnd(wnd);
1489	goto try_again;
1490	}
1491
1492	/* wnd->cur may be NULL or even marked N_DELETED. */
1493	return true;
1494	}
1495
1496	/** Garbage collects the N_DELETED node at \a wnd skipping join nodes. */
1497	static bool gc_deleted_node(cht_t h, walk_mode_t walk_mode, wnd_t wnd,
1498	bool *resizing)
1499	{
1500	assert(N_DELETED & get_mark(wnd->cur->link));
1501
1502	/* Skip deleted JOIN nodes. */
1503	if (walk_mode == WM_LEAVE_JOIN && (N_JOIN & get_mark(wnd->cur->link))) {
1504	next_wnd(wnd);
1505	} else {
1506	/* Ordinary deleted node or a deleted JOIN_FOLLOWS. */
1507	assert(walk_mode != WM_LEAVE_JOIN
1508	\|\| !((N_JOIN \| N_JOIN_FOLLOWS) & get_mark(wnd->cur->link)));
1509
1510	/* Unlink an ordinary deleted node, move JOIN_FOLLOWS mark. */
1511	if (!unlink_from_pred(wnd, walk_mode, resizing)) {
1512	/* Retry. The predecessor was deleted, invalid, const, join_follows. */
1513	return false;
1514	}
1515
1516	free_later(h, wnd->cur);
1517
1518	/* Leave ppred as is. */
1519	wnd->last = wnd->cur;
1520	wnd->cur = get_next(wnd->cur->link);
1521	}
1522
1523	return true;
1524	}
1525
1526	/** Returns true if a bucket join had already completed.
1527	*
1528	* May only be called if upd_resizing_head() indicates a bucket join
1529	* may be in progress.
1530	*
1531	* If it returns false, the search must be retried in order to guarantee
1532	* all item that should have been encountered have been seen.
1533	*/
1534	static bool join_completed(cht_t h, const wnd_t wnd)
1535	{
1536	/*
1537	* The table is shrinking and the searched for item is in a bucket
1538	* appended to another. Check that the link joining these two buckets
1539	* is visible and if not, make it visible to this cpu.
1540	*/
1541
1542	/*
1543	* Resizer ensures h->b->order stays the same for the duration of this
1544	* func. We got here because there was an alternative head to search.
1545	* The resizer waits for all preexisting readers to finish after
1546	* it
1547	*/
1548	assert(h->b->order > h->new_b->order);
1549	assert(wnd->cur);
1550
1551	/* Either we did not need the joining link or we have already followed it.*/
1552	if (wnd->cur != &sentinel)
1553	return true;
1554
1555	/* We have reached the end of a bucket. */
1556
1557	if (wnd->last != &sentinel) {
1558	size_t last_seen_hash = node_hash(h, wnd->last);
1559
1560	if (last_seen_hash == h->invalid_hash) {
1561	last_seen_hash = calc_node_hash(h, wnd->last);
1562	}
1563
1564	size_t last_old_idx = calc_bucket_idx(last_seen_hash, h->b->order);
1565	size_t move_src_idx = grow_idx(shrink_idx(last_old_idx));
1566
1567	/*
1568	* Last node seen was in the joining bucket - if the searched
1569	* for node is there we will find it.
1570	*/
1571	if (move_src_idx != last_old_idx)
1572	return true;
1573	}
1574
1575	/*
1576	* Reached the end of the bucket but no nodes from the joining bucket
1577	* were seen. There should have at least been a JOIN node so we have
1578	* definitely not seen (and followed) the joining link. Make the link
1579	* visible and retry.
1580	*/
1581	read_barrier();
1582	return false;
1583	}
1584
1585	/** When resizing returns the bucket head to start the search with in \a phead.
1586	*
1587	* If a resize had been detected (eg cht_t.b.head[idx] is marked immutable).
1588	* upd_resizing_head() moves the bucket for \a hash from the old head
1589	* to the new head. Moreover, it splits or joins buckets as necessary.
1590	*
1591	* @param h CHT to operate on.
1592	* @param hash Hash of an item whose chain we would like to traverse.
1593	* @param[out] phead Head of the bucket to search for \a hash.
1594	* @param[out] join_finishing Set to true if a bucket join might be
1595	* in progress and the bucket may have to traversed again
1596	* as indicated by join_completed().
1597	* @param[out] walk_mode Specifies how to interpret node marks.
1598	*/
1599	static void upd_resizing_head(cht_t h, size_t hash, marked_ptr_t *phead,
1600	bool join_finishing, walk_mode_t walk_mode)
1601	{
1602	cht_buckets_t *b = rcu_access(h->b);
1603	size_t old_idx = calc_bucket_idx(hash, b->order);
1604	size_t new_idx = calc_bucket_idx(hash, h->new_b->order);
1605
1606	marked_ptr_t *pold_head = &b->head[old_idx];
1607	marked_ptr_t *pnew_head = &h->new_b->head[new_idx];
1608
1609	/* In any case, use the bucket in the new table. */
1610	*phead = pnew_head;
1611
1612	/* Growing the table. */
1613	if (b->order < h->new_b->order) {
1614	size_t move_dest_idx = grow_idx(old_idx);
1615	marked_ptr_t *pmoved_head = &h->new_b->head[move_dest_idx];
1616
1617	/* Complete moving the bucket from the old to the new table. */
1618	help_head_move(pold_head, pmoved_head);
1619
1620	/* The hash belongs to the moved bucket. */
1621	if (move_dest_idx == new_idx) {
1622	assert(pmoved_head == pnew_head);
1623	/*
1624	* move_head() makes the new head of the moved bucket visible.
1625	* The new head may be marked with a JOIN_FOLLOWS
1626	*/
1627	assert(!(N_CONST & get_mark(*pmoved_head)));
1628	*walk_mode = WM_MOVE_JOIN_FOLLOWS;
1629	} else {
1630	assert(pmoved_head != pnew_head);
1631	/*
1632	* The hash belongs to the bucket that is the result of splitting
1633	* the old/moved bucket, ie the bucket that contains the second
1634	* half of the split/old/moved bucket.
1635	*/
1636
1637	/* The moved bucket has not yet been split. */
1638	if (N_NORMAL != get_mark(*pnew_head)) {
1639	size_t split_hash = calc_split_hash(new_idx, h->new_b->order);
1640	split_bucket(h, pmoved_head, pnew_head, split_hash);
1641	/*
1642	* split_bucket() makes the new head visible. No
1643	* JOIN_FOLLOWS in this part of split bucket.
1644	*/
1645	assert(N_NORMAL == get_mark(*pnew_head));
1646	}
1647
1648	*walk_mode = WM_LEAVE_JOIN;
1649	}
1650	} else if (h->new_b->order < b->order ) {
1651	/* Shrinking the table. */
1652
1653	size_t move_src_idx = grow_idx(new_idx);
1654
1655	/*
1656	* Complete moving the bucket from the old to the new table.
1657	* Makes a valid pnew_head visible if already moved.
1658	*/
1659	help_head_move(&b->head[move_src_idx], pnew_head);
1660
1661	/* Hash belongs to the bucket to be joined with the moved bucket. */
1662	if (move_src_idx != old_idx) {
1663	/* Bucket join not yet completed. */
1664	if (N_INVALID != get_mark(*pold_head)) {
1665	size_t split_hash = calc_split_hash(old_idx, b->order);
1666	join_buckets(h, pold_head, pnew_head, split_hash);
1667	}
1668
1669	/*
1670	* The resizer sets pold_head to &sentinel when all cpus are
1671	* guaranteed to see the bucket join.
1672	*/
1673	join_finishing = (&sentinel != get_next(pold_head));
1674	}
1675
1676	/* move_head() or join_buckets() makes it so or makes the mark visible.*/
1677	assert(N_INVALID == get_mark(*pold_head));
1678	/* move_head() makes it visible. No JOIN_FOLLOWS used when shrinking. */
1679	assert(N_NORMAL == get_mark(*pnew_head));
1680
1681	*walk_mode = WM_LEAVE_JOIN;
1682	} else {
1683	/*
1684	* Final stage of resize. The resizer is waiting for all
1685	* readers to notice that the old table had been replaced.
1686	*/
1687	assert(b == h->new_b);
1688	*walk_mode = WM_NORMAL;
1689	}
1690	}
1691
1692
1693	#if 0
1694	static void move_head(marked_ptr_t psrc_head, marked_ptr_t pdest_head)
1695	{
1696	start_head_move(psrc_head);
1697	cas_order_barrier();
1698	complete_head_move(psrc_head, pdest_head);
1699	}
1700	#endif
1701
1702	/** Moves an immutable head \a psrc_head of cht_t.b to \a pdest_head of cht_t.new_b.
1703	*
1704	* The function guarantees the move will be visible on this cpu once
1705	* it completes. In particular, *pdest_head will not be N_INVALID.
1706	*
1707	* Unlike complete_head_move(), help_head_move() checks if the head had already
1708	* been moved and tries to avoid moving the bucket heads if possible.
1709	*/
1710	static inline void help_head_move(marked_ptr_t *psrc_head,
1711	marked_ptr_t *pdest_head)
1712	{
1713	/* Head move has to in progress already when calling this func. */
1714	assert(N_CONST & get_mark(*psrc_head));
1715
1716	/* Head already moved. */
1717	if (N_INVALID == get_mark(*psrc_head)) {
1718	/* Effects of the head move have not yet propagated to this cpu. */
1719	if (N_INVALID == get_mark(*pdest_head)) {
1720	/* Make the move visible on this cpu. */
1721	read_barrier();
1722	}
1723	} else {
1724	complete_head_move(psrc_head, pdest_head);
1725	}
1726
1727	assert(!(N_CONST & get_mark(*pdest_head)));
1728	}
1729
1730	/** Initiates the move of the old head \a psrc_head.
1731	*
1732	* The move may be completed with help_head_move().
1733	*/
1734	static void start_head_move(marked_ptr_t *psrc_head)
1735	{
1736	/* Mark src head immutable. */
1737	mark_const(psrc_head);
1738	}
1739
1740	/** Marks the head immutable. */
1741	static void mark_const(marked_ptr_t *psrc_head)
1742	{
1743	marked_ptr_t ret, src_link;
1744
1745	/* Mark src head immutable. */
1746	do {
1747	cht_link_t next = get_next(psrc_head);
1748	src_link = make_link(next, N_NORMAL);
1749
1750	/* Mark the normal/clean src link immutable/const. */
1751	ret = cas_link(psrc_head, next, N_NORMAL, next, N_CONST);
1752	} while(ret != src_link && !(N_CONST & get_mark(ret)));
1753	}
1754
1755	/** Completes moving head psrc_head to pdest_head (started by start_head_move()).*/
1756	static void complete_head_move(marked_ptr_t psrc_head, marked_ptr_t pdest_head)
1757	{
1758	assert(N_JOIN_FOLLOWS != get_mark(*psrc_head));
1759	assert(N_CONST & get_mark(*psrc_head));
1760
1761	cht_link_t next = get_next(psrc_head);
1762
1763	DBG(marked_ptr_t ret = )
1764	cas_link(pdest_head, &sentinel, N_INVALID, next, N_NORMAL);
1765	assert(ret == make_link(&sentinel, N_INVALID) \|\| (N_NORMAL == get_mark(ret)));
1766	cas_order_barrier();
1767
1768	DBG(ret = )
1769	cas_link(psrc_head, next, N_CONST, next, N_INVALID);
1770	assert(ret == make_link(next, N_CONST) \|\| (N_INVALID == get_mark(ret)));
1771	cas_order_barrier();
1772	}
1773
1774	/** Splits the bucket at psrc_head and links to the remainder from pdest_head.
1775	*
1776	* Items with hashes greater or equal to \a split_hash are moved to bucket
1777	* with head at \a pdest_head.
1778	*
1779	* @param h CHT to operate on.
1780	* @param psrc_head Head of the bucket to split (in cht_t.new_b).
1781	* @param pdest_head Head of the bucket that points to the second part
1782	* of the split bucket in psrc_head. (in cht_t.new_b)
1783	* @param split_hash Hash of the first possible item in the remainder of
1784	* psrc_head, ie the smallest hash pdest_head is allowed
1785	* to point to..
1786	*/
1787	static void split_bucket(cht_t h, marked_ptr_t psrc_head,
1788	marked_ptr_t *pdest_head, size_t split_hash)
1789	{
1790	/* Already split. */
1791	if (N_NORMAL == get_mark(*pdest_head))
1792	return;
1793
1794	/*
1795	* L == Last node of the first part of the split bucket. That part
1796	* remains in the original/src bucket.
1797	* F == First node of the second part of the split bucket. That part
1798	* will be referenced from the dest bucket head.
1799	*
1800	* We want to first mark a clean L as JF so that updaters unaware of
1801	* the split (or table resize):
1802	* - do not insert a new node between L and F
1803	* - do not unlink L (that is why it has to be clean/normal)
1804	* - do not unlink F
1805	*
1806	* Then we can safely mark F as JN even if it has been marked deleted.
1807	* Once F is marked as JN updaters aware of table resize will not
1808	* attempt to unlink it (JN will have two predecessors - we cannot
1809	* safely unlink from both at the same time). Updaters unaware of
1810	* ongoing resize can reach F only via L and that node is already
1811	* marked JF, so they won't unlink F.
1812	*
1813	* Last, link the new/dest head to F.
1814	*
1815	*
1816	* 0) ,-- split_hash, first hash of the dest bucket
1817	* v
1818	* [src_head \| N] -> .. -> [L] -> [F]
1819	* [dest_head \| Inv]
1820	*
1821	* 1) ,-- split_hash
1822	* v
1823	* [src_head \| N] -> .. -> [JF] -> [F]
1824	* [dest_head \| Inv]
1825	*
1826	* 2) ,-- split_hash
1827	* v
1828	* [src_head \| N] -> .. -> [JF] -> [JN]
1829	* [dest_head \| Inv]
1830	*
1831	* 3) ,-- split_hash
1832	* v
1833	* [src_head \| N] -> .. -> [JF] -> [JN]
1834	* ^
1835	* [dest_head \| N] -----------------'
1836	*/
1837	wnd_t wnd;
1838
1839	rcu_read_lock();
1840
1841	/* Mark the last node of the first part of the split bucket as JF. */
1842	mark_join_follows(h, psrc_head, split_hash, &wnd);
1843	cas_order_barrier();
1844
1845	/* There are nodes in the dest bucket, ie the second part of the split. */
1846	if (wnd.cur != &sentinel) {
1847	/*
1848	* Mark the first node of the dest bucket as a join node so
1849	* updaters do not attempt to unlink it if it is deleted.
1850	*/
1851	mark_join_node(wnd.cur);
1852	cas_order_barrier();
1853	} else {
1854	/*
1855	* Second part of the split bucket is empty. There are no nodes
1856	* to mark as JOIN nodes and there never will be.
1857	*/
1858	}
1859
1860	/* Link the dest head to the second part of the split. */
1861	DBG(marked_ptr_t ret = )
1862	cas_link(pdest_head, &sentinel, N_INVALID, wnd.cur, N_NORMAL);
1863	assert(ret == make_link(&sentinel, N_INVALID) \|\| (N_NORMAL == get_mark(ret)));
1864	cas_order_barrier();
1865
1866	rcu_read_unlock();
1867	}
1868
1869	/** Finds and marks the last node of psrc_head w/ hash less than split_hash.
1870	*
1871	* Finds a node in psrc_head with the greatest hash that is strictly less
1872	* than split_hash and marks it with N_JOIN_FOLLOWS.
1873	*
1874	* Returns a window pointing to that node.
1875	*
1876	* Any logically deleted nodes along the way are
1877	* garbage collected; therefore, the predecessor node (if any) will most
1878	* likely not be marked N_DELETED.
1879	*
1880	* @param h CHT to operate on.
1881	* @param psrc_head Bucket head.
1882	* @param split_hash The smallest hash a join node (ie the node following
1883	* the desired join-follows node) may have.
1884	* @param[out] wnd Points to the node marked with N_JOIN_FOLLOWS.
1885	*/
1886	static void mark_join_follows(cht_t h, marked_ptr_t psrc_head,
1887	size_t split_hash, wnd_t *wnd)
1888	{
1889	/* See comment in split_bucket(). */
1890
1891	bool done = false;
1892
1893	do {
1894	bool resizing = false;
1895	wnd->ppred = psrc_head;
1896	wnd->cur = get_next(*psrc_head);
1897
1898	/*
1899	* Find the split window, ie the last node of the first part of
1900	* the split bucket and the its successor - the first node of
1901	* the second part of the split bucket. Retry if GC failed.
1902	*/
1903	if (!find_wnd_and_gc(h, split_hash, WM_MOVE_JOIN_FOLLOWS, wnd, &resizing))
1904	continue;
1905
1906	/* Must not report that the table is resizing if WM_MOVE_JOIN_FOLLOWS.*/
1907	assert(!resizing);
1908	/*
1909	* Mark the last node of the first half of the split bucket
1910	* that a join node follows. It must be clean/normal.
1911	*/
1912	marked_ptr_t ret
1913	= cas_link(wnd->ppred, wnd->cur, N_NORMAL, wnd->cur, N_JOIN_FOLLOWS);
1914
1915	/*
1916	* Successfully marked as a JF node or already marked that way (even
1917	* if also marked deleted - unlinking the node will move the JF mark).
1918	*/
1919	done = (ret == make_link(wnd->cur, N_NORMAL))
1920	\|\| (N_JOIN_FOLLOWS & get_mark(ret));
1921	} while (!done);
1922	}
1923
1924	/** Marks join_node with N_JOIN. */
1925	static void mark_join_node(cht_link_t *join_node)
1926	{
1927	/* See comment in split_bucket(). */
1928
1929	bool done;
1930	do {
1931	cht_link_t *next = get_next(join_node->link);
1932	mark_t mark = get_mark(join_node->link);
1933
1934	/*
1935	* May already be marked as deleted, but it won't be unlinked
1936	* because its predecessor is marked with JOIN_FOLLOWS or CONST.
1937	*/
1938	marked_ptr_t ret
1939	= cas_link(&join_node->link, next, mark, next, mark \| N_JOIN);
1940
1941	/* Successfully marked or already marked as a join node. */
1942	done = (ret == make_link(next, mark))
1943	\|\| (N_JOIN & get_mark(ret));
1944	} while(!done);
1945	}
1946
1947	/** Appends the bucket at psrc_head to the bucket at pdest_head.
1948	*
1949	* @param h CHT to operate on.
1950	* @param psrc_head Bucket to merge with pdest_head.
1951	* @param pdest_head Bucket to be joined by psrc_head.
1952	* @param split_hash The smallest hash psrc_head may contain.
1953	*/
1954	static void join_buckets(cht_t h, marked_ptr_t psrc_head,
1955	marked_ptr_t *pdest_head, size_t split_hash)
1956	{
1957	/* Buckets already joined. */
1958	if (N_INVALID == get_mark(*psrc_head))
1959	return;
1960	/*
1961	* F == First node of psrc_head, ie the bucket we want to append
1962	* to (ie join with) the bucket starting at pdest_head.
1963	* L == Last node of pdest_head, ie the bucket that psrc_head will
1964	* be appended to.
1965	*
1966	* (1) We first mark psrc_head immutable to signal that a join is
1967	* in progress and so that updaters unaware of the join (or table
1968	* resize):
1969	* - do not insert new nodes between the head psrc_head and F
1970	* - do not unlink F (it may already be marked deleted)
1971	*
1972	* (2) Next, F is marked as a join node. Updaters aware of table resize
1973	* will not attempt to unlink it. We cannot safely/atomically unlink
1974	* the join node because it will be pointed to from two different
1975	* buckets. Updaters unaware of resize will fail to unlink the join
1976	* node due to the head being marked immutable.
1977	*
1978	* (3) Then the tail of the bucket at pdest_head is linked to the join
1979	* node. From now on, nodes in both buckets can be found via pdest_head.
1980	*
1981	* (4) Last, mark immutable psrc_head as invalid. It signals updaters
1982	* that the join is complete and they can insert new nodes (originally
1983	* destined for psrc_head) into pdest_head.
1984	*
1985	* Note that pdest_head keeps pointing at the join node. This allows
1986	* lookups and updaters to determine if they should see a link between
1987	* the tail L and F when searching for nodes originally in psrc_head
1988	* via pdest_head. If they reach the tail of pdest_head without
1989	* encountering any nodes of psrc_head, either there were no nodes
1990	* in psrc_head to begin with or the link between L and F did not
1991	* yet propagate to their cpus. If psrc_head was empty, it remains
1992	* NULL. Otherwise psrc_head points to a join node (it will not be
1993	* unlinked until table resize completes) and updaters/lookups
1994	* should issue a read_barrier() to make the link [L]->[JN] visible.
1995	*
1996	* 0) ,-- split_hash, first hash of the src bucket
1997	* v
1998	* [dest_head \| N]-> .. -> [L]
1999	* [src_head \| N]--> [F] -> ..
2000	* ^
2001	* ` split_hash, first hash of the src bucket
2002	*
2003	* 1) ,-- split_hash
2004	* v
2005	* [dest_head \| N]-> .. -> [L]
2006	* [src_head \| C]--> [F] -> ..
2007	*
2008	* 2) ,-- split_hash
2009	* v
2010	* [dest_head \| N]-> .. -> [L]
2011	* [src_head \| C]--> [JN] -> ..
2012	*
2013	* 3) ,-- split_hash
2014	* v
2015	* [dest_head \| N]-> .. -> [L] --+
2016	* v
2017	* [src_head \| C]-------------> [JN] -> ..
2018	*
2019	* 4) ,-- split_hash
2020	* v
2021	* [dest_head \| N]-> .. -> [L] --+
2022	* v
2023	* [src_head \| Inv]-----------> [JN] -> ..
2024	*/
2025
2026	rcu_read_lock();
2027
2028	/* Mark src_head immutable - signals updaters that bucket join started. */
2029	mark_const(psrc_head);
2030	cas_order_barrier();
2031
2032	cht_link_t join_node = get_next(psrc_head);
2033
2034	if (join_node != &sentinel) {
2035	mark_join_node(join_node);
2036	cas_order_barrier();
2037
2038	link_to_join_node(h, pdest_head, join_node, split_hash);
2039	cas_order_barrier();
2040	}
2041
2042	DBG(marked_ptr_t ret = )
2043	cas_link(psrc_head, join_node, N_CONST, join_node, N_INVALID);
2044	assert(ret == make_link(join_node, N_CONST) \|\| (N_INVALID == get_mark(ret)));
2045	cas_order_barrier();
2046
2047	rcu_read_unlock();
2048	}
2049
2050	/** Links the tail of pdest_head to join_node.
2051	*
2052	* @param h CHT to operate on.
2053	* @param pdest_head Head of the bucket whose tail is to be linked to join_node.
2054	* @param join_node A node marked N_JOIN with a hash greater or equal to
2055	* split_hash.
2056	* @param split_hash The least hash that is greater than the hash of any items
2057	* (originally) in pdest_head.
2058	*/
2059	static void link_to_join_node(cht_t h, marked_ptr_t pdest_head,
2060	cht_link_t *join_node, size_t split_hash)
2061	{
2062	bool done = false;
2063
2064	do {
2065	wnd_t wnd = {
2066	.ppred = pdest_head,
2067	.cur = get_next(*pdest_head)
2068	};
2069
2070	bool resizing = false;
2071
2072	if (!find_wnd_and_gc(h, split_hash, WM_LEAVE_JOIN, &wnd, &resizing))
2073	continue;
2074
2075	assert(!resizing);
2076
2077	if (wnd.cur != &sentinel) {
2078	/* Must be from the new appended bucket. */
2079	assert(split_hash <= node_hash(h, wnd.cur)
2080	\|\| h->invalid_hash == node_hash(h, wnd.cur));
2081	return;
2082	}
2083
2084	/* Reached the tail of pdest_head - link it to the join node. */
2085	marked_ptr_t ret =
2086	cas_link(wnd.ppred, &sentinel, N_NORMAL, join_node, N_NORMAL);
2087
2088	done = (ret == make_link(&sentinel, N_NORMAL));
2089	} while (!done);
2090	}
2091
2092	/** Instructs RCU to free the item once all preexisting references are dropped.
2093	*
2094	* The item is freed via op->remove_callback().
2095	*/
2096	static void free_later(cht_t h, cht_link_t item)
2097	{
2098	assert(item != &sentinel);
2099
2100	/*
2101	* remove_callback only works as rcu_func_t because rcu_link is the first
2102	* field in cht_link_t.
2103	*/
2104	rcu_call(&item->rcu_link, (rcu_func_t)h->op->remove_callback);
2105
2106	item_removed(h);
2107	}
2108
2109	/** Notes that an item had been unlinked from the table and shrinks it if needed.
2110	*
2111	* If the number of items in the table drops below 1/4 of the maximum
2112	* allowed load the table is shrunk in the background.
2113	*/
2114	static inline void item_removed(cht_t *h)
2115	{
2116	size_t items = (size_t) atomic_predec(&h->item_cnt);
2117	size_t bucket_cnt = (1 << h->b->order);
2118
2119	bool need_shrink = (items == h->max_load * bucket_cnt / 4);
2120	bool missed_shrink = (items == h->max_load * bucket_cnt / 8);
2121
2122	if ((need_shrink \|\| missed_shrink) && h->b->order > h->min_order) {
2123	atomic_count_t resize_reqs = atomic_preinc(&h->resize_reqs);
2124	/* The first resize request. Start the resizer. */
2125	if (1 == resize_reqs) {
2126	workq_global_enqueue_noblock(&h->resize_work, resize_table);
2127	}
2128	}
2129	}
2130
2131	/** Notes an item had been inserted and grows the table if needed.
2132	*
2133	* The table is resized in the background.
2134	*/
2135	static inline void item_inserted(cht_t *h)
2136	{
2137	size_t items = (size_t) atomic_preinc(&h->item_cnt);
2138	size_t bucket_cnt = (1 << h->b->order);
2139
2140	bool need_grow = (items == h->max_load * bucket_cnt);
2141	bool missed_grow = (items == 2 * h->max_load * bucket_cnt);
2142
2143	if ((need_grow \|\| missed_grow) && h->b->order < CHT_MAX_ORDER) {
2144	atomic_count_t resize_reqs = atomic_preinc(&h->resize_reqs);
2145	/* The first resize request. Start the resizer. */
2146	if (1 == resize_reqs) {
2147	workq_global_enqueue_noblock(&h->resize_work, resize_table);
2148	}
2149	}
2150	}
2151
2152	/** Resize request handler. Invoked on the system work queue. */
2153	static void resize_table(work_t *arg)
2154	{
2155	cht_t *h = member_to_inst(arg, cht_t, resize_work);
2156
2157	#ifdef CONFIG_DEBUG
2158	assert(h->b);
2159	/* Make resize_reqs visible. */
2160	read_barrier();
2161	assert(0 < atomic_get(&h->resize_reqs));
2162	#endif
2163
2164	bool done = false;
2165
2166	do {
2167	/* Load the most recent h->item_cnt. */
2168	read_barrier();
2169	size_t cur_items = (size_t) atomic_get(&h->item_cnt);
2170	size_t bucket_cnt = (1 << h->b->order);
2171	size_t max_items = h->max_load * bucket_cnt;
2172
2173	if (cur_items >= max_items && h->b->order < CHT_MAX_ORDER) {
2174	grow_table(h);
2175	} else if (cur_items <= max_items / 4 && h->b->order > h->min_order) {
2176	shrink_table(h);
2177	} else {
2178	/* Table is just the right size. */
2179	atomic_count_t reqs = atomic_predec(&h->resize_reqs);
2180	done = (reqs == 0);
2181	}
2182	} while (!done);
2183	}
2184
2185	/** Increases the number of buckets two-fold. Blocks until done. */
2186	static void grow_table(cht_t *h)
2187	{
2188	if (h->b->order >= CHT_MAX_ORDER)
2189	return;
2190
2191	h->new_b = alloc_buckets(h->b->order + 1, true, false);
2192
2193	/* Failed to alloc a new table - try next time the resizer is run. */
2194	if (!h->new_b)
2195	return;
2196
2197	/* Wait for all readers and updaters to see the initialized new table. */
2198	rcu_synchronize();
2199	size_t old_bucket_cnt = (1 << h->b->order);
2200
2201	/*
2202	* Give updaters a chance to help out with the resize. Do the minimum
2203	* work needed to announce a resize is in progress, ie start moving heads.
2204	*/
2205	for (size_t idx = 0; idx < old_bucket_cnt; ++idx) {
2206	start_head_move(&h->b->head[idx]);
2207	}
2208
2209	/* Order start_head_move() wrt complete_head_move(). */
2210	cas_order_barrier();
2211
2212	/* Complete moving heads and split any buckets not yet split by updaters. */
2213	for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2214	marked_ptr_t *move_dest_head = &h->new_b->head[grow_idx(old_idx)];
2215	marked_ptr_t *move_src_head = &h->b->head[old_idx];
2216
2217	/* Head move not yet completed. */
2218	if (N_INVALID != get_mark(*move_src_head)) {
2219	complete_head_move(move_src_head, move_dest_head);
2220	}
2221
2222	size_t split_idx = grow_to_split_idx(old_idx);
2223	size_t split_hash = calc_split_hash(split_idx, h->new_b->order);
2224	marked_ptr_t *split_dest_head = &h->new_b->head[split_idx];
2225
2226	split_bucket(h, move_dest_head, split_dest_head, split_hash);
2227	}
2228
2229	/*
2230	* Wait for all updaters to notice the new heads. Once everyone sees
2231	* the invalid old bucket heads they will know a resize is in progress
2232	* and updaters will modify the correct new buckets.
2233	*/
2234	rcu_synchronize();
2235
2236	/* Clear the JOIN_FOLLOWS mark and remove the link between the split buckets.*/
2237	for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2238	size_t new_idx = grow_idx(old_idx);
2239
2240	cleanup_join_follows(h, &h->new_b->head[new_idx]);
2241	}
2242
2243	/*
2244	* Wait for everyone to notice that buckets were split, ie link connecting
2245	* the join follows and join node has been cut.
2246	*/
2247	rcu_synchronize();
2248
2249	/* Clear the JOIN mark and GC any deleted join nodes. */
2250	for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2251	size_t new_idx = grow_to_split_idx(old_idx);
2252
2253	cleanup_join_node(h, &h->new_b->head[new_idx]);
2254	}
2255
2256	/* Wait for everyone to see that the table is clear of any resize marks. */
2257	rcu_synchronize();
2258
2259	cht_buckets_t *old_b = h->b;
2260	rcu_assign(h->b, h->new_b);
2261
2262	/* Wait for everyone to start using the new table. */
2263	rcu_synchronize();
2264
2265	free(old_b);
2266
2267	/* Not needed; just for increased readability. */
2268	h->new_b = NULL;
2269	}
2270
2271	/** Halfs the number of buckets. Blocks until done. */
2272	static void shrink_table(cht_t *h)
2273	{
2274	if (h->b->order <= h->min_order)
2275	return;
2276
2277	h->new_b = alloc_buckets(h->b->order - 1, true, false);
2278
2279	/* Failed to alloc a new table - try next time the resizer is run. */
2280	if (!h->new_b)
2281	return;
2282
2283	/* Wait for all readers and updaters to see the initialized new table. */
2284	rcu_synchronize();
2285
2286	size_t old_bucket_cnt = (1 << h->b->order);
2287
2288	/*
2289	* Give updaters a chance to help out with the resize. Do the minimum
2290	* work needed to announce a resize is in progress, ie start moving heads.
2291	*/
2292	for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2293	size_t new_idx = shrink_idx(old_idx);
2294
2295	/* This bucket should be moved. */
2296	if (grow_idx(new_idx) == old_idx) {
2297	start_head_move(&h->b->head[old_idx]);
2298	} else {
2299	/* This bucket should join the moved bucket once the move is done.*/
2300	}
2301	}
2302
2303	/* Order start_head_move() wrt to complete_head_move(). */
2304	cas_order_barrier();
2305
2306	/* Complete moving heads and join buckets with the moved buckets. */
2307	for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2308	size_t new_idx = shrink_idx(old_idx);
2309	size_t move_src_idx = grow_idx(new_idx);
2310
2311	/* This bucket should be moved. */
2312	if (move_src_idx == old_idx) {
2313	/* Head move not yet completed. */
2314	if (N_INVALID != get_mark(h->b->head[old_idx])) {
2315	complete_head_move(&h->b->head[old_idx], &h->new_b->head[new_idx]);
2316	}
2317	} else {
2318	/* This bucket should join the moved bucket. */
2319	size_t split_hash = calc_split_hash(old_idx, h->b->order);
2320	join_buckets(h, &h->b->head[old_idx], &h->new_b->head[new_idx],
2321	split_hash);
2322	}
2323	}
2324
2325	/*
2326	* Wait for all updaters to notice the new heads. Once everyone sees
2327	* the invalid old bucket heads they will know a resize is in progress
2328	* and updaters will modify the correct new buckets.
2329	*/
2330	rcu_synchronize();
2331
2332	/* Let everyone know joins are complete and fully visible. */
2333	for (size_t old_idx = 0; old_idx < old_bucket_cnt; ++old_idx) {
2334	size_t move_src_idx = grow_idx(shrink_idx(old_idx));
2335
2336	/* Set the invalid joinee head to NULL. */
2337	if (old_idx != move_src_idx) {
2338	assert(N_INVALID == get_mark(h->b->head[old_idx]));
2339
2340	if (&sentinel != get_next(h->b->head[old_idx]))
2341	h->b->head[old_idx] = make_link(&sentinel, N_INVALID);
2342	}
2343	}
2344
2345	/* todo comment join node vs reset joinee head*/
2346	rcu_synchronize();
2347
2348	size_t new_bucket_cnt = (1 << h->new_b->order);
2349
2350	/* Clear the JOIN mark and GC any deleted join nodes. */
2351	for (size_t new_idx = 0; new_idx < new_bucket_cnt; ++new_idx) {
2352	cleanup_join_node(h, &h->new_b->head[new_idx]);
2353	}
2354
2355	/* Wait for everyone to see that the table is clear of any resize marks. */
2356	rcu_synchronize();
2357
2358	cht_buckets_t *old_b = h->b;
2359	rcu_assign(h->b, h->new_b);
2360
2361	/* Wait for everyone to start using the new table. */
2362	rcu_synchronize();
2363
2364	free(old_b);
2365
2366	/* Not needed; just for increased readability. */
2367	h->new_b = NULL;
2368	}
2369
2370	/** Finds and clears the N_JOIN mark from a node in new_head (if present). */
2371	static void cleanup_join_node(cht_t h, marked_ptr_t new_head)
2372	{
2373	rcu_read_lock();
2374
2375	cht_link_t cur = get_next(new_head);
2376
2377	while (cur != &sentinel) {
2378	/* Clear the join node's JN mark - even if it is marked as deleted. */
2379	if (N_JOIN & get_mark(cur->link)) {
2380	clear_join_and_gc(h, cur, new_head);
2381	break;
2382	}
2383
2384	cur = get_next(cur->link);
2385	}
2386
2387	rcu_read_unlock();
2388	}
2389
2390	/** Clears the join_node's N_JOIN mark frees it if marked N_DELETED as well. */
2391	static void clear_join_and_gc(cht_t h, cht_link_t join_node,
2392	marked_ptr_t *new_head)
2393	{
2394	assert(join_node != &sentinel);
2395	assert(join_node && (N_JOIN & get_mark(join_node->link)));
2396
2397	bool done;
2398
2399	/* Clear the JN mark. */
2400	do {
2401	marked_ptr_t jn_link = join_node->link;
2402	cht_link_t *next = get_next(jn_link);
2403	/* Clear the JOIN mark but keep the DEL mark if present. */
2404	mark_t cleared_mark = get_mark(jn_link) & N_DELETED;
2405
2406	marked_ptr_t ret =
2407	_cas_link(&join_node->link, jn_link, make_link(next, cleared_mark));
2408
2409	/* Done if the mark was cleared. Retry if a new node was inserted. */
2410	done = (ret == jn_link);
2411	assert(ret == jn_link \|\| (get_mark(ret) & N_JOIN));
2412	} while (!done);
2413
2414	if (!(N_DELETED & get_mark(join_node->link)))
2415	return;
2416
2417	/* The join node had been marked as deleted - GC it. */
2418
2419	/* Clear the JOIN mark before trying to unlink the deleted join node.*/
2420	cas_order_barrier();
2421
2422	size_t jn_hash = node_hash(h, join_node);
2423	do {
2424	bool resizing = false;
2425
2426	wnd_t wnd = {
2427	.ppred = new_head,
2428	.cur = get_next(*new_head)
2429	};
2430
2431	done = find_wnd_and_gc_pred(h, jn_hash, WM_NORMAL, same_node_pred,
2432	join_node, &wnd, &resizing);
2433
2434	assert(!resizing);
2435	} while (!done);
2436	}
2437
2438	/** Finds a non-deleted node with N_JOIN_FOLLOWS and clears the mark. */
2439	static void cleanup_join_follows(cht_t h, marked_ptr_t new_head)
2440	{
2441	assert(new_head);
2442
2443	rcu_read_lock();
2444
2445	wnd_t wnd = {
2446	.ppred = NULL,
2447	.cur = NULL
2448	};
2449	marked_ptr_t *cur_link = new_head;
2450
2451	/*
2452	* Find the non-deleted node with a JF mark and clear the JF mark.
2453	* The JF node may be deleted and/or the mark moved to its neighbors
2454	* at any time. Therefore, we GC deleted nodes until we find the JF
2455	* node in order to remove stale/deleted JF nodes left behind eg by
2456	* delayed threads that did not yet get a chance to unlink the deleted
2457	* JF node and move its mark.
2458	*
2459	* Note that the head may be marked JF (but never DELETED).
2460	*/
2461	while (true) {
2462	bool is_jf_node = N_JOIN_FOLLOWS & get_mark(*cur_link);
2463
2464	/* GC any deleted nodes on the way - even deleted JOIN_FOLLOWS. */
2465	if (N_DELETED & get_mark(*cur_link)) {
2466	assert(cur_link != new_head);
2467	assert(wnd.ppred && wnd.cur && wnd.cur != &sentinel);
2468	assert(cur_link == &wnd.cur->link);
2469
2470	bool dummy;
2471	bool deleted = gc_deleted_node(h, WM_MOVE_JOIN_FOLLOWS, &wnd, &dummy);
2472
2473	/* Failed to GC or collected a deleted JOIN_FOLLOWS. */
2474	if (!deleted \|\| is_jf_node) {
2475	/* Retry from the head of the bucket. */
2476	cur_link = new_head;
2477	continue;
2478	}
2479	} else {
2480	/* Found a non-deleted JF. Clear its JF mark. */
2481	if (is_jf_node) {
2482	cht_link_t next = get_next(cur_link);
2483	marked_ptr_t ret =
2484	cas_link(cur_link, next, N_JOIN_FOLLOWS, &sentinel, N_NORMAL);
2485
2486	assert(next == &sentinel
2487	\|\| ((N_JOIN \| N_JOIN_FOLLOWS) & get_mark(ret)));
2488
2489	/* Successfully cleared the JF mark of a non-deleted node. */
2490	if (ret == make_link(next, N_JOIN_FOLLOWS)) {
2491	break;
2492	} else {
2493	/*
2494	* The JF node had been deleted or a new node inserted
2495	* right after it. Retry from the head.
2496	*/
2497	cur_link = new_head;
2498	continue;
2499	}
2500	} else {
2501	wnd.ppred = cur_link;
2502	wnd.cur = get_next(*cur_link);
2503	}
2504	}
2505
2506	/* We must encounter a JF node before we reach the end of the bucket. */
2507	assert(wnd.cur && wnd.cur != &sentinel);
2508	cur_link = &wnd.cur->link;
2509	}
2510
2511	rcu_read_unlock();
2512	}
2513
2514	/** Returns the first possible hash following a bucket split point.
2515	*
2516	* In other words the returned hash is the smallest possible hash
2517	* the remainder of the split bucket may contain.
2518	*/
2519	static inline size_t calc_split_hash(size_t split_idx, size_t order)
2520	{
2521	assert(1 <= order && order <= 8 * sizeof(size_t));
2522	return split_idx << (8 * sizeof(size_t) - order);
2523	}
2524
2525	/** Returns the bucket head index given the table size order and item hash. */
2526	static inline size_t calc_bucket_idx(size_t hash, size_t order)
2527	{
2528	assert(1 <= order && order <= 8 * sizeof(size_t));
2529	return hash >> (8 * sizeof(size_t) - order);
2530	}
2531
2532	/** Returns the bucket index of destination*/
2533	static inline size_t grow_to_split_idx(size_t old_idx)
2534	{
2535	return grow_idx(old_idx) \| 1;
2536	}
2537
2538	/** Returns the destination index of a bucket head when the table is growing. */
2539	static inline size_t grow_idx(size_t idx)
2540	{
2541	return idx << 1;
2542	}
2543
2544	/** Returns the destination index of a bucket head when the table is shrinking.*/
2545	static inline size_t shrink_idx(size_t idx)
2546	{
2547	return idx >> 1;
2548	}
2549
2550	/** Returns a mixed hash of the search key.*/
2551	static inline size_t calc_key_hash(cht_t h, void key)
2552	{
2553	/* Mimic calc_node_hash. */
2554	return hash_mix(h->op->key_hash(key)) & ~(size_t)1;
2555	}
2556
2557	/** Returns a memoized mixed hash of the item. */
2558	static inline size_t node_hash(cht_t h, const cht_link_t item)
2559	{
2560	assert(item->hash == h->invalid_hash
2561	\|\| item->hash == sentinel.hash
2562	\|\| item->hash == calc_node_hash(h, item));
2563
2564	return item->hash;
2565	}
2566
2567	/** Calculates and mixed the hash of the item. */
2568	static inline size_t calc_node_hash(cht_t h, const cht_link_t item)
2569	{
2570	assert(item != &sentinel);
2571	/*
2572	* Clear the lowest order bit in order for sentinel's node hash
2573	* to be the greatest possible.
2574	*/
2575	return hash_mix(h->op->hash(item)) & ~(size_t)1;
2576	}
2577
2578	/** Computes and memoizes the hash of the item. */
2579	static inline void memoize_node_hash(cht_t h, cht_link_t item)
2580	{
2581	item->hash = calc_node_hash(h, item);
2582	}
2583
2584	/** Packs the next pointer address and the mark into a single pointer. */
2585	static inline marked_ptr_t make_link(const cht_link_t *next, mark_t mark)
2586	{
2587	marked_ptr_t ptr = (marked_ptr_t) next;
2588
2589	assert(!(ptr & N_MARK_MASK));
2590	assert(!((unsigned)mark & ~N_MARK_MASK));
2591
2592	return ptr \| mark;
2593	}
2594
2595	/** Strips any marks from the next item link and returns the next item's address.*/
2596	static inline cht_link_t * get_next(marked_ptr_t link)
2597	{
2598	return (cht_link_t*)(link & ~N_MARK_MASK);
2599	}
2600
2601	/** Returns the current node's mark stored in the next item link. */
2602	static inline mark_t get_mark(marked_ptr_t link)
2603	{
2604	return (mark_t)(link & N_MARK_MASK);
2605	}
2606
2607	/** Moves the window by one item so that is points to the next item. */
2608	static inline void next_wnd(wnd_t *wnd)
2609	{
2610	assert(wnd);
2611	assert(wnd->cur);
2612
2613	wnd->last = wnd->cur;
2614	wnd->ppred = &wnd->cur->link;
2615	wnd->cur = get_next(wnd->cur->link);
2616	}
2617
2618	/** Predicate that matches only exactly the same node. */
2619	static bool same_node_pred(void node, const cht_link_t item2)
2620	{
2621	const cht_link_t item1 = (const cht_link_t) node;
2622	return item1 == item2;
2623	}
2624
2625	/** Compare-and-swaps a next item link. */
2626	static inline marked_ptr_t cas_link(marked_ptr_t link, const cht_link_t cur_next,
2627	mark_t cur_mark, const cht_link_t *new_next, mark_t new_mark)
2628	{
2629	return _cas_link(link, make_link(cur_next, cur_mark),
2630	make_link(new_next, new_mark));
2631	}
2632
2633	/** Compare-and-swaps a next item link. */
2634	static inline marked_ptr_t _cas_link(marked_ptr_t *link, marked_ptr_t cur,
2635	marked_ptr_t new)
2636	{
2637	assert(link != &sentinel.link);
2638	/*
2639	* cas(x) on the same location x on one cpu must be ordered, but do not
2640	* have to be ordered wrt to other cas(y) to a different location y
2641	* on the same cpu.
2642	*
2643	* cas(x) must act as a write barrier on x, ie if cas(x) succeeds
2644	* and is observed by another cpu, then all cpus must be able to
2645	* make the effects of cas(x) visible just by issuing a load barrier.
2646	* For example:
2647	* cpu1 cpu2 cpu3
2648	* cas(x, 0 -> 1), succeeds
2649	* cas(x, 0 -> 1), fails
2650	* MB, to order load of x in cas and store to y
2651	* y = 7
2652	* sees y == 7
2653	* loadMB must be enough to make cas(x) on cpu3 visible to cpu1, ie x == 1.
2654	*
2655	* If cas() did not work this way:
2656	* a) our head move protocol would not be correct.
2657	* b) freeing an item linked to a moved head after another item was
2658	* inserted in front of it, would require more than one grace period.
2659	*
2660	* Ad (a): In the following example, cpu1 starts moving old_head
2661	* to new_head, cpu2 completes the move and cpu3 notices cpu2
2662	* completed the move before cpu1 gets a chance to notice cpu2
2663	* had already completed the move. Our requirements for cas()
2664	* assume cpu3 will see a valid and mutable value in new_head
2665	* after issuing a load memory barrier once it has determined
2666	* the old_head's value had been successfully moved to new_head
2667	* (because it sees old_head marked invalid).
2668	*
2669	* cpu1 cpu2 cpu3
2670	* cas(old_head, <addr, N>, <addr, Const>), succeeds
2671	* cas-order-barrier
2672	* // Move from old_head to new_head started, now the interesting stuff:
2673	* cas(new_head, <0, Inv>, <addr, N>), succeeds
2674	*
2675	* cas(new_head, <0, Inv>, <addr, N>), but fails
2676	* cas-order-barrier
2677	* cas(old_head, <addr, Const>, <addr, Inv>), succeeds
2678	*
2679	* Sees old_head marked Inv (by cpu2)
2680	* load-MB
2681	* assert(new_head == <addr, N>)
2682	*
2683	* cas-order-barrier
2684	*
2685	* Even though cpu1 did not yet issue a cas-order-barrier, cpu1's store
2686	* to new_head (successful cas()) must be made visible to cpu3 with
2687	* a load memory barrier if cpu1's store to new_head is visible
2688	* on another cpu (cpu2) and that cpu's (cpu2's) store to old_head
2689	* is already visible to cpu3. *
2690	*/
2691	void expected = (void)cur;
2692
2693	/*
2694	* Use the acquire-release model, although we could probably
2695	* get away even with the relaxed memory model due to our use
2696	* of explicit memory barriers.
2697	*/
2698	__atomic_compare_exchange_n((void*)link, &expected, (void )new, false,
2699	__ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE);
2700
2701	return (marked_ptr_t) expected;
2702	}
2703
2704	/** Orders compare-and-swaps to different memory locations. */
2705	static inline void cas_order_barrier(void)
2706	{
2707	/* Make sure CAS to different memory locations are ordered. */
2708	write_barrier();
2709	}
2710
2711
2712	/** @}
2713	*/

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: mainline/kernel/generic/src/adt/cht.c@ 314f4b59

Download in other formats: