Queue (unbounded, many producer, many consumer)
Source Files
└───liblfds710 ├───inc │ └───liblfds710 │ lfds710_queue_umm.h └───src └───lfds710_queue lfds710_queue_umm_cleanup.c lfds710_queue_umm_dequeue.c lfds710_queue_umm_enqueue.c lfds710_queue_umm_init.c lfds710_queue_umm_internal.h lfds710_queue_umm_query.c
Enums
enum lfds710_queue_umm_query;
Opaque Structures
struct lfds710_queue_umm_element; struct lfds710_queue_umm_state;
Macros
#define LFDS710_QUEUE_UMM_GET_KEY_FROM_ELEMENT( queue_umm_element ) #define LFDS710_QUEUE_UMM_SET_KEY_IN_ELEMENT( queue_umm_element, new_key ) #define LFDS710_QUEUE_UMM_GET_VALUE_FROM_ELEMENT( queue_umm_element ) #define LFDS710_QUEUE_UMM_SET_VALUE_IN_ELEMENT( queue_umm_element, new_value ) #define LFDS710_QUEUE_UMM_GET_USER_STATE_FROM_STATE( queue_umm_state )
Prototypes
void lfds710_queue_init_valid_on_current_logical_core( struct lfds710_queue_umm_state *qumms, struct lfds710_queue_umm_element *qumme_dummy, void *user_state ); void lfds710_queue_umm_cleanup( struct lfds710_queue_umm_state *qumms, void (*element_dequeue_callback)(struct lfds710_queue_umm_state *qumms, struct lfds710_queue_umm_element *qumme, enum lfds710_liblfds_misc_flag dummy_element_flag) ); void lfds710_queue_umm_enqueue( struct lfds710_queue_umm_state *qumms, struct lfds710_queue_umm_element *qumme ); int lfds710_queue_umm_dequeue( struct lfds710_queue_umm_state *qumms, struct lfds710_queue_umm_element **qumme ); void lfds710_queue_umm_query( struct lfds710_queue_umm_state *qumms, enum lfds710_queue_umm_query query_type, void *query_input, void *query_output );
Overview
This data structure implements an unbounded, many-producer, many-consumer queue and internally implements exponential backoff to help deal with high load and so improve scalability.
The implementation performs no allocations. The user is responsible for all allocations (and deallocations), where these allocations are passed into the API functions, which then use them. As such, allocations can be on the stack, on the heap, or as can sometimes be the the case in embedded systems, allocated with fixed addresses at compile time from a fixed global store. Allocations can also be shared memory, but in this case, the virtual addresses used must be the same in all processes.
General usage is that the user calls lfds710_queue_umm_init_valid_on_current_logical_core to initialize a struct lfds710_queue_umm_state, and then calls lfds710_queue_umm_enqueue and lfds710_queue_umm_dequeue to enqueue and dequeue struct lfds710_queue_umm_elements.
A queue element provides the ability to store a key and a value, both of which are of type void *. The key is not used in any way by the queue (and of course the value is neither), rather, it is available as a convenience for the user, for situations where data is being transferred between different types of data structures, where some of these data structures do support a meaningful key. The key and value are get and set by macros, such as LFDS710_QUEUE_UMM_SET_VALUE_IN_ELEMENT. The SET macros can only be used when an element is outside of the queue. (Things may seem to work even if they are used on elements which are in the queue, but it's a matter of pure chance).
(See the section below, on lock-free specific behaviour, for an explanation of the unusual init function name.)
The state and element structures are both public, present in the lfds710_queue_umm.h header file, which is in line with the style for the rest of the library. Using this queue however differs somewhat from the usual expected method whereby each user structure contains an element structure (freelist element, btree element, etc) of the data structure it is to be placed in, because the queue always contains a dummy element (which is why there is a queue element argument to the init function; it is for the dummy element).
If we imagine the queue immediately after init, but before any enqueue, we see the queue contains a single dummy element, which contains no data. If we try to dequeue, dequeue fails (the queue knows that only the dummy is present). If we enqueue, the queue then contains the dummy element (with no data) and the newly enqueued real element (with data). Now if we dequeue, we are given the dummy element, but is has been populated with the data from the real element, and the real element (which now lacks data) has become the dummy element.
As such, the usual expected method of putting queue element structures in the user structures which are being enqueued does not quite work as we expect because the queue element which comes back to us is *not* the queue element in the user structure which has been returned to us. We can immediately re-use, or place on a freelist, the queue element which is returned to us - this is safe, it has left the queue - but we cannot use the queue element which is in the user structure, because this is now the dummy element in the queue - it is still in the queue.
This also means we lose locality between the queue element and the user structure its in, which is unfortunate.
Lock-free Specific Behaviour
The state initialization function, lfds710_queue_umm_init_valid_on_current_logical_core, as the same suggests, initializes the state structure but that initialization is only valid on the current logical core. For the initialization to be valid on other logical cores (i.e. other threads where they are running on other logical cores) those other threads need to call the long-windedly named macro LFDS710_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE, which will do that which its name suggests.
Once a queue element structure has been enqueued to the queue, it cannot be deallocated (free, or stack allocation lifetimes ending due to say a thread ending, etc) until function lfds710_queue_umm_cleanup has returned. Typical usage is to return dequeued queue elements to a freelist.
The SET macros (for setting key and value, in stack elements) can only be correctly used on elements which are outside of a queue, and the GET macros, if called by a thread on a logical core other than the logical core of the thread which called the SET macros, can only be correctly used on a queue element which has been enqueued to the queue and then dequeued.
By correctly is it meant to say that the GET macros will actually read the data written by the SET macros, and not some other data.
The queue should be regarded as a safe communication channel between threads. Any queue element which has a key and/or value set, and then is enqueued to a queue, will allow any thread which dequeues that element to correctly read the key and/or value (and by this is it meant to say not just the void pointer of the key and value, but also whatever they point to). This is the only guarantee. Any reads or writes of key and/or value, or what they point to, which occur outside of this enqueuing and dequeuing pair are not guaranteed to be correct; the data written may never be seen by other threads.
Benchmark Results and Analysis
queue (unbounded, many produces, many consumer) | ||
---|---|---|
ARM32 | x64 | |
Raspberry Pi 2 Model B | AWS dedicated VM | Core i5 |
One benchmark operation is an enqueue-dequeue pair. The enqueue function requires two DWCAS operations, and the dequeue one. The enqueue takes about twice as long as the dequeue.
Quite a bit of work went into improving queue performance in 7.1.0. The basic problem was that in 7.0.0, the static exponential backoff value was much, much too low. The autotuning exponential backoff in 7.1.0 addresses this issue. There was also some improvement (or fixing of mistakes :-) with regard to cache-line usage in the queue element. The upshot is a large (3x) improvement on x64. However, for reasons yet unexplored, the ARM32 version of this data structure performs appallingly; this was not noticed during the performance improvement work, due to too-many-columns-in-a-row-blindness.
White Paper
This data structure impliments a modified version of Michael and Scott's queue.
Note however the word modified. There are two modifications.
The first relates to an issue identified by Bryan Kerr, which can be found in this blog post. In short, where the counters are uninitialized, and where elements carry a counter with them, it is possible when those uninitialized values happen to have the same starting values and when moving elements between multiple queues to end up when enqueuing an element to Queue A, to actually end up enqueing to Queue B - essentially, you come to enqueue the element to Queue A, and then just before that occurs, everything is dequeued from Queue A and placed into Queue B. To the atomic operation about to perform the enqueue, everything looks as it should!
The solution adopted has been to endow each queue instance with an ABA counter, which is initialized by a high quality PRNG, and which then is used with atomic add to provide the initial ABA value for an element when it is enqueued.
The second modification is more questionable. It is an issue which appears to me to exist - I am sure I must be wrong, but cannot for the life of me see that it does not exist, and so until someone can explain to me what has failed to be understood, I have modified the code to address the issue.
The issue is discussed in this blog post, on the liblfds blog.
Essentially, it seems there may be a typo or a bug in the psudo-code in the paper, the upshot of which is a race condition which makes the call to free in the dequeue unsafe. It is easily fixed - so easily, and given the seeming typo in the enqueue code vs enqueue comment on line E4, that is seems to me that this may be a typographic error rather than an actual design flaw. However, this does not explain the matching bug in the dequeue code, which suffers no such typo.
The fix is very, very simple and does not invalidate or in any way significantly change the code or the function of the code - see the blog post for a description.
License
Unclear. The design comes from the 1998 white paper by Michael and Scott. No patent is known, and the queue is widely used - Java uses it, for example, for ConcurrentLinkedQueue.
Example
#include <stdio.h> #include <stdlib.h> #include "liblfds710.h" struct test_data { struct lfds710_queue_element qe; int number; }; int main() { int long long unsigned loop; struct lfds710_queue_element *qe, qe_dummy; struct lfds710_queue_state qs; struct test_data *td, *td_temp; lfds710_queue_umm_init_valid_on_current_logical_core( &qs, &qe_dummy, NULL ); td = malloc( sizeof(struct test_data) * 10 ); // TRD : queue ten elements for( loop = 0 ; loop < 10 ; loop++ ) { // TRD : we enqueue the numbers 0 to 9 td[loop].number = (int) loop; LFDS710_QUEUE_UMM_SET_VALUE_IN_ELEMENT( td[loop].qe, &td[loop] ); lfds710_queue_umm_enqueue( &qs, &td[loop].qe ); } // TRD : dequeue until the queue is empty while( lfds710_queue_umm_dequeue(&qs, &qe) ) { temp_td = LFDS710_QUEUE_UMM_GET_VALUE_FROM_ELEMENT( *qe ); // TRD : we dequeue the numbers 0 to 9 printf( "number = %d\n", temp_td->number ); } lfds710_queue_cleanup( &qs, NULL ); free( td ); return( EXIT_SUCCESS ); }