Usage Guide (liblfds)

Introduction

This page describes how to use the liblfds library, and then covers the novel and pecuilar issues which originate in the lock-free nature of the data structures in the library.

Where-ever possible such issues have been hidden from the user, but there are some which simply cannot be hidden and as such the user has be aware of them.

Library Initialization and Cleanup

No library initialization or cleanup are required.

Usage

To use liblfds, include the header file liblfds710.h and link as normal to the library in your build.

Novel and Peculiar Issues

Memory Allocation

The liblfds library performs no memory allocation or deallocation. Accordindly, there are no new and delete functions, but rather init and cleanup.

The user is responsible for all allocation and all deallocation. As such, allocations can be from the heap or from the stack, or from user-mode or from the kernel; the library itself just uses what you give it, and doesn't know about and so does not differentiate between virtual or physical addresses. Allocations can be shared memory, but note the virtual memory ranges must be the same in all processes - liblfds uses addresses directly, rather than using offsets. Being able to used shared memory is particularly important for Windows, which lacks a high performance cross-process lock; the data structures in liblfds when used with shared memory meet this need.

Memory Deallocation

Any data structure element which has at any time been present in a lock-free data structure can never be passed to free until the data structure in question is no longer in use and has had its cleanup function called.

As such, typical usage is for data structure elements to be supplied from (and returned to) a lock-free freelist.

There is a single exception to this, which is the unbounded, many producer, many consumer queue. It is safe to deallocate elements which have emerged from this data structure.

Data Structure Initialization

Passing a data structure state to its init function initializes that state but that initialization is and is only valid for the logical core upon which it occurs.

The macro LFDS710_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE is used to make the initialization valid on other logical cores and it will make the initialization valid upon and only upon the logical core which calls the macro.

Expected use is that a thread will initialize a data structure, pass a pointer to its state to other threads, all of whom will then call LFDS710_LIBLFDS_MAKE_USABLE_TO_CURRENT_LOGICAL_CORE_INITS_PERFORMED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE.

The Standard Library and lfds710_pal_uint_t

The liblfds library is intended for both 32 and 64 bit platforms. As such, there is a need for an unsigned type which is 32 bits long on a 32 bit platform, and 64 bits long on a 64 bit platform - but remember that the Standard Library is not used, so we can't turn to it for a solution (and also that for C89, there wasn't really such a type anyway - size_t did in fact behave in this way on Windows and Linux, but semantically size_t means something else, and so it is only co-incidentally behaving in this way).

As such, liblfds in the platform abstraction layer typedefs liblfds710_pal_uint_t (and a signed equivelent, lfds710_pal_int_t). This is set to be an unsigned integer which is the natural length for the platform, i.e. the length of the processor register, 32 bits on a 32 bit CPU and 64 bits on a 64 bit CPU.

Exclusive Reservation Granule (ARM, POWERPC)

On ARM and POWERPC there is a define in the liblfds header file lfds710_porting_abstraction_layer_processor.h, LFDS710_PAL_ATOMIC_ISOLATION_IN_BYTES which SHOULD BE SET CORRECTLY, as the value in the header file is necessrily the worse-case (longest) value, which in the case of ARM is 2048 bytes - which has the effect ot making many of the data structure structures HUGE.

There are two approaches in hardware, in processors, to implementing atomic operations. The first approach is compare-and-swap (CAS), the second approach is load-linked/store-conditional (LL/SC).

Atomic operation involve writing a variable to memory. CAS implements atomicity by locking the cache-line containing the variable. LL/SC implements atomicity by loading the variable into a register, where anything can be done with it, but watching the memory location it comes from until the variable is stored again; if in the meantime another write occurred to that memory location, the store aborts.

The granularity of the 'watching' varies a great deal. On some platforms, such as MIPS, only one 'watcher' is available per logical processor. On some platforms, such as ARM and POWERPC, memory is conceptually divided up into pages (which are known as "Exclusive Reservation Granules", "ERG" for short) and if a write occurs to the page which contains the target varible, then the store fails.

On ARM the Exclusive Reservation Granule range in size from 8 to 2048 bytes (always a power of two, though - 8, 16, 32, 64, etc), depending on implementation.

Obviously, for liblfds to alway work, the header file has to use the 2048 byte value - but where in all the liblfds structures variables which are the target of atomic operations have to be in their own granule, then naturally the larger the granule size, the larger the structure. Some structures have a number of variables which are the target of atomic operations, and so those structures can become very large.

This then leads to the question of finding out or determining the ERG length.

The ERG length can be obtained from the processor, it's stored in a register, but on ARM this is only possible when the processor is in supervisor mode; so liblfds cannot access this information. The documentation for any given system should, somewhere deeply buried, indicate the ERG length.

There is however another way. The libtest library offers a function libtest_misc_determine_erg which attempts to empircally determine the ERG length, by running on one logical core an LL operation, then on every other logical core touching memory just inside the largest possible ERG size and then trying the SC operation, repeating this with progressively smaller ERG sizes, until the operation fails, which indicates the ERG size.

This function can only work on systems which have more than one physical processor (multiple logical processors in one physical processor is not enough). This is because ARM implements per-processor 'local' watchers, which are typically much more relaxed than the 'global' (system-wide, i.e. multiple physical processors) watchers, which normally not throw an error even if a write occurs inside the ERG - i.e. with the local watcher only, it's not possible to make the LL/SC fail, so the code cannot work out the length.

The test binary has an argument, "-e", which runs a test using this function, like so;

test -e

The output will look like this (plus a bunch of fixed explanatory text);

 ERG length in bytes : Number successful LL/SC ops
 =================================================
 4 bytes : 0
 8 bytes : 0
 16 bytes : 0
 32 bytes : 0
 64 bytes : 1024
 128 bytes : 1023
 256 bytes : 1023
 512 bytes : 1024
 1024 bytes : 1024
 2048 bytes : 12

The ERG size is on the left, the number of successful LL/SC ops on the right. Each size is tested 1024 times. The smallest size with 1024, or almost 1024, operations, is the ERG size. (LL/SC ops can fail naturally due to system activity, so it's expected that sometimes one or two will fail by themselves).

We see here the ERG sie for this platform is 64 bytes, which is correct - this is a Cortex A7 in a Raspberry Pi 2 Model B.

(That there are 12 successful ops for teh 2048 byte size is not in fact understood. There is a 4 byte ERG test as a sanity check, because if it passes, then it is clear the test is not working properly).