Difference between pages "r7.0.0:Usage Guide (liblfds)" and "r7.0.0:Porting Guide (liblfds)"

From liblfds.org
(Difference between pages)
Jump to navigation Jump to search
 
 
Line 1: Line 1:
{{DISPLAYTITLE:Usage Guide (liblfds)}}
{{DISPLAYTITLE:Porting Guide (liblfds)}}
==Introduction==
==Introduction==
This page describes how to use the ''liblfds'' library in terms of compiling and linking, and then covers the novel and pecuilar issues which originate in the lock-free nature of the data structures in the library.
To minimize the work involved is making ''liblfds'' work on any given platform a porting abstraction layer has been written.  Porting simply involves implementing the porting abstraction layer; the library will then compile and work.  Implementation involves providing values to a small set defines, macros and typedefs.  The porting abstraction layer is implemented as a small set of header files in the public header file directory, thus;


Where-ever possible such issues have been hidden from the user, but there are some - those here - which simply cannot be hidden and the user has to know about them.
└───liblfds700
    └───inc
        └───liblfds700
                lfds700_porting_abstraction_layer_atomic_compiler.h
                lfds700_porting_abstraction_layer_atomic_operating_system.h
                lfds700_porting_abstraction_layer_atomic_processor.h


== Compiling and Linking==
Each header file uses #ifdefs and compiler defined macros to select the appropriate porting abstraction layer implementation for the local platform.
The liblfds library compiles to either a static or shared library, to the pathname ''liblfds700/bin/liblfds700.[nnn]'', where the suffix of the library file varies by platform and by the type of library (static or dynamic).


==Library Initialization and Cleanup==
So, for example, in ''lfds700_porting_abstraction_layer_operating_system.h'', for MSVC, x86, Windows user-mode, we see the following, which is for MSVC on Windows kernel-mode.
The function ''[[r7.0.0:function lfds700_misc_library_init_valid_on_current_logical_core|lfds700_misc_library_init_valid_on_current_logical_core]]'' must be called before any other functions are called.  This function is called once, by one thread.


Other threads must call ''[[r7.0.0:define LFDS700_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE|LFDS700_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE]]'' for the library init to be valid for them. Threads are permitted to call this and only this macro, even though they have not themselves called ''lfds700_misc_library_init_valid_on_current_logical_core''.
#if( defined _MSC_VER && _MSC_VER >= 1400 && defined _WIN32 && defined _KERNEL_MODE )
  [snip porting abstraction layer implementation]
  #endif


The function ''[[r7.0.0:function lfds700_misc_library_cleanup|lfds700_misc_cleanup]]'' must be called when finished with the library.  Again, this is called once, by one thread.  Other threads have no work to do, to cleanup.
So, to add a new platform, introduce a new #ifdef, which matches the appropriate compiler defined macros for your platform.


==General Usage Style==
==The Header Files==
All allocations are performed by the user and passed into the library; as such there are no ''new'' and ''delete'' functions, but rather ''init'' and ''cleanup''.  Each data structure is represented by a ''state'' structure (i.e. ''lfds700_freelist_state''), and usually has also an ''element'' structure (i.e. ''lfds700_freelist_element'').  Each element carries both a key and value, the key being mandatory for the btree, hash and ordered list, and optional for all other data structures (being provided for convenience, in the case of passing data between different data structures and wishing to retain the key).  There are getter and setter macros for key and value, where the data structure functions themselves almost only ever deal with operations on the data structure elements themselves; the user sets or gets the keys and values from the elements, before or after a data structure function is called.
A porting abstraction layer is a small set of defines, macros and typedefs.  Each of the compiler, the operating system and the processor (for the processor, in the form of information about the processor provided by the user in #defines) turns out to be a source of information for a porting abstraction layer.


==Novel and Peculiar Issues==
In general then the porting abstraction layer is split up into these three files, so that everything in the porting layer which comes from any one source is in one place - i.e. everything which comes from the GCC compiler is in one section, for the GCC compiler, in the compiler header file.


===Memory Deallocation===
In almost all cases, across all the platforms supported so far - i.e. all the compilers, all the operating systems, all the processors - the same sets of information come from the same sources, i.e. the compiler provides the atomic instrincs, the processor provides cache-line lengths, etc.  However, this not quite a hard and fast rule.  There has so far been one exception - on Windows, with SSE2 or greater, processor barriers are a compiler instrinc (and so are in the compiler file), but prior to this, they were a function call provided by the operating system (and so are in the operating system file).
Any data structure element which has at any time been present in a lock-free data structure can never be passed to ''free'' until the data structure in question is no longer in use and has had its ''cleanup'' function called.


As such, typical usage is for data structure elements to be supplied from (and returned to) a lock-free freelist.
Basically, put things where they come from - and the current divison seems to work well for this end.


This is true for both the add-only and non-add-only data structures.
===''lfds700_porting_abstraction_layer_atomic_compiler.h''===
A compiler port should consist of everything which is a compiler intrinsic and this always or almost always consists of the following;


So, for example, if the user allocates a queue element and some user data and enqueues them, when that queue element is finally dequeued, it cannot be passed to ''free'' until the queue itself has been terminated and has had its ''cleanup'' function called.
#define [[r7.0.0:define LFDS700_PAL_COMPILER_STRING|LFDS700_PAL_COMPILER_STRING]]
#define [[r7.0.0:macro LFDS700_PAL_ALIGN|LFDS700_PAL_ALIGN]]( alignment )
#define [[r7.0.0:define LFDS700_PAL_INLINE|LFDS700_PAL_INLINE]]
#define [[r7.0.0:define LFDS700_PAL_BARRIER_COMPILER_LOAD|LFDS700_PAL_BARRIER_COMPILER_LOAD]]
#define [[r7.0.0:define LFDS700_PAL_BARRIER_COMPILER_STORE|LFDS700_PAL_BARRIER_COMPILER_STORE]]
#define [[r7.0.0:define LFDS700_PAL_BARRIER_COMPILER_FULL|LFDS700_PAL_BARRIER_COMPILER_FULL]]
#define [[r7.0.0:define LFDS700_PAL_BARRIER_PROCESSOR_LOAD|LFDS700_PAL_BARRIER_PROCESSOR_LOAD]]
#define [[r7.0.0:define LFDS700_PAL_BARRIER_PROCESSOR_STORE|LFDS700_PAL_BARRIER_PROCESSOR_STORE]]
#define [[r7.0.0:define LFDS700_PAL_BARRIER_PROCESSOR_FULL|LFDS700_PAL_BARRIER_PROCESSOR_FULL]]
#define [[r7.0.0:macro LFDS700_PAL_ATOMIC_CAS|LFDS700_PAL_ATOMIC_CAS]]( pointer_to_destination, pointer_to_compare, new_destination, cas_strength, result )
#define [[r7.0.0:macro LFDS700_PAL_ATOMIC_DWCAS|LFDS700_PAL_ATOMIC_DWCAS]]( pointer_to_destination, pointer_to_compare, pointer_to_new_destination, cas_strength, result )
#define [[r7.0.0:macro LFDS700_PAL_ATOMIC_EXCHANGE|LFDS700_PAL_ATOMIC_EXCHANGE]]( pointer_to_destination, pointer_to_exchange )


===Memory Allocation (and Shared Memory Support)===
===''lfds700_porting_abstraction_layer_atomic_operating_system.h''===
The ''liblfds'' library performs no memory allocation or deallocation.
An operating system port contains and #includes and implements the following;


The user is responsible for all allocation and all deallocation. As such, allocations can be from the heap or from the stack, or from user-mode or from the kernel; the library itself just uses what you give it, and doesn't know about and so does not differentiate between virtual or physical addressesAllocations can be shared memory, but note the virtual memory ranges must be the same in all processes - ''liblfds'' uses addresses directly, rather than using offsets. Being able to used shared memory is particularly important for Windows, which lacks a high performance cross-process lock; the data structures in ''liblfds'' when used with shared memory meet this need.
#define [[r7.0.0:define LFDS700_PAL_OS_STRING|LFDS700_PAL_OS_STRING]]
  #define [[r7.0.0:macro LFDS700_PAL_ASSERT|LFDS700_PAL_ASSERT]]( expression )


===Data Structure Initialization===
===''lfds700_porting_abstraction_layer_atomic_processor.h''===
Passing a data structure state to its ''init'' function initializes that state but that initialization is and is only valid for the logical core upon which it occurs.
A processor port contains user-provided information about the processor;


The macro ''[[r7.0.0:LFDS700_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE|LFDS700_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE]]'' is used to make the initialization valid on other logical cores and it will make the initialization valid upon and only upon the logical core which calls the macro.
typedef [type] [[r7.0.0:typedef lfds700_pal_atom_t|lfds700_pal_atom_t]];
typedef [type] [[r7.0.0:typedef lfds700_pal_uint_t|lfds700_pal_uint_t]];
#define [[r7.0.0:define LFDS700_PAL_PROCESSOR_STRING|LFDS700_PAL_PROCESSOR_STRING]]
#define [[r7.0.0:define LFDS700_PAL_ALIGN_SINGLE_POINTER|LFDS700_PAL_ALIGN_SINGLE_POINTER]]
#define [[r7.0.0:define LFDS700_PAL_ALIGN_DOUBLE_POINTER|LFDS700_PAL_ALIGN_DOUBLE_POINTER]]
#define [[r7.0.0:define LFDS700_PAL_CACHE_LINE_LENGTH_IN_BYTES|LFDS700_PAL_CACHE_LINE_LENGTH_IN_BYTES]]
#define [[r7.0.0:define LFDS700_PAL_ATOMIC_ISOLATION_IN_BYTES|LFDS700_PAL_ATOMIC_ISOLATION_IN_BYTES]]


Expected use is that a thread will initialize a data structure, pass a pointer to its state to other threads, all of whom will then call ''LFDS700_LIBLFDS_MAKE_USABLE_TO_CURRENT_LOGICAL_CORE_INITS_PERFORMED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE''.
==Partial Implementatons==
It is not necessary to fully implement a porting abstraction layer; not everything in the porting abstration layer is mandatory, and each of the atomic operations has a dummy version provided, for the purpose of allowing the library to compile even if that atomic operation has not been implemented or is not available on the platform (as not all atomic operations are available on all processors).  Naturally, if a data structure depends on a given atomic operation and that atomic operation has not in fact been implemented, then you cannot use that data structure.


===Lock-Free Exponential Backoff===
      <table border="1" cellpadding="4">
Lock-free data structure operations generally are implemented as loops which attempt an operation, check to see if it succeeded or failed, and if it failed, try again.  This is because lock-free operations can collide; if two or more threads are attempting to perform the same operation concurrently (such as pushing to a stack), then the thread which succeeds will in fact disrupt the other thread's operation and cause it to fail - and so to need to repeat the operation in its loop.
        <tr>
          <th colspan="6">Atomic Operation Support by Processor</th>
        </tr>


This can lead to a problem known as ''livelock'', where so many threads are trying to perform a given operation that they make no or almost no forward progress, because they keep running into each other.
        <tr>
          <th>&nbsp;</th>
          <th>Memory<BR>Barriers</th>
          <th>CAS</th>
          <th>DWCAS</th>
          <th>Exchange</th>
        </tr>


A method which ameilorates this problem is exponential backoff.  When a thread attempts and fails to perform an operation, it performs a very brief busy-wait loop, before retrying.
        <tr align="center">
          <th align="left">ARM32/64, x86, x64</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
        </tr>


The duration of the loop is computed by randomly picking a value between 0 and a maximum value, where that maximum value starts at one and doubles every time the attempted operation fails - i.e. the maximum possible backoff (and so also the mean actual backoff, since the random value will on mean be half of the maximum) exponentially increases, where that value chosen is then multiplied by a constant, where the result is finally the actual number of iterations to perform.
        <tr align="center">
          <th align="left">Alpha, Itanium, MIPS32/64, PowerPC32/64, SPARC32/64</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
        </tr>
      </table>


There is one constant for CAS operations and another for DWCAS operations.  They are intended to reflect how long the CAS and DWCAS take to operate; so in fact when we randomly select the first number, between 0 and the current maximum value, we are conceptually selecting how many CAS or DWCAS operations to wait for.
The data structures have the following requirements;


The value of these constants can be get and set, at any time, by calling ''[[r7.0.0:function lfds700_misc_query|lfds700_misc_query]]''.  The ability to set them is only really useful when the system ''liblfds'' will run on is fixed, as the optimal values for these constants varies by processor model, clock speed and current system load.
      <table border="1" cellpadding="4">
        <tr>
          <th colspan="6">Data Structure by Atomic Operations</th>
        </tr>


====Randon Number Generation====
        <tr>
A key issue here is the need to be able to generate random numbers.  The standard library function ''rand'' is unsuitable in every way; it is not thread safe, it has a very poor algorithm and only has a guaranteeded maximum value of 2^15-1, the performance of the implementation is unknown and may very across implementations and it is anyway a function call (i.e. slow) - remember, we want to generate a random number in the middle of a lock-free operation loop; we want to be FAST.  The idea of a function call, with its attendent cache disruption and delay, is beyond the pale.
          <th>&nbsp;</th>
          <th>Memory<BR>Barriers</th>
          <th>CAS</th>
          <th>DWCAS</th>
          <th>Exchange</th>
        </tr>


Out of necessity then ''liblfds'' has implemented well-known fast and lightweight random number genertor (on 32-bit platforms, a 32-bit xorshift, on 64-bit platforms an xorshift64*) and uses this for timely random number generation.
        <tr align="center">
          <th align="left">Binary Tree (add-only, unbalanced)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
        </tr>


Each thread using ''liblfds'' is expected to initialize one instance of ''struct lfds700_liblfds_prng_state'' and it is this which is passed as the last argument to function calls which require the ability to produce random numbers for exponential backoff.
        <tr align="center">
          <th align="left">Freelist</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
        </tr>


The random number generator code is not thread safe; it is expected that there is one state per thread. This is also processor cache friendly.
        <tr align="center">
          <th align="left">Hash (add-only)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
</tr>


This leads to the question of the seed value for the random number generator.
        <tr align="center">
          <th align="left">List (add-only, ordered, singly-linked)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
</tr>


It is often the case that the ''time'' function is used to produce a seed. However, the ''liblfds'' library is written to compile on a bare C implementation, so ''time'' cannot be used. In fact, on a bare C implementation, there are no external sources of entropy. So what to do?  because if we use the same seed, all of the thread will progress through the same sequence of random numbers - probably it will be okay, but all the same, it would be nicer to do something better than this.
        <tr align="center">
          <th align="left">List (add-only, singly-linked, unordered)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
        </tr>


What has been done is to implement a second random number generator, a xorshift1024*. This is big (1024 bits of entropy, vs 32 bits for xorshift and 64 bits for xorshift64*), slow and high quality. There is a single fixed seed, ''[[r7.0.0:define LFDS700_MISC_PRNG_SEED|LFDS700_MISC_PRNG_SEED]]'', which was taken from an atmospheric noise based true random number generator, which is passed through murmurhash3, to initialize the 1024 bytes of state. When one of the per-thread random number generator states is initialized, it obtains its seed from the xorshift1024*.
        <tr align="center">
          <th align="left">Queue</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
        </tr>


===The Standard Library and ''liblfds_uint_t''===
        <tr align="center">
The ''liblfds'' library is intended for both 32 and 64 bit platforms. As such, there is a need for an unsigned type which is 32 bits long on a 32 bit platform, and 64 bits long on a 64 bit platform - but remember that the Standard Library is not used, so we can't turn to it for a solution (and also that for C89, there wasn't really such a type anyway - ''size_t'' did in fact behave in this way on Windows and Linux, but semantically ''size_t'' means something else, and so it is only co-incidentally behaving in this way).
          <th align="left">Queue (BSCSP)<sup>1</sup></th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
</tr>


As such, ''liblfds'' in the platform abstraction layer typedefs ''liblfds_uint_t''. This is set to be an unsigned integer which is the natural length for the platform, i.e. the length of the processor register.
        <tr align="center">
          <th align="left">Ringbuffer</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
</tr>


===Alignment and Padding of Atomic Variables===
        <tr align="center">
There are two approaches in hardware, in processors, to implementing atomic operations, and there are two issues relating to these methods.
          <th align="left">Stack</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
</tr>
      </table>


The first approach is ''compare-and-swap'' (CAS), the second approach is ''load-linked/store-conditional'' (LL/SC).
<p><small>1. Bounded (fixed maximum number of elements, specified at instantiation time), single consumer, single producer<br></small></p>


Atomic operation involve writing a variable to memory.  CAS implements atomicity by locking the cache-line containing the variable.  As such, it has no failure mode - it will stall if it cannot lock the cache-line.  (The ''compare'' will come back saying equal or not-equal - the data may not be the same - but the ''comparing operation'' itself cannot fail).  LL/SC works by loading the variable into a register, for further work, but watching the memory location it comes from until the variable is stored again; if in the meantime another write occurred to that memory location, the store aborts.  The compare was performed by the user in his normal code after the load, and assuming the user decided he did want to store, then upon attempting to do so, the processor aborts the store because it knows the variable has in the meantime changes; the compare returned equal or not-equal as may be, but the compare operation itself failed - was aborted by the processor.  The granularity of the 'watching' varies a great deal.  On some platforms, memory is conceptually divided up into pages (which are known as "exclusive reservation granules", and on ARM, for example, these range in size from 8 to 2048 bytes, depending on implementation) and if a write occurs to the page which contains the target varible, then the store fails.  On other platforms (MIPS, for example), there is one 'watcher' per processor, so if ANY other writes occur on that processor between the load and store, the store fails.  Good luck with that :-)
Which means you end up with the following;


The first issue, which applies to both approaches, is that all writes to memory actually write out the full cache-line the varible sits in (this is true for all processors; when you write to a variable, the cache line it sits in is loaded to memory, the variable is modified (which may be only one byte, for example) and then the entire cache line is written back out to memory).  This causes the cache-line to be valid for and only for the writing logical core (i.e. all other copies of this line in other cores caches are now invalid, because they're out of date), and to be invalidated on all other logical cores.  Generally speaking, any varible which is the target of atomic operations will be being written to regularly or relatively regularly by many logical cores - why else would atomic operations be being used?  A concern here then is collateral damage - that we may have other important varibles in that same cache line, and they are continually being (inadvertantly) invalidated by the atomic operations.  A solution to this is to align and pad all variables subject to atomic operations such that they have a cache-line to themselves.
      <table border="1" cellpadding="4">
        <tr>
          <th colspan="3">Data Structures by Processors</th>
        </tr>


The second issue, which applies only to LL/SC, is somewhat similar; where a write to a granule will break any in-progress atomic operations within that granule, then its best to ensure that a variable subject to atomic operations is aligned and padded such that they have a granule to themselves.
        <tr>
          <th>&nbsp;</th>
          <th>ARM32/64, x86, x64</th>
          <th>Alpha, Itanium, MIPS32/64, PowerPC32/64, SPARC32/64</th>
        </tr>


These two issues then come down to the same - that variables subject to atomic operations need alignment and padding, although what that alignment and padding should be varies by platform. On CAS platforms, it should be to the cache-line length. On LL/SC platforms with granules, it should be to the larger of the cache-line length and the granule length (granule lengths are always a power of 2 and always are exactly divided by cache-line lengths). On LL/SC platforms where there is only one watcher per logical core, it should be to the cache-line length.
        <tr align="center">
          <th align="left">Binary Tree (add-only, unbalanced)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
</tr>


The kicker though is this - on platforms where the granule size varies, the only way to set it correctly is for the user to configure it; the library can't find out the size. The porting abstraction layer is configured out-of-the-box to use the worst-case value, and that worst-case value can be LARGE. On ARM, it's 2048 bytes - and a btree element structure has for example THREE atomic variables, i.e. it is about 6kb in size for one element!
        <tr align="center">
          <th align="left">Freelist</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
</tr>


'''So the moral of the story is this - on such platforms, be sure to set ''LFDS700_PAL_ATOMIC_ISOLATION_IN_BYTES'' correctly, or your structures will be painfully large.'''
        <tr align="center">
          <th align="left">Hash (add-only)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
</tr>


The platforms which need to care about this are;
        <tr align="center">
          <th align="left">List (add-only, ordered, singly-linked)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
</tr>


* ARM
        <tr align="center">
* POWERPC
          <th align="left">List (add-only, singly-linked, unordered)</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
</tr>
 
        <tr align="center">
          <th align="left">Queue</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
</tr>
 
        <tr align="center">
          <th align="left">Queue (BSCSP)<sup>1</sup></th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
</tr>
 
        <tr align="center">
          <th align="left">Ringbuffer</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
</tr>
 
        <tr align="center">
          <th align="left">Stack</th>
          <td><img src="http://www.liblfds.org/pages/images/button_green.png"/></td>
          <td><img src="http://www.liblfds.org/pages/images/button_red.png"/></td>
</tr>
      </table>
 
<p><small>1. Bounded (fixed maximum number of elements, specified at instantiation time), single consumer, single producer</small></p>
 
So, for example, if you port to some new x86 based platform and implement only memory barriers, CAS and exchange, then you can use the hash, the list and the bounded, single consumer, single producer queue.  The library will compile fully, but if you try to use the other data structures, you will call a dummy version of one of the other atomic operations, where those dummy functions call assert (if it's been provided) and then try to write 0 to memory location 0.  You were warned :-)


==See Also==
==See Also==
* [[r7.0.0:Release_7.0.0_Documentation|Release 7.0.0 Documentation]]
* [[r7.0.0:Release_7.0.0_Documentation|Release 7.0.0 Documentation]]

Latest revision as of 15:17, 26 December 2015

Introduction

To minimize the work involved is making liblfds work on any given platform a porting abstraction layer has been written. Porting simply involves implementing the porting abstraction layer; the library will then compile and work. Implementation involves providing values to a small set defines, macros and typedefs. The porting abstraction layer is implemented as a small set of header files in the public header file directory, thus;

└───liblfds700
    └───inc
        └───liblfds700
                lfds700_porting_abstraction_layer_atomic_compiler.h
                lfds700_porting_abstraction_layer_atomic_operating_system.h
                lfds700_porting_abstraction_layer_atomic_processor.h

Each header file uses #ifdefs and compiler defined macros to select the appropriate porting abstraction layer implementation for the local platform.

So, for example, in lfds700_porting_abstraction_layer_operating_system.h, for MSVC, x86, Windows user-mode, we see the following, which is for MSVC on Windows kernel-mode.

#if( defined _MSC_VER && _MSC_VER >= 1400 && defined _WIN32 && defined _KERNEL_MODE )
  [snip porting abstraction layer implementation]
#endif

So, to add a new platform, introduce a new #ifdef, which matches the appropriate compiler defined macros for your platform.

The Header Files

A porting abstraction layer is a small set of defines, macros and typedefs. Each of the compiler, the operating system and the processor (for the processor, in the form of information about the processor provided by the user in #defines) turns out to be a source of information for a porting abstraction layer.

In general then the porting abstraction layer is split up into these three files, so that everything in the porting layer which comes from any one source is in one place - i.e. everything which comes from the GCC compiler is in one section, for the GCC compiler, in the compiler header file.

In almost all cases, across all the platforms supported so far - i.e. all the compilers, all the operating systems, all the processors - the same sets of information come from the same sources, i.e. the compiler provides the atomic instrincs, the processor provides cache-line lengths, etc. However, this not quite a hard and fast rule. There has so far been one exception - on Windows, with SSE2 or greater, processor barriers are a compiler instrinc (and so are in the compiler file), but prior to this, they were a function call provided by the operating system (and so are in the operating system file).

Basically, put things where they come from - and the current divison seems to work well for this end.

lfds700_porting_abstraction_layer_atomic_compiler.h

A compiler port should consist of everything which is a compiler intrinsic and this always or almost always consists of the following;

#define LFDS700_PAL_COMPILER_STRING
#define LFDS700_PAL_ALIGN( alignment )
#define LFDS700_PAL_INLINE
#define LFDS700_PAL_BARRIER_COMPILER_LOAD
#define LFDS700_PAL_BARRIER_COMPILER_STORE
#define LFDS700_PAL_BARRIER_COMPILER_FULL
#define LFDS700_PAL_BARRIER_PROCESSOR_LOAD
#define LFDS700_PAL_BARRIER_PROCESSOR_STORE
#define LFDS700_PAL_BARRIER_PROCESSOR_FULL
#define LFDS700_PAL_ATOMIC_CAS( pointer_to_destination, pointer_to_compare, new_destination, cas_strength, result )
#define LFDS700_PAL_ATOMIC_DWCAS( pointer_to_destination, pointer_to_compare, pointer_to_new_destination, cas_strength, result )
#define LFDS700_PAL_ATOMIC_EXCHANGE( pointer_to_destination, pointer_to_exchange )

lfds700_porting_abstraction_layer_atomic_operating_system.h

An operating system port contains and #includes and implements the following;

#define LFDS700_PAL_OS_STRING
#define LFDS700_PAL_ASSERT( expression )

lfds700_porting_abstraction_layer_atomic_processor.h

A processor port contains user-provided information about the processor;

typedef [type] lfds700_pal_atom_t;
typedef [type] lfds700_pal_uint_t;
#define LFDS700_PAL_PROCESSOR_STRING
#define LFDS700_PAL_ALIGN_SINGLE_POINTER
#define LFDS700_PAL_ALIGN_DOUBLE_POINTER
#define LFDS700_PAL_CACHE_LINE_LENGTH_IN_BYTES
#define LFDS700_PAL_ATOMIC_ISOLATION_IN_BYTES

Partial Implementatons

It is not necessary to fully implement a porting abstraction layer; not everything in the porting abstration layer is mandatory, and each of the atomic operations has a dummy version provided, for the purpose of allowing the library to compile even if that atomic operation has not been implemented or is not available on the platform (as not all atomic operations are available on all processors). Naturally, if a data structure depends on a given atomic operation and that atomic operation has not in fact been implemented, then you cannot use that data structure.

Atomic Operation Support by Processor
  Memory
Barriers
CAS DWCAS Exchange
ARM32/64, x86, x64
Alpha, Itanium, MIPS32/64, PowerPC32/64, SPARC32/64

The data structures have the following requirements;

Data Structure by Atomic Operations
  Memory
Barriers
CAS DWCAS Exchange
Binary Tree (add-only, unbalanced)
Freelist
Hash (add-only)
List (add-only, ordered, singly-linked)
List (add-only, singly-linked, unordered)
Queue
Queue (BSCSP)1
Ringbuffer
Stack

1. Bounded (fixed maximum number of elements, specified at instantiation time), single consumer, single producer

Which means you end up with the following;

Data Structures by Processors
  ARM32/64, x86, x64 Alpha, Itanium, MIPS32/64, PowerPC32/64, SPARC32/64
Binary Tree (add-only, unbalanced)
Freelist
Hash (add-only)
List (add-only, ordered, singly-linked)
List (add-only, singly-linked, unordered)
Queue
Queue (BSCSP)1
Ringbuffer
Stack

1. Bounded (fixed maximum number of elements, specified at instantiation time), single consumer, single producer

So, for example, if you port to some new x86 based platform and implement only memory barriers, CAS and exchange, then you can use the hash, the list and the bounded, single consumer, single producer queue. The library will compile fully, but if you try to use the other data structures, you will call a dummy version of one of the other atomic operations, where those dummy functions call assert (if it's been provided) and then try to write 0 to memory location 0. You were warned :-)

See Also