Difference between pages "r7.1.1:Usage Guide (liblfds)" and "r7.1.1:Usage Guide (libshared)"

Latest revision as of 20:16, 17 February 2017

Introduction

This page describes how to use the libshared library.

The library implements a great deal of functionality, almost all of which is used and only used by other liblfds components. From the point of view of an external caller to its API, there is in fact only one API, which handles user-allocated memory.

Usage

To use libshared, include the header file libshared.h and link as normal to the library in your build.

Dependencies

The libtest libraries depends on the liblfds711 library.

Source Files

└── test_and_benchmark
    └── libshared
        ├── inc
        │   ├── libshared
        │   │   └── libshared_memory.h
        └── src
            └── libshared_memory
                ├── libshared_memory_add.c
                ├── libshared_memory_cleanup.c
                └── libshared_memory_init.c

This is a small subset of the full set of files, and shows only those files used by the publically exposed APIs.

Opaque Structures

struct libshared_memory_state;

Prototypes

void libshared_memory_init( struct libshared_memory_state *ms );
void libshared_memory_cleanup( struct libshared_memory_state *ms,
                               void (*memory_cleanup_callback)(enum flag known_numa_node_flag,
                                                               void *store,
                                                               lfds711_pal_uint_t size) );

void libshared_memory_add_memory( struct libshared_memory_state *ms,
                                  void *memory,
                                  lfds711_pal_uint_t memory_size_in_bytes );
void libshared_memory_add_memory_from_numa_node( struct libshared_memory_state *ms,
                                                 lfds711_pal_uint_t numa_node_id,
                                                 void *memory,
                                                 lfds711_pal_uint_t memory_size_in_bytes );

Overview

All liblfds libraries are written such that they perform no memory allocations. This is straightforward for liblfds, where the user passes in state structures and so on, but it is problematic for libtest and libbenchmark as the store they require varies depending on the number of logical cores in the system, where that number cannot be known in advance, and where the work being done is complex enough that it is impractical to require the user to pass in the required store to functions - rather, a generic method is needed, where the libraries can in effect perform dynamic memory allocation.

This is the purpose of the libshared_memory API. The caller of libtest or libbenchmark functionality initializes a struct libshared_memory_state, performs some memory allocation by whatever means are available and adds the pointer to that memory and its size in bytes to the memory state. This memory state is then passed into such function in libtest or libbenchmark which require it, which in turn draw upon the memory so provided for dynamic allocations.

The libtest library is not currently NUMA aware - it simply runs one thread per logical core and allocates everything from the allocation with the most free space at the time of the allocation request. The libbenchmark library is NUMA aware and on NUMA systems in fact requires an allocation from every NUMA node in the system.

On SMP systems, or on NUMA systems but where a non-NUMA aware allocator is used (e.g. malloc rather than say numa_alloc_onnode) memory is added by the libshared_memory_add_memory function. On NUMA systems, memory is added with the libshared_memory_add_memory_from_numa_node function. Any number of allocations from any number of nodes (or from the non-NUMA aware allocators) can be provided, although there's no obvious use case for this, since normal usage is to initialize and allocate per-NUMA node and then call a libtest or libenchmark function.

The libbenchmark_topology API offers an iterator API, which permits easy iteration over the NUMA nodes in a system, saving the caller the trouble of having to enumerate the processor/memory topology. Note that initializing a topology state requires an initialized and populated memory state; however, this state is not NUMA sensitive, and so it can be allocated using malloc and then, once obtained, a second memory state can be populated with per-NUMA node allocations.

@@ Line 1: / Line 1: @@
-{{DISPLAYTITLE:Usage Guide (liblfds)}}
+{{DISPLAYTITLE:Usage Guide (libshared)}}
 ==Introduction==
-This page describes how to use the ''liblfds'' library, and then covers the novel and pecuilar issues which originate in the lock-free nature of the data structures in the library.
+This page describes how to use the ''libshared'' library.
-Where-ever possible such issues have been hidden from the user, but there are some which simply cannot be hidden and as such the user has be aware of them.
+The library implements a great deal of functionality, almost all of which is used and only used by other ''liblfds'' components.  From the point of view of an external caller to its API, there is in fact only one API, which handles user-allocated memory.
-==Library Initialization and Cleanup==
-No library initialization or cleanup are required.
 ==Usage==
-To use ''liblfds'', include the header file ''liblfds711.h'' and link as normal to the library in your build.
+To use ''libshared'', include the header file ''libshared.h'' and link as normal to the library in your build.
-==Novel and Peculiar Issues==
-===Memory Allocation===
-The ''liblfds'' library performs no memory allocation or deallocation.  Accordindly, there are no ''new'' and ''delete'' functions, but rather ''init'' and ''cleanup''.
-The user is responsible for all allocation and all deallocation.  As such, allocations can be from the heap or from the stack, or from user-mode or from the kernel; the library itself just uses what you give it, and doesn't know about and so does not differentiate between virtual or physical addresses.  Allocations can be shared memory, but note the virtual memory ranges must be the same in all processes - ''liblfds'' uses addresses directly, rather than using offsets.  Being able to used shared memory is particularly important for Windows, which lacks a high performance cross-process lock; the data structures in ''liblfds'' when used with shared memory provide a process and thread safe cross-process communication channel (but they do not provide sychronization, so the reader cannot be signalled, by the library, as to ''when'' to read).
-===Memory Deallocation===
-Any data structure element which has at any time been present in a lock-free data structure can never be passed to ''free'' until the data structure in question is no longer in use and has had its ''cleanup'' function called.
-As such, typical usage is for data structure elements to be supplied from (and returned to) a lock-free freelist.
-There is a single exception to this, which is the unbounded, many producer, many consumer queue.  It ''is'' safe to deallocate elements which have emerged from this data structure.
-===Data Structure Initialization===
-Passing a data structure state to its ''init'' function initializes that state but that initialization is and is only valid for the logical core upon which it occurs.
-The macro ''[[r7.1.1:LFDS711_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE|LFDS711_MISC_MAKE_VALID_ON_CURRENT_LOGICAL_CORE_INITS_COMPLETED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE]]'' is used to make the initialization valid on other logical cores and it will make the initialization valid upon and only upon the logical core which calls the macro.
-Expected use is that a thread will initialize a data structure, pass a pointer to its state to other threads, all of whom will then call ''LFDS711_LIBLFDS_MAKE_USABLE_TO_CURRENT_LOGICAL_CORE_INITS_PERFORMED_BEFORE_NOW_ON_ANY_OTHER_LOGICAL_CORE''.
-===The Standard Library and ''lfds711_pal_uint_t''===
-The ''liblfds'' library is intended for both 32 and 64 bit platforms.  As such, there is a need for an unsigned type which is 32 bits long on a 32 bit platform, and 64 bits long on a 64 bit platform - but remember that the Standard Library is not used, so we can't turn to it for a solution (and also that for C89, there wasn't really such a type anyway - ''size_t'' did in fact behave in this way on Windows and Linux, but semantically ''size_t'' means something else, and so it is only co-incidentally behaving in this way).
-As such, ''liblfds'' in the platform abstraction layer typedefs ''liblfds711_pal_uint_t'' (and a signed equivelent, ''lfds711_pal_int_t'').  This is set to be an unsigned integer which is the natural length for the platform, i.e. the length of the processor register, 32 bits on a 32 bit CPU and 64 bits on a 64 bit CPU.
-===Exclusive Reservation Granule (ARM, POWERPC)===
-On ARM and POWERPC there is a define in the ''liblfds'' header file ''lfds711_porting_abstraction_layer_processor.h'', ''LFDS711_PAL_ATOMIC_ISOLATION_IN_BYTES'' which '''SHOULD BE SET CORRECTLY''', as the value in the header file is necessrily the worse-case (longest) value, which in the case of ARM is 2048 bytes - which has the effect ot making many of the data structure structures '''HUGE'''.
-There are two approaches in hardware, in processors, to implementing atomic operations.  The first approach is ''compare-and-swap'' (CAS), the second approach is ''load-linked/store-conditional'' (LL/SC).
-Atomic operation involve writing a variable to memory.  CAS implements atomicity by locking the cache-line containing the variable.  LL/SC implements atomicity by loading the variable into a register, where anything can be done with it, but watching the memory location it comes from until the variable is stored again; if in the meantime another write occurred to that memory location, the store aborts.
-The granularity of the 'watching' varies a great deal.  On some platforms, such as MIPS, only one 'watcher' is available per logical processor.  On some platforms, such as ARM and POWERPC, memory is conceptually divided up into pages (which are known as "Exclusive Reservation Granules", "ERG" for short) and if a write occurs to the page which contains the target varible, then the store fails.
-On ARM the Exclusive Reservation Granule range in size from 8 to 2048 bytes (always a power of two, though - 8, 16, 32, 64, etc), depending on implementation.
-Obviously, for ''liblfds'' to alway work, the header file has to use the 2048 byte value - but where in all the ''liblfds'' structures variables which are the target of atomic operations have to be in their own granule, then naturally the larger the granule size, the larger the structure.  Some structures have a number of variables which are the target of atomic operations, and so those structures can become ''very'' large.
-This then leads to the question of finding out or determining the ERG length.
-The ERG length can be obtained from the processor, it's stored in a register, but on ARM this is only possible when the processor is in supervisor mode; so ''liblfds'' cannot access this information.  The documentation for any given system should, somewhere deeply buried, indicate the ERG length.
+==Dependencies==
+The ''libtest'' libraries depends on the ''liblfds711'' library.
-There is however another way.  The ''libtest'' library offers a function ''libtest_misc_determine_erg'' which attempts to empircally determine the ERG length, by running on one logical core an LL operation, then on every other logical core touching memory just inside the largest possible ERG size and then trying the SC operation, repeating this with progressively smaller ERG sizes, until the operation fails, which indicates the ERG size.
+==Source Files==
+ └── test_and_benchmark
+     └── libshared
+         ├── inc
+         │   ├── libshared
+         │   │   └── libshared_memory.h
+         └── src
+             └── libshared_memory
+                 ├── libshared_memory_add.c
+                 ├── libshared_memory_cleanup.c
+                 └── libshared_memory_init.c
-This function can only work on systems which have more than one ''physical'' processor (multiple logical processors in one physical processor is not enough).  This is because ARM implements per-processor 'local' watchers, which are typically much more relaxed than the 'global' (system-wide, i.e. multiple physical processors) watchers, which normally not throw an error even if a write occurs inside the ERG - i.e. with the local watcher only, it's not possible to make the LL/SC fail, so the code cannot work out the length.
+This is a small subset of the full set of files, and shows only those files used by the publically exposed APIs.
-The ''test'' binary has an argument, "-e", which runs a test using this function, like so;
+==Opaque Structures==
+ struct [[r7.1.1:struct libshared_memory_state|libshared_memory_state]];
-  test -e
+==Prototypes==
+  void [[r7.1.1:function libshared_memory_init|libshared_memory_init]]( struct libshared_memory_state *ms );
+ void [[r7.1.1:function libshared_memory_cleanup|libshared_memory_cleanup]]( struct libshared_memory_state *ms,
+                                void (*memory_cleanup_callback)(enum flag known_numa_node_flag,
+                                                                void *store,
+                                                                lfds711_pal_uint_t size) );
+ void [[r7.1.1:function libshared_memory_add_memory|libshared_memory_add_memory]]( struct libshared_memory_state *ms,
+                                   void *memory,
+                                   lfds711_pal_uint_t memory_size_in_bytes );
+ void [[r7.1.1:function libshared_memory_add_memory_from_numa_node|libshared_memory_add_memory_from_numa_node]]( struct libshared_memory_state *ms,
+                                                  lfds711_pal_uint_t numa_node_id,
+                                                  void *memory,
+                                                  lfds711_pal_uint_t memory_size_in_bytes );
-The output will look like this (plus a bunch of fixed explanatory text);
+==Overview==
+All ''liblfds'' libraries are written such that they perform no memory allocations.  This is straightforward for ''liblfds'', where the user passes in state structures and so on, but it is problematic for ''libtest'' and ''libbenchmark'' as the store they require varies depending on the number of logical cores in the system, where that number cannot be known in advance, and where the work being done is complex enough that it is impractical to require the user to pass in the required store to functions - rather, a generic method is needed, where the libraries can in effect perform dynamic memory allocation.
-  ERG length in bytes : Number successful LL/SC ops
+This is the purpose of the ''libshared_memory'' API.  The caller of ''libtest'' or ''libbenchmark'' functionality initializes a ''struct libshared_memory_state'', performs some memory allocation by whatever means are available and adds the pointer to that memory and its size in bytes to the memory state.  This memory state is then passed into such function in ''libtest'' or ''libbenchmark'' which require it, which in turn draw upon the memory so provided for dynamic allocations.
-  =================================================
-bytes : 0
-bytes : 0
-bytes : 0
-bytes : 0
-bytes : 1024
-bytes : 1023
-bytes : 1023
-bytes : 1024
-bytes : 1024
-bytes : 12
-The ERG size is on the left, the number of successful LL/SC ops on the right.  Each size is tested 1024 times.  The smallest size with 1024, or almost 1024, operations, is the ERG size.  (LL/SC ops can fail naturally due to system activity, so it's expected that sometimes one or two LL/SC operations will fail by themselves and so the total value will be slightly slower).
+The ''libtest'' library is not currently NUMA aware - it simply runs one thread per logical core and allocates everything from the allocation with the most free space at the time of the allocation request.  The ''libbenchmark'' library ''is'' NUMA aware and on NUMA systems in fact ''requires'' an allocation from every NUMA node in the system.
-We see here the ERG sie for this platform is 64 bytes, which is correct - this is a Cortex A7 in a Raspberry Pi 2 Model B.
+On SMP systems, or on NUMA systems but where a non-NUMA aware allocator is used (e.g. ''malloc'' rather than say ''numa_alloc_onnode'') memory is added by the ''libshared_memory_add_memory'' function.  On NUMA systems, memory is added with the ''libshared_memory_add_memory_from_numa_node'' function.  Any number of allocations from any number of nodes (or from the non-NUMA aware allocators) can be provided, although there's no obvious use case for this, since normal usage is to initialize and allocate per-NUMA node and then call a ''libtest'' or ''libenchmark'' function.
-(That there are 12 successful ops for the 2048 byte size is not in fact understood.  There is a 4 byte ERG test as a sanity check, because if it passes, then it is clear the test is not working properly).
+The ''libbenchmark_topology'' API offers an iterator API, which permits easy iteration over the NUMA nodes in a system, saving the caller the trouble of having to enumerate the processor/memory topology.  Note that initializing a topology state requires an initialized and populated memory state; however, this state is not NUMA sensitive, and so it can be allocated using malloc and then, once obtained, a second memory state can be populated with per-NUMA node allocations.
 ==See Also==
-* [[r7.1.1:Release_7.1.1_Documentation|Release 7.1.1 Documentation]]
+* [[r7.1.1:Usage Guide (benchmarking)|Usage Guide (benchmarking)]]
+* [[r7.1.1:Usage Guide (testing)|Usage Guide (testing)]]

Difference between pages "r7.1.1:Usage Guide (liblfds)" and "r7.1.1:Usage Guide (libshared)"

Latest revision as of 20:16, 17 February 2017

Contents