lfds700_misc_prng_state parameter bypass

  1. last year

    Hi,
    I am glad that the seventh version of liblfds finally came out!
    However, I see that in addition to the need to allocate memory on the user side, there is another important change in the API:

    for most operations with data structures now need to pass a parameter lfds700_misc_prng_state specific to each thread, which can be problematic because it can be quite difficult to bypass this parameter at a high depth of nested calls.

    It would a reasonable option to create an instance of prng in each thread, and then save it's address in the thread's local storage,
    using functions such as pthread_set/getspecific()? Or are there any simpler solutions?

  2. admin

    16 Jan 2016 Administrator
    Edited last year by admin

    I looked closely at thread local storage when I came to implement the PRNG state - in fact, in its original form, it was simply a per-thread state; the user didn't know the per-thread state was in fact only a PRNG state.

    Per-thread state was MUCH easier for the user - in fact, it would make the PRNG state invisible to the user; the user would run a per-thread init function, just as he runs the main library init function, and it would all work. The extra argument PRNG state argument would go away. Lovely!

    However, in fact, not lovely.

    The first problem is that thread local storage is most definitely NOT widely supported. In fact, GCC only supports this on a couple of architectures. There has to be a lot of compiler and linker support.

    So I couldn't implement it in the library, because it would mean the library would only work on those few platforms. You however ARE free to use it!

    The second problem is that performance can be poor - there are a number of different ways in which support ends up being implemented, depending in part on compiler options and in part on architecture. Given the key use of the state is *in the tight, fast lock-free loop*, being used for back-off, anything *slow* - forming a significant or even dominating factor in the delay length - was just out of the question.

    So if you do go the TLS route, look very carefully at your compiler options and make sure you understand what kind of TLS you're getting. The quickest TLS, as I recall, adds a single pointer lookup. The slowest is something awful which I can't even remember. Even a single pointer lookup could be doubling the intended backoff delay, especially for the first delay, which is always very short. The backoff code is written to the fullest extent possible to avoid memory accesses, because they are so slow, and the duration of the backoff period has a large effect on performance on highly loaded systems - but I have only benchmarked highly loaded systems (e.g. one test thread per logical core, each in a tight loop doing say a push and then a pop).

    I'm working on the benchmark application now. I will start thinking now if I can get a variant in there which is using TLS for PRNG, so I can get an idea of what kind of impact it produces.

    On the whole, if your system is not heavily loaded, I would say you'll be fine to use TLS.

    There may be one other approach which might help you. The PRNG state is single-threaded. There is in fact no reason not to have any number of them being used by a single thread - the only consideration is that initialization of a PRNG state is slow, so only do it during thread init. Would it help you at all to have more than one of these states per thread?

    I also am glad 7.0.0 is finally out :-)

  3. Thank you for the clarification!

    At the moment I can not figure out which method of storing PRNG state
    would be optimal and, in my case, create less overhead for the entire system.

    The architecture of my system is approximately as follows: there are as many threads as processor cores.
    One thread reads from the queue - and all the others write into it.

    All threads are created at once, so I have no difficulty initialize the PRNG state in the beginning,
    but it's hard to pass that state because the functions used in the writing threads have different nesting levels. Moreover, threads exchange their computing contexts (stack + processor registers), so for me the use of TLS seems the only option.

    Maybe if I write a couple tests I can measure my problems/difficulties.

  4. admin

    21 Jan 2016 Administrator

    Can you explain the situation more, with the difficulty in passing the PRNG state around? I haven't yet understood exactly what the problem is, but that it exists concerns me - I may have missed a use case scenario or scenarios where the PRNG state is problematic, which makes the library problematic, and this may mean I need to redesign the APIs.

    So, the overall picture obviously is clear - many writer threads, one reader. Each can easily start with its own PRNG state. The writers though - "different nesting levels". What do you mean by this? does it mean the PRNG state would end up having to be passed around a lot, down through many function calls, until finally you get to the point where you call a queue function?

    You know, writing about all this, I'm now thinking about callbacks. What if users want to call (say) queue functions in a callback? it's too expensive to init a PRNG state inside every call to the callback - they'd have to init them earlier and then use the queue user state pointer to point to them; and then in the callback, know which thread is calling the callback and then get that info from the queue user state which is pointing to the PRNG states.

    Problem is, what data structure are we using to store the PRNG states? either we have to lookup an OS thread ID to a PRNG state, or we use an array and somehow pass a per-thread index into the callback (which is exactly the same as just passing in the PRNG state in the first place).

  5. It is quite hard to describe the architecture of my system, because I (sometimes) don't fully understand it either :(
    I will try to do it using pseudo-code.
    Let's imagine there is an object my_obj which contains a 'pointer to queue'

    	struct my_obj {
    	    // another fields...
    	    liblhds_queue  *queue;
    	};

    The thread that reads from the queue works on a simple principle: it always waits new messages in the queue and if they are absent for a long time, it starts to sleep, waiting for semaphore. Let's call it Thread_0 here.

    It's code:

       while (continue) {
            if (queue_dequeue(queue)==1) {
            	// processing
            }
            else {
                // No data, wait
                dwWaitResult = Semaphore_Wait(semaphore);
                switch (dwWaitResult) { 
                case SEMAPHORE_POSTED:
                    if (queue_dequeue(queue)==1) {
                    	// processing
                    }
                    break; 
                case SEMAPHORE_TIMEOUT:
                    // The semaphore was nonsignaled, so a time-out occurred.
                    break; 
                }
            }
        }

    Writing threads work somewhat differently, they truely are similar to callbacks, however, they are closer to co-routines (I use libconcurent: https://code.google.com/p/libconcurrent ).

    At the start, as I told you, I create as many threads as there are cores. For example: if there are four cores, then except for Thread_0, also exists Thread_1 ... 3.

    Furthermore, the user gives me a pointer to one of his functions. I call the function in one of my own threads while giving it my_obj as a parameter. The user, in his turn, calls my function (my_func). It's prototype is something like this:

    	int my_func(my_obj *obj, void *user_data);

    My functions has to place the user's data into the queue. Respectively, inside my_func I have to get the pointer to the queue and execute queue_enqueue() (so was in version 6 of liblfds).

    	int my_func(my_obj *obj, void *user_data) {
    	
    		liblfds_queue  *queue = obj->queue;
    		queue_enqueue(queue, user_data);
    	
    	}

    In addition, evidently, the user can call my_func in any of the remaining three threads. And in version 7 I will have to get PRNG state from somewhere, thread-specifically. So that's how my architecture looks like, mostly.

    In reality, the situation is even worse because inside my_func, the user context can be interrupted, saved, and even postponed. So for example: Thread_1 executes the user's function user_func which is called by my_func. my_func decides to interrupt user_func and save it's context into user_context_1.

    After some time, another thread (Thread_2) restores the user's context and continues the execution of my_func which then returns it's control to user_func. Graphically, it looks something like so:

       Moment 0:
    	thread_1:                                            
    	  user_func() {
    	  	my_func() {
    	  	  stop, save ---> user_context_1          
    	
    	
        Moment 1:
    	thread_2:
    	      restore, start <--- user_context_1
    	      return from my_func()
    	    }
    	    return from user_func()
    	  }

    As you see, everything is quite confused and complicated, so I don't think you should waste your time dealing with my architectural complications.

    At this moment, I see thread local storage as a nice solution for me. Another solution is to use an array, as you described in your last post, comparing my current thread id with one of the saved in this array (inside of a loop).

  6. admin

    2 Feb 2016 Administrator

    So - first, sorry my reply has taken so long; I've been working all the time on the benchmark app (see the blog for posts on the results) because that's needed to find out how much impact the PRNG state has on backoff - and indeed, how important backoff is - and to benchmark TLS.

    Main result so far - backoff really matters; it makes a big difference to performance.

    I'm about to perform the experiment where I replace the PRNG-based ethernet-like backoff with a fixed back-off period, and see what difference that makes.

    I think I have some understanding now of your architecture. The first and more superficial issue is that where the user-threads are calling the callback you gave them, the only way they can have a PRNG state is if you hand it to them along with my_obj, which is a pain - the overall design is having to accomodate one particular library.

    The second and more serious issue is that it can in theory be *any* thread which actually in the end executes the callback given to the user, so how can you know you're using the correct PRNG state anyway? even if you did pass in a PRNG, it could end up being for the wrong thread.

    Certainly, I did not envision this class of arrangement when I was implementing! =-)

    Thinking it all over now.

  7. admin

    2 Feb 2016 Administrator

    It looks like I am going to remove the PRNG state. Back-off is vital, but the random selection of the back-off duration simply seems to make no difference to the benchmark results.

  8. Hello,
    firstly I apologize for a long silence, I was distracted by another project.
    Thank you for finding time and energy to understand my architecture.
    And also, deleting the PRNG state will be really helpful to me :D

  9. admin

    20 Apr 2016 Administrator
    Edited last year by admin

    I'm almost ready to release 7.1.0 (i.e. no pre-thread state).

    Hopefully one week.

 

or Sign Up to reply!