$ perf probe -x /lib64/libpthread.so.0 -L 'pthread_mutex_lock'Īssert (sizeof (mutex->_size) >= sizeof (mutex->_data)) ĥ8 else if (_builtin_expect (PTHREAD_MUTEX_TYPE (mutex)Ħ7 int max_cnt = MIN (MAX_ADAPTIVE_COUNT,ħ8 while (LLL_MUTEX_TRYLOCK (mutex) != 0) Ĩ0 mutex->_data._spins += (cnt - mutex->_data._spins) / 8 Let's see if we can identify the location where pthread_mutex_lock gives up and stops spinning. How often is the mutex contended when a thread attempts to acquire it? Even if the mutex was contended, what's the typical spin count before getting it anyway - or giving up and making a system call? These are questions that are best answered with some form of dynamic tracing. And that's with a fairly naive queue implementation and coarse-grained locking. We transferred 15GB of data (2 million 8KB messages) in 3.2 seconds. But still, take a look at the 8KB numbers. When messages get bigger (and don't fit in any level of cache anymore), performance goes down. Well, it looks like small messages aren't very efficient - there are some queue management overheads that dominate the running time. Total data passed between client and server: 125000.00 MB, 2.00 million packets Total data passed between client and server: 15625.00 MB, 2.00 million packets Total data passed between client and server: 1953.12 MB, 2.00 million packets Total data passed between client and server: 244.14 MB, 2.00 million packets Total data passed between client and server: 30.52 MB, 2.00 million packets In the following ASCII diagram, the elements marked with an x are waiting for be dequeued. The enqueue and dequeue operations advance the respective index. When dequeuing an element, it is copied from the read index - unless the queue is empty. When enqueuing an element, it goes into the write index - unless the queue is full. There are two indices: the write index and the read index. The internal implementation of the queue is based on a circular buffer.
#Pthread c shared memory full#
The enqueue and dequeue functions are not blocking: they simply return false if the queue is full or empty, respectively. If more than max_count elements are in the queue, you can't enqueue any more elements in until someone dequeues them. You can put elements of size element_size in it, up to a statically-defined maximum of max_count. Simply put, a shared queue is identified by its name. Void shmemq_destroy (shmemq_t* self, int unlink) Shmemq_t* shmemq_new (char const* name, unsigned long max_count, unsigned int element_size) īool shmemq_try_enqueue (shmemq_t* self, void* element, int len) īool shmemq_try_dequeue (shmemq_t* self, void* element, int len) Here is the interface of the shared memory queue that we'll be using: The heart of the implementation is in the shmemq.c file. My original test was on Windows, but I ported it to Linux so that I can use dynamic tracing to see the impact of locking on the shared queue. Well, your mileage may vary, but I decided to see how fast a naive queue implementation on top of shared memory can be. The basic requirement is that two processes pass arbitrary-sized messages to one another - each process waits for a message before sending a reply. The question was whether it's worth the effort to build a message-passing interface on top of shared memory queues, or whether sockets or pipes could produce a better result in terms of performance with a minimal implementation effort. In fact, most IPC mechanisms are based on shared memory in their implementation. My knee-jerk reply was that shared memory can be used for inter-process communication and message-passing. In one of my recent training classes, I was asked to demonstrate some practical uses of shared memory. Shared Memory Queue, Adaptive pthread_mutex, and Dynamic Tracing