Multicore Data Structures Workshop -

Priority Queue

By:

Omer Levy - omerlevy88@gmail.com  

Roi Tagar - roitagar@mail.tau.ac.il  

Adam Elimelech - adam.elimelech@gmail.com  

Tel Aviv University, Workshop in Computer Science, 2015

Instructors:

Prof. Nir Shavit

Moshe Sulamy

Abstract

In this project we’ve implemented, developed and compared priority queues using different approaches, in order to increase parallelism and performance of multicore execution.

We’ve implemented the structures and the testbenches in Java and C++, and used some external packages, in order to support the multithreading features we needed.

The data structure we’ve tested is a scalable and relaxed version of priority queue called Spray List.

You can find more information about this data structure here:

http://research.microsoft.com/apps/pubs/default.aspx?id=209108

In the core of the SprayList lies a SkipList. Our implementations are mostly based on different SkipList implementations from the book “The Art of Multiprocessor Programming” by Maurice Herlihy & Nir Shavit.

Introduction

The implementations we have implemented in Java:

  1. Java Priority Blocking Queue - The Java PriorityBlockingQueue from the concurrent package.
  2. Naive Lock Native Priority Queue -  Uses a global Reentrant Lock that wraps the non concurrent heap based Java’s priority queue. (java.util.PriorityQueue).
  3. Global Lock Spray List Priority Queue - Uses Global Reentrant Lock that wraps the class   a Sequential Spray List priority queue (the sequential is implemented in the class SeqSprayListPriorityQueue ) .
  4. Lazy Lock Spray List Priority Queue   - A concurrent fine-grained lock lazy implementation, based on the lazy skip list from the book.
  5. Lock Free Spray List Priority Queue - A concurrent lock-free implementation, based on the lock free skip list from the book.
  6. TM  Spray List Priority Queue  - A software transactional memory based implementation. We’ve used  deuce STM package.
    This implementation wraps the sequential implementation with transactions.

  1. Cool Spray List Priority Queue  - This is our implementation for Spray List, explained later.
  2. Optimistic Cool Spray List Priority Queue  - This is an improvement for Cool Spray List Priority Queue.

Our 2 news implementations (the Cool and the Optimistic Cool)   use a Read-Write Locks. We’ve tested both fair locks and unfair locks.

All SprayList implementations use a shared atomic counter for the amount of active threads, to determine spray parameters, as stated in the spray list specification.

The implementations we have implemented in C++

  1. Global Lock Spray List Priority Queue
  2. Lazy Lock Spray List Priority Queue  (not working well)
  3. Cool Spray List Priority Queue (both fair and unfair)

Modifications to skiplist implementations from the book

Every skiplist implementation was added “deletemin” and “spray” methods, as well as an atomic thread counter to calculate spray height/length.

deleteMin would call spray, then call remove with the value it received.

LockFree skiplist additional changes

We noticed a repeating pattern where the benchmark would seem to be stuck, and when debugging we found nodes that existed in some levels of the list, but not the bottom level.

This issue is critical to the spray list priority queue, since sometimes these ghost nodes are accessible by the “spray” method, and invisible for the “find” method, causing a high amount of retries in “deleteMin”.

We took two measures to reduce this negative effect:

  1. An inserter thread will reset the new node’s successor to be its predecessor’s next on that level, instead of keeping the successor from the first call to “find”.
  2. An inserter thread that discovered that its node is being removed before insertion is completed, will immediately cease connecting the node to upper levels of the list.

These changes noticeably reduced benchmark runtime.

To see an example of the lock-free ghost nodes problem, click here

The Cool Spray List Priority Queue  

A bird’s-Eye View

The Cool Spray List Priority Queue is our novel implementation for the Spray List.

The motivation was the observation, that the main problem of all the concurrent structures above is the contention on the first elements of the SkipList , especially when a deletion of an element is performed.
For example: The lazy lock implementation - has to lock all its neighbours in the different levels. In case that one or more of the predecessors or the successors of the node was deleted, or the node has new predecessors or successors since the ‘find’ operation was committed by the deleter, we will have to start over.

In the lock free implementation - it can start over due to changes that were made in the victim’s close environment.

Since we are using the SkipList as a priority queue, we will most definitely WANT to perform changes in the victim’s close environment (deleteMin near the head of the queue), these approaches of node removal cause high local contention, and harm performance of the most important deleteMin method of the priority queue.

The basic Idea of our new implementation is when removing an element we only logically mark it , but not  physically delete it neither by remove nor by find (in contrast to what the lazy and the lock-free implementations do).
That approach reduces the contention (from all the victim’s environment to the single victim node), and reduces the amount of actions required for removal, both of which add to delete-Min’s higher concurrent performance.

After a certain portion of the list was removed (many nodes are logically deleted and they are probably concentrated near the head of the list, due to deleteMin operations), we perform a grouped cleanup : we pick a large group of mostly-removed items, and perform a single skiplist disconnection between the head of the list and the highest edge of the group.

This reduces the amount of CAS operations required to disconnect the group (one node’s height, instead of for every node), thus reducing the total amount of operations performed on the skiplist nodes over time.

It might be hard to find a large enough group of items that are all logically deleted because of the nature of the structure. We can think about it as a fragmentation, where there are some “alive” nodes scattered within logically deleted nodes.

In order to solve this issue, we allow a small amount of live items to be physically removed from the skiplist together with the logically deleted ones. Then we move the live items into a wait-free elimination queue, where the next coming deleteMin calls will first look for items to remove.

Example: Cool Spray List during a cleanup process (received from our implementation):

Blue - normal items

Red - logically deleted items

Dark blue - “alive” items that are about to be moved to the elimination array

Dark red - logically deleted items that are about to be removed in a single batch

Gray - the selected cleanup group

The Structure and the algorithm

Timing relations table

           \ time

action \

no cleaner, no elimination

cleanup group selection

disconnecting the group

fill elimination array

elimination array is alive

add,

low* value

normal

wait

wait

wait

normal

add,

high* value

normal

wait

normal

normal

normal

deleteMin

normal

normal

normal

normal

elimination

* Note: low/high value means under/over the top limit of the current cleanup group, respectively.

Potential problems and their solutions

The Methods

find()

find() method searches for a given key in the skiplist and used by insert() and remove() methods. This method is Wait-free.
Find tr
a vers the SkipList (including logically deleted items). If the key was found - it checks if it is logically deleted.

The return value can be either (1) Found - found and alive, (2) Deleted - found and logically deleted, or (3) Not found.

find() also returns the predecessors and successors arrays that are given to it as out parameters, and used for insert and remove as in other implementations.


insert()

insert() method Inserts a new item to the priority queue. It fails if the item’s key already exists (no duplications are allowed, as with other implementations).
insert() is
usually  “lock-free”, except when a cleanup is taking place (see time table).

This method uses two fair reader-writer locks (as readers), in a way that allows multiple “insert” operations, but prevents forbidden collisions with cleanup operation (which grabs the equivalent writer locks).

The flow:

First, we grab a read lock in order to be sure that no delete-group is now in a decision process. That lock ( lock2 ) can block all the inserters if a cleanup process is running and has not published yet the highest node key value (that is represented by the variable highestNodeKey) - which is the highest key in the formed  delete-group.If no delete-group exist, highestNodeKey will be null.

Then we test the variable highestNodeKey. If it is not null, we have to check if the key that we want to insert is smaller than the highest key in the delete-group. If it is, we have to wait until the delete-group is ready (by grabbing another reader lock - lock3 ) and then we can insert our value. If the value we want to insert already exists in the skiplist, but marked as logically deleted - we can revive it - so we just unmark it and return true if we succeed.

Otherwise - we perform the insertion similar to the lock-free spray list.

Now, if there still exists an item in the elimination array whose value is higher than the value we just inserted, it needs to be reinserted to the skiplist, to preserve linearization of the priority queue. Lets remember that the demand for the relaxed structure is to pick a key that with high probability is in the  first elements (where    is the number of the active threads).

Every insert thread only attempts to reinsert a single item.

The original item’s insertion time is used as “backoff” - grace time for incoming deletemin threads to eliminate the item, because we would like to avoid reinserts as much as we can.

In the end - we don’t forget to release the reader locks we’ve grabbed.

reinsert

Take a skiplist node from the elimination array, and attempt to insert it back to the skiplist.

The reinsert action is actually a repetition of insert on the node from the elimination array, with slightly different edge cases. it is performed within the locks grabbed by insert, before releasing them.

The reinsert action aborts if the node is deleted (from the elimination array) before it succeeds connecting the node to the skiplist.

remove()

remove() method performs only a logical remove of an item.
First, it calls the find() method in order to know if the relevant key exists. If the key was not found or found but is already logically deleted - we will return false.

Otherwise - we will try to mark it as deleted (with CompareAndSet). On successful delete - return true.

clean()

clean() method is the cleaner. Its duty is to find a delete-group, disconnect it and create an elimination array with the alive elements of this delete group.

The flow:

First it tries to grab lock ( lock1 ) using tryLock , in order to prevent some cleaners together. If tryLock return false - it means that there is a working cleaner, and we don’t need another one, so return.

If lock1 was successfully locked, grab the writer locks: lock2 and lock3 in order to block inserters.
After grabbing these locks - we can be sure that no inserter is running, so we can start creating the deletion-group.

The creation of the deletion-group and the elimination array is divided into 4 steps:

step 1:

In this step we have to determine the upper limit of this group. This will be done by scanning the SkipList at level 0 (i.e linear traversing), and counting the number of alive elements we’ve seen during the scan.
As we’ve claimed before - the size of the elimination array has to be limited, in order to keep a (relaxed) linearization. So the max number of healthy nodes was set to be
.

The last element of the deletion-group will be the max leveled node we’ve seen during the scan.

step 2:

Now, after we’ve decided what elements will be inserted to the deletion group - we publish the highest key in the group by writing its value to the variable highestNodeKey, and release lock 2 that prevents inserters to perform. Pay attention that lock3 is still locked - such that low inserters is still blocked.

step 3:

we mark the references of the highest node in the deletion group (using AtomicMarkableReferece) to avoid the high-valued inserts to disturb.

Now we can disconnect the group by connecting the head to the highest node’s next items in all the levels.

step 4:

We’ve got a disconnected deletion-group. Now we can mark each alive node as a node that belong to the elimination array, and create the elimination array from the alive nodes, and then publish the elimination array and unlock lock3 in order to allow all the lower inserters to do their work.

In the end - we don’t forget to release the cleaner lock we’ve grabbed.

spray()

Uses random traversal to obtain a low key for deletion.

deleteMin()

deleteMin() removes a low key from the queue, and returns it. It fails if the queue is empty, and returns Integer.MAX_VALUE upon failure.

deleteMin() is usually  lock-free, except when a cleanup is taking place (see time table).

deleteMin() repeatedly attempts to obtain a low key using different methods, until either success or failure, according to this order:

  1. if there is a live elimination array, attempt to quickly eliminate an item from there.
  2. on low probability, perform a cleanup. it slows down this deleteMin() call, but optimizes the skiplist for future deleteMin() and insert() calls, and shortens future spray-remove-retry attempts.
  3. spray() → remove() - like in other spraylist implementations

Inner/Helper classes

serviceClass

This class provides “random” services for all skiplist implementation. it uses ThreadLocalRandom, and provides two methods:

  1. randomStep - a simple uniformly-distributed value
  2. randomLevel - a geometrically distributed value, used to achieve the desired skiplist height distribution (level i should skip ~2 i  items per step)

NodesEliminationArray

This is the elimination array, it is filled by the clean() method, and consumed by deleteMin() calls, and some low-valued insert() calls.

The elimination array is meant to be used only once. it is replaced with a new copy on every successful clean() operation.

The elimination array has two phases in its lifetime:

1 - creation phase : only cleaner thread is allowed to access, and there is no protection from multiple threads accessing it at this point.

2 - consumption phase : can be accessed by multiple deleteMin() and insert() threads freely.

deleteMin() threads call getNode(), which uses an atomic integer (getAndDecrement) to get a unique token in a wait-free manner, and return the node associated with this token, or immediately fail if the array is empty.

insert() threads call getNodeForReinsert(), which uses a different atomic integer in the same manner.

getNode() attempts to get the nodes in increasing order (for a correctly sorted deleteMin order), while getNodeForReinsert() attempts to get the nodes in decreasing order (higher values are the ones causing the linearization problem described above).

The Optimistic Cool Spray List Priority Queue

The Optimistic Cool Spray List Priority Queue   implementation is like the regular one, except a change in the clean method.

This Improvement comes to solve the problem which while deciding on a delete-group - all the inserters  have to wait.
Instead of freezing all the inserter threads, we will search a delete-group in an optimistic way, hope it will not be changed so much. When we will get to the highest node in the optimistic delete-group, we will acquire lock3 and will not allow lower inserters then the node we’ve found. Now we’ll scan again the list and will find the final deletion-list.

The table now will look like that:

         \ time

action \

no cleaner, no elimination

cleanup optimistic group selection

cleanup group selection

disconnecting the group

fill elimination array

elimination array is alive

add,

low* value

normal

normal

wait

wait

wait

normal

add,

high* value

normal

normal

normal

normal

normal

normal

deleteMin

normal

normal

normal

normal

normal

elimination

Future notes:

Implementation in C++

Our initial plan was to copy-paste all the Java implementations to C++, replacing references to pointers. But things did not work out as expected.

The C++ Concurrency package allowed using all the concurrency constructs from Java, with very minor differences.

We took the task of translation from the simplest to the most complicated:

Global Lock ⇒ Lazy Lock ⇒ Cool

The Global lock spray list was easy to translate. Since only one thread can access the data structure at a time, this single thread can allocate and deallocate nodes freely, without any side effects to other threads.

The global lock keeps implementation simple, and everything works well.

Moving on to the Lazy lock spray list, we soon discovered the main challenge with concurrent data structures in C++ is memory management - the problem of knowing when it is safe to destruct an object and deallocate its memory.

To solve this problem, one must know for certain that it is the last possible reference to that object, before deallocating its memory. Otherwise other threads might read, write or use that memory when it is no longer allocated and no longer valid, causing unexpected behavior.

Our first attempt was using std::shared_ptr (formerly boost).

Replacing the regular “next” pointer arrays in the Lazy Lock implementation with arrays of shared_ptr was straightforward, and made possible since making changes to pointers required grabbing the node’s lock.

std::shared_ptr assured that whether a thread read the value before or after a change, the pointed value remained valid until its last reference was destroyed, causing the object to be destroyed as well.

The Lazy Lock Spray List Priority Queue works well, and has no memory leaks.

Unfortunately it had performance issues, and was not included in the benchmark.

Lock-Free / Cool Spray List Priority Queue

Lock-Free insertion and removal of nodes require combining pointers and flags/timestamps into a single atomic object - allowing atomic Compare-And-Set, with the flag/timestamp preserving some order between critical operations.

The C++ Concurrency Package does provide us with all the atomic constructs of java. std::shared_ptr supports standard atomic operations.

but combining the two is impossible: AtomicMarkableReference mangles a normal pointer, shared_ptr keeps reference counters based on its copy-constructor.

Composition also gets you nowhere: if the AtomicMarkableReference points to the address of a heap-allocated shared_ptr, you’re left with the problem of deleting this shared_ptr instance. if shared_ptr points to an AtomicMarkableReference instance, it will only destroy the AtomicMarkableReference, and not the referenced list node.

At this point ThreadScan was introduced to us - an advanced library for C, meant specifically for building concurrent data structures.

ThreadScan only attempts to free objects for which the program called “threadscan_collect”.

With objects marked for collection, ThreadScan acts somewhat like Java/.Net Garbage Collection - threads are roots, and every pointer on their stack or on a per-thread pre-allocated buffer is kept alive until no references to it are left.

ThreadScan is not recursive though, so it still requires careful handling and prevention of non-stack storage of objects marked for collection.

ThreadScan’s limitations required designing our data structure in an appropriate way. The simplified logic presented in Cool Spray List, and its invariants, were helpful in achieving this.

As a result, we decided not to translate the LockFree spray list, and focus on Cool spray list:

These are the changes we made to the Cool Spray List in order to support ThreadScan:

Future notes:

Working with external libraries

Deuce STM

Software Transactional Memory library for Java. by Guy Korland.

Introducing a simple approach: “Just write your code optimistically, and it should work”.

After switching to the latest version of Deuce STM, this was really the case - it just magically worked as if it was with locks and correct synchronization.

Issues:

At first we used version 1.2.0, thinking that it is 2.0. This gave us some strange exceptions regarding improper ReadWriteLock usage.

After switching to version 1.3.0 we had no issues  at all, so while performance is not guaranteed, the practice and concept is definitely made possible with Duece STM.

C++ Concurrency Package

A set of classes for simple development of concurrent C++ applications, with very high similarity to Java interfaces. by Moran Tzafrir.

The CCP was very helpful in porting our Java code into C++ - both the data structures and the testing environment. The preserved similarity allows better results-compatibility, and is less error-prone.

This library is portable between several operating systems (linux and windows included), allowing developers to write portable concurrent programs for multiple targets.

Changes and Issues:

The Linux version of ReentrantLock was not re-entrant - we got exceptions and strange behavior when attempting to use it for the Lazy-Lock-based implementation, which may lock a specific node in several levels. We fixed this issue locally for Linux only, by setting the mutex attribute “PTHREAD_MUTEX_RECURSIVE”. W e did not test this on other platforms.

The class ReentrantReadWriteLock lacked a “public” access modifier, rendering the class unusable. We added the access modifier locally.

The class ReentrantReadWriteLock was always fair, unlike Java where fairness is controlled by a boolean initialization parameter. We added a boolean fairness parameter locally.

ThreadScan

A C/C++ library for memory deallocation of concurrently-used objects, preventing access to unallocated memory. by William M. Leiserson.

After examining self implementation of memory deallocation, and using std::shared_ptr, we got to a sad conclusion: It is impossible to use these methods of deallocation for our scenario without global locks .

(one reason is the impossibility of combining std::shared_ptr with CCP::AtomicMarkableReference)

That’s where ThreadScan comes into play - it monitors normal pointer references on all active thread stacks, but only attempts to free ones we explicitly asked it to. It provides us a combination of both the automatic-behind-the-scene-magic of std::shared_ptr, and the high level of control of calling “free” only when the data structure instruct it.

Issues:

ThreadScan is only compatible with 64-bit Linux. It uses linux libraries, and has assembly code using 64-bit-only registers.

Startup bug: On startup ThreadScan reads a special virtual Linux file (memory map file). There was a bug in parsing this file, that caused our program to crash before it started. We fixed this bug and our fix was merged to the official ThreadScan repository.

ThreadScan is a C library. As such, it only supports malloc-free memory management, and not new-delete with constructors and destructors. Our code is object oriented, based on the Java implementation, and uses the object-oriented C++ Concurrency Package.

To work around this issue, we had to make sure every class collected with ThreadScan has no meaningful destructor, so we can allocate it with “new”, but deallocate with “free” (inside ThreadScan), without any memory leaks or program errors.

ThreadScan does not collect all its memory on program exit, so valgrind shows some small memory leak.

ThreadScan does collect memory over time, and the program’s size does not grow.

Tests

In order to test our implementations we build a wide set of tests.

Each test use a number of insert and delete worker.

There are different types of insertion workers:

  1. Separate groups -
    Each worker has its own unique range of number to insert
  2. Interliving -
    When using  this type of workers the insertion performed in an interliving manner.
  3. Decreasing Order -
    With a “highest” as parameter, the insertion starts with it, and performing the action by decreasing until reaching to 1/timeout queue.
  4. Increasing Order -
    Same as Decreasing Order but visa versa.
  5. Random -
    Workers are performing insertion of random numbers into the queue.

There are different types of deletion worker:

  1. Perform delete until the queue is empty
  2. Perform delete until immediate stop which is giving by the test (using a  bool)

Another important role is that Delete workers and Insert workers can run either  simultaneously or serially in the test, we use both in our tests.

Now that we defined the different types of workers and roles, we can explain better about our tests by splitting them against the different workers they use:

(Note the each test has its own parameters which define what type of roles does it use)

Separate groups

Test Bench 2 (x, y, itemsPerThread)

Each insert worker inserts its own unique range. Insert with x threads & delete-min with y threads simultaneously, implicitly stopped, Execute delete min until queue is empty.

Test Bench 3 (x, y, itemsPerThread)

Each insert worker inserts its own unique range. Insert with x threads & delete-min with y threads on serial (NOT simultaneously), implicitly stopped, Execute delete min until queue is empty. Measure BOTH the insertion and deletion time.

 

Interleaving

Test Bench 5 (x, y, timeout)

Insert in an interleaving manner, insert with x threads & delete-min with y threads simultaneously, using Boolean flag to stop, Execute delete min until queue is empty.

 

Test Bench 6 (x, y, timeout)

Insert in an interleaving manner, insert with x threads & delete-min with y threads on serial (NOT simultaneously), using Boolean flag to stop, execute delete min until queue is empty. Measure BOTH the insertion and deletion time.

 

Test bench 7 (x, y, timeout)

Execute insert (interleaving) for infinite time with x threads

Execute Delete with y thread until timeout.

 

Decreasing order

Test Bench 8 (x, y, timeout, highest)

Simultaneously insert (Decreasing order) with x threads & delete with y threads – stop until timeout. Execute delete min until queue is empty.

 

Test Bench 9 (x, y, highest)

Simultaneously insert (Decreasing order) with x threads & delete-min with y threads. Insert all values (highest to 1) delete min until queue is empty

 

Test Bench 10 (x, y, highest)

Serially insert (Decreasing order) & delete-min.

insert all values (highest to 1) delete min until queue is empty.

Measure BOTH the insertion and deletion time.

 

Increasing order

Test Bench 11 (x, y, timeout)

Simultaneously insert (Increasing order) with x threads & delete with y threads – stop until timeout. Execute delete min until queue is empty.

 

Test Bench 12 (x, y, highest)

Simultaneously insert (Increasing order) with x threads & delete-min with y threads. Insert all values (1 to highest) delete min until queue is empty

 

Test Bench 13 (x, y, highest)

Serially insert (Increasing order) & delete-min.

insert all values (1 to highest) delete min until queue is empty.

Measure BOTH the insertion and deletion time.

Test Bench 17 (x, y, highest)

Serially insert (Increasing order)

Perform delete min & insert so that the queue will remain with the same size.

always insert a value that would be the highest in queue.

Measure BOTH the insertion and deletion time.

Also measure how much the delete-min is effective.

 

Random

Test Bench 14 (x, y, timeout)

Insert random values with x threads & delete-min with y threads simultaneously, using Boolean flag to stop, Execute delete min until queue is empty.

Test Bench 15 (x, y, timeout)

Insert random values with x threads & delete-min with y threads on serial (NOT simultaneously), using Boolean flag to stop, execute delete min until queue is empty. Measure BOTH the insertion and deletion time.

Test bench 16 (x, y, timeout)

Execute insert (random values) for infinite time with x threads

Execute Delete with y thread until timeout.

Results

All the benchmarks were run on a server with 40 Xeon processor cores (hyperthreaded - 80 threads)

First we can notice in all the graphs that the performance of the C++ implementations both of the Global and the Cool are very poor compared to the corresponding Java implementations.

We conclude that there is a performance issue with the locks of C++ Concurrency Package compared to the Java locks.

For example in testbench 2:

(for the the full testbenches results - see references)

For this reason, our analysis will focus mainly on Java results.

Let’s examine the main tests:

Separate groups   (Testbenches 2 and 3) :
Insertion:  We can notice that the TM and the LockFree leads in the insertion throughput above 10 threads and show a trend of improvement as the number of threads increases.

We can also notice that this trend occurs in our Cool and Optimistic Cool, but the throughput is much lower than the TM and the LockFree.

In contrast, the Global shows very good throughput in small number of threads (best is 2) - but the throughput decrease as the number of threads increase, as we would expect.

Deletion: we can notice that that throughput of the Cool and the Optimistic Cool is better: more than the transaction and when the number of processors grows - they even better then the LockFree. We can see that the lazy implementation is also leading here.

The reason for those results is clear: because each thread responsible for a specific range of keys, there is no contention, and therefore - TM, LockFree and Lazy are better in insertion, and global’s throughput decreases.

About the Cool and the Optimistic Cool: we can see that the insertion throughput is not so high, but the deletion throughput is pretty high in comparison to other implementations and almost equals to the insertion rate.
The immediate conclusion is that the deletion rate is limited by the low insertion rate, in contrast to other implementations where the deletion rate is not even close to the insertion rate!

The serial test (TB3) approves this conclusion, showing negative logarithmic insert-delete ratio for Cool spray list, and positive ratio for the other implementations.

simultaneous tests:

serial tests:

Interleaving: (Testbenches 5 and 6)

interleaving introduce both high insert contention, and high insert-delete contention: all insert threads are inserting adjacent values, and delete threads are “chasing” them.

The lazy+grouped cleanup property of both Cool implementations reduces this contention, and we can see a significant throughput and scalability (slope) advantages in delete-min over the other implementations.

simultaneous tests:

serial tests:

Increasing order: (Testbenches 11,13):

These are similar to interleaving, but with higher insert-insert contention: Here the number generator is shared, and adjacency of insertions is preserved throughout the test.

We can see very poor performance of all the implementations except the Global both in insertion and deletion. The reason is that all the other implementations are parallel - which mean a huge contention on the end of the lists in insertion - and as a result - poor performance in insertions leads to poor performance in deletion - because the number of elements that was inserted is low.

simultaneous tests:

serial tests:

Decreasing order:  (Testbenches 8, 10)

The decreasing simultaneous test (testbench 8) did not run correctly on the target machine, and did not give comparable results, although on our machines it did.

Since this mode introduces higher insert-delete contention (delete-min is supposed to remove the last inserted value, and new inserted values are adjacent), we expected to see a drastic performance hit, especially in the Cool spray list implementations, where insert threads would have to wait much longer for the locks preventing insertion of lower values.

Testbench 10 introduces very high insert contention like in the “increasing order” tests, and still its insertion performance with even one thread are very high. The high performance is caused by short and easy insert operations - every “find” call only has to traverse down the skiplist height to insert a new minimal value - insert is quicker, retry is quicker, and total progress is faster.

serial tests:

Random: (Testbenches 14, 15)

These tests introduce uncontended insert, and even more than in the interleaving tests, all skiplist implementations perform very well and show similar results.

simultaneous tests:

serial tests:

constant size:  (Testbench 17)

This test separates inserts from deletes, reducing their contention.

The results show similar high performance from nearly all implementations, and show that a large skiplist does not harm performance at all.

Fair lock implementations show a major hit: now a single deleteMin operation blocks all other threads, because threads are not divided to groups, and all have to perform “insert” operations. This actually blocks those threads from performing their next delete-min as well.

TM  implementation’s low performance match its low delete performance on other scenarios.

serial tests:

General Points:

  • At first we could not compare the Cool implementation performance to the LockFree implementation that lies at its base, because the LockFree implementation would get stuck too often.
    Having the Cool implementation not stuck at all shows a large improvement over its base.
    Later, when we reduced the amount of ghost nodes in the lock free skiplist, performance became similar, but the LockFree implementation’s results are much less stable compared to the Cool implementation.

  • For the Software-Transactional-Memory implementation, delete-min performance was poor: sometimes just below average, sometimes a lot less than that.
    Its insertion performance however, was remarkably good on some tests, and average on others (depending on the insertion pattern)

  • In most of the tests, the Java built in implementations (lock-based, binary-heap-based) outperformed all skiplist implementations.
    Even though they use locks, they are way more efficient than the skiplist on small number of threads, but when the number of threads increases, the java built in performance become very poor, and this result is very reasonable.

    We can see this results in testbench 2 for example:

Cool Spray List Results

Let examine the the results of the 4 implementations of the Cool SprayList, and compare between them:
We can clearly notice that in almost all the testbenches, the performance of the Cool and the Optimistic Cool is significantly better than the implementations with the fair lock.

(for all the testbenches results see: references).

In addition the throughput of the Optimistic Cool is a bit better than the Cool, means that the Optimistic improvement works well.

The fact that the unfair lock is better that the fair one, is a little counter intuitive, and surprised us in the very first sight. In addition, in reference to the Optimistic improvement, the Unfair lock, brings much greater improvement.

A possible explanation for that phenomenon is that  in the fair lock implementations, from the moment a writer (means cleaner) asks to acquire a write lock, no reader (means inserters) can enter the Critical Section, until it will finish. This time includes:

  1.  The period until all the current readers will finish their job (may takes a very long time).
  2. The time it takes to the writer to acquire the lock.
  3. The time it takes it to finish his job.

Probably, In our case, the period until the old inserters finish their job is much longer than the time it takes the cleaner to do its job, and therefore we get a significant degradation in the performance with the fair lock implementations in comparison to the unfair one.

In most of the tests, deleteMin performance was significantly better than insert.

This was expected to balance and show better performance with asymmetric thread amounts (20-10, 50-10).

Unfortunately, when comparing the symmetric, asymmetric and serial tests, it can be seen that Cool Spray List insertion is bounded by some sequential bottleneck, around 500-700 actions per ms. (depending on the insertion pattern)

Conclusions

The biggest surprise in these results was the positive effects of Unfair Lock on the performance of our implementations. The effect was so significant, even more than the effect of the optimistic approach with shorter inserters blocking time.

Considering what we have seen in the comparison with the java built in implementations for priority queue, it appears that overcoming the very high skiplist overhead, requires both a high amount of processors, and very careful optimizations to avoid sequential bottlenecks.

The Deuce software transactional memory library does make life easy, but it did not yield adequate performance under contention situations, and is not ready to serve as a high performance replacement for standard synchronization mechanisms.

It did perform very well in uncontended insert conditions, which shows it has great potential when combined with contention-reducing mechanisms, just not naively on its own.

An interesting test would be combining Deuce STM and Cool spray list:

STM avoids long waiting for locks, which can drastically improve insertion performance.

Cool spray list reduces the environmental effect of delete-min, and will allow higher deletemin performance.

This can result in a priority queue with STM’s scalable insert performance, and Cool’s scalable deletemin.

The hard parts would be (a)  verifying the TM  mechanism can “prefer” clean threads over other, less important ones, and (b)  correct marking of elimination array nodes so they fail relevant insert transactions.

References

  1. “The Art of Multiprocessor Programming” b y Maurice Herlihy & Nir Shavit.
    Chapter 14 - Skiplist and Balanced Search

  1. The SprayList: A Scalable Relaxed Priority Queue
    by Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit http://research.microsoft.com/apps/pubs/default.aspx?id=209108
  2. C++ Concurrency Package:
    http://mcg.cs.tau.ac.il/projects/projects/c-concurrency-package
    (discontinued, our code contains the framework source code with several fixes as described above)

  1. ThreadScan Repository:
    https://github.com/Willtor/ThreadScan

  1. Our Java Implementations:
    https://github.com/roitagar/CSWorkshopPriorityQueue2015-java

  1. Our C++ Implementations:
    https://github.com/roitagar/CSWorkshopPriorityQueue2015-cpp

  1. Example of the lock-free ghost nodes problem

  1. Raw Graphs:
  1. all implementations
  2. all implementations without java built-in implementations
  3. C++ implementations vs. corresponding java implementations
  4. Java’s Cool implementations