Concurrent Hopscotch Hash Map

 

Programmers:

Hila Goel, Tel Aviv University
hila.goel@gmail.com

Maya Gershovitz, Tel Aviv University
mgershovitz@gmail.com

 

This code was developed as part of "Workshop on Multicore Algorithms" at Tel-Aviv University, under the guidance of Prof. Nir Shavit and Moshe Sulamy.

Our source code can be found here - Source_Code

 

As part of this project we created eight different implementations of concurrent hopscotch, four of them in Java, and four in C++.

The implementations are:

1.       Global lock - CPP, Java

2.       Fine grained locking - CPP, Java

3.       Transactional memory - CPP, Java

4.       Our improvement on the fine grained locking - CPP, Java

In our code, we used an existing implementation of hopscotch hash mapping, with fine grained locking system in CPP, created by –

Sathya Hariesh Prakash - New York University

Royston Monteiro - New York University

Mohan Pandiyan - New York University

The source can be found here - https://github.com/harieshsathya/Hopscotch-Hashing.


 

About Hopscotch Hashing

 

Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing was introduced by Maurice Herlihy, Nir Shavit and Moran Tzafrir in 2008.

 

The hash table is an array of buckets. The key characteristic of the hopscotch approach is the notion of a neighborhood of buckets around any specific bucket. This neighborhood has the property that the cost of finding the desired item in any of the buckets in the neighborhood is the same or very close to the cost of finding it in the bucket itself.

 

This property is achieved and maintained in the adding process.

When adding a new item to the hash table, we only add an item to the neighborhood of the original bucket it was mapped to. If a neighborhood is full, the buckets are moved around the table, clearing space in the neighborhood.

 

The Hopscotch Hashing scheme allows for excellent performance with very high table density (up to 99%).

 

*Based on this article - http://mcg.cs.tau.ac.il/papers/disc2008-hopscotch.pdf by Herlihy, Shavit & Tzafrir.


 

More about the Code…

The purpose of this workshop was to understand the differences, the pros and cons of different methods for implementing concurrent data structures, as well as the differences between the two programming languages – Java and C++, each of them with their own advantages and disadvantages.

 

Main Differences between the Languages

In this section, we’d like to detail the main differences between Java and C++, and how they influenced the code and overall programming experience.

A.      One of the most well-known differences between the languages is the fact that C++ uses pointers, whereas java uses references. Due to these attributes, our handling of the data structure and the objects it contained was different in each of the cases –

In C++, our implementation used pointers to the data structure and the individual objects in it.

In Java, there are no pointers, so we used index numbers, referring the functions to the proper places in the table.

B.      Another crucial difference is garbage handling.

In C++, it’s the programmer’s duty to handle garbage collecting – in the form of creating deconstructors to free up any memory that was allocated during run time, and is not used any more.

In Java, there is a built in garbage collector responsible for freeing up this unused memory.

 

This fact might influence test result, since in C++ we know exactly when does the garbage handling take place, where as in Java, it can happen at any time, decided by the JRE (not controlled by the programmer), and so it may interfere with test timing.

 

C.      The third difference is the implementation of transactional memory.

In C++ we worked with hardware TRX, provided by the Intel Haswell processor.

In this implementation, the processor is responsible for executing the transactions atomically.

In this case we used Intel core i7-4770 processor.

 

In Java, we worked with software TRX. We used a third-party package called Deuce, (can be found here - https://sites.google.com/site/deucestm/).

Using Deuce, we annotated certain functions as Atomic. Deuce is responsible for making any required calculations, and finally trying to commit the changes to the memory, while checking if there were any collisions.

 

In both cases, we created fallback methods using fine grained locking system, used for failed transactions.

Main functions

Our code has four main functions, which we would test in the following benchmarks –

1.       Contains

2.       remove

3.       Add

4.       Find closer bucket

 

We will not be detailing the Remove function, since it’s very similar to Contains – with one difference – Contains returns true when it finds an objects, and Remove deletes it.

 

The Contains function uses a very simple algorithm – it checks all the buckets In the neighborhood of start_buckets, and returns true if the given key is found in one of them.

 

boolean contains(int key) {

     int hash = ((key)&(MAX_SEGMENTS-1));

     Bucket start_bucket = segments_arys[hash];

 

     long hop_info = start_bucket._hop_info;

     int mask = 1;

     for(int i = 0; i < HOP_RANGE; ++i, mask <<= 1) {

           if((mask & hop_info) >= 1){

                Bucket check_bucket = segments_arys[hash+i];

                if(key == check_bucket._key) {

                     return true;

                }

           }

     }

     return false;

}

 


 

The Add function consists of two parts – the first one is simple, and similar to the algorithm above – it checks all the buckets in the neighborhood of start_buckets, and looks for an empty bucket to put the new key-data pair in.

 

boolean add(int key,int data){

     int val = 1;

     int hash = ((key)&(MAX_SEGMENTS-1));

     Bucket start_bucket = segments_arys[hash];

     // lock the lock in the locking system

     locks.lock_arys[locks.get_index(hash)].lock();

    

     if(contains(key)) {

           locks.lock_arys[locks.get_index(hash)].unlock();

           return false;

     }

    

     int free_bucket_index = hash;

     Bucket free_bucket = segments_arys[hash];

     int free_distance = 0;

     for(; free_distance < ADD_RANGE; ++free_distance) {

           if(-1 == free_bucket._key) {

                free_bucket._key = BUSY;

                break;

           }

           ++free_bucket_index;

           free_bucket = segments_arys[free_bucket_index];

     }

 

If an empty bucket is found in the neighborhood, the Add function puts the key-data pair in the bucket and returns true.

If an empty bucket is not found, the Add function calls the Find Closer Bucket function.

 


 

The Find Closer Bucket Function is the one responsible for the main algorithm of Hopscotch hash mapping.

The function receives two parameters –

Free_bucket_index – the index of the first empty bucket the Add function reached.

Free_distance – the distance between the original hash, and the empty bucket.

 

The function checks all the buckets in the neighborhood of the given empty bucket, starting with the furthest one:

int move_bucket_index = free_bucket_index - (HOP_RANGE-1);

 

For each candidate, the function checks if it can be moved to the given empty bucket.

If the function finds a suitable bucket it moves it, and returns the index of the newly emptied bucket to the calling function – Add.

 

int[] find_closer_bucket(int free_bucket_index,int free_distance,int val) {

        //0 - free distance, 1 - val, 2 - new free bucket

        int[] result = new int[3];

        int move_bucket_index = free_bucket_index - (HOP_RANGE-1);

        Bucket move_bucket = segments_arys[move_bucket_index];

       

        for(int free_dist = (HOP_RANGE -1); free_dist > 0; --free_dist) {

            long start_hop_info = move_bucket._hop_info;

            int move_free_distance = -1;

            long mask = 1;

            for(int i = 0; i < free_dist; ++i, mask <<= 1) {

                if((mask & start_hop_info) >= 1) {

                    move_free_distance = i;

                    break;

                }

            }

.

.

.

.

++move_bucket_index;

            move_bucket = segments_arys[move_bucket_index];

      }


 

Our “Fine Grained Improved” implementation

When creating our “Fine Grained Improved” implementation, we used this first benchmark, designed in order to find the optimal number of locks our locking system.

In the “regular” fine grained code we had a lock in each of the table objects (a.k.a Buckets) – all in all, the number of locks was identical to the maximal size of the table.

In this way, the chances for collisions were very low, but the number of locks being locked and unlocked was very high, so it took a toll on performance.

In the “improved” code, we decided to reduce the number of locks.
We created an array of locks, smaller than the table, so that every x number of table objects shared the same lock – and set out to find the optimal number of locks.

In one of the earlier tests we did, we discovered that if we distribute the lock to adjacent objects in the table, we created a situation where if the user inserts or try to access adjacent keys – normal behavior when using a data structures – they will all be mapped to the same lock, and so will have to wait.
So, we decided to use a “slicing” method – so that every two Buckets mapped to the same key will be as far away from each other as possible.

Example:

In order to find the optimal number of locks, we used Test1, listed in the Benchmarks section.

 


 

Benchmarks

Test 1:

1.       Create an array with 1,000,000 random numbers.

2.       Created a new hopscotch table object.

3.       Create an array of x threads (the number of threads is given as a variable)

4.       Each thread was assigned a unique segment of the random numbers array.

5.       Concurrently, each thread added all the numbers (in its segment) to the hopscotch table,

and then removed them.

 

This test was run 20 times, and finally an average runtime was calculated.

In this graph we can see the result for the test on “fine grained improved” method in C++.

 

 

In this graph we can see the result for the test on “fine grained improved” method in Java.

It’s easy to see from the results that in both cases (Java and C++) – the optimal number of locks, which gives the best performance is about 5000 (between 1000 and 10000).

So, from this point on (unless stated otherwise) we will use the fine grained improved method with 5000 locks.

 


 

Java vs C++

After discovering the optimal number of locks for our “fine grained improved” method, we ran a performance test to compare between the two programming languages we used – Java and C++.

We used test 2 (described below), these are the results.

 

 


 

Overall comparison

For our first performance benchmark, we ran the exact same test on ALL of our codes.

The test is Test 1, which we used for testing our fine grained improved method.

 

We ran all of the tests on the same server, which we used exclusively.

We used a server equipped with Intel core i7-4770 processor and eight cores.

 

Following this appears four graphs – two for Java and two for C++.

The first graph details the performance of each locking mechanism in relation to the number of threads, and the second graph details the CPU usage of the server during run time (which, as stated above, we were the only one using at the time).


 

 

 


 

 

 

 

 


 

Test1 – Conclusions

In test1 we measured the time it took each implementation to add and remove 1,000,000 random numbers.

In the performance graph (both in C++ and Java) we can clearly see a few distinctive results:

1.       When we only used 1 thread all implementations gave about the same results – this is our “Control Group” results, since it acted as a serial implementation.

2.       The TSX (or HLE) consistently did worse than all the other methods, the reason is that in this test, due the implementation, the memory transactions failed repeatedly.

The “Retries” mechanism attempted the transactions many times before giving up and using the fallback methods – fine grained locking – so all in all, this method performed badly.

3.       As the number of threads went up:

a.       The performance of Global Lock deteriorated – there was no concurrent work, but many collisions.

b.      The performance of Fine Grained improved and Fine grained consistently got better.

 

In the CPU usage graph (both in C++ and Java) we can see a few things:

1.       When using one thread, we got a constant CPU usage of 12%-13% in all implementations – since that in every given time only one CPU (out of eight) was active.

2.       As the number of thread went up:

a.       The CPU usage of Global Lock deteriorated – all of the CPUs were idle except one.

b.      The performance of Fine Grained improved and Fine grained consistently got better.

c.       The best CPU utilization was achieved when using eight threads – this is of course the expected result since the computer we used an 8-core processor.

d.      When we used 16 thread, the CPU usage went down – since the server we worked with had eight thread, when we used 16 threads there were more collisions and idle processor time.

 

In the TSX section of the CPU graph we can see a distinct difference between JAVA and CPP.
As we detailed above, we used two different methods of implementing transactional memory in Java and in CPP.

Due the lockless implementation of TSX (both in Java and in CPP), we would have expected very high CPU usage (notice that in Java, TSX has had the highest CPU usage rates).
In TSX, weather the transactions succeeded or failed, the threads are constantly working, and rarely waiting around for locks.

At this point, the CPU utilization test result has surprised us, and we cannot provide an explanation for the relatively low CPU usage of TSX in CPP.

In the Cache Misses graph (both in C++ and Java) we can see a few things:

1.       We can clearly see that both the Global Lock and the Fine Grained Improved implementations have performed relatively the same.
The one implementation standing out with higher numbers of cache misses is Fine Grained.

2.       In Fine Grained (and also a little bit in Fine Grained Remove in Java) we can see that as the number of threads went up, we got a smaller number of cache misses, whereas Global Lock remained constant.

Cache misses result from the program trying to access data that is not present in the cache, forcing the processor to update it.
In terms of accessing the data in the table, all three implementations acted the same – the all accessed the same 1,000,000 random numbers – so the reason for these different results is in the locking system.
Global Lock only used one lock – so it always remained in the cache.
Fine Grained Improved used 5,000 locks – so in any given time a large percentage of the locks were already in the cache.
Fine Grained used ~1,000,000 locks – so, for every action, the test accessed two pieces of random information, both the data in the table, and the locks, so there were about double the amount of cache misses in this implementation.

As for the improvement we saw for high number of threads – when we used more threads, the program utilized bigger parts of the available cache memory.  So, the more threads the program used, it had more available cache memory, and bigger chances of accessing data that was already updated in the cache.

 



 


 

Successful Deletion Rate

In this test, we would like to measure the rate in which each implementation finds and deletes existing objects from the table.

Test 2:

1.       Create an array with 1,000,000 random numbers.

2.       Created a new hopscotch table object.

3.       Add all random numbers from the array to the table (so it’s about 98% full).

4.       Create an array of x threads (the number of threads is given as a variable)

5.       Each thread got a unique segment of the random numbers array.

6.       Concurrently, each thread looked for the number in the array, deleted it.

7.       Each thread kept count of the number of objects it deleted.

8.       After 1 millisecond we interrupted all the threads, retrieved the counters from all of them, and summed up the results.

 

 

 


 

 


 

Test2 – Conclusions

In test2 we measured how many successful deletion every implementation can achieve in 1 millisecond.

In the performance graph we can see that unlike in the case of Test1 (which included adding and deleting of objects) – this time “fine grained improved” did slightly worse than “Fine grained”.

From this test we can learn that the optimal number of locks which we found in test1, is actually only optimized for the adding operation, whereas for the removing operation, a higher number of locks would have been better.

The reason for this strange affect is that adding actually consists of two levels of locking –
One locking level is in the “Add” function, and the other one is in the “Find closer bucket” function (which is responsible for moving objects around the table and clearing space for new objects).

Since the adding operation uses much more locking and unlocking the accumulated effect of using many locks damage performance, so we got a relatively low optimal number of locks.

In the removing operation we only have one level of locking, so the accumulated effect of collision was the more significant of the two.


 

Unsuccessful Deletion Rate

In this test, we would like to measure the rate in which each implementation looks for non-existing objects in the table, and returns a negative result.

Test 3:

1.       Create an array with 1,000,000 random numbers.

2.       Created a new hopscotch table object.

3.       Add all random numbers from the array to the table (so it’s about 98% full).

4.       Create an array of x threads (the number of threads is given as a variable).

5.       Create a second array with 1,000,000 random numbers – all different from the numbers in the original array.

6.       Each thread got a unique segment of the second random numbers array.

7.       Concurrently, each thread looked for the number in the array.

8.       Each thread kept count of the number of objects it couldn’t locate.

9.       After 1 millisecond we interrupted all the threads, retrieved the counters from all of them, and summed up the results.


 

 

 

 

 


 

Test3 – Conclusions

In test3 we measured how many unsuccessful deletions every implementation can achieve in 1 millisecond.
An unsuccessful deletion means that the given key, which we wanted to delete, was not present in the array. So, in every remove action, the function had to look through all the buckets in the neighborhood, before returning.

In the performance graph we can see that the most successful implementation (both in CPP and JAVA) was the TSX. The reason for this result is found in the way the Remove function works.

The Remove function consists of two parts – 1. Finding the key. 2. Deleting the key.
In Global Lock, Fine Grained and Fine Grained Improved, we lock the appropriate section of the data structure, at the beginning of the function (prior to the search).
In TSX, since searching for a key does not change the data, we don’t use any locks for the search part of Remove. So, TSX was able to give a negative result (“key not found”) without using any locks, thus completing the highest number of searches in the given time.