Implementing A Spinlock In C++

Motivation

When a thread tries to lock a mutex and the mutex is already locked, it will move to a sleep state, now another thread can run. this state will change only when the thread is awakened and when the mutex is unlocked by the thread holding the lock. On the contrary, when a thread locks a spinlock and fails, it will continuously re-try locking it, until it succeeds. This time it will not allow another thread to take its place, note that the operating system will switch to another thread eventually once the CPU runtime quota of the current thread has been exceeded. In the case of a mutex putting a thread to sleep, creates a context switch and has its performance penalty. If you wish to avoid this penalty and have a short lock time then spinlock may be the answer for you. Be aware most of the time spinlock is not your answer:

Because you should never ever think that you’re clever enough to write your own locking routines.. Because the likelihood is that you aren’t (and by that “you” I very much include myself – we’ve tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).
Linus Torvalds – https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723

Implementing the spinlock

We will need an atomic way to determine if the lock is currently free or in use, atomic means it is guaranteed to be isolated from other operations that may be happening at the same time. So we know for sure that when we test the variable for its state or even change its state we are guaranteed we are safe from context switching. A nice definition I reviewed: an operation acting on shared memory is atomic if it completes in a single step relative to other threads.

In our case, one might think that we can just use a boolean and alter it when we acquire the lock – but this operation does not guarantee atomics since when setting a boolean we actually have 2 steps to load and write. Luckily we have the atomic library that we can use that will currently set and check a variable will be atomic.

#include <atomic>


std::atomic_flag atomic_flag = ATOMIC_FLAG_INIT;

ATOMIC_FLAG_INIT – This is the only way to initialize std::atomic_flag to a definite value: the value held after any other initialization is unspecified.

std::atomic_flag is an atomic boolean type. Unlike all specializations of std::atomic, it is guaranteed to be lock-free. Unlike std::atomic<bool>, std::atomic_flag does not provide load or store operations.

Now that we understand the differences we will use the flag, now when we are trying to acquire the lock and fail we will just try over and over till we manage – meaning we should have a loop with a breaking rule of success.

while(atomic_flag.test_and_set(std::memory_order_acquire)){}

The compiler and CPU can reorder memory accesses. That is, they can happen in a different order than what’s specified in the code. That’s fine most of the time, the problem arises when different threads try to communicate and may see such order of memory accesses that breaks the invariants of the code.

Usually, you can use locks for synchronization. The problem is that they’re slow. Atomic operations are much faster because the synchronization happens at the CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.). acquire – no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.

#include <atomic>

class Spinlock
{
private:
    std::atomic_flag atomic_flag = ATOMIC_FLAG_INIT;

public:
    void lock()
    {
       while (atomic_flag.test_and_set(std::memory_order_acquire))
        {
        }
     
    }
    void unlock()
    {
        atomic_flag.clear(std::memory_order_release);
    }
};

Testing the spinlock:

const int count_to = 1000000;

volatile int value = 0;

Spinlock l;

int task(int count_to)
{

    std::cout << "Started  " << count_to << std::endl;
    for (int i = 0; i < count_to; ++i)
    {
        l.lock();
        value++;
        usleep(1);
        l.unlock();
        // std::cout << "Value  " << value << std::endl;
        // l.unlock();
    }

    return 0;
}

int main(int, char **)
{
    time_t begin, end; // time_t is a datatype to store time values.
    time(&begin);      // note time before execution
    const int num_workers = 5;
    std::vector<std::thread> threads;
    std::cout << "SpinLock inc MyTest start" << std::endl;
    value = 0;
    for (int i = 0; i < num_workers; ++i)
        threads.push_back(std::move(std::thread(task, count_to)));

    for (auto it = threads.begin(); it != threads.end(); it++)
        it->join();
    time(&end); // note time after execution
    double difference = difftime(end, begin);
    std::cout << "job done" << std::endl;
    std::cout << "count_: " << value << std::endl;
    std::cout << "time taken for function() %.2lf seconds.\n"
              << std::endl;

    std::cout << std::fixed << std::setw(11) << std::setprecision(6)
              << std::setfill('0') << difference;
    if (value == count_to * num_workers)
    {
        std::cout << "SpinLock  passed" << std::endl;
        return true;
    }
    else
    {
        std::cout << "SpinLock failed" << std::endl;
        return false;
    }
}

Notice the test – we init a vector of threads and spin them off and we join to wait till they finish. the work the threads are doing is incrementing a counter that is globally available – or a shared memory if you may. when all the threads are complete we test to see if the counter has the correct value.

Now run the code twice once with the lock and unlock before and after the increment statement and once when you comment it out. You will notice that without the lock we are getting unpredictable results. Notice the penalty hit we are getting for using a lock and syncing the threads. The cost of syncing the threads is very high 23 sec on average with the lock vs 7 sec with this out this is almost a magnitude of 4 times!! so in many cases when performance is critical try to explore alternate solutions. In this case, we can just let every thread count on a local variable and at the end of the for loop to lock the shared memory once and add its local result to it. moving from n times locking to only 1

int task(int count_to)
{
    int local_var = 0;
    std::cout << "Started  " << count_to << std::endl;
    for (int i = 0; i < count_to; ++i)
    {
       
        local_var++;
        usleep(1);
        
    }
    l.lock();
    value += local_var;
    l.unlock();
    return 0;
}

This time the code finished on average in 6-7 sec with the correct result! much better!

Are we there yet? Understanding Cache coherence

Turns out there is another topic we need to understand so we can improve our solution – Cache coherence which stands for the uniformity of shared resources data that ends up stored in multiple caches – which is what happens when we run on multiple CPUs that each processor has its own cache. so we end up with one copy in the main memory and one in the local cache of each processor that requested that data.

Since there might be more than one thread spinning and trying to acquire the lock a lock of cache coherency traffic is going around to make sure all the caches are valid. We can leverage the fact the readers are less expensive since they don’t affect the state of the cache. we can wait for the lock holder to first release the lock and only then try and write to it. Switching to a test and test and set tactics. where we first test which is reading only and only then attempting to test and write and only then adding coherency traffic if we have a hint that the lock is free. We can achieve this using a nested loop.

Further improvement

class Spinlock
{
private:
    std::atomic_flag atomic_flag = ATOMIC_FLAG_INIT;

public:
    void lock()
    {
        for (;;)
        {
            if (!atomic_flag.test_and_set(std::memory_order_acquire))
            {
                break;
            }
            while (atomic_flag.test(std::memory_order_relaxed))
                ;
        }
    }
    void unlock()
    {
        atomic_flag.clear(std::memory_order_release);
    }
};

Running the same test on the benchmark improved the runtime by almost 20%! great improvement only by digging down and pilling another layer under the hood.

Check this post by Linus Torvalds regarding spinlocks in userspace.

References:

https://stackoverflow.com/questions/15054086/what-does-atomic-mean-in-programming/15054186
https://stackoverflow.com/questions/12346487/what-do-each-memory-order-mean1
https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/
https://en.cppreference.com/w/cpp/atomic/atomic_flag
https://rigtorp.se/spinlock/
https://en.wikipedia.org/wiki/Cache_coherence
https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723

Implementing a spinlock in c++

Also published on Medium.