This is a very simplified latch which allows several threads to block and busywait until they are released to begin work in parallel. I’m using this to do some multi-thread stress testing, where I want to maximize the overlapping work in order to check for suprising non-deterministic behavior. There’s no count associated with the gate, nor is it reuseable. All we need is the one bit of information, weather the gate is open or closed.
The simplest atomic boolean is the std::atomic_flag. It’s guaranteed by standard to be lock free. Unfortunately, to provide that guarantee, it provides no way to do an atomic read or write to the flag, just a clear, and a set and return prior value. This isn’t rich enough for multiple threads to wait, although it is enough to implement a spinlock.
std::atomic<bool> isn’t guaranteed to be lock free, but in practice, any architecture I’m interested has at least 32bit aligned atomics that are lock free. Older hardware, such as ARMv5, SparcV8, and 80386 are missing cmpxchg, so loads are generally implemented with a lock in order to maintain the guarantees if there were a simultaneous load and exchange. See, for example, LLVM Atomic Instructions and Concurrency Guide. Modern ARM, x86, Sparc, and Power chips are fine.
When the spin gate is constructed, we’ll mark the gate as closed. Threads will then wait on the flag, spinning until the gate is opened. For this we use Release-Acquire ordering between the open and wait. This will ensure any stores done before the gate is opened will be visible to the thread waiting.
Using a SpinGate is fairly straightfoward. Create an instance of SpinGate and wait() on it in each of the worker threads. Once all of the threads are created, open the gate to let them run. In this example, I sleep for one second in order to check that none of the worker threads get past the gate before it is opened.
The synchronization is on the SpingGate’s std::atomic_bool, flag_. The flag_ is set to true in the constructor, with release memory ordering. The function wait() spins on loading the flag_ with acquire memory ordering, until open() is called, which sets the flag_ to false with release semantics. The other threads that were spinning may now proceed. The release-acquires ordering ensures that happens-before writes by the thread setting up the work and calling open will be read by the threads that were spin waiting.
I’ve added a yield() call after the open, hinting that the thread opening the gate may be rescheduled and allow the other threads to run. On the system I’m testing on, it doesn’t seem to make a difference, but it does seem like the right thing to do, and it does not seem to hurt.
I’d originally had the body of the threads just spitting out that they were running on std::cout, and the lack of execution before the gate, plus the overlapping output, being evidence of the gate working. That looked like:
for (std::size_t n = 0; n < std::thread::hardware_concurrency(); ++n) {
workers.emplace_back([&gate, n]{
gate.wait();
std::cout << "Output from gated thread " << n << std::endl;
});
}
The gate is captured in the thread lambda by reference, the thread number by value, and when run, overlapping gibberish is printed to the console as soon as open() is called.
But then I became curious about how long the spin actually lasted. Particularly since the guarantees for atomics with release-acquire semantics, or really even sequentially consistent, are about once a change is visible, that changes before are also visible. It’s really a function of the underlying hardware how fast the change is visible, and what are the costs of making the happened-before writes available. I’d already observed better overlapping execution using the gate, as opposed to just letting the threads run, so for my initial purposes of making contention more likely, I was satisfied. Visibility, on my lightly loaded system, seems to be in the range of a few hundred to a couple thousand nanoseconds, which is fairly good.
Checking how long it took to start let me do two thing. First, play with the new-ish chrono library. Second, check that the release-acquire sync is working the way I expect. The lambdas that the threads are running capture the start time value by reference. The start time is set just before the gate is opened, and well after the threads have started running. The spin gate’s synchronization ensures that if the state change caused by open is visible, the setting of the start time is also visible.
Here are one set of results from running a spingate:
The normal way of implementing a gate like this is with a condition variable, associated mutex, and a plain bool. The mutex guarantees synchronization between the wait and open, rather than the atomic variable in the SpinGate. The unlock/lock pair in mutex have release-acquire semantics. The actual lock and unlock are done by the unique_lock guard.
This has the same interface as SpinGate, and is used exactly the same way.
Running it shows:
That the overhead of the mutex and condition variable is significant. On the other hand, the system load while it’s waiting is much lower. Spingate will use all available CPU, while CVGate yields, so useful work can be done byu the rest of the system.
However, for the use I was originally looking at, releasing threads for maximal overlap, spinning is clearly better. There is much less overlap as the cv blocked threads are woken up.
This CMake file pulls together the various directories I have scatter things into, as well as setting up the options for building with a locally installed clang with its libc++.
And here we build a release version of the test executable:
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make
-- Configuring done -- Generating done -- Build files have been written to: /home/sdowney/src/spingate/build Scanning dependencies of target cvgate [ 16%] Building CXX object CMakeFiles/cvgate.dir/cvmain.cpp.o [ 33%] Building CXX object CMakeFiles/cvgate.dir/cvgate.cpp.o [ 50%] Linking CXX executable cvgate [ 50%] Built target cvgate Scanning dependencies of target spingate [ 66%] Building CXX object CMakeFiles/spingate.dir/main.cpp.o [ 83%] Building CXX object CMakeFiles/spingate.dir/spingate.cpp.o [100%] Linking CXX executable spingate [100%] Built target spingate
Exported from an org-mode doc. All of the source is available on github at SpinGate
(org-babel-tangle)