Has anyone here used: atomic_flag_test_and_set_explicit ?

xox

Joined Sep 8, 2017
838
The atomic_flag_test_and_set_explicit seems to be a standard thing since C11, has anyone here used it in their code for anything?
They are really only useful for building synchronization primitives. But those have been already been available for a long time now. IIRC pthreads was standardized way back in the 90's. So unless you are working on some truly obscure, low-level problem (eg. designing an operating system), you probably won't ever have to bother with such fiddly atomic operations.

For example, by default I always synchronize all of my threads with a SINGLE mutex. Now that isn't necessarily the utmost efficient way to do things, of course, but pretty much every single time I have found the performance to be completely acceptable. One or two times I think I had to use a handful of them. But again, not once have I had the need for an API to do atomic swaps or what have you. There just aren't that many actual applications for it.
 

Thread Starter

ApacheKid

Joined Jan 12, 2015
1,533
Well I was simply wondering about how I might lock a list say, in code where we might have interrupt handlers accessing that list.

This is in the context of STM32 boards, where I am using HAL primitives and working in C.

Windows uses a concept of a DPC (deferred procedure call) where a high priority interrupt can queue less critical work that will run later, at a lower IRQL, after the ISR exits and therefore do the minimal it needs to do at the IRQ level. I was kind of wondering how something like that might be written on an STM32 system.

Of course adding entries to a callback list requires locking the list, but an ISR would never be able to acquire the lock if a lower priority handler has already locked it and then been interrupted by a higher priority IRQ, we'd get a deadlock.

This is an interesting problem and something that would be extremely useful to have without needing a full blown OS.
 
Last edited:

michael8

Joined Jan 11, 2015
410
I'd want to use atomic_compare_exchange instead. With a single word (pointer) version of this can add an
element to the top of a list (push down) and also remove the whole list (can't just remove one element).

https://en.cppreference.com/w/c/atomic/atomic_compare_exchange
https://en.wikipedia.org/wiki/Compare-and-swap

So the IRQ routine creates an element containing a link pointer (to be set), the address of the function to run later,
and a parm pointer for the function. It uses atomic_compare_exchange to add this element to the top of the
"run later" list. So the new top of list points to the new element and the new element points to the old top of
list.

Later, some code removes the whole list (compare against current top and exchange in 0, this is safe, trying
to remove just the top element isn't safe). If run order is important it can reverse the order of the elements.
The run them until it runs out of elements to run (and then go get another whole list.

There's no way to remove an element if you change your mind. Also don't trying adding an element
which is already queued.

I've used this many times on IBM 370 type systems...
 

Thread Starter

ApacheKid

Joined Jan 12, 2015
1,533
I'd want to use atomic_compare_exchange instead. With a single word (pointer) version of this can add an
element to the top of a list (push down) and also remove the whole list (can't just remove one element).

https://en.cppreference.com/w/c/atomic/atomic_compare_exchange
https://en.wikipedia.org/wiki/Compare-and-swap

So the IRQ routine creates an element containing a link pointer (to be set), the address of the function to run later,
and a parm pointer for the function. It uses atomic_compare_exchange to add this element to the top of the
"run later" list. So the new top of list points to the new element and the new element points to the old top of
list.

Later, some code removes the whole list (compare against current top and exchange in 0, this is safe, trying
to remove just the top element isn't safe). If run order is important it can reverse the order of the elements.
The run them until it runs out of elements to run (and then go get another whole list.

There's no way to remove an element if you change your mind. Also don't trying adding an element
which is already queued.

I've used this many times on IBM 370 type systems...
OK Yes, that is definitely better, no need to disable higher priority interrupts during the insertion of the element.

Thx
 

michael8

Joined Jan 11, 2015
410
The tricky part is where do you get the storage for the element from?

In at least some cases, you only need one element (per later routine) and it can be static.

The atomic test and set can be used to know if the static eiement needs to be queued.

Assuming set means that the element is already queued (or to be queued or running)
then the non-later routine can do the test & set and if already sent it's done. If not set then
it can queue the element.

The "later" routine should clear the flag sometime in it's run. There are several
choices on when to clear the flag. Be careful, it's best of the "later" routine
is idempotent:

see the computer science section in: https://en.wikipedia.org/wiki/Idempotence
 

Thread Starter

ApacheKid

Joined Jan 12, 2015
1,533
The tricky part is where do you get the storage for the element from?

In at least some cases, you only need one element (per later routine) and it can be static.
Yes, a very interesting question. I haven't really finalized that yet. I'm toying with the idea of allocating a "memory pool" from the heap. This would be a do-once allocate at app start. The memory pool would then be used to "allocate" these callback nodes. The pool would effectively be N x node_size and the pool would internally use a bitmap to record free/busy blocks.

With a pool like that any "allocate" for a block N or less bytes can be satisfied. The original - true - heap allocate would be safe, because it is a do-once operation allocating a pool big enough for a reasonable max list length. I've used this memory pool technique to great effect in the past when we had huge numbers of identically sized blocks needing to be allocated/freed. The memory overheads are reduced because each allocate from the pool carries no block header.

It seems that on STM devices, the heap begins at the other end of memory, growing towards the stack, so allocating these pools at startup seems like a reasonable idea, not a huge difference though to making it all static.

The atomic test and set can be used to know if the static element needs to be queued.

Assuming set means that the element is already queued (or to be queued or running)
then the non-later routine can do the test & set and if already sent it's done. If not set then
it can queue the element.

The "later" routine should clear the flag sometime in it's run. There are several
choices on when to clear the flag. Be careful, it's best of the "later" routine
is idempotent:

see the computer science section in: https://en.wikipedia.org/wiki/Idempotence
I'm going to need to refresh my memory on the compare/exchange, really forgot a lot about that operation. Really all I'm doing is simple, but I can't help but abstract stuff like this, the small app I am writing does use interrupts and one can easily just write crude code that "just works" but I want to be cognizant of these interrupts that are happening asynchronously and doing work.

Right now the code uses a simple volatile "flag", the handler sets the flag to indicate there's some "work" to do and the app has a check of the flag and if set, it does the "work".

Having a neat way to minimize the time spent inside handlers seems like good design, if I intend to attempt bigger projects.
 
Last edited:

Thread Starter

ApacheKid

Joined Jan 12, 2015
1,533
The tricky part is where do you get the storage for the element from?

In at least some cases, you only need one element (per later routine) and it can be static.

The atomic test and set can be used to know if the static eiement needs to be queued.

Assuming set means that the element is already queued (or to be queued or running)
then the non-later routine can do the test & set and if already sent it's done. If not set then
it can queue the element.

The "later" routine should clear the flag sometime in it's run. There are several
choices on when to clear the flag. Be careful, it's best of the "later" routine
is idempotent:

see the computer science section in: https://en.wikipedia.org/wiki/Idempotence
Ouch, I just discovered that the standard C "malloc" and "free" are not thread safe, not surprising with hindsight but we do need better than that, do developers in C and C++ rely on "malloc" or something or do they use custom, high performance heap libraries?

I imagine if using C++ you're metaphorically f****d because every reference to "new" is a call into a heap allocation of some form, at least in C all allocates/frees are explicit.
 

xox

Joined Sep 8, 2017
838
Ouch, I just discovered that the standard C "malloc" and "free" are not thread safe, not surprising with hindsight but we do need better than that, do developers in C and C++ rely on "malloc" or something or do they use custom, high performance heap libraries?

I imagine if using C++ you're metaphorically f****d because every reference to "new" is a call into a heap allocation of some form, at least in C all allocates/frees are explicit.
Any library function for that matter which relies on some global (ie. static) state is strictly speaking not thread safe. For C programs, you should simply replace malloc/free calls with synchronized analogs. C++ on the other hand allows you to override new and delete, which does make things quite a bit more straight-forward.
 

michael8

Joined Jan 11, 2015
410
from "man malloc" on linux debian 11 bullseye libc6:

To avoid corruption in multithreaded applications, mutexes are used in‐
ternally to protect the memory-management data structures employed by
these functions. In a multithreaded application in which threads si‐
multaneously allocate and free memory, there could be contention for
these mutexes. To scalably handle memory allocation in multithreaded
applications, glibc creates additional memory allocation arenas if mu‐
tex contention is detected. Each arena is a large region of memory
that is internally allocated by the system (using brk(2) or mmap(2)),
and managed with its own mutexes.
 
Top