Hardware locking mechanism on multicore processors -- how does it work?

Thread Starter


Joined Mar 31, 2012
In order to mitigate race conditions in multi-threaded/multi-process applications using shared memory, a common technique is to use an atomic TSL (Test and Set Lock) instruction. This instruction simultaneously reads the value of the lock variable and sets it to capture the lock if it is free, it then exams the returned (original) value of the variable to see if it was, indeed, free.

On x86 processors, this is done using the XCHG instruction. When this instruction is executed between a register and a memory location, the hardware employs a locking mechanism to make the operation atomic. However, on multicore systems, it is possible for threads running on different cores to execute the XCHG instruction on the same memory location. No specific action is taken to lock around the operation; instead, the cache-coherency protocols are employed to prevent this from happening. These protocols are pretty complicated, and I imagine involve some degree of proprietary magic.

The question I'm trying get an answer to is how a 'winner' is decided in a truly parallel situation.

Imagine two threads running the same code on two difference cores and imagine that they are running in perfect lock step. Doesn't matter how unlikely this is to happen, only that it be possible.

So, the first question is, is this possible, or is there some mechanism that renders it impossible? I would think that this would not be the case.

Assuming that it is possible, what happens when they both attempt to access the same memory location on the same clock cycle? We are assuming that the two cores are in identical states, which may require that the requested memory address is not currently in the L1 cache for either core. So let's assume a cache miss, at least up to whatever level of cache/memory both cores are fed from.

If this happens, how is the tie broken?

I can imagine a couple of possibilities. One would be for each core to have a relative priority on resources so that when ties happen, the priority of the core dictates the winner. I could also imagine a very simple random process by which the winner is chosen. I think, in practice, either approach would be viable since if two processes competing for the same resource are tied to that degree, any kind of arbitrary tie breaker is valid as long as it is fast -- pick a winner and move on.

Whatever mechanism is used also has to be generic in allowing for the possibility that every core on the processor somehow managed to want to access the same resource on the same clock cycle.

Does anyone know how this situation is handled? Preferably on a x86, but it would be more than useful to have an explanation of how it is handled on ANY multi-core processor.


Joined Jan 30, 2016
According to ChatGPT, the question " how is hardware locking between cores on an intel x86 cpu implemented? "

Hardware locking between cores on an Intel x86 CPU is implemented using several mechanisms:

  1. Atomic Instructions: Atomic instructions like XCHG are used to implement hardware locking. When this instruction is executed between a register and a memory location, the hardware employs a locking mechanism to make the operation atomic1.
  2. Cache Coherency Mechanism: The cache coherency mechanism automatically prevents two or more processors that have the same area of memory from simultaneously modifying data in that area2. This operation is called “cache locking” and is used to ensure that the operation is carried out atomically2.
  3. MESI Protocol: The Modified, Exclusive, Shared, Invalid (MESI) protocol is employed by x86 processors to maintain cache coherence between cores3.
These mechanisms ensure that the operations are carried out atomically, preventing race conditions and ensuring data integrity when multiple cores are accessing and modifying the same memory location.

Is that helpful?

Probably not, seeing as its referenced back to your question from yesterday!... hmmm.

how is hardware locking between cores on an ESP32 implemented?

Hardware locking between cores on an ESP32 is implemented using several mechanisms:

  1. Task Communication: The ESP32 uses the FreeRTOS operating system, which allows tasks to communicate with each other through mechanisms like queues, notifications, semaphores, and mutexes1.
  2. Spinlocks: Spinlocks are used for resources that need exclusive access for a very short time or have very low contention2.
  3. Disabling Hardware Locks: In some cases, hardware locks can be disabled by defining CONFIG_DISABLE_HAL_LOCKS as 13.
  4. Interrupts: Each core has its own interrupt service routine (ISR), and interrupts can be attached to specific cores4.
These mechanisms ensure that operations are carried out atomically, preventing race conditions and ensuring data integrity when multiple cores are accessing and modifying the same memory location.
Last edited:


Joined Aug 27, 2009
The Linux kernel smp locking is one example

One complication with atomic operations is encountered in multi-core systems, where an atomic operation is not longer atomic at the system level (but still atomic at the core level).

To understand why, we need to decompose the atomic operation in memory loads and stores. Then we can construct race condition scenarios where the load and store operations are interleaved across CPUs, like in the example below where incrementing a value from two processors will produce an unexpected result:


On most platforms, the Linux kernel currently uses a “queued spinlock” that has a pretty complicated implementation. It is spread across several files but the bulk of the logic is here. The rest of this post will ignore that code, however, and rather focus on Linux’s spinlock for 32-bit ARM platforms
Last edited:

Thread Starter


Joined Mar 31, 2012
The heart of my question is how simultaneous attempts to access a resource are handled. I'm not interested in the mechanisms at play when it is assumed that one attempt to access something is somehow slightly ahead of another, unless there is a mechanism to ensure that this will always be the case (in which case, it is THAT mechanism that I am interested in). In an asynchronous world, the assumption of non-simultaneity is a lot more valid and we can design circuits that are guaranteed to pick a winner, perhaps after an undesirable period of metastability, no matter how close to being simultaneous two events are. But what about in a synchronous world in which two identical threads, in identical states, on difference cores, attempt to access the same resource on the same clock cycle?


Joined Aug 27, 2009
Priority is likely the simplest chip level hardware solution.
A common problem in DMA, or other independent hardware module is simultaneous attempts to access the same resource. CPU wants to read a memory location when DMA wants to write to that memory location.

A pretty simple example using priority system arbitration hardware in the controller.
7.1 System Arbitration
The system arbiter resolves memory access between the system level selections (i.e., Main, Interrupt Service Routine) and peripheral selection (e.g., DMA and Scanner) based on user-assigned priorities. A block diagram of the system arbiter can be found below. Each of the system level and peripheral selections has its own priority selection registers. Memory access priority is resolved using the number written to the corresponding Priority registers, 0 being the highest priority selection and the maximum value being the lowest priority. All system level and peripheral level selections default to the lowest priority configuration. If the same value is in two or more Priority registers, priority is given to the higher-listed selection according to the following table.
7.2 Memory Access Scheme
The user can assign priorities to both system level and peripheral selections based on which the system arbiter grants memory access. Consider the following priority scenarios between ISR, MAIN and peripherals.
At a more complex level of controller a Bus Matrix can be used


Shared Bus Fabric

3.1 Shared Bus Fabric A Shared Bus Fabric is a high speed link needed to communicate data between processors, caches, IO, and memory within a CMP system in a coherent fashion. It is the on-chip equivalent of the system bus for snoop-based shared memory multiprocessors [10, 31, 23]. We model a MESI-like snoopy write invalidate protocol with write-back L2s for this study [4, 15]. Therefore, the SBF needs to support several coherence transactions (request, snoop, response, data transfer, invalidates, etc.) as well as arbitrate access to the corresponding buses. Due to large transfer distances on the chip and high wire delays, all buses must be pipelined, and therefore unidirectional. Thus, these buses appear in pairs; typically, a request traverses from the requester to the end of one bus, where it is queued up to be re-routed (possibly after some computation) across a broadcast bus that every node will eventually see, regardless of their position on the bus and distance from the origin. In the following discussion a bidirectional bus is really a combination of two unidirectional pipelined buses. We are assuming, for this discussion, all cores have private L1 and L2 caches, and that the shared bus fabric connects the L2 caches (along with other units on the chip and off-chip links) to satisfy memory requests and maintain coherence. Below we describe a typical transaction on the fabric.
Last edited: