Article:CAS and LL/SC Implementation Details by Processor family

Introduction

There exist a small set of distinct processor families which support CAS or LL/SC. Implementation details vary by platform and are important for software implementation.

CAS

CAS (Compare-and-Swap) implementations offer a small set of atomic instructions, such as ADD or EXCHANGE. These and only these instructions are atomic. The instructions have no failure mode - they are issued, and they take however long they take, and then they are done. If a CAS operation is issued, it will always succeed in the sense that the instruction was issued and performed, but of course the compare being performed by the CAS may not be equal and so the target of the CAS then is not updated by the swap value.

CAS means to compare-and-swap one word. DWCAS means to compare-and-swap two contigious words of memory. Only the Compare-and-Swap instruction ever offers double-word - all the others, ADD, EXCHANGE, etc, are always one word.

LL/SC

LL/CS (Load-Linked and Conditional-Store) implementations offer two instructions, LL and SC (different processor families have different names for them), where the LL loads from memory into a register but in some way monitors the memory location loaded; the caller can then continue as usual, using the whole instruction set, until the caller finally comes to issue the SC, which has to store back to the address loaded from, and where the SC operation will fail if the contents of that address have changed since the load.

The nature of the monitoring of the load address varies by family. Some families allow one monitor per processor; others allow have one monitor per n bytes of memory (so that any write inside the zone (known as an "Exclusive Reservation Granule") containing the load address, by any processor, causes the store to fail), etc. For almost all processor families, only one LL can be held at a time, so issuing a second LL invalidates the first, and a context switch on a processor holding an LL normally causes the SC to fail always, when it comes to be performed.

LL/SC is usually used to emulate CAS, which is to say, LL/SC is used to write a tiny bit of code which loads a target memory address, compares it to a comparand, and then writes back a swap value to the target if the comparand and target values are equal.

Because the SC can fail spuriously (context switch, another processor writes into the monitor zone, etc), we see here unlike real CAS, we can see the instruction itself fail, as well as the compare itself "fail" (not be equal). This is what GCC (version 4.7.3 or greater, the new instrinics, not the old ones) is referring to in its atomic instrinics what it talks about "weak" and "strong" atomic operations. A weak operation means issue the LL/SC once, and come back with the result, even if the failure is actually a spurious failure (the target and comparand are equal, but the SC operation reported failure to write); a strong operation means loop on the LL/SC operation until the SC operation comes back with a failure or success which matches whether or not the target and comparand are equal.

Processor Families Implementating CAS

IA64
SPARC
x64
x86

Processor Families Implementating LL/SC

68k
Alpha
ARM
MIPS
POWERPC
RISC V

Implementation Details

IA64

CAS
no DWCAS as such, but there is an instruction to compare one word and then to write back two, and this is all you need for ABA with counters; you compare the counter, and if it differs, you swap both the pointer and counter

SPARC

CAS
no DWCAS

x64

CAS
DWCAS
Note Intel by default (can be disabled in some BIOSes) fetches both the cache line you've asked for AND THE FOLLOWING CACHE-LINE, ALWAYS. As such atomic isolation on Intel is actually twice the cache line size.

x86

CAS
DWCAS introduced either with the 386 or 486.

68k

CAS
DCAS - not DWCAS - but DCAS; the ability to atomically compare-and-swap two *distinct* memory addresses.

Alpha

LL/SC
Unknown DW LL/SC

ARM

LL/SC
DW LL/SC (from architecture ARM6k onwards)
One LL/SC pair active per processor at any one time, uses an Exclusive Reservation Granule from 8 to 2048 bytes, but also has the concept of a per-processor (local) monitor and a global (system-wide) monitor. If the LL/SC is on a cache-line which is exclusive, only the local monitor is invoked, and the local monitor can be very weak indeed - it may even only keep track of whether or not a second LL has been issued since the original!
ERG size is 8 to 2048 bytes for both ARM32 and ARM64. The size has not changed for ARM64.

MIPS

LL/SC
no DW LL/SC
One active LL/SC pair per processor.

POWERPC

LL/SC
no DW LL/SC