Company
Solutions
Developers
Careers
Search
EN
EN
Ampere Computing Logo
Solutions
Solutions Home
Systems
Solutions
Performance Overview
Workload Briefs Overview
Tuning Guides Overview
Where to Try
Ampere Systems
Ampere Altra
Azure
Equinix
Google Cloud
Oracle
Tencent Cloud
Ampere AIDownloadsHow It WorksFAQs
Developers
Developer CenterDesigning Cloud ApplicationsBuilding Cloud ApplicationsDeploying Cloud ApplicationsUsing Your DataEnabling the Open-Source CommunityAmpere Ready Software
Support
Search
Building Cloud Applications Hero

Locking Primitives and Memory Ordering on the Ampere Altra Family of Processors

Oct, 2022

Print
Download
Locking Basics on Ampere Altra and Ampere Altra Max
Lock Types
How to Implement Locks
Other Design Decisions
Memory Ordering
Summary
Locking Basics on Ampere Altra and Ampere Altra Max

Let us take care of some basics first. Arm introduced Large System Extensions (LSE) in the Arm v8.2-A architecture that replaced a sequence of instructions for lock operations with a single atomic instruction. A good summary can be found here. While older Arm versions will work fine functionally, expect a performance hit as the core count increases and locks are contended more often. Ampere Altra and Ampere Altra Max support LSE and are well equipped for scalable lock performance.

To illustrate the difference in instructions used, let us look at the gcc intrinsic __atomic_fetch_add(). In this case decrementing a lock value by 1:

__atomic_fetch_add(&lockptr->lockval, -1, __ATOMIC_ACQ_REL);

Compiling using the –march=armv8.2-a option, the compiler generates code with an atomic instruction:

998: f8f60280 ldaddal x22, x0, [x20]

On the other hand, setting –march=armv8-a (no LSE support), generates a different sequence:

9a4: c85ffe60 ldaxr x0, [x19] 9a8: d1000400 sub x0, x0, #0x1 9ac: c801fe60 stlxr w1, x0, [x19] 9b0: 35ffffa1 cbnz w1, 9a4 <main+0x104>

An exclusive monitor is needed to make the sequence atomic. The ldaxr gets a tag for the address, [x19] in this case. Then a subtraction is performed followed by a store back to the memory location. The store, however, is only successful if that tag at the time of store matches the tag from the load. The conditional branch, cbnz, after the stlxr checks if the store was successful, meaning that the tags from load and store match. If not, jump back to the beginning of the sequence, at address 0x9a4 in this case.

What is noteworthy here is that without LSE instructions, this sequence of instructions might be executed several times before it is deemed successful. With LSE, the ldaddal instruction is guaranteed to finish with a single instruction and no loop is needed.

Figure 1 shows the difference in performance for exclusive lock gets per second with and without LSE as the thread count is increased from 1 to 80.

Figure 1- Lock gets: lse vs no lse

Typically, Compare and Exchange hardware instructions are used to implement locks in software. It is important to note that these instructions need to be atomic.

What does atomic mean in this context? The instructions first get ownership of the cache line containing the lock and load it into the local cache of the CPU. Then the current value is compared with the compare value submitted with the instruction. If equal, the new value submitted as part of the instruction will replace the current value. If not equal, the current value remains. Atomicity in this regard means that the entire sequence is executed by a thread without other threads getting access to the cache line, guaranteed by hardware.

Lock Types

There are different types of locks that can be implemented in software, like mutexes, ticket locks, and spinlocks. As mentioned previously, different lock types are implemented in software, and hardware provides instructions like cmpxchg or fetchadd. The same lock types run on different hardware, and only the instructions used will differ.

How to Implement Locks

This is the million-dollar question. Let us break it down into two options: 1) using available libraries and 2) using atomic instructions to implement a proprietary locking algorithm.

Option 1 has several advantages. The libraries exist, no custom implementation is needed, it is well tested, and typically will be maintained in future library versions. Examples are pthread_mutex_lock and pthread_rwlock. Sounds great, so what is the catch? Lack of statistics can be an issue. No information is returned to the application reporting spins or how many times the thread was scheduled out. Also, the library implementation might not be a good fit for an application as libraries are more generic.

Option 2 is more involved. It requires implementing the locking functions and maintaining them. However, it can reap benefits since it is designed specifically for the application. Locking primitives and atomic instructions can be implemented through inline assembly or utilizing compiler support. Again, writing code with inline assembly requires the application to maintain that code.

For compilers, gcc provides atomic built-in functionsthat enable the application with low-level functions which will be compiled into Arm atomic instructions. These built-in functions provide the application with different methods for atomic instructions and memory ordering directives. The code is also more portable. It is important, however, to compile the application with the correct settings for -mcpu or -march to generate Arm LSE instructions. Ampere Altra and Ampere Altra Max use the Neoverse-n1 architecture, which includes LSE.

Nonetheless, implementing locks with atomic instructions requires design decisions. Is spinning justified if a lock is held? How many times to spin? Should the thread yield the CPU if unsuccessful after a certain number of spins? Is a backoff required in the spin loop, or is it a straight spin? These are just a few considerations that need to be addressed.

Other Design Decisions
  • Data type and size of the lock. Typically, applications use int or long for locks. Built-in functions for atomic operations read the lock value from memory. If an application also reads the lock value directly, the lock type should have the “volatile” prefix, for example volatile long. With volatile, the compiler generates instructions to read the data from memory. Otherwise, the value might be in a register and not get updated, thus missing updates to the lock location.
  • Lock granularity. Coarse grain locks have the potential to be a performance bottleneck due to contention. On the other hand, if every resource is protected by its own lock, there will be much memory required to store the locks. It must be a compromise to avoid either downside.
  • Lock alignment. Compilers align structures to be properly aligned. If an application manages its own memory, it is possible for a lock location to be unaligned with regards to the size of the lock. In the worst case, the lock could span two cache lines. On AArch64, atomic operations on unaligned locks result in a SIGBUS (a bus error raised by the hardware to the operating system signaling that a memory address cannot be addressed by the CPU, in this case due to unaligned access). On a positive note, getting a SIGBUS requires that alignment be fixed, instead of hidden performance issues that are rarely discovered.
  • False sharing. What does false sharing mean? Independent data on same cache line that has adverse performance effects. Arrays of locks fall into that category. Those locks protect different critical regions. However, atomic operations on the locks on the same cache line impact all locks on that cache line. It is important to state that atomicity is NOT for the lock itself but for the entire cache line that contains the lock.
  • Test before cmpxchg. Reading the lock value in a spinloop before executing the cmpxchg instruction might benefit contended locks. Cmpxchg requires ownership of the cache line, while test will get the cache line in shared mode, thus avoiding invalidations. However, this potentially increases the number of spins executed.
  • Use fetchadd instead of cmpxchg on lock free if possible. Freeing a lock requires reverting the operation the thread executed to get the lock. Cmpxchg, especially for shared or read-write locks, requires a loop and potential for retries of the operation due to a changing lock value. A fetchadd, however, does not require a loop, there is no compare, and thus it will succeed.
  • Lock hold times. Usually referred to as the number of instructions in the critical region or time spent in the critical region. Time is a better metric, since the critical region might have few instructions. However, all instructions might read from memory. Nested locks fall into the same category. Failure to get the inner lock and spinning or sleeping affects the hold time of the outer lock. It is always good to reduce the hold time for a critical region, and it is even better if the data is in the local cache instead of memory.
  • Preemption. Unfortunately, it is possible that a thread will be rescheduled while holding a lock. If the lock is held in exclusive mode, that means no other threads will be able to acquire the lock. That is something to keep in mind when looking at performance issues. Shorter hold times will lower the possibility of preemption.
Memory Ordering

As mentioned previously, the correct memory ordering directive is important for correctness. AArch64 follows a relaxed memory model. With LSE, AArch64 instructions enforce certain memory ordering. For example, the cmpxchg instruction set has versions to acquire (CASA instruction), release (CASL instruction), and acquire and release (CASAL instruction). Hardware guarantees that these instructions follow the memory model for their specific directive. It is up to software to use the appropriate instructions. Typically, an acquire is used on lock get and a release on lock free. However, if the application reads data after lock free (for example, if there are any waiters for the lock), then the release semantics of the free can cause issues, as the read of the waiter structure can be hoisted above the lock free, and thus stale data is present in the register after the free. In those cases, it is better to use acquire and release semantics. Again, this is dependent on the application implementation. The gcc compiler uses these directives directly in the built-in functions as seen here.

Summary

The Ampere family of CPUs is pushing the envelope with its ever-increasing core count and has all the ingredients for scalable performance for multithreaded applications that use locks. With LSE, atomic instructions are provided in hardware for great lock performance. As we have seen in this white paper, it is up to the application developer to make the best use of these instructions, either through locking libraries or correct implementation of locking algorithms.

Ampere Computing

4655 Great America Parkway

Suite 601 Santa Clara, CA 95054

Tel: +1-669-770-3700

info[at]amperecomputing.com

About
image
image
image
image
© 2022 Ampere Computing LLC. All rights reserved. Ampere, Altra and the A and Ampere logos are registered trademarks or trademarks of Ampere Computing.