Re: Test atomic operations on SPARC 32-bit

To: Petr Vorel <petr.vorel@gmail.com>
Cc: Romain Dolbeau <romain@dolbeau.org>, debian-sparc <debian-sparc@lists.debian.org>
Subject: Re: Test atomic operations on SPARC 32-bit
From: James Clarke <jrtc27@jrtc27.com>
Date: Sun, 10 Dec 2017 18:55:37 +0000
Message-id: <[🔎] 56A4427D-1708-4432-B6EE-B7444D2BAC93@jrtc27.com>
In-reply-to: <[🔎] CADuzgbpfVN4JcR2_VswkDwTymh4YrjVFAfm4vYJ+oPkVWSbp_g@mail.gmail.com>
References: <[🔎] 20171208213231.GA10010@x230> <[🔎] CADuzgbqQLOcvMww90BszsBkWyEcC4qvdkP9UnuoW+8EgYiELAg@mail.gmail.com> <[🔎] 20171208234523.GE10010@x230> <[🔎] CADuzgbp5ttimgwx1_1w35hAcBu41b_KYcY2=10cK44ZwAj7XRw@mail.gmail.com> <[🔎] 20171210011921.GA30303@x230> <[🔎] CADuzgbpfVN4JcR2_VswkDwTymh4YrjVFAfm4vYJ+oPkVWSbp_g@mail.gmail.com>

On 10 Dec 2017, at 10:01, Romain Dolbeau <romain@dolbeau.org> wrote:
> 2017-12-10 2:19 GMT+01:00 Petr Vorel <petr.vorel@gmail.com>:
>> I test it (in my github fork of) LTP project [1], but unfortunately some tests which heavy
>> test it fails.
> 
> With a large number of threads, the test-and-test-and-set option will
> significantly
> cut down the coherency traffic between caches. It's a much better solution than
> a pure test-and-test.
> 
>> The tests just get slightly fewer incrementation than it should get (6373821 instead of 6400000).
>> 
>> The implementation in github is for SPARC32, but I adjusted the code to test it on SPARC64
>> and (of course) it behaves the same.
>> 
>> Any idea what can be wrong?
> 
> Are you testing on true v8 hardware or on v9 ?
> On v9 (and v8+) you might be running in RMO (Relaxed Memory Order), in which
> case you probably need some extra "membar" to force the update of the
> variable [1].
> By default, the code probably only work for PSO and TSO modes (the only two
> that exist in v8)
> 
> if adding a "membar #MemIssue" before and after the update of the variable
> solve the problem, then memory ordering is the culprit. (this forces way too
> strong an ordering, but is useful for a test). "membar" is v9/v8+ only.
> 
> If you're on true v8 HW - darn. PSO still could be the problem. try with "stbar"
> surrounding the variable update. "stbar" is a bit weaker that some
> variants of "membar",
> but is in v8.
> 
> You might want to take a look on how much ordering instructions the kernel
> uses, after all :-(
> 
> Low-level parallelism is hard :-)

Indeed, if the code is being compiled for V9, you will be exposed to RMO. §J.6
Spin Locks of the V9 architecture manual provides an example implementation
using LDSTUB, but the important points are:

1. Before returning from your acquire_lock function, you must perform a
   "membar #StoreLoad | #StoreStore" (#LoadLoad and #StoreLoad, and #LoadStore
   and #StoreStore are interchangeable in this context, since LDSTUB counts as
   both a load and a store; in fact the V9 manual suggests
   "membar #LoadLoad | #LoadStore", but I personally think treating LDSTUB as a
   store is clearer).  This ensures that no memory accesses inside the critical
   section are reordered with the LDSTUB.

2. Before clearing lock in release_lock, you must perform a
   "membar #LoadStore | #StoreStore" to ensure no memory accesses inside the
   critical section are reordered with the release.

That should be all that's required for V9 with RMO. For PSO, every load can be
treated as being followed by a "membar #LoadLoad | #LoadStore" (see §D.5).
Therefore, no barriers are needed in acquire_lock (LDSTUB counts as a load).
However, in release_lock, you must perform a "membar #StoreStore" (the
#LoadStore is dropped), which if you're compiling for V8 is what the
(deprecated as of V9) instruction STBAR does. For TSO, no barriers are needed,
as stores are not reordered (and the only barrier needed for PSO was a
#StoreStore).

TL;DR:

RMO (V9): "membar #StoreLoad | #StoreStore" after lock,
          "membar #LoadStore | #StoreStore" before unlock
PSO (V9): "membar #StoreStore" before unlock
PSO (V8): "stbar" before unlock
TSO (V8/V9): nothing

GCC defaults to RMO for 64-bit (V9 or above). I don't know if it defaults to
PSO or TSO for V8, and equally I don't know which of RMO, PSO and TSO it
defaults to for 32-bit V9 aka V8+. Also, this is all post-compilation; you
still need the necessary compile-time barriers to ensure the compiler doesn't
perform any unwanted reorderings, but that's the same on any architecture.

Regards,
James

Reply to:

Follow-Ups:
- Re: Test atomic operations on SPARC 32-bit
  - From: Petr Vorel <petr.vorel@gmail.com>

References:
- Test atomic operations on SPARC 32-bit
  - From: Petr Vorel <petr.vorel@gmail.com>
- Re: Test atomic operations on SPARC 32-bit
  - From: Romain Dolbeau <romain@dolbeau.org>
- Re: Test atomic operations on SPARC 32-bit
  - From: Petr Vorel <petr.vorel@gmail.com>
- Re: Test atomic operations on SPARC 32-bit
  - From: Romain Dolbeau <romain@dolbeau.org>
- Re: Test atomic operations on SPARC 32-bit
  - From: Petr Vorel <petr.vorel@gmail.com>
- Re: Test atomic operations on SPARC 32-bit
  - From: Romain Dolbeau <romain@dolbeau.org>

Prev by Date: Re: Test atomic operations on SPARC 32-bit
Next by Date: Re: Test atomic operations on SPARC 32-bit
Previous by thread: Re: Test atomic operations on SPARC 32-bit
Next by thread: Re: Test atomic operations on SPARC 32-bit
Index(es):
- Date
- Thread