Draft TLS/NPTL ABI for m68k and ColdFire, version 0.2

To: linux-m68k@vger.kernel.org, debian-68k@lists.debian.org
Subject: Draft TLS/NPTL ABI for m68k and ColdFire, version 0.2
From: "Joseph S. Myers" <joseph@codesourcery.com>
Date: Fri, 30 Nov 2007 19:05:05 +0000 (UTC)
Message-id: <[🔎] Pine.LNX.4.64.0711301901100.20112@digraph.polyomino.org.uk>
CodeSourcery has been investigating implementing TLS (Thread-Local
Storage) and NPTL (Native POSIX Thread Library) for ColdFire
processors.  The proposed TLS ABI for ColdFire and m68k, including the
required kernel interfaces, is below; any comments?

We do not at present have a timescale for the implementation to be
available.  Toolchain patches will probably be contributed to the
respective development mainlines in the usual order (first binutils,
then GCC, then glibc).

ColdFire and m68k TLS and NPTL ABI draft version 0.2
====================================================

For background reading on TLS, see Ulrich Drepper's document
<http://people.redhat.com/drepper/tls.pdf>.

Design choices
--------------

* There are no spare registers available to designate as the thread
  register.  Therefore, kernel magic is needed to obtain the thread
  pointer from userspace.  Kernel helpers are provided in a vDSO since
  they will need unwind information associated; see details below.
  Compiler-generated code will use an ABI-defined function
  __m68k_read_tp with that function handling the details of calling
  the vDSO.

* Use TLS variant I (TLS_DTV_AT_TP in glibc terms), where the TLS data
  goes after the TCB.

* The thread pointer points to 0x7000 (the value of TLS_TCB_OFFSET in
  glibc) after the start of the TLS data areas, as on Power and MIPS.
  This makes a greater amount of the data accessible with signed
  16-bit offsets from the thread pointer than with an unbiased
  pointer.  (0x7000 is used instead of 0x8000 so that the TCB can also
  be accessed with 16-bit offsets from the thread pointer.)

* The DTP for a module points to 0x8000 (the value of TLS_DTV_OFFSET
  in glibc) after the start of the TLS data for that module, as on
  Power and MIPS.

* There are no linker optimizations to convert one TLS model into
  another; as such, the compiler can rearrange and optimize the
  instruction sequences shown.  The relocations can be applied to
  extension words in many different instructions.

* The __tls_get_addr function is

  typedef struct {
    unsigned long int ti_module;
    unsigned long int ti_offset;
  } tls_index;

  extern void *__tls_get_addr (tls_index *ti);

* All the static relocations for offsets from GOT, DTP or TP are
  defined in 8-bit, 16-bit and 32-bit forms, similarly to existing
  m68k/ColdFire relocations.  Both the 16-bit and 32-bit forms are
  likely to be of use in compiler-generated code.

Kernel helpers
--------------

This TLS ABI defines a function __m68k_read_tp, provided by libc.
This returns the thread pointer in register a0 (not d0) and may
clobber other call-clobbered registers.  The compiler will generate
calls to this function for the initial exec and local exec models.

To implement this function and other requirements for NPTL, four
kernel helpers are to be provided in a vDSO (as provided by the kernel
on Power and other architectures).  The symbols indicated are exported
at symbol version LINUX_2.6.  Full DWARF unwind information for all
these functions must be included in the vDSO, as thread cancellation
may need to unwind from any point in any of these functions.  The
kernel informs glibc of the location of the vDSO by putting an
AT_SYSINFO_EHDR entry in the auxiliary vector passed to each process.
If glibc is configured for a subset of processors where the necessary
operations do not require a kernel helper, then it does not need to
use the kernel helper (for example, glibc configured only for m68k
processors with a cas instruction does not need to use the
compare-and-exchange helper), but the kernel must provide all these
helpers on all m68k and ColdFire processors so that
lowest-common-denominator glibc binaries can work across all
processors.

The helper __kernel_read_tp returns the thread pointer in register a0
(not d0) and may clobber other call-clobbered registers.  (Because it
is only called from __m68k_read_tp, which is called through the PLT,
and the resolver may clobber call-clobbered registers, there seems to
be no advantage in restricting clobbers from this helper.)

Beyond the helper required for TLS, three further kernel helpers are
proposed for NPTL implementation: one to provide an atomic
compare-and-exchange operation (not available directly in the ColdFire
instruction set), one to provide a memory barrier (which can just
return to the user for non-SMP) and one to set the thread pointer.

The helper __kernel_atomic_cmpxchg_32 compares the 32-bit value at the
location pointed to by a0 with the value in d0.  If the values are
equal, it writes the value in d1 to the location pointed to by a0;
otherwise, it writes the value at the location pointed to by a0 to d0.
It does not clobber any registers other than the condition codes (and
the modification of d0 indicated so that d0 is returned with the
original value of the memory location in all cases).  (On m68k - where
this kernel helper would only be used if glibc is built for the
intersection of ColdFire and m68k - this could be implemented with a
single cas instruction and a return.)

The helper __kernel_atomic_barrier provides a memory barrier.  It does
not clobber any registers other than the condition codes.  On non-SMP,
it can just return to the user; on SMP it needs to ensure memory
synchronization between processors.

The helper __kernel_write_tp sets the thread pointer to the value in
a0.  It does not clobber any registers other than the condition codes.

Offset length issues
--------------------

On ColdFire (and m68k before 68020), only 16-bit offsets can be used
in memory addresses.  On m68k (68020 and later), 32-bit offsets can be
used; a ".w" assembly suffix is used for 16-bit offsets, and otherwise
the offsets are 32 bits.

The use of 16-bit offsets limits GOT size to 8192 entries (the
toolchain does not use negative GOT offsets on m68k/ColdFire).  On
m68k (68020 and later), GCC uses 32-offsets with -fPIC and 16-bit
offsets with -fpic (and does not need to use GOT accesses for non-PIC
code at present).

The proposals here do not address GOT size limitations, although an
example is given to illustrate a possible longer access sequence to
avoid those limitations on ColdFire.  The examples using offsets such
as #x@TLSGD in GOT accesses are shown for ColdFire and use the 16-bit
relocations shown.  For m68k (68020 and later), either the syntax
shown may be used, with a 32-bit relocation, or a ".w" suffix may be
used, with a 16-bit relocation.  It is proposed that the compiler, on
m68k (68020 and later), will use ".w" for -fpic and the 32-bit offsets
otherwise.  (No specific option is proposed to choose between 16-bit
and 32-bit offsets for the non-PIC, initial exec case, though such an
option could be added later.)

The same issue as for GOT accesses also applies to accesses to TLS
data using the local dynamic and local exec models.  The example code
sequences determine the address of the variable, but typically it will
be desired to read or write the variable and this may be done more
efficiently using offset addressing.  It is proposed that by default
the compiler will require the relevant TLS area to be accessible using
16-bit offsets, and that an option -mxtls must be used when compiling
objects that use the local dynamic or local exec models and will be
linked into a module with too large a TLS area for 16-bit offset
addressing.

Conventions
-----------

In the instruction sequences shown below, a5 is used to refer to the
GOT pointer (which must already have been loaded).  Apart from the
ABI-defined registers used for thread-pointer return (a0) and
__tls_get_addr return (d0), other registers may be used where
convenient.

The relocations shown on instructions are to be understood to be
applied to the extension word or words of those instructions.

Code sequences are shown in the form:

instruction                   relocation          against variable

General Dynamic TLS model
-------------------------

Code sequence:

pea #x@TLSGD(%a5)             R_68K_TLS_GD16      x
jbsr __tls_get_addr

Outstanding relocations:

GOT[n]                        R_68K_TLS_DTPMOD32  x
GOT[n+1]                      R_68K_TLS_DTPREL32  x

The R_68K_TLS_GD16 relocation causes the static linker to allocate two
consecutive GOT entries for a tls_index structure and apply the
indicated relocations to them.  The dynamic linker fills in those
entries at runtime.  The code sequence leaves the address of x in d0.

On ColdFire, the example code sequence is limited to a 16-bit GOT
offset, as discussed above.  If a larger GOT is required on ColdFire,
a longer instruction sequence must be used; for example:

move.l %a5,%a0
add.l #x@TLSGD,%a0            R_68K_TLS_GD32      x
pea (%a0)
jbsr __tls_get_addr

Local Dynamic TLS model
-----------------------

Code sequence:

pea #x@TLSLDM(%a5)            R_68K_TLS_LDM16     x
jbsr __tls_get_addr
...
move.l %d0,%a1
add.l #x1@TLSLDO,%a1          R_68K_TLS_LDO32     x1

Outstanding relocations:

GOT[n]                        R_68K_TLS_DTPMOD32  x

The R_68K_TLS_LDM16 relocation causes the static linker to allocate
two consecutive GOT entries for a tls_index structure and apply the
indicated relocation to the first; the second has a value of 0 and no
relocation.  The dynamic linker fills in those entries at runtime.
The first part of the code sequence leaves the address of the TLS
block for the current module (biased by 0x8000 as discussed above) in
%d0.  The second part of the code sequence determines the address of
x1 based on the address of the TLS block; the static linker resolves
R_68K_TLS_LDO32 to the correct offset from the (biased) DTP value.
Other code sequences may be used to access the value of x1 rather than
computing its address, possibly with R_68K_TLS_LDO16 relocations
depending on whether the size of the TLS area for this module is known
to be at most 64k.

Note that the local dynamic model is generally only beneficial if a
function is accessing more than one TLS variable with this model and
so can reuse the TLS block address.

The same comments about GOT size apply as for the general dynamic
model.

Initial Exec TLS model
----------------------

Code sequence:

jbsr __m68k_read_tp
...
move.l #x@TLSIE(%a5),%a1      R_68K_TLS_IE16      x
add.l %a0,%a1

Outstanding relocations (apart from those associated with calling
__m68k_read_tp through the PLT):

GOT[n]                        R_68K_TLS_TPREL32   x

The jbsr instruction loads the thread pointer into a0.  This may be
reused for each variable accessed with this model.  Each
R_68K_TLS_IE16 relocation causes the allocation of a single GOT entry
with the indicated relocation; this GOT entry is set up by the dynamic
linker with the offset for that TLS variable relative to the (biased)
thread pointer.  The second part of the code sequence loads this
offset from the GOT and adds the thread pointer to put the address of
x in a1.

The same comments about GOT size apply as for the general dynamic
and local dynamic models.

Local Exec TLS model
--------------------

Code sequence:

jbsr __m68k_read_tp
...
move.l %a0,%a1
add.l #x@TLSLE,%a1            R_68K_TLS_LE32      x

No outstanding relocations (apart from those associated with calling
__m68k_read_tp through the PLT).

The jbsr instruction loads the thread pointer into a0.  This may be
reused for each variable accessed with this model or the initial exec
model.  The R_68K_TLS_LE32 relocation is resolved by the static linker
to the offset of x relative to the (biased) thread pointer.  The
second part of the code sequence puts the address of x in a1.  Other
code sequences may be used to access the value of x rather than
computing its address, possibly with R_68K_TLS_LE16 relocations
depending on whether all of the TLS area for the executable is known
to be within 32k of the thread pointer.

Debug information
-----------------

DWARF-2 sequence:

DW_OP_addr
.word #x@TLSLDO+0x8000        R_68K_TLS_LDO32     x
DW_OP_GNU_push_tls_address

No outstanding relocations.

The static linker resolves the relocation and offset to put the
unbiased address of x relative to the TLS block for its module in the
word of debug information.  GDB then uses this to locate the variable
at debug time.

ELF relocations
---------------

Static relocations:

#define R_68K_TLS_GD32      25
#define R_68K_TLS_GD16      26
#define R_68K_TLS_GD8       27
#define R_68K_TLS_LDM32     28
#define R_68K_TLS_LDM16     29
#define R_68K_TLS_LDM8      30
#define R_68K_TLS_LDO32     31
#define R_68K_TLS_LDO16     32
#define R_68K_TLS_LDO8      33
#define R_68K_TLS_IE32      34
#define R_68K_TLS_IE16      35
#define R_68K_TLS_IE8       36
#define R_68K_TLS_LE32      37
#define R_68K_TLS_LE16      38
#define R_68K_TLS_LE8       39

Dynamic relocations:

#define R_68K_TLS_DTPMOD32  40
#define R_68K_TLS_DTPREL32  41
#define R_68K_TLS_TPREL32   42

-- 
Joseph S. Myers
joseph@codesourcery.com
Reply to:
Prev by Date: spurious dep-wait for monotone on m68k
Next by Date: Re: spurious dep-wait for monotone on m68k
Previous by thread: Re: spurious dep-wait for monotone on m68k
Index(es):
- Date
- Thread