[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: libc recently more aggressive about pthread locks in stable ?



On Sat, 05 Nov 2016, Ian Jackson wrote:
> Looking at the code, I think that gs in jessie is plainly violating
> the rules about the use of pthread locks.  On my partner's machine,

Per logs from message #15 on bug #842796:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15

SIGSEGV on __lll_unlock_elision is a signature (IME with very high
confidence) of an attempt to unlock an already unlocked lock while
running under hardware lock elision.


Well, unlocking an already unlocked lock is a pthreads API rule
violation, and it is going to crash the process on something that
implements hardware lock elision.

These would be Intel x86 processors with TSX enabled[1] for Debian
8/jessie.  For Debian 9/stretch and for unstable, I believe it also
includes IBM Power8, and s390x systems -- AFAIK they won't forgive an
attempt to unlock an unlocked lock any more than Intel TSX does.

[1] Broadwell-E, Skylake, and later processors, as well as Xeon *v5
    processors.  I am not sure if we blacklisted any of the Xeon *v4
    or not, and too tired to look their model numbers up right now.

Unfortunately, when hardware lock elision support was added to glibc
upstream, libpthreads was *not* changed to properly assert() this
forbidden condition on the non-hardware-elision codepaths.  Such an
assert() would have given us consistent behavior, thus flushing the bugs
out in the open... at the cost of a performance hit (I have no idea how
severe), and much screaming.

To be fair: it is likely nobody upstream had any idea of just how much
code got libpthreads usage wrong... and we certainly didn't know better
in Debian, either.  Well, now we're going to find out :-(

BTW, AFAIK libpthreads still doesn't have any such assert(), so there's
likely a lot of such buggy code in unstable still.  This is going to
cause trouble for Debian stretch, too.

> Has something changed in jessie's libc recently ?  I find it difficult
> to imagine that these bugs would have been missed earlier during the
> life of jessie.

The required hardware was not widely available at the time, the
knowledge of how hardware lock elision would really behave was sparse
outside of Intel and IBM -- so people either didn't know, or did not
grasp the importance of the fact that the hardware would be utterly
intolerant to something that the old code was too lenient about -- and
libpthreads was not instrumented to compensate for that.

I actually recommended that it would be safer to disable lock elision
for jessie[2]: the sharp corners nature of the code in glibc 2.19 scared
me, as well as just how messed up the implementation on Intel processors
were at the time.  Unfortunately, I didn't push for it at all: I didn't
know how correct I were at the time[3].

[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195#50

The hard truth is that nobody in Debian knew how deep those murky waters
were at the time[3], and I don't think glibc upstream developers did
either.  So, we limited ourselves in Debian to blacklisting the
processors where Intel (either for sure, or highly likely) screwed it up
beyond repair.

[3] A number of subtle Intel TSX errata were fixed by Skylake and
    Broadwell microcode updates, and the latest ones are quite recent.
    The until-then latent (or subtle) broken locking bugs in
    applications/libs becoming high-hitter crashers as more users get
    newer computers, etc.

Anyway, any library or application that hits this issue has broken
locking, plain and simple.

A package crashing from this issue very likely requires a stable update
to fix the locking (which won't always be a trivial fix, either), even
if we changed libpthreads to disable lock elision support and it stopped
the crashes -- even if it wouldn't crash anymore, the locking would
still be broken and therefore suspect of not being as effective as it
would have to be to ensure correct operation at all times.

> I will try to make a patch to fix ghostscript, or at least file a
> proper bug.  But, if there was a libc change, would it be possible to
> revert it or make some kind of workaround ?

If the problem is too widespread and too hard to fix on a large number
of packages, I suppose we could ask the glibc maintainers to consider
disabling hardware lock elision support in stable through a stable
update.

Such a change to glibc would likely requires some patches to ensure it
*really* disabled Intel TSX opcode/instruction insertion, but I think we
already ship all of them as part of the Intel TSX blacklist.  The result
would need real-world testing on an up-to-date Skylake box as well as
objdump inspection to ensure *no* TSX-related instructions leaked into
the binaries.

And what should we do about Debian stretch, then?

Some references:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=824191
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195

-- 
  Henrique Holschuh


Reply to: