[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#800574: backport to sid/stable? (was RE: libc6: lock elision hazard on Intel Broadwell and Skylake)



On Fri, Oct 23, 2015, at 11:13, Carlos Alberto Lopez Perez wrote:
> I was having trouble (crashes with the NVIDIA proprietary driver) on a
> Debian system with an i7-5775C and libc6=2.19-18+deb8u1 (stable)

This is very very likely to be braindamage on the NVIDIA driver, though.

Are you sure that driver is not doing something as idiotic as unlocking
an already unlocked mutex ?

The proper fix in that case is _always_ to fix whatever is broken,
because eventually it will run on something that has working hardware
lock elision... and crash.

> I tried first to update the Intel microcode with the "unreleased" 0x13
> microcode version but it didn't disabled the TSX-NI instructions [1]
> neither the crashes.

Mobile Broadwell-H seems to disable TSX, while Desktop Broadwell-H
doesn't.  That's why we blacklisted the whole thing: inconsistent
behavior on the same microcode, and that behavior is itself inconsistent
with the errata sheet that says such processors shouldn't even be able
to advertise Intel TSX RTM in CPUID.

At the moment, we don't even know what is wrong with RTM in
Broadwell/Broadwel-H/Broadwell-DE.  We do know some of what is wrong
with HLE in Broadwell/-H/-DE (and it is really nasty), but HLE is not
used by glibc in the first place, and the HLE erratum is supposedly
worked around somehow (because it is documented to be so on the Xeon
D-1500/Broadwell-DE) by the batch of microcode updates available in the
kernel bugzilla bug report mentioned in this bug report.

Broadwell-H Microcode 0x13 is useful anyway because it fixes other
critical errata that hangs/oopses the kernel: you box should be a _lot_
more stable with it.  And at least one person reported that not all
hangs were fixed by microcode 0x12, thus you probably should use keep
using microcode 0x13 (or newer, should one become available).

> Finally I upgraded to glibc=2.21-0experimental2 and it fixed the crashes.

"Works around" a bug in the NVIDIA drivers is just as likely, see above.

If we instrumented non-lock-elision glibc to complain about operations
that are illegal on most processors implementing lock elision, we'd know
for sure.

> Should this patch be backported both to stable and unstable?

It needs to go to stable sooner than later, yes.  But it seems wise to
let it cook in unstable/testing for a bit, first.

I don't know what the plans for uploading new glibc to unstable are.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique de Moraes Holschuh <hmh@debian.org>


Reply to: