[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#800574: Final analysis for Broadwell



Well, I've finally finished analysing things for Broadwell.

The amd64 (x86-64) glibc lock elision code keys on the RTM CPUID bit
because it is actually using RTM (and not HLE) to implement lock
elision.  I failed to keep this in mind when worrying about permanently
blacklisting Broadwell processors from glibc lock elision in unstable.

Errata BDM53/BDW51/BDD51/BDE42 "Intel TSX Instructions not avalilable"
_should_ mean that trying to use any of the TSX-NI opcodes (i.e. RTM) in
Broadwell would always result in an illegal opcode exception (SIGILL). 
The specificaton updates also explicitly say in the descriptions of
these errata that the RTM bit in CPUID is not set.

Were those errata present on every microcode revision/core versions of
those processors, it would make them "safe" as far as our (patched)
glibc lock elision is concerned.  We are not that lucky.

Seaching the network for cpuinfo reports resulted in a /proc/cpuinfo
dump of signature 0x306d4, microcode rev 0x11 (Core i5-5300U), and rev
0x18 (Core M-5Y71), where both RTM and HLE are reported as enabled.

Another /proc/cpuinfo dump of signature 0x306d4, with microcode rev 0x18
(Core i5-5300U and also Core M-5Y10c) and rev 0x1f (Core i5-5287U),
shows both HLE and RTM already disabled.

The fact that revision 0x18 had a different CPUID response for (Core
i5-5300U, Core M-5Y10c), and Core M-5Y71 was a surprise.
Perhaps it also has a dependency on the firmware doing a (hopefully
boot-locked) wrmsr to disable TSX.

Anyway, regardless of the reason, one cannot count on the RTM and HLE
bits being disabled in CPUID(7) on every Broadwell processor and
microcode revision out there.

OTOH, it does means we can, and should, blacklist signature 0x306d4 (and
earlier) permanently, because RTM is extremely unlikely to be
fixed/fixable on these processors. Either it is disabled as it should be
per the errata documentation, or enabled and very dangerous (resulting
in either SIGILL or Haswell-style risk of unpredictable system
behavior).  Since signature 0x40671 also has the same "TSX unavailable"
type of errata (BDD51, BDE42), I guess we can assume the same applies to
Broadwell-H and Broadwell-DE, and blacklist lock elision there
permanently as well.

I am still collecting data for Skylake-S, but it boils down to whether
up-to-date Skylake-S microcode (revisions 0x34 and higher) fixes, or
disable TSX.  We know that microcode update does stop glibc lock elision
crashing with SIGSEGV, though.


Meanwhile, a suggestion by Samuel Thibault to try to use hwcap did
provide for a possible long-term plan to fine-tune the lock-elision
blacklist (and anything else of that sort).

We would have to (finally) extend x86-64 hwcap to cpuid(1) fully, and
also at least cpuid(7), which is anything but trivial and a lot of work.
 This is _not_ worth the trouble if it is done just for lock elision
blacklisting purposes.

However, it would be useful for link-time optimization in libraries
(e.g. avx2 flavours of something that really benefits from it, etc), so
it is likely worth pursuing... but only if we get buy-in from upstream.

Once it is there for far better purposes than blacklisting, there is no
reason not to do the trivial work to have the kernel blacklist whatever
capabilities should be avoided, and switch glibc to use the hwcap
extension instead of doing cpuid directly wherever available, thus
making it usable _also_ for blacklisting things.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique de Moraes Holschuh <hmh@debian.org>


Reply to: