[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#800574: More details and references



On Thu, 01 Oct 2015, Henrique de Moraes Holschuh wrote:
> We have a fix for the HLE BDW50 errata confirmed for Broadwell-H,
> through updated microcode.
> 
> Broadwell-H errata BDW50 fix:
> signature 0x40671, pf_mask 0x22, revision >= 0x12
> 
> Which would allow us to selectively blacklist only Broadwell with
> "outdated" microcode.  I can (and will) make a major ruckus about this

...

> There is also a Skylake microcode update available (dated 2015-08-08), I

Which has been confirmed to fix the HLE issue there, as well.

Further datapoints: other than the E7-v3 Xeons, it looks like HLE is
critically broken on anything running microcode older than 2015-04.  We can
probably depend on Broadwell-DE (Xeon D-1500) motherboards to ship with
new-enough microcode.

My search was not completely exhaustive and I don't have priviledged access
to any Intel documentation, so I might have missed a processor family or
two, etc.

Still, I could not find *anything* supposed to support TSX/HLE that didn't
have either the old Haswell "TSX may cause unpredictable system behavior"
erratum, or the newer errata "TSX not available" and "reading the memory
destination of an instruction that begins an HLE transaction may return the
original value" listed.  Skylake's spec update doesn't list them yet as of
2015-10-04, but we *know* it has either the same or very similar errata, and
that it got fixed by a recent microcode update.


So, for non-free and Ubuntu, microcode updates through the intel-microcode
package are likely to be a viable way to fix this: it all depends on the
required microcode updates being made available in the first place.

But non-free is not Debian, people rarely update their firmware unless you
push hard for it, and it takes at least six months for fixed microcode to be
reasonably available through firmware updates.

Just ignoring the issue (read: passively documenting it), while still an
option, should be left as the least desireable choice IMO.


Unfortunately, blacklisting HLE by microcode revision would require parsing
/proc/cpuinfo ATM, which is not really desireable for the HLE blacklist
code, to put it lightly.  So, it looks like any blacklisting done in the
library code will have to be all-or-nothing: fixing the processor by a
microcode update will not lift the blacklist.

Also, processors that share the same CPU signature have to blacklisted as a
group, even if they take different microcode (which would also be a problem
for microcode-revision-based whitelists: we *might* need to know the
processor's microcode platform flags in some cases).

I recommend that, for Debian stable (jessie), we switch to a whitelist-based
approach for HLE support, currently only whitelisting the latest stepping of
Haswell-EX (Xeon E7-v3) and Broadwell-DE (Xeon D-1500).  We can revisit that
decision in six months or one year, and possibly switch back to blacklisting
instead of whitelisting.

Only processors that are known to never have been widely deployed with HLE
errata would be eligible to be whitelisted.

This means at least Broadwell, Broadwell-H, and Skylake-H/S would never get
HLE support reenabled in Debian jessie, which includes several Xeon
processors.  Obviously, if we ever find a way to make the blacklist
microcode-revision aware, we can do better.

For unstable, we could adopt the same whitelisting approach in the short
term (three to six months), while we work on something more flexible that
would allow processors that got a later-than-launch errata fix to get
delisted from the HLE "whitelist-based blacklist".


One should keep in mind that, if we add such blacklisting, we also need to
decide how we will deal with removals from the blacklist in the future due
to fixed microcode being made available: should we lift the blacklisting for
a processor signature, it will regress systems still missing the microcode
update (fixable by installing non-free intel-microcode and rebooting before
upgrading glibc).

It is possible to add preinst logic to abort glibc install/upgrades for the
"we are removing this processor signature from the blackist, and
/proc/cpuinfo lists a microcode revision known to be broken" case.  This
takes care of regressions (in a rather user-unfriendly way, though) should
we decide that users ought to either install firmware updates, or tolerate
installing non-free intel-microcode.   Something would need to be done for
the Debian installer as well (to address new installs).


I really wish we had without-HLE and with-HLE variants of glibc for x86-64,
with non-HLE being the preferred/default choice for now (the preferred
choice being something to revisit in the future, as working HLE becomes more
widespread).

Then, we could have as-complex-as-required blacklisting logic in the preinst
of the HLE variant, which could be easily be made microcode-revision aware,
etc.  It would be really user-unfriendly when tripped (refuse the install /
abort the upgrade), but at least it would be safer.


Comments?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh


Reply to: