[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#900087: xserver-xorg-video-amdgpu: AMD RX 550 often locks up



On 2019-08-29 11:29+0200 Moritz Mühlenhoff wrote:

I'm seeing the same [instabilities] with RX 570 (which according to Wikipedia should
be the same architecture as your RX 550) and the stable Buster release,
both with the standard 4.19 kernel and the 5.2 kernel from buster-backports.

The short story is after a BIOS upgrade my box has been stable for 27 days and counting.

For the longer story read on....

I am now of the opinion that this whole mess is due to
incompatibilities between real AMD Motherboard and graphics chip sets
*as sold with initial BIOS* to unsuspecting Linux users in the last
couple of years and the AMD documentation of those chip sets (which
Linux kernel developers must rely on).  So the net result of these
initially buggy BIOS's is general Linux-AMD instability.  (Do a google
search for the terms (without quotes) "amd linux instability" and you
will find many sorry tales.)

Up to a month ago, I had solved virtually all the night-time almost
completely idle lockups by using the kernel parameters idle=nomwait
rcu_nocbs=0-15 (where the 15 corresponds to one less than the 16
hardware threads I have on my Ryzen 7 1700 system).  But the rcu_nocbs
parameter requires a special kernel build with a different
configuration (CONFIG_RCU_NOCB_CPU=y) than what Debian supplies.  With
that custom kernel (4.18.10-custom) the idle lockups dropped to just
one for a large number of months, but the active (as opposed to idle)
instability issues still caused lockups with up times between them
that ranged from 0.5 days to 24 days with an average up time of a week
or so.

Soon after I reported this box instability to kernel developers I got
the advice from them to try a BIOS update.  But I put that off for
more than a year because such upgrades are considered to be a last
resort.  The reason for that is they can turn your motherboard into a
brick with low but still non zero probability due to a number of
different causes (such as AMD/Motherboard vendor screw ups with the
BIOS upgrade, internet download errors with the BIOS, power outages
during the BIOS upgrade, etc.)  Also, there was huge churn in the BIOS
upgrades with ASUS (my motherboard vendor) putting out 10 (!) of them
since the BIOS I received when I bought the box.  So I decided to wait
until that churn had settled down, i.e., ASUS had gotten
asymptotically closer to the definitive BIOS for my box.  And
meanwhile the above average up time of ~1 week was livable.

The upshot was that 27 days ago I finally arranged for some
professionals to do the BIOS upgrade.  That made the AMD CBS Power
Supply Idle Control option available for the first time, and I set
that control from "auto" to "Typical Current Idle" (which apparently is an
alternative to setting the custom kernel parameter rcu_nocbs=0-15 for
dealing with idle instability issues).  The net result of this update
is the current up time (still with the same 4.18.10-custom kernel and
kernel parameters) is 27 days and counting which is a new record.  So
I am beginning to hope that this BIOS upgrade has solved the active
lockup issues not covered by rcu_nocbs=0-15, and might even (with
Power Supply Idle Control set to "Typical Current Idle") solve the
idle lockup issues if I drop rcu_nocbs=0-15.

Therefore, if the current up time experiment with kernel 4.18.10 custom is
able to continue for another month or so, my plan is to try a similar
experiment with stock Debian kernel (i.e. without the custom kernel
parameter rcu_nocbs=0-15), and if I get, say ~60 days up time with that
kernel, I will likely conclude this problem has been completely solved,
and I will close this bug report.

Meanwhile, I hope if you decide as a last resort to try updating your
own BIOS (after careful consideration of the known risks), that will
completely solve this issue for you.

Alan
__________________________
Alan W. Irwin

Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.org); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________

Linux-powered Science
__________________________


Reply to: