Re: Weird behaviour on System under high load
> -------- Ursprüngliche Nachricht --------
> Von: David Christensen <dpchrist@holgerdanske.com>
> An: debian-user@lists.debian.org
> Betreff: Re: Weird behaviour on System under high load
> Datum: Sun, 21 May 2023 03:11:43 -0700
>
> On 5/21/23 01:14, Christian wrote:
>
> > > -------- Ursprüngliche Nachricht --------
> > > Von: David Christensen <dpchrist@holgerdanske.com>
> > > An: debian-user@lists.debian.org
> > > Betreff: Re: Weird behaviour on System under high load
> > > Datum: Sat, 20 May 2023 18:00:48 -0700
> > >
> > > On 5/20/23 14:46, Christian wrote:
> > > > Hi there,
> > > >
> > > > I am having trouble with a new build system. It works normal
> and
> > > > stable
> > > > until I put extreme stress on it, e.g. using all 12 cores with
> > > > stress
> > > > tool.
> > > >
> > > > System will suddenly loose network connection and become
> > > > unresponsive.
> > > > Only a reset works. I am not sure what is going on, but it is
> > > > reproducible: Put stress on the system and it fails. It seems,
> > > > that
> > > > something is getting out of step.
> > > >
> > > > Stuff below I found in the logs. I tried quite a bit, even
> > > > upgraded
> > > > to
> > > > bookworm, to see if the newer kernel works.
> > > >
> > > > If anyone knows how to analyze this issue, it would be very
> > > > helpful.
>
>
> Please use inline posting style and proper indentation.
Phew... will be quite hard to read. But here you go.
>
>
> > > Have you verified that your PSU has sufficient capacity for the
> > > load on
> > > each and every rail?
>
> > Hi there,
> >
> > Lets go through the different topics:
> > - Setup: It is a AMD 5600G
>
> https://www.amd.com/en/products/apu/amd-ryzen-5-5600g
>
> 65 W
>
>
> > on a ASRock B550M-ITX/ac,
>
>
> https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp
>
>
> > powered by a BeQuiet SP7 300W
> >
> > - Power: From the specifications it should fit. As it takes 5-20
> > minutes for the error to occur, I would take that as an
> indication,
> > that the power supply is ok. Otherwise would expect that to fail
> right
> > away? Is there a way to measure/test if there is any issue with
> it?
> > I also tested to limit PPT to 45W which also makes no difference.
>
>
> If all you have a motherboard, a 65W CPU, and an SSD, that looks like
> a
> good quality 300W PSU and I would think it should support long-term
> full
> loading of the CPU. But, there is no substitute for doing the
> engineering.
>
>
> I do PSU calculations using a spreadsheet. This requires finding
> power
> specifications (or making estimates) for everything in the system,
> which
> can be tough.
>
>
> BeQuiet has a PSU calculator. I suggest using it:
>
> https://www.bequiet.com/en/psucalculator
>
>
> Measuring actual power supply output and system usage would involve
> building or buying suitable test equipment. The cost would be non-
> trivial.
>
>
> An easy A/B test would be to connect a known-good, high-quality PSU
> with
> a higher power rating (say, 500-1000W). I use:
>
> https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/
>
Used the calculator, however might be, that the onboard graphics is not
attributed properly for. Will see that I get a 500W PSU for testing.
>
> > > Have you cleaned the system interior, filters, fans, heatsinks,
> > > ducts,
> > > etc., recently?
>
>
> ?
As written in OP, the system is new. Only PSU is used. So it is clean
>
>
> > > Have you tested the thermal solution(s) recently?
>
> > - Thermal: I am observing the temperatures on the stresstest. If I
> am
> > correct in reading Smbusmaster0, Temps haven't been above 71°C,
> but
> > error also occurs earlier, way below 70.
>
>
> Okay.
>
>
> What is your CPU thermal solution?
>
What is a thermal solution?
>
> What stresstest are you using?
>
stress running in s-tui
>
> > > Have you tested the power supply recently?
>
It was working before without issues, so not explicitly tested.
>
> I suffered a rash of bad PSU's recently. I was able to figure it out
> because I bought an inexpensive PSU tester years ago. It has saved
> my
> sanity more than once. I suggest that you buy something like it:
>
> https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=antec+atx12+tester&_sacat=0
>
I am not building regularly, so would need to borrow such equipment
somewhere
> > > Have you tested the memory recently?
>
> > - Memory: Yes was tested right after the build with no errors
>
>
> Okay.
>
>
> Did you do multi-threaded/ stress tests?
>
Yes, stress is running multiple threads. Only on 2 threads it was
stable so far. However it takes longer for the errors to come up when
using less threads. might be that I did not test long enough.
>
> > > Are you running Debian stable?
> > >
> > >
> > > Are you running Debian stable packages only? Were they all
> > > installed
> > > with the same package manager?
Having docker and log2ram as additional sources and now debmatic.
>
> > - OS: I was running Debian stable in quite a minimal configuration
> > (fresh install as most services are dockerized) when first
> observed the
> > error. Now moved to Debian 12/Bookworm to see if it makes any
> > difference with higher kernel (it does not). Also exchanged r8169
> for
> > the r8168. It changes the error messages, however system
> instability
> > stays.
>
>
> Did you see the problems when running Debian stable OOTB, before
> adding
> anything?
I would need to do this with a liveUSB, to have it run OOTB
>
>
> Did you stress test the system before adding anything (other than the
> stress test)?
No, I did the basic setup of my system first, then encountered the
error. Will try with LiveUSB.
>
>
> > > If all of the above are okay and the system is still locking up,
> I
> > > would
> > > disable or remove all disks in the system, install a zeroed SSD,
> > > install
> > > Debian stable choosing only "SSH server" and "standard system
> > > utilities", install only the stable packages required for your
> > > workload,
> > > put the workload on it, and see what happens.
>
> > I could disconnect the disks and see if it makes any difference.
> > However when reproducing this error, disks other than system where
> > unmounted. So would guess this would be a test to see if it is
> about
> > power?
>
>
> Stripping the system down to minimum hardware and software is a good
> starting point. You will need a tool to load the system and some
> means
> to watch what happens. Assuming the base configuration passes all
> tests, then add something, test, and repeat until testing fails.
>
>
> Here is a Perl script I wrote for loading the CPU. It should run on
> a
> base install of Debian OOTB:
>
> 2023-05-21 02:24:44 dpchrist@taz ~/home
> $ cat exercise-cpu
> #!/usr/bin/env perl
> # $Id: exercise-cpu,v 1.1 2023/04/10 02:05:22 dpchrist Exp $
> # by David Paul Christensen dpchrist@holgerdanske.com
> # Public Domain
> #
> # Exercise central processing unit
>
> use threads;
> use strict;
> use warnings;
>
> use File::Basename;
> use Time::HiRes qw( sleep time );
>
> die sprintf "Usage: %s PERCENT DURATION\n", basename($0)
> unless @ARGV == 2;
>
> my $a = 0.01 * shift; # periodic exercise duration
> my $b = 1 - $a; # periodic sleep duration
>
> $_ = qx/lscpu/; # Debian GNU/Linux
> my ($c) = /CPU.s.:\s+(\d+)/; # number of virtual cores
>
> my $e = time + shift; # time to end
>
> my @thr; # threads
>
> push @thr, async {
> while (time < $e) {
> my $d = time + $a / 10;
> 1 while time < $d;
> sleep $b/10;
> }
> } for 1..$c;
>
> $_->join for @thr;
>
>
> Run it like this:
>
> 2023-05-21 02:50:06 dpchrist@taz ~/home
> $ ./exercise-cpu
> Usage: exercise-cpu PERCENT DURATION
>
> 2023-05-21 02:50:52 dpchrist@taz ~/home
> $ ./exercise-cpu 25 10
>
> 2023-05-21 02:51:33 dpchrist@taz ~/home
> $ ./exercise-cpu 50 10
>
> 2023-05-21 02:51:48 dpchrist@taz ~/home
> $ ./exercise-cpu 75 10
>
> 2023-05-21 02:52:01 dpchrist@taz ~/home
> $ ./exercise-cpu 100 10
>
>
> I install Xfce when installing Debian and use the Xfce plugins to
> watch
> CPU loading and CPU temperature. The above tests loaded all virtual
> cores at the specified percentage for the specified duration. CPU
> temperature peaked at 32 C, 38 C, 66 C and 72 C, respectively.
>
>
> Having a Debian install on a USB 3.0 flash drive is very useful for
> trouble-shooting and for imaging, backup/ restore, archiving,
> integrity
> checking, migration, validation, etc..
>
As said above will try with LiveUSB
>
> David
>
>
Reply to: