[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: System unusably slow after Debian upgrade.



Hi there,

Thanks Dan and Greg (again), Stefan, Tixy, Thomas, and as before keep
those ideas coming.  I'm still baffled but we'll get there.  I have to
say that I'm well impressed by the quality of all the responses, and
even where some of the suggestions have been to try things that I've
already done, please don't think that I'm not grateful for the ideas.
I am.  Very grateful indeed, to all of you.  Being on the digest list
it's a little more difficult to be sure that I haven't missed any of
the replies; I've tried to make sure that I've responded to everyone
but if I have missed anything, or perhaps attributed something to the
wrong author, please don't hesitate to pipe up.

======================================================================
On Fri, Feb 28, 2020,  Greg Wooledge wrote:
On Thu, Feb 27, 2020 G.W. Haywood wrote:
> On Thu, 27 Feb 2020, Dan Ritter wrote:
> > > Go to /etc/nsswitch.conf
> > If these lines look like this
> > > > passwd: compat systemd
> > group:          compat systemd
> > shadow:         compat systemd
> > > > remove the systemd references. > > > > If performance improves immensely immediately after the edit,
> > that was the problem.
> > I'm no lover of systemd, but as with the graphics drivers it's hard to
> imagine how the Name Service Switch could affect the time taken to
> gzip a file.  Nevertheless I gave it a shot.  The 'passwd' and 'group'
> entries were as you described but the 'shadow' entry was not.  I've
> removed the two 'systemd' refrences, and at least remotely from the
> command line it doesn't look like it's helped - see the timings below,

I'd never heard of this particular "issue" either.  My buster
workstation has no performance problems, and my nsswitch.conf
looks like this (minus the comments):
...

Unsurprisingly your version is identical with my backup copy.  The
change suggested by Dan is still in there and seems to have had no
effect at all.

======================================================================
On Fri, Feb 28, 2020,  Greg Wooledge wrote:

Are there any error messages *anywhere* ...

Oh, how I wish!

Obvious places to look would be dmesg, journalctl, and whichever
file(s) in /var/log/ are most recently modified.

Yeah, I looked.  The only remotely relevant thing in dmesg on the
problem machine is

[  555.366313] perf: interrupt took too long (2547 > 2500), lowering kernel.perf_event_max_sample_rate to 78500
[  734.905116] perf: interrupt took too long (3196 > 3183), lowering kernel.perf_event_max_sample_rate to 62500
[ 1008.832272] perf: interrupt took too long (3998 > 3995), lowering kernel.perf_event_max_sample_rate to 50000
[ 1471.380189] perf: interrupt took too long (4998 > 4997), lowering kernel.perf_event_max_sample_rate to 40000
[ 2451.380812] perf: interrupt took too long (6248 > 6247), lowering kernel.perf_event_max_sample_rate to 32000

but I see those on the non-problem machine too:

[ 7756.908521] perf interrupt took too long (2515 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[ 9979.598528] perf interrupt took too long (5032 > 5000), lowering kernel.perf_event_max_sample_rate to 25000

The five messages in the problem machine's dmesg above are literally
the only messages since the few at boot, which was just shy of three
days ago.  There are some USB connect/disconnect messages in the non-
problem box dmesg (as expected, because the users are using it).
Nothing else.

Yesterday more out of desperation than anything I moved the Postgres
database (mainly the security camera stuff) from Postgres 9.4 to the
11.0 instance which Buster installed and started.  There were in fact
three instances of Postgres running (9.4, 9.6 and 11.7) after the two
jumps from Jessie to Stretch to Buster.  Although two of them weren't
doing anything, so just a few extra sleeping processes and some RAM
(of which there's more than enough), I wondered if might be an issue.
It wasn't.  There's just the one Postgres 11.7 instance running now.

It would also be good to look after the basics, like running "uptime"
to check the load average, "top" to see if there are processes running
amok, "df" to see if a file system is unexpectedly full, and so on.

As I mentioned in my OP I run Nagios/Icinga routinely.  It will send
me an email if it sees anything untoward on any of the machines for
which I'm responsible - there are too many to look at them ad-hoc -
and in a couple of seconds I can flip to a tab on one of my browser
windows and see an idiot-light style list of most of the interesting
metrics and processes on all the machines.  A list of several hundred
so it's a routine morning chore to scroll through the list, even if I
haven't had any overnight Nagios emails.  One click gives me graphs
for each metric for (at least) a day, a week, a month and a year.  If
Nagios hadn't existed I'd have had to write it, I can't tell you how
much effort it saves.  Lots.

Load average is one of the routinely graphed metrics for all machines,
and there's nothing exceptional on the problem machine although as I
said the averages are a little elevated which isn't surprising given
the problems with the box.  Other metrics include Disc I/O, Free space
and temperatures; CPU load (as distinct from the usual 1m, 5m and 15m
load averages) and temperature; free RAM and swap; LAN traffic; SSH
connection response time; and the system time offset from the time on
my local time servers, which is typically no more than 20ms.  All the
numeric values are graphed, and generally if a box gets into trouble
that last figure will go off the chart (full scale on the chart is at
25ms, which gives you an idea of how fussy I am about the system times
on my boxes) but this one seems just a little higher and just a little
more variable than its partner.  Nothing remarkable.

======================================================================
On Fri, 28 Feb 2020 Stefan Monnier wrote:
Greg Wooledge wrote:
> It would also be good to look after the basics, like running "uptime"
> to check the load average, "top" to see if there are processes running
> amok, "df" to see if a file system is unexpectedly full, and so on.

Yes, I'd recommend running `atop` on both machines during your test to
try and see if something jumps out.

Yes - all good advice, thanks - but see above.  One of my Nagios
plugins uses 'atop' for some of the monitoring, and 'atop' runs
continuously on most of the machines.  Atop logs data to files in
/var/log/atop/, from where the plugins can take their data:

Farm-1:~# >>> ps -ef | grep atop
root       482     1  0 Feb26 ?        00:00:20 /usr/sbin/atopacctd
root      3711     1  0 00:00 ?        00:02:52 /usr/bin/atop -R -w /var/log/atop/atop_20200229 600
root      6408 20203  0 11:37 pts/0    00:00:00 grep atop

======================================================================
G.W. Haywood wrote:
On Fri, 28 Feb 2020, Dan Ritter wrote:

> A ridiculously decelerated gzip is evidence of one of the
> following:
> > - CPU throttling
> - disk errors
> - something interfering with the disk reading or writing

You've probably seen my reply to your first by now, it seems that disc
access problems can be ruled out if 'rm /file/in/ram' is ridiculously
slow too!  But I hadn't thought that something might be throttling CPU
by accident so I need to look into that too, thanks.

I just installed cpufreq-info on the machines (the later package which
is to replace cpufreq-info isn't available on Jessie) and it seems to
be telling me that both CPUs are running at the same speed.  The only
notable difference is in what it calls "maximum transition latency",
which I take to mean the time taken by the CPU to respond to a change
instruction from software telling it to increase or decrease the CPU
clock frequency.  But both clocks seem to be running at the maximum of
1.47GHz so I think that lets CPU throttling off the hook, at least for
the moment.  If anyone has seen this odd several-second delay in the
response to a CPU frequency change instruction before, I'm all ears.
It might lead somewhere.

----------------------------------------------------------------------
Farm-2:~# >>> cpufreq-info
  driver: intel_pstate
  maximum transition latency: 0.97 ms.
  hardware limits: 533 MHz - 1.47 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 533 MHz and 1.47 GHz.
    The governor "powersave" may decide which speed to use...
  current CPU frequency is 1.47 GHz (asserted by call to hardware).
----------------------------------------------------------------------
Farm-1:/etc# >>> cpufreq-info
  driver: intel_pstate
  maximum transition latency: 4294.55 ms.
  hardware limits: 533 MHz - 1.47 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 533 MHz and 1.47 GHz.
    The governor "powersave" may decide which speed to use...
  current CPU frequency is 1.47 GHz.
----------------------------------------------------------------------

On Fri, 28 Feb 2020 Thomas Schmitt <scdbackup@gmx.net>
G.W. Haywood wrote:
> The system load averages are elevated to an extent,
> but 'top' doesn't show any particular processes hogging CPU.

If top does not show processes which cause visible high overall CPU load,
then this might indicate many short running processes.

You could estimate the number of processes created in a second by a
bash run like

  ( echo $BASHPID ) ; sleep 1 ; ( echo $BASHPID )

which gives me e.g.

  4482
  4485

and thus indicates that my machine does not busily fork processes.

As mentioned, no obvious problem there.  It's just like the machine is
suddenly being powered by an 8080 instead of an E3815...

> Intel E3815

A single core CPU. That's unusual in our time.

:)

======================================================================
On Fri, 28 Feb 2020 Greg Wooledge wrote:
On Fri, Feb 28, 2020 at 01:55:12PM -0600, David Wright wrote:
> Like G.W. Haywood, I run fvwm with
All of the cool people do!

I'm waaaay to old to be cool... :(

--

73,
Ged.


Reply to: