[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Server REALLY slow after console messages



On Tue, Jun 27, 2006 at 05:24:02PM -0400, Carl Fink wrote:
> You're out of memory, just like the messages say.  Presumably some process
> on that server has used it all, including all your swap.  Eventually the
> process should be killed automatically or the program might segfault.  If
> you can get on as root and stay on long enough to type some commands, you
> could do:
> 
> 	dd if=/dev/zero of=/var/spool/swapfile bs=1024 count=262144
> 
> 	swapon /var/spool/swapfile

Realistically, this isn't likely to help...  He's already used up 5GB
of virtual memory -- 2GB of RAM and 3 GB of swap space.  At such a
point, the problem is the system is thrashing the swap disk... that
is, it is trying to rapidly pull processes back from swap space as the
kernel changes context between all the runable processes.  

People still advocate having swap that's anywhere from 1.5 to 3 times
your physical RAM...  That made sense on ancient hardware with 8MB of
RAM, when memory was relatively a lot slower and way more expensive,
but I think on modern hardware, that idea is totally brain-dead.  Part
of the problem is that memory speeds have not kept up with CPU
improvements (so context switches kill you), but mostly I think it's
that memory is way, way faster than disk (especially as compared to 20
years ago), so virtual memory doesn't buy you as much as it used to on
paleolithic hardware.

If you're actively using 3GB of swap, there's no way your disks can
keep up to the CPU's context switches, and your system is dead in the
water (note: emphasis on ACTIVELY -- If you have a 3GB process swapped
to disk, but it's just sitting around doing nothing, it's not going to
kill your system... at least not until someone decides they need to
use it again).

The only real solution is to buy more RAM, particularly if this
problem continues to reoccur.  Though, someone suggested a memory
leak... there's a real possibility that one of the processes (or more
than one) does actually have one.  That would be where getting output
from top while the system is thrashing would be useful.  It's
difficult to get due to the state of the system, but totally necessary
to figure out what's really going on.  Steps that might help:

  1. log in on the machine's console.  There's less work for the
     system to do, compared with logging in over the network, so
     logging in locally should be easier.

  2. Boost the priority of your shell (you must be root).  This
     command will do it (including the $$):

     # renice -20 $$

If the system is at all capable of being responsive, this should make
your shell usable.  The $$ is an automatic shell variable which
expands to the process id of your shell.  Here's an example.

First, let's show what the process id of my shell is:

[root@archonis ddm]# echo $$
13357

Now, notice the "NI" value for that PID in the output of ps -elf, below:

[root@archonis ddm]# ps -elf |egrep "$$|PPID"
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root     13357 13355  0  76   0 -  1389 wait   22:44 pts/5    00:00:00 -bash
4 R root     13439 13357  0  76   0 -  1415 -      22:47 pts/5    00:00:00 ps -elf

It's 0, which is the normal nice value for any process.  This means
the process has the default priority, same as every other normal
process on the system.  But by reducing the nice value, we increase
the priority.  Not exactly intuitive, I know... but just remember that
by reducing the NICE value, we are making our process "less nice" than
before.  :)

[root@archonis ddm]# renice -20 $$
13357: old priority 0, new priority -20

Now, notice the new NICE value in the output of ps:

[root@archonis ddm]# ps -elf |egrep "$$|PPID"
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root     13357 13355  0  60 -20 -  1389 wait   22:44 pts/5    00:00:00 -bash
4 R root     13460 13357  0  60 -20 -  1415 -      22:51 pts/5    00:00:00 ps -elf
0 R root     13461 13357  0  60 -20 -  1235 -      22:51 pts/5    00:00:00 egrep 13357|PPID


We've changed the nice value to -20, as low as it can go, i.e. it's
the "least nice" we can make our process.  You must be root to reduce
the nice value...  Regular users can only increase it.  The idea is to
make processes which the user is running for a long time in the
background be nice to other users...

So, once you log in, make sure "renice -20 $$" is the first thing you
do.  After that, the system may respond better for you... but also
realize that all the other processes will run worse for everyone else.

If your system is thrashing like this, about the only solution is to
stop and restart proceses (or just reboot)... but the above is meant
to give you a way to see WHY the system is falling over, so hopefully
you can do something to prevent it after you do finally reboot the
system. ;-)

-- 
Derek D. Martin
http://www.pizzashack.org/
GPG Key ID: 0x81CFE75D

Attachment: pgpjh52Xfsr3E.pgp
Description: PGP signature


Reply to: