Re: Logging question
El 2012-04-27 a las 21:53 -0700, cletusjenkins escribió:
(resending to the list)
> ---- On Fri, 27 Apr 2012 09:06:28 -0700 Camaleón wrote ----
> >On Thu, 26 Apr 2012 16:06:15 -0700, cletusjenkins wrote:
> >> I have a machine that is locking up every few days. It doesn't seem to
> >> be doing much when it happens, nor do I see anything in the syslog or
> >> messages files. Is there any way to enable extra logging to try to catch
> >> what is going wrong? Thanks.
> >The lock could come from different sources, either software based locks
> >(X server, kernel soft/hard lock...) or hardware ones (a device failure,
> >such as bad ram, micro over-heating, a problem with the power supply, a
> >hard disk issue...).
> >I would start by discarding X first (of course, if you are not running an
> >X server there's no need to try this ;-) ), so can you "ssh" to the
> >machine when it gets freezed?
> No, I can't ssh to it once it occurs.
Then the lock is royal :-/
> I'll see if I can reproduce it without X running.
You can try it but if it were X crashing, you will be still able to
login from SSH which does not seem to be the case.
> It's a desktop, so it was always logged in when it occurs. But the
> failure seems to occur at night when no one is
> actively using it (but left logged on). I have triggered it under load,
> say when copying several GB's of files over the network or even from
> one disk to another.
That can be interesting. The fact that system becomes unstable when
running intensive tasks can point more than a hardware problem a
softare based one.
To discard a problem involving the hard disks buses and NIC, have you
tried to put some stress to your system which does not make use of the
NIC card nor copying files from a disk to another? I mean, something
such as kernel compiling or "tar-ing" big files placed in the same
disk, just to check if the system still locks under that sitution.
> I did find a problem where PCI slot 3 shares a DMA
> with the IDE controller, the NIC was in that slot. It is a 3com 3905B
> which is supposed to be able to share DMAs (and so does the
> controller), but after taking the card out the number of lockups went
> down, but still occur. Occasionally when it locks up I can still move
> the mouse and even type commands into an xterm, but if you do anything
> that hits the harddrive it locks up totally. At least once I was able
> to enter a shutdown command that worked, but usually it locks up before
> that happens.
> I replaced the disks and cables, same problem. I moved the OS disk to
> another controller and it still locks up (eventually). I can do a
> fresh installl of debian without any lockups. I even took all the
> drives off the motherboards controllers, disabled the controller in
> the bios and used a disk/cable along with a PCI IDE card that worked
> in a spare machine. Still it eventually locked up.
> I just don't see anything in the logs now. Before I found the
> NIC/controller DMA issue, I would see a DMA timeout in the logs (the
> last entry before the machine was reset).
A hardware problem does not tend to leave any trace in the log so that
they become harder to debug but the fact the system runs fine when no
intensive tasks are in place it doe snot point to a hardwware fault :-?
What you can try, in the meantime, is logging whatever is available, by
sending the information out to a second computer. You can follow the
instructions given here:
Debugging system freezes