Re: random system hang

On Thu, June 16, 2005 15:56, Charles Leggett said:
> My dual opteron system hangs at random intervals. Sometimes it's stable
> for a week, sometimes it hangs after just a few hours. The symptoms
> are always the same - NONE. Carefully scanning the system logs shows
> abslutely nothing occurred to cause a hang. No kernel oopses, no error
> messages. It's been this way ever since I insalled debian 6 months ago -
> before that I was running CentOS, and it never died then, so I'm pretty
> sure it's not a hardware problem.

I don't really have a solution, but rather this is more of a "me too" reply.

I have also been seeing seemingly random lockups on a dual Opteron system.
The screen is frozen and there is no keyboard response. Once it has locked
up, however, I was able to login via ssh and do some looking around. Most
commands take about 1-2 minutes to complete. Just typing "vim somefile"
can take as long as 2 minutes before it completes. Login in via ssh
sometimes takes 30 seconds or more.

On one of the lockups I started killing off processes and finally
determined that X was not stopping when given the HUP and SEGV signals.
Running (I use KDE) "/etc/init.d/kdm stop" would end in an error about the
xserver not responding. Luckily (for me) X would stop with a KILL signal
(-9) and I was able to restart X with "/etc/init.d/kdm restart" which
would then return the local console and the frozen screen to normal and
the machine would operate normally from that point on.

There doesn't seem to be any visable connection between the lockups
outside X and friends. It can be anywhere from a few hours (rare) to a
week. So far most of the lockups were while I wasn't even in the office -
one was in the middle of the night and another was during the day when I
was away.

I have been having weird behavior from xscreensaver (doesn't want to start
sometimes, some screensavers (especially opengl ones) will bleed onto the
screen in preview mode (forcing me to restart X to regain control), and
sometimes (rare) it doesn't want to stop on keyboard or mouse activity)
which could be the root of the problem. I have not yet tried running
without xscreensaver and if the lockups continue I may try stopping it. I
configured xscreensaver to use the "slide show" screensaver and I've only
seen one lockup so far (knock on wood). The machine has never locked up
while in use, only when idle.

This system has been otherwise rock solid. It runs everything extremely
well including games like ut2004 (amd64) and doom3 (i386 chroot). When a
lockup happens there is nothing in any of the logs. I am running sid and
keep it up-to-date. This machine has been heavily tested with a huge range
of tasks (games, compiling, benchmarks, etc.) and none have shown signs of
any problems. I like to compile large packages (for comparison to other
machines and) to look for any signs of instability. After compiling
xserver-xfree86 (33 minutes) there were no errors or unusual behavior
(although I did not actually try using the compiled packages). Other
compiled packages have built and run fine (although none are as large as
X) so I'm leaning towards a problem with X or xscreensaver.

Hardware list (to look for any possible common connections):
2 x Opteron 252
Tyan S2875
2GB DDR400 (2 x 1GB)
SATA (one seagate drive)
IDE (one sony DVDRW)
EVGA Nvidia Geforce 6800 Ultra 256MB (AGP 8x, FW, and SBA enabled)
no PCI devices installed

Right now the machine has been up 7 days without a lockup and I'll
continue to track it - but narrowing the problem down is difficult when
the lockups only happen about a week or two apart...


