[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

crashes without trace--module problem?



I have a new system running Lenny, amd64 architecture and 8 core Xeon
chips.  It has been crashing regularly, often after less than 24 hours
uptime.

There are indications the problem might be related to the ath5k wireless
driver; details below.  Google shows ath5k oops has lots of hits, but
they seem to concern errors loading or unloading the driver.

Could the kernel or Debian be loading or unloading this, or any other,
driver without user intervention?

The crashes are particularly frustrating because there is generally no
indication in the logs of their cause.  I have routed the logs to
another machine, but they still doesn't show anything.  We've been
unable to get a serial console working.

The system ran memtest+ without error for several days; we pulled the
disks and put on a different OS (CentOS?) and that ran for a couple of
days too.

Any other ideas about what could be causing this, or at least how we
could get debug information?

We suspected some anacron triggered job might be causing trouble.  Is
there a way to find out what jobs will be run when, or at least getting
them logged when they start?


WIRELESS DETAILS

The system has a wireless card because of local network policies; it
also has ethernet.  The best evidence that the driver is the culprit is
that since blacklisting and modprobe -r ath5k almost 5 days ago, the
system has been up.

We tried this because once, but only once, the logs showed that driver
crashing 10--20 minutes before a crash. Our wireless network has a
password, and I have yet to configure my machine to use it.  For both
these reasons it seemed unlikely that the wireless was the culprit, but
we were out of ideas.

CRASH DETAILS

By crashing I mean that when I come in the power is on but the screen is
black (I think not getting a signal); there's no response to the
keyboard or mouse (including VT switching or restart sequences); and it
can't be reached over the network.  The system needs a hardware reset.

OTHER SUSPECTS
The disk setup is complex.  There are 2 identical hard drives, with 3
partitions each.  The 2 first partitions are combined with  software
RAID 1, as are the 2 3rd partitions.  The result is 2 separate md
devices.  The 2 2nd partitions are non-RAID swap.  The md device from
the 3rd partitions is under LVM, and some of the LVM volumes are LUKS
password encrypted.  Swap is random encrypted.  One swap partition was
initially setup as LUKS encrypted, a mistake I later fixed.

The partitions were already on the disks; otherwise setup was through
the released Lenny installer.


Reply to: