[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to diagnose kernel panic?



On Sun, Jul 09, 2006 at 03:56:12PM -0400, Mark Copper wrote:
> I have a server that is brought down by a kernel panic every two weeks
> on average.  Nothing untoward gets in the logs and the on-screen panic
> message starts with something like
>    Kernel panic - not syncing: Fatal exception in interrupt
>    
>    Call trace:
>    [<c026bc42>] scsi_request_fn+0xf610x294
> I wasn't able to get any more at the data center...
> 
> So I brought the machine home and am running folding@home on it and so
> far I have not been able to induce the panic.

there was something unique about what the machine was doing previously
that caused this. You are not doing whatever that is now and thus not
inducing the error. 

>  The replacement machine
> is similar, but not identical.  The main difference being a switch from
> software to hardware RAID1.  Also, the new machine, except for the
> hardware driver, uses stable while the problematic machine uses testing.
> And the replacement has run so far without problem.

well, the call trace above points to a disk problem and you've changed
the disk setup in the new machine by putting a piece of hardware
between the disks and the mother board, so you're problem may be gone
because of that. Its unclear what exactly you've done here. Does the
new machine use the old disks through the hardware raid? or are you
dealing with all new disks. Either way you've changed a lot from old
to new machine and its not surprising that you've eliminated the
problem as a result of this. 

> 
> The only other thing I can add is that the bad machine would seem to
> start getting "sluggish" before it froze, but for the life of me, I
> couldn't see why.


maybe the kernel was trying repeatedly to do some disk operation that
failed, which used up cpu time and caused the sluggish behaviour?
> 

> I am posting because I'm hopeful that list participants might have
> suggestions how I might start to chase down or, better yet, eliminate
> this problem.

can you reproduce the exact setup that was causing problems before,
including the usage levels? 

A

Attachment: signature.asc
Description: Digital signature


Reply to: