[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Lost interrupt, page allocation failure, and kernel oops



> > Bad RAM, perhaps? Or other hardware dying?
>
> The harddrives themselves are fine: they are less than two
> months old and smartmontools' smartctl reports no errors
> at all.

The BadCRC seems to indicate otherwise. Though this error should also be
logged in the disk SMART log.

> As to RAM, how can I test it? http://www.memtest86.com/ seems
> to be for Intel architectures only.

I wish I knew.

> I would also be glad for more background information: what
> does a lost interrupt mean? What is a order-0 page allocation

Lost interrupt means just that: the IDE driver posted a read or write
command and expected to get notified at completion by an interrupt. That
interrupt never came in time, and the drive status seems to indicate the
drive mechanism positioned the heads OK ("SeekComplete") but failed to
retrieve the data properly (bad sector or block checksum, CRC)

The failed order-0 allocation either means the kernel has no more free
memory of that size (order 0 equals 4k IIRC) or the free memory lists
got corrupted (and the kernel should have noticed that). The page
allocation error happened in alloc_skb which was called by bmac_rx_intr,
so the kernel ran out of room receiving data from the network. Seems
non-fatal.

The oops happened in do_select which, again, is the network stack - maybe
there still was a memory problem. Please note that the lost disk interrupt
happened several minutes earlier, so the disk error likely wasn't the
cause of the later oops. The signal 11 (segmentation fault) means a
pointer to memory wasn't really pointing at an existing (or permitted)
memory location. A user program doing this just gets itself killed. The
kernel, well, just kills itself there.

> failure, and the mentioned kernel oops, is it more serious
> than its name "oops"?

It depends. Many times I see the kernel happily carry on after oops (2.6
that is; earlier kernels would panic right there). Access of a bad memory
area while in kernel still seems to be a Bad Thing (tm). The kernel
assumes that kernel code or data got corrupt by a kernel error, and you
would not want the kernel to run on, perhaps writing corrupt buffered data
destined for the disk back out.

> Also, does this look like something non-powerpc specific
> so that I should seek for help on another list, too?

It might all be caused by bad hardware, or it might be bad error handling
on behalf of the bmac driver (not a likely chance; I bet BenH crossed all
the i's and dotted the t's there). Both would be powerpc specific.

The bad data written to disk (duplicate blocks galore) suggests something
badly scribbled over kernel data. Should never happen except when the
hardware fails.

HTH,

	Michael



Reply to: