[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: My first Linux crash



Michael P. Soulier said:

>    Would not desirable behaviour be to log as many errors as possible,
>    but
> recover from the hardware problem? I see no reason why any software, user
> space or kernel space, should crash due to errors in a peripheral. Bad
> RAM is one thing, but errors on a CD? I disagree. This is incorrect
> behavour for any OS.


to the system its not an error on the CD. its getting flooded with
I/O errors on the disk controller. the system usually tries to cope by
resetting the controller and trying again, but it reaches a point where
the controller is screwed and the system stops responding. this is beyond
control of the software.

linux is not alone, ive crashed at least half a dozen different linux/unix
and non unix systems doing the exact same thing. in my case its always been
due to a CD-R disc. All my CDs are very clean, just sometimes a CDROM freaks
out when reading very large files from CD-R media.

think of it thisway. the software is at the mercy of the hardware, the
software cannot prevent you from pulling the power plug, it cannot prevent
a disk failure, it cannot prevent I/O errors, there is some things it
can do to try to work around the problem, but PC hardware is so limited
that sometimes all workarounds fail and the system crashes.

want to hear something that really sucks? recently at my former company
one of the raid systems I built .. 6 x 80GB raid 10 hardware array connected
to a 3ware 8 port raid card. So this is hardware raid. Transparent to
the OS ..  every single time a disk fails it crashes the system(kernel panic),
there is no reason for it to crash, the disk failure should be handled
transparently in the background by the controller, the OS doesn't care if
one of of 2 of a raid1 array fails, the data is still there in it's entirety
on the 2nd disk. So why does it crash everytime? At the mercey of the
hardware, buggy hardware ..i worked with 3ware and my vendor for 6 months
last year to try to fix these things by changing hdd brands, upgrading
power supplies etc.. and a year later the problem is still there. fuckin
3ware.

now on good hardware, perhaps some high end sun or RS6000 stuff where
there is a lot more redundancy(e.g. multiple independent PCI busses,
multiple disk controllers, multiple redundant cpus), the software has
much more flexiblity in preventing a complete failure when hit with
such a situation. But even then, someone blownin holes in the front
of the system with a shotgun is gonna be beyond control of the software
to prevent a crash :)


nate





Reply to: