[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: My first Linux crash



    "Andrew" == Andrew Perrin <clists@perrin.socsci.unc.edu> writes:

    Andrew> FWIW, I'm skeptical of Nate's claim that excessive I/O
    Andrew> errors must bring down the system. I'm certainly not a
    Andrew> kernel hacker, but I see no reason why the kernel couldn't
    Andrew> do what it does in other roughly analogous situations:

I'm not a Linux kernel programmer, but I've worked on device drivers
and firmware for many systems. There are always some hardware errors
you cannot recover from, though the details will vary based on the
situation. For every strategy you can device, you can find another
class of hardware errors you simply cannot recover from.

For example, if I program a DMA controller to transfer bytes from
address x to address y but the controller sends it somewhere else, I'm
hosed. When I program a PCI bus master to do a burst transfer, I
*expect* it to obey the rules. I do not checksum all of my memory and
then verify that it did not change. 

If the hardware breaks the rules, there is very little you can do to
recover. For example, depending on the OS and architecture, the I/O
error might erase the very code that is supposed to recover from the
error!

Most programming involves a chain of trust. It is just one of those
compromises you have to make. TCPA non-withstanding ;-)

Cheers!
Shyamal



Reply to: