[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

filesystem corruption



I'm trying to migrate from solaris 8 to debian sparc, and I'm running
into some filesystem corruption.  I've got two disks in a sparc v100
(one on each ide bus), and each disk is partitioned like this:

   Device Flag    Start       End    Blocks   Id  System
/dev/hda1             0        65    522112+  83  Linux native
/dev/hda2  u         65       130    522112+  82  Linux swap
/dev/hda3             0      9727  78132127+   5  Whole disk
/dev/hda4           130       782   5237190   83  Linux native
/dev/hda5           782      9727  71850712+  83  Linux native

/dev/hdc is the same.  /dev/hda5 and /dev/hdc5 are configured as a
raid1 pair, but I don't think this is relevant, as I'll discuss in a
moment.

Before pulling the old solaris disk, I copied the contents, including
a cyrus spool, onto another machine, then after installing debian,
copied it back into a subdirectory of the raid partition.  After
installing cyrus, I attempted to use rsync to copy the spool from the
backup on the raid partition to the place where cyrus is actually
going to read it from, which is a different place on the raid
partition.  When I do this, I get nondeterministic read errors, like
this (/u2/ppg-solaris is where the backup lives):

readlink_stat "/u2/ppg-solaris/u1/imap/user/marc/folder/3371." failed: Input/output error
readlink_stat "/u2/ppg-solaris/u1/imap/user/marc/folder/3372." failed: Input/output error

but not on the same part of the disk each time, and sometimes not at
all.

Significantly, I can get different results from successive invocations
of the same rsync command.

I've also seen similar I/O errors when running
"find /u2/ppg-solaris/u1/imap/user/marc -type f | wc"
(the cyrus spool contains about 2 gig of email in 150,000 files in 80
directories).

I commented /dev/md0 out of fstab, disabled raid, rebooted, and
manually mounted /dev/hdc5 read-only on /u2.  rsync gives me errors
there, too.  I did the same thing with /dev/hda5, and also got
intermittent errors.  Since this happens on both disks, without raid,
it does not seem that the problem is in the raid code, the disk, the
ide controllers, or the cabling.  Since the memory is ECC memory, I
would assume the problem is not there, either.  I have never
experienced any sort of corruption on this machine when it was running
solaris, so I am tempted to blame linux.  Perhaps with all the IDE
thrashing on the same disk, I'm losing an interrupt somewhere?

I'm running the 2.4.21 kernel from the kernel-image-2.4.21-sparc64
package from testing.  The filesystem is an ext3 filesystem.

I tried backing out to kernel-image-2.4.19-sun4u from stable, but that
wouldn't even boot.  It got as far as

    ALI15X3: IDE controller at PCI slot 00:0d.0
    ALI15X3: chipset revision 195
    ALI15X3: 100% native mode on irq 4,7cc
        ide0: BM-DMA at 0x1fe02010220-0x1fe02010227, BIOS settings: hda:pio, hdb:pio
        ide1: BM-DMA at 0x1fe02010228-0x1fe0201022f, BIOS settings: hdc:pio, hdd:pio
    hda: WDC WD800JB-00ETA0, ATA DISK drive
    ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
    hdc: ST380011A, ATA DISK drive
    hdd: CD-224E, ATAPI CD/DVD-ROM drive
    ide0 at 0x1fe02010200-0x1fe02010207,0x1fe0201021a on irq 4,7cc
    ide1 at 0x1fe02010210-0x1fe02010217,0x1fe0201020a on irq 4,7cc
    hda: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=9729/255/63, UDMA(66)
    hdc: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=9729/255/63, UDMA(66)
    hdd: ATAPI 24X CD-ROM drive, 128kB Cache, UDMA(33)
    Uniform CD-ROM driver Revision: 3.12
    Partition check:
     hda:

but then hung right there.  2.4.18 boots, but doesn't have ipip
support, which I need.

Does anybody have any suggestions for how I might further diagnose and
fix this problem?

Thanks!

                Marc



Reply to: