filesystem corruption
I'm trying to migrate from solaris 8 to debian sparc, and I'm running
into some filesystem corruption. I've got two disks in a sparc v100
(one on each ide bus), and each disk is partitioned like this:
Device Flag Start End Blocks Id System
/dev/hda1 0 65 522112+ 83 Linux native
/dev/hda2 u 65 130 522112+ 82 Linux swap
/dev/hda3 0 9727 78132127+ 5 Whole disk
/dev/hda4 130 782 5237190 83 Linux native
/dev/hda5 782 9727 71850712+ 83 Linux native
/dev/hdc is the same. /dev/hda5 and /dev/hdc5 are configured as a
raid1 pair, but I don't think this is relevant, as I'll discuss in a
moment.
Before pulling the old solaris disk, I copied the contents, including
a cyrus spool, onto another machine, then after installing debian,
copied it back into a subdirectory of the raid partition. After
installing cyrus, I attempted to use rsync to copy the spool from the
backup on the raid partition to the place where cyrus is actually
going to read it from, which is a different place on the raid
partition. When I do this, I get nondeterministic read errors, like
this (/u2/ppg-solaris is where the backup lives):
readlink_stat "/u2/ppg-solaris/u1/imap/user/marc/folder/3371." failed: Input/output error
readlink_stat "/u2/ppg-solaris/u1/imap/user/marc/folder/3372." failed: Input/output error
but not on the same part of the disk each time, and sometimes not at
all.
Significantly, I can get different results from successive invocations
of the same rsync command.
I've also seen similar I/O errors when running
"find /u2/ppg-solaris/u1/imap/user/marc -type f | wc"
(the cyrus spool contains about 2 gig of email in 150,000 files in 80
directories).
I commented /dev/md0 out of fstab, disabled raid, rebooted, and
manually mounted /dev/hdc5 read-only on /u2. rsync gives me errors
there, too. I did the same thing with /dev/hda5, and also got
intermittent errors. Since this happens on both disks, without raid,
it does not seem that the problem is in the raid code, the disk, the
ide controllers, or the cabling. Since the memory is ECC memory, I
would assume the problem is not there, either. I have never
experienced any sort of corruption on this machine when it was running
solaris, so I am tempted to blame linux. Perhaps with all the IDE
thrashing on the same disk, I'm losing an interrupt somewhere?
I'm running the 2.4.21 kernel from the kernel-image-2.4.21-sparc64
package from testing. The filesystem is an ext3 filesystem.
I tried backing out to kernel-image-2.4.19-sun4u from stable, but that
wouldn't even boot. It got as far as
ALI15X3: IDE controller at PCI slot 00:0d.0
ALI15X3: chipset revision 195
ALI15X3: 100% native mode on irq 4,7cc
ide0: BM-DMA at 0x1fe02010220-0x1fe02010227, BIOS settings: hda:pio, hdb:pio
ide1: BM-DMA at 0x1fe02010228-0x1fe0201022f, BIOS settings: hdc:pio, hdd:pio
hda: WDC WD800JB-00ETA0, ATA DISK drive
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
hdc: ST380011A, ATA DISK drive
hdd: CD-224E, ATAPI CD/DVD-ROM drive
ide0 at 0x1fe02010200-0x1fe02010207,0x1fe0201021a on irq 4,7cc
ide1 at 0x1fe02010210-0x1fe02010217,0x1fe0201020a on irq 4,7cc
hda: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=9729/255/63, UDMA(66)
hdc: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=9729/255/63, UDMA(66)
hdd: ATAPI 24X CD-ROM drive, 128kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.12
Partition check:
hda:
but then hung right there. 2.4.18 boots, but doesn't have ipip
support, which I need.
Does anybody have any suggestions for how I might further diagnose and
fix this problem?
Thanks!
Marc
Reply to: