Re: grep & tar segfault - broken system
On Monday 03 January 2005 18:45, Alexandros Papadopoulos wrote:
> On Monday 03 January 2005 12:59, Alexandros Papadopoulos wrote:
> <snip>
>
> > [0] GREP segfaults:
> > helios:/# grep
> > Segmentation fault
>
> <snip>
>
> > [1] TAR segfaults too:
> > helios:/# tar -cf boot.tar boot/
> > Segmentation fault
>
> On closer inspection, I realised that "find" segfaulted too. I also
> checked the md5sums of the binaries and compared them with md5sums
> from another sarge machine:
> helios:~# md5sum grep
> 3e39a37478852cbc407a48cbb87742b1 grep
> helios:~# md5sum /bin/grep
> 32f9b2c685911afe25d6000c07c6f3a7 /bin/grep
>
> helios:~# md5sum tar
> 4a1f9c9a1679faaf66073c96f1435284 tar
> helios:~# md5sum /bin/tar
> 4a1f9c9a1679faaf66073c96f1435284 /bin/tar
>
> helios:~# md5sum /usr/bin/find
> d046e60434e9d1b7a21781fddc0799af /usr/bin/find
> helios:~# md5sum find
> f88ace1e9fd6f456cfff178e29189c32 find
>
> So, it seems that /usr/bin/find and /bin/grep are different on the
> problematic machine!
Further findings:
helios:~# cmp -l /usr/bin/perl /usr/bin/perl_SUSPECT
1055193 377 337
helios:~# cmp -l /usr/bin/find /usr/bin/find_SUSPECT
49561 377 337
helios:~# cmp -l /bin/tar /bin/tar_SUSPECT
163993 377 337
I guess this points to filesystem corruption, but fsck.ext -f didn't
come up with anything.
So my suspicions fell on the SATA disks (this is a 2-disk software
RAID-1 bootable array, on 2 SATA disks connected with a PCI-to-SATA
card using the SiI3112 chipset), so I dug a little deeper, with
smartctl, and came up with the following for the two disks:
For /dev/hde, a long test came up with the overall result PASSED, but
the following error:
Error 1 occurred at disk power-on lifetime: 164 hours (6 days + 20
hours)
When the command that caused the error occurred, the device was in an
unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 40 40 a4 b4 e6 Error: ICRC, ABRT at LBA = 0x06b4a440 =
112501824
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 03 80 00 a4 b4 e6 00 00:09:00.928 WRITE DMA
ca 03 80 80 a3 b4 e6 00 00:09:00.912 WRITE DMA
ca 03 80 00 a3 b4 e6 00 00:09:00.912 WRITE DMA
ca 03 80 80 a2 b4 e6 00 00:09:00.912 WRITE DMA
ca 03 80 00 a2 b4 e6 00 00:09:00.912 WRITE DMA
And, for /dev/hdg, the same tests came out with an overall PASSED, but
with 9 errors, all looking like this:
Error 9 occurred at disk power-on lifetime: 747 hours (31 days + 3
hours)
When the command that caused the error occurred, the device was in an
unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 c7 b9 8a e0 Error: ICRC, ABRT at LBA = 0x008ab9c7 = 9091527
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 45 c7 b9 8a e0 00 05:09:27.296 READ DMA
ef 03 45 c7 b9 8a e0 00 05:08:01.312 SET FEATURES [Set transfer
mode]
ef 03 45 c7 b9 8a e0 00 05:08:01.312 SET FEATURES [Set transfer
mode]
c8 00 08 c0 b9 8a e9 00 05:09:02.112 READ DMA
c8 00 01 ff b3 8a e9 00 05:09:02.080 READ DMA
A fellow user suggested setting the DMA mode to udma5 (hdparm -X
udma5 /dev/hdg) and then monitoring the behavior of the system, but it
made no difference. After SMART tests, the same read/write errors
occur. I've changed SATA cables, to no avail.
Anyone have advice on this?
Thanks
-A
Reply to: