[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: grep & tar segfault - broken system



On Monday 03 January 2005 18:45, Alexandros Papadopoulos wrote:
> On Monday 03 January 2005 12:59, Alexandros Papadopoulos wrote:
> <snip>
>
> > [0] GREP segfaults:
> > helios:/# grep
> > Segmentation fault
>
> <snip>
>
> > [1] TAR segfaults too:
> > helios:/# tar -cf boot.tar boot/
> > Segmentation fault
>
> On closer inspection, I realised that "find" segfaulted too. I also
> checked the md5sums of the binaries and compared them with md5sums
> from another sarge machine:
> helios:~# md5sum grep
> 3e39a37478852cbc407a48cbb87742b1  grep
> helios:~# md5sum /bin/grep
> 32f9b2c685911afe25d6000c07c6f3a7  /bin/grep
>
> helios:~# md5sum tar
> 4a1f9c9a1679faaf66073c96f1435284  tar
> helios:~# md5sum /bin/tar
> 4a1f9c9a1679faaf66073c96f1435284  /bin/tar
>
> helios:~# md5sum /usr/bin/find
> d046e60434e9d1b7a21781fddc0799af  /usr/bin/find
> helios:~# md5sum find
> f88ace1e9fd6f456cfff178e29189c32  find
>
> So, it seems that /usr/bin/find and /bin/grep are different on the
> problematic machine!

Further findings:
helios:~# cmp -l /usr/bin/perl /usr/bin/perl_SUSPECT
1055193 377 337
helios:~# cmp -l /usr/bin/find /usr/bin/find_SUSPECT
49561 377 337
helios:~# cmp -l /bin/tar /bin/tar_SUSPECT
163993 377 337

I guess this points to filesystem corruption, but fsck.ext -f didn't 
come up with anything. 

So my suspicions fell on the SATA disks (this is a 2-disk software 
RAID-1 bootable array, on 2 SATA disks connected with a PCI-to-SATA 
card using the SiI3112 chipset), so I dug a little deeper, with 
smartctl, and came up with the following for the two disks:

For /dev/hde, a long test came up with the overall result PASSED, but 
the following error:

Error 1 occurred at disk power-on lifetime: 164 hours (6 days + 20 
hours)
  When the command that caused the error occurred, the device was in an 
unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 40 40 a4 b4 e6  Error: ICRC, ABRT at LBA = 0x06b4a440 = 
112501824

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 03 80 00 a4 b4 e6 00      00:09:00.928  WRITE DMA
  ca 03 80 80 a3 b4 e6 00      00:09:00.912  WRITE DMA
  ca 03 80 00 a3 b4 e6 00      00:09:00.912  WRITE DMA
  ca 03 80 80 a2 b4 e6 00      00:09:00.912  WRITE DMA
  ca 03 80 00 a2 b4 e6 00      00:09:00.912  WRITE DMA

And, for /dev/hdg, the same tests came out with an overall PASSED, but 
with 9 errors, all looking like this:

Error 9 occurred at disk power-on lifetime: 747 hours (31 days + 3 
hours)
  When the command that caused the error occurred, the device was in an 
unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 c7 b9 8a e0  Error: ICRC, ABRT at LBA = 0x008ab9c7 = 9091527

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 45 c7 b9 8a e0 00      05:09:27.296  READ DMA
  ef 03 45 c7 b9 8a e0 00      05:08:01.312  SET FEATURES [Set transfer 
mode]
  ef 03 45 c7 b9 8a e0 00      05:08:01.312  SET FEATURES [Set transfer 
mode]
  c8 00 08 c0 b9 8a e9 00      05:09:02.112  READ DMA
  c8 00 01 ff b3 8a e9 00      05:09:02.080  READ DMA

A fellow user suggested setting the DMA mode to udma5 (hdparm -X 
udma5 /dev/hdg) and then monitoring the behavior of the system, but it 
made no difference. After SMART tests, the same read/write errors 
occur. I've changed SATA cables, to no avail.

Anyone have advice on this?

Thanks

-A



Reply to: