[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels



On Wed, Feb 03, 2010 at 11:22:06AM +0100, Cesare Leonardi wrote:
> M. Dietrich wrote:
> > my system had serious filesystem corruption with several -bigmem
> > kernel in the past (from 2.6.28 to 2.6.32).
> 
> Does this mean that with normal 686 or 486 kernel the corruption
> doesn't happen?

yes.
> 
> However many years ago i've experienced frequent filesystem
> corruption but i couldn't figure out why. Eventually i discovered
> was some hdparm settings...
> Was a lot hard to find, so i hope this could help you.  ;-)

there are no special settings installed using hdparm:

/dev/sda:
 multcount     =  0 (off)
 IO_support    =  1 (32-bit)
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 30401/255/63, sectors = 488397168, start = 0

> > for sure i can't guarantee that this isn't related to some hardware
> > fault like broken ram or the like but i checked ram with memtest86+.
> 
> If i were you, i would also install smartmontools and try something
> like: smartctl -a /dev/yourdisk I'd put particular attention in the
> "Vendor Specific SMART Attributes with Thresholds" table to find
> something strange.

it's already installed, this is the output:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   085   069   034    Pre-fail  Always       -       98867399
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   001   001   020    Old_age   Always   FAILING_NOW 248712
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       40211526
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       269350284038985
 10 Spin_Retry_Count        0x0013   100   100   034    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       448
184 End-to-End_Error        0x0032   100   253   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x003a   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x0022   100   100   045    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   071   052   000    Old_age   Always       -       29 (Lifetime Min/Max 10/48)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       19
192 Power-Off_Retract_Count 0x0022   062   062   000    Old_age   Always       -       77434
193 Load_Cycle_Count        0x001a   001   001   000    Old_age   Always       -       320283
194 Temperature_Celsius     0x0012   029   048   000    Old_age   Always       -       29 (0 10 0 0)
195 Hardware_ECC_Recovered  0x0010   070   061   000    Old_age   Offline      -       98881899
196 Reallocated_Event_Count 0x003e   096   096   000    Old_age   Always       -       3645 (28548, 0)
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0000   200   200   000    Old_age   Offline      -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0000   100   253   000    Old_age   Offline      -       0

i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because
hdaps is stopping the device often? why is that bad? hm.

but everything else looks fine, right?

> And try to hear if the disk make suspicious noise.

it doesnt - silent as a sleeping baby.
> 
> If you have a minimum suspect for the ram, try to temporarly remove
> some bank, if you have more than one, or replace completely if you
> can. In the past i've seen at least two cases where memtest run ok
> for about a day but the system had sporadic system freeze and BSOD
> (Windows PCs). When i've replaced the ram the problems disapperead.
> 
removing would reduce mem size and the need for bigmem kernel obsolete.
replacing isn't possible right now. point is: i never had strange behaviour
related to mem like kernel-freezes or program core dumps and i use the system
quite alot with big (cross-)compiles and everything that uses mem alot...

thing is that i discovered fs corruption by accident - git complained
about a defect repo. then i forced a fsck run at boot and that failed.
maybe all bigmem users should force a fsck and see if they already
suffer from a similar corruption. if not this bug should be closed
because it seems to be hw related. but i don't know how & where to
search, especially because this computer is a tool to do my work on.

best regards,
	michael



Reply to: