[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels



M. Dietrich wrote:
Does this mean that with normal 686 or 486 kernel the corruption
doesn't happen?

yes.

So could be a kernel bug. Or the bigmem kernel trigger the problem early or frequently. Have you already searched through internet if someone had hit your problem? Because i suspect it's not a kernel problem (see later)...

there are no special settings installed using hdparm:

/dev/sda:
 multcount     =  0 (off)
 IO_support    =  1 (32-bit)
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 30401/255/63, sectors = 488397168, start = 0

This is the output of the command, but it doesn't tell all the things you could have changed from the default. Have you customized /etc/hdparm.conf?
For example i've set apm=254 but the above output doesn't report it.

My suggestion is: try to comment out everything you have customized about the disk.

it's already installed, this is the output:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   085   069   034    Pre-fail  Always       -       98867399
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   001   001   020    Old_age   Always   FAILING_NOW 248712
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       40211526
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       269350284038985
 10 Spin_Retry_Count        0x0013   100   100   034    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       448
184 End-to-End_Error        0x0032   100   253   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x003a   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x0022   100   100   045    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   071   052   000    Old_age   Always       -       29 (Lifetime Min/Max 10/48)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       19
192 Power-Off_Retract_Count 0x0022   062   062   000    Old_age   Always       -       77434
193 Load_Cycle_Count        0x001a   001   001   000    Old_age   Always       -       320283
194 Temperature_Celsius     0x0012   029   048   000    Old_age   Always       -       29 (0 10 0 0)
195 Hardware_ECC_Recovered  0x0010   070   061   000    Old_age   Offline      -       98881899
196 Reallocated_Event_Count 0x003e   096   096   000    Old_age   Always       -       3645 (28548, 0)
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0000   200   200   000    Old_age   Offline      -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0000   100   253   000    Old_age   Offline      -       0

i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because
hdaps is stopping the device often? why is that bad? hm.

Good question. I suggest to download a diagnostic tool from your disk's vendor site and see if it report it as failing.

The problem (one of) with smart is that the semantic of the table above in not consistent between manufacturers. So i suggest you to look at the wikipedia SMART page, in particular the "Known ATA S.M.A.R.T. attributes" table, but take it with a grain of salt:
http://en.wikipedia.org/wiki/S.M.A.R.T.

That said, from your smart table i'd do some search regarding this attributes and you disk manufacturer:
* Raw_Read_Error_Rate
* Start_Stop_Count
* Seek_Error_Rate
* Power-Off_Retract_Count
* Load_Cycle_Count
* Hardware_ECC_Recovered
* Reallocated_Event_Count

I'd look if the raw values of Raw_Read_Error_Rate and Seek_Error_Rate as used by your manufactured are worrying or not. Same thing for Hardware_ECC_Recovered. At work we have at least 4 Maxtor that show high and always increasing raw values but they work without problem since years.

Also the Reallocated_Event_Count should require some investigation: why is so high but Reallocated_Sector_Ct and Current_Pending_Sector are zero?

Last, looking from your smart table seems that your drive turn often in standby/sleep mode. This can be seen by the high values of Start_Stop_Count, Load_Cycle_Count and Power-Off_Retract_Count. An in your initial report you said that you used suspend/resume. I think that you should reduce these value because they are very high and all this start/stop cycle will (or already have) reduce the life of your disk. Maybe on your system there is something that force too aggressive power saving on your disk. Laptop-mode-tools is installed?

However it is a common problem, if you do some search.
It is the reason i've put "apm=254" in my hdparm configuration: without this my disk parked its head a bit two often *during normal pc usage*. And i could notice this as clicks and very brief unresponsiveness of the system. With that parameter i've forced my disk to work at full power without parking and going to sleep automatically.
Your disk could require different settings.

i never had strange behaviour
related to mem like kernel-freezes or program core dumps and i use the system
quite alot with big (cross-)compiles and everything that uses mem alot...

In your initial report you said that you noted the problem from 2.6.28 but you found it accidentaly. Another test could be try using previous kernel to see if they work, for example 2.6.26 from Lenny.
You can test other kernel from:
http://snapshot.debian.net/

I understand you are in difficulties removing ram, but it is another of the suspected. It's the original from Lenovo?

An hard problem to solve.
Good luck.

Cesare.



Reply to: