Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels

To: mdt@emdete.de
Cc: 567204@bugs.debian.org
Subject: Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
From: Cesare Leonardi <celeonar@gmail.com>
Date: Sat, 06 Feb 2010 16:02:36 +0100
Message-id: <4B6D848C.3020002@gmail.com>
Reply-to: Cesare Leonardi <celeonar@gmail.com>, 567204@bugs.debian.org
In-reply-to: <20100204091753.GB9666@emdete.de>
References: <20100127222557.3007.595.reportbug@marple.localnet> <4B694E4E.6080707@gmail.com> <20100204091753.GB9666@emdete.de>

M. Dietrich wrote:

Does this mean that with normal 686 or 486 kernel the corruption
doesn't happen?


yes.

So could be a kernel bug. Or the bigmem kernel trigger the problem earlyor frequently.Have you already searched through internet if someone had hit yourproblem? Because i suspect it's not a kernel problem (see later)...

there are no special settings installed using hdparm:

/dev/sda:
 multcount     =  0 (off)
 IO_support    =  1 (32-bit)
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 30401/255/63, sectors = 488397168, start = 0

This is the output of the command, but it doesn't tell all the thingsyou could have changed from the default. Have you customized/etc/hdparm.conf?

For example i've set apm=254 but the above output doesn't report it.

My suggestion is: try to comment out everything you have customizedabout the disk.

it's already installed, this is the output:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   085   069   034    Pre-fail  Always       -       98867399
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   001   001   020    Old_age   Always   FAILING_NOW 248712
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       40211526
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       269350284038985
 10 Spin_Retry_Count        0x0013   100   100   034    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       448
184 End-to-End_Error        0x0032   100   253   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x003a   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x0022   100   100   045    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   071   052   000    Old_age   Always       -       29 (Lifetime Min/Max 10/48)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       19
192 Power-Off_Retract_Count 0x0022   062   062   000    Old_age   Always       -       77434
193 Load_Cycle_Count        0x001a   001   001   000    Old_age   Always       -       320283
194 Temperature_Celsius     0x0012   029   048   000    Old_age   Always       -       29 (0 10 0 0)
195 Hardware_ECC_Recovered  0x0010   070   061   000    Old_age   Offline      -       98881899
196 Reallocated_Event_Count 0x003e   096   096   000    Old_age   Always       -       3645 (28548, 0)
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0000   200   200   000    Old_age   Offline      -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0000   100   253   000    Old_age   Offline      -       0

i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because
hdaps is stopping the device often? why is that bad? hm.

Good question. I suggest to download a diagnostic tool from your disk'svendor site and see if it report it as failing.

The problem (one of) with smart is that the semantic of the table abovein not consistent between manufacturers. So i suggest you to look at thewikipedia SMART page, in particular the "Known ATA S.M.A.R.T.attributes" table, but take it with a grain of salt:

http://en.wikipedia.org/wiki/S.M.A.R.T.

That said, from your smart table i'd do some search regarding thisattributes and you disk manufacturer:

* Raw_Read_Error_Rate
* Start_Stop_Count
* Seek_Error_Rate
* Power-Off_Retract_Count
* Load_Cycle_Count
* Hardware_ECC_Recovered
* Reallocated_Event_Count

I'd look if the raw values of Raw_Read_Error_Rate and Seek_Error_Rate asused by your manufactured are worrying or not.Same thing for Hardware_ECC_Recovered. At work we have at least 4 Maxtorthat show high and always increasing raw values but they work withoutproblem since years.

Also the Reallocated_Event_Count should require some investigation: whyis so high but Reallocated_Sector_Ct and Current_Pending_Sector are zero?

Last, looking from your smart table seems that your drive turn often instandby/sleep mode. This can be seen by the high values ofStart_Stop_Count, Load_Cycle_Count and Power-Off_Retract_Count. An inyour initial report you said that you used suspend/resume.I think that you should reduce these value because they are very highand all this start/stop cycle will (or already have) reduce the life ofyour disk.Maybe on your system there is something that force too aggressive powersaving on your disk. Laptop-mode-tools is installed?


However it is a common problem, if you do some search.

It is the reason i've put "apm=254" in my hdparm configuration: withoutthis my disk parked its head a bit two often *during normal pc usage*.And i could notice this as clicks and very brief unresponsiveness of thesystem. With that parameter i've forced my disk to work at full powerwithout parking and going to sleep automatically.

Your disk could require different settings.

i never had strange behaviour
related to mem like kernel-freezes or program core dumps and i use the system
quite alot with big (cross-)compiles and everything that uses mem alot...

In your initial report you said that you noted the problem from 2.6.28but you found it accidentaly. Another test could be try using previouskernel to see if they work, for example 2.6.26 from Lenny.

You can test other kernel from:
http://snapshot.debian.net/

I understand you are in difficulties removing ram, but it is another ofthe suspected. It's the original from Lenovo?


An hard problem to solve.
Good luck.

Cesare.

Reply to:

References:
- Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
  - From: Cesare Leonardi <celeonar@gmail.com>
- Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
  - From: "M. Dietrich" <mdt@emdete.de>

Prev by Date: Bug#567747: xserver-xorg-video-intel: After upgrade, all system lock up when X start
Next by Date: Bug#521944: linux-image-2.6.29-1-686: Excessive number of interrupts from hrtimer_start_expires
Previous by thread: Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
Next by thread: Bug#541185: marked as done ([linux-image-2.6.30-1-686] hibernate s2disk hangs with 2.6.30, worked with 2.6.29)
Index(es):
- Date
- Thread