Disk corruption and performance issue.
This is rather long - so if you're replying to just one bit, please
consider trimming the parts that you're not responding to to make
everybody's life a little bit better!
Some time ago I wrote about a data corruption issue. I've still not
managed to track it down but I have two new datapoints one (inspired but
a recent thread) and I'm hoping someone will have ideas how I should
move forward. By avoiding heavy disk load (and important tasks/jobs!) on
the problem machine I've had no more data corruption. There are no
errors/warnings anywhere. A part of me is suspecting a faulty SSD!
I have new disks on order so I can replace the existing disks soon if
that's what it will need to fix this.
Inspired from the recent thread:
On the server that has no issues:
sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 512 bytes
These are then gpt partitioned, a small BIOS boot and EFI partition and
then a big "Linux filesystem" partition that is part of a mdadm raid
md0 : active raid1 sda3[3] sdb3[2]
On the server that has performance issues and I get occasional data
corruption (both reading and writing) under heavy (disk) load:
sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 4096 bytes
I'm wondering if that physical sector size is the issue. All the
partitions start on a 4k boundary but the big partition is not an exact
multiple of 4k. Inside the raid is a LVM PV so I think everything is 4K
aligned anyway except the filesystems themselves and the "heavy load"
filesystem that triggered the issue uses 4k blocks. But I don't know if
something somewhere has "padding" so that the actual data doesn't
actually start on a 4k boundary on the disk. There are a LOT of
partitions and filesystems in a complicated layered LVM setup so it will
be easier for me to check with instructions than to try to provide the
data for someone else to check - if someone can give me instructions to
work out exactly where the data ends up on the disk. (all partitions are
formatted with ext3)
The remaining setup is identical
The new disks are the same make and model as sdb in this server - I hope
that's not a problem!
The second datapoint. My VMs all use iscsi to provide their disk.
Normally the vm runs on the same server as the iscsi target but today I
did a kernel upgrade on a pair of vms (the one on the "problem" machine
took about twice as long) and then "cross booted" them and purged the
old kernel. I actually took timings here:
Booted on the problem machine but physical disk still on the OK machine:
real 0m35.731s
user 0m5.291s
sys 0m4.677s
Booted on the good machine but physical disk still on the problem
machine:
real 0m57.721s
user 0m5.446s
sys 0m4.783s
I was running these at the same time - which I think rules out cpu
issues. (I've done other tests that also suggest that cpu/memory isn't the
issue, it seems to be disk, cabling etc).
The SMART attributes from the problem machine:
sda:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 18280
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 54
177 Wear_Leveling_Count 0x0013 087 087 000 Pre-fail Always - 129
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age Always - 33
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 39
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 62154466086
sdb:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18697
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age Always - 433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 12
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 45
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 074 052 000 Old_age Always - 26 (Min/Max 0/48)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 1
202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age Offline - 33
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 63148678276
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1879223820
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1922002147
Does anything leap out at anyone? Anything I should try next? Normally I
try and avoid having disks bought at the same time from the same brand
paired together but I'll give that a try if it will fix this.
Tim.
Reply to: