[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Dying hard drive?



Hello,

I have 4 HDDs in software RAID10 for my backup server. I had help on this list
when I started to configure it and everything was working great for a year.

Last few days I noticed some load issues. During backup rotation of rsnapshot
load would go very high. I watched iotop and jbd2 was top most offten. It was
very strange, since jbd2 is ext4 journal manager and I'm using ext4 only for
root partition. I created RAID1 for /boot, RAID10 for root and another,
biggest RAID10 device for backup. Backup partition is using xfs. Boot and root
are using ext4 (don't know why I used ext4 for boot, I know it's not necessary
to have journal on boot partition). Debian installer set up LVM by default, so
I leave it that way for boot and root.

Here is graph where it can be seen that iowait went suddenly up:
http://img163.imageshack.us/img163/8453/jef4.png


Since backup partition is where most of the job is done, I thought that it's
LVMs fault for high load. I moved mysql to xfs backup partition and it
improved situation. Still, load is much higher that 10 days ago. Idle, it was
0.1-0.3.  Now it's 1.5-3. iotop shows some mysqld and jbd2 processes when
server is idle and not much data is written or read. Why that load then? I was
thinking of reinstalling Debian without LVM on boot and root.

I rememberd atop command and that it also shows disk usage. So here is
relevant part:

DSK |          sda | busy     80%  | read       2 | write    204  | KiB/r      4 | KiB/w     16 | MBr/s   0.00  | MBw/s   0.33 | avq     6.24  | avio 38.5 ms |
DSK |          sdd | busy     12%  | read       0 | write    215  | KiB/r      0 | KiB/w     16 | MBr/s   0.00  | MBw/s   0.36 | avq     5.25  | avio 5.51 ms |
DSK |          sdb | busy      9%  | read       0 | write    203  | KiB/r      0 | KiB/w     16 | MBr/s   0.00  | MBw/s   0.33 | avq     7.45  | avio 4.49 ms |
DSK |          sdc | busy      8%  | read       0 | write    215  | KiB/r      0 | KiB/w     16 | MBr/s   0.00  | MBw/s   0.36 | avq     8.91  | avio 3.89 ms |

Although all four disks are used in the same way regarding data being read or
written to them, sda is much busier and it's average number of milliseconds
needed by a request ('avio') is way higher. 

So, my next assumption is that sda is malfunctioning. I used smartctl to see
if I can get any useful information about that. Output of smartctl -t short
/dev/sda:

SMART overall-health self-assessment test result: PASSED

I now started "smartctl -t long /dev/sda" but it will take four hours to
finish. Until I have those results, I thought to ask you for an opinion. Can I
assume that hard drive is failing? Can there be some other cause for this
strange sda behavior? 

Regards,
Veljko


Reply to: