Dying hard drive?
Hello,
I have 4 HDDs in software RAID10 for my backup server. I had help on this list
when I started to configure it and everything was working great for a year.
Last few days I noticed some load issues. During backup rotation of rsnapshot
load would go very high. I watched iotop and jbd2 was top most offten. It was
very strange, since jbd2 is ext4 journal manager and I'm using ext4 only for
root partition. I created RAID1 for /boot, RAID10 for root and another,
biggest RAID10 device for backup. Backup partition is using xfs. Boot and root
are using ext4 (don't know why I used ext4 for boot, I know it's not necessary
to have journal on boot partition). Debian installer set up LVM by default, so
I leave it that way for boot and root.
Here is graph where it can be seen that iowait went suddenly up:
http://img163.imageshack.us/img163/8453/jef4.png
Since backup partition is where most of the job is done, I thought that it's
LVMs fault for high load. I moved mysql to xfs backup partition and it
improved situation. Still, load is much higher that 10 days ago. Idle, it was
0.1-0.3. Now it's 1.5-3. iotop shows some mysqld and jbd2 processes when
server is idle and not much data is written or read. Why that load then? I was
thinking of reinstalling Debian without LVM on boot and root.
I rememberd atop command and that it also shows disk usage. So here is
relevant part:
DSK | sda | busy 80% | read 2 | write 204 | KiB/r 4 | KiB/w 16 | MBr/s 0.00 | MBw/s 0.33 | avq 6.24 | avio 38.5 ms |
DSK | sdd | busy 12% | read 0 | write 215 | KiB/r 0 | KiB/w 16 | MBr/s 0.00 | MBw/s 0.36 | avq 5.25 | avio 5.51 ms |
DSK | sdb | busy 9% | read 0 | write 203 | KiB/r 0 | KiB/w 16 | MBr/s 0.00 | MBw/s 0.33 | avq 7.45 | avio 4.49 ms |
DSK | sdc | busy 8% | read 0 | write 215 | KiB/r 0 | KiB/w 16 | MBr/s 0.00 | MBw/s 0.36 | avq 8.91 | avio 3.89 ms |
Although all four disks are used in the same way regarding data being read or
written to them, sda is much busier and it's average number of milliseconds
needed by a request ('avio') is way higher.
So, my next assumption is that sda is malfunctioning. I used smartctl to see
if I can get any useful information about that. Output of smartctl -t short
/dev/sda:
SMART overall-health self-assessment test result: PASSED
I now started "smartctl -t long /dev/sda" but it will take four hours to
finish. Until I have those results, I thought to ask you for an opinion. Can I
assume that hard drive is failing? Can there be some other cause for this
strange sda behavior?
Regards,
Veljko
Reply to: