[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?



Bryan Irvine wrote:
Which brings me to another fun question. What's your worst administration mistake and how did you recover? -Bryan

Discovered the hard way the symptoms of a failing drive in a RAID array, leading to completely rebuilding an O/S install and restoring from backup.

Had a server that was running slower... and slower... and slower.... Still running, but taking forever to respond to even the simplest prompts. Couldn't figure out what was wrong - some things made it look like hardware, some like software.

Long story, short: turns out one of the drives in a 4-drive RAID array was experiencing a high, and increasing, raw-read-error rate. Since the drive's internal software was doing re-reads, and eventually succeeding, the result was that the drive simply slowed down; and pulled down the response time of the entire array. That's when I discovered (after the fact) that linux md drivers don't consider long delays a reason for failing a drive out of an array.

Worse.. when you're running a high-availability configuration (xen, pacemaker, drbd, etc.) - one slow drive in an array on one server, drags down the DRBD mirror, as well. The good news: when I powered down the failing system, the backup started to work just fine. The bad news: I trashed some stuff before figuring this out. Sigh...

If I had known, I could have pulled one drive, plugged in a new one, let the array rebuild, and kept on going. Unfortunately, what I did was... lots of diagnostics, lots of trial and error, ultimately trashing my system and some user data (not a lot.. good backups).. and ultimately had to reinstall the o/s and restore from backup.

Four lessons learned:
- RAID and high-availability configurations are vulnerable to a single drive failure - keep a close eye on the raw-read-error rates of drives (anything over 0 raises questions) - be sure to purchase server-grade drives (they assume that failures will be handled by a RAID array, so spend less time trying to recover from a read error) - when one disk starts going, replace them all (assuming that they went online at the same time)... it's amazing how similar the lifetime is for all the disks in an array

Miles Fidelman

--
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



Reply to: