Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?

To: debian-user@lists.debian.org
Cc: debian-user@lists.debian.org
Subject: Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?
From: Miles Fidelman <mfidelman@meetinghouse.net>
Date: Wed, 14 Sep 2011 08:51:50 -0400
Message-id: <[🔎] 4E70A366.8010508@meetinghouse.net>
In-reply-to: <[🔎] CAG367gZn9N+V8TUwAMFRa-p4cO6domHYEbMV_X3AY1mN18zKvw@mail.gmail.com>
References: <[🔎] CAG367gZn9N+V8TUwAMFRa-p4cO6domHYEbMV_X3AY1mN18zKvw@mail.gmail.com>

Bryan Irvine wrote:

Which brings me to another fun question. What's your worstadministration mistake and how did you recover? -Bryan

Discovered the hard way the symptoms of a failing drive in a RAID array,leading to completely rebuilding an O/S install and restoring from backup.

Had a server that was running slower... and slower... and slower....Still running, but taking forever to respond to even the simplestprompts. Couldn't figure out what was wrong - some things made it looklike hardware, some like software.

Long story, short: turns out one of the drives in a 4-drive RAID arraywas experiencing a high, and increasing, raw-read-error rate. Sincethe drive's internal software was doing re-reads, and eventuallysucceeding, the result was that the drive simply slowed down; and pulleddown the response time of the entire array. That's when I discovered(after the fact) that linux md drivers don't consider long delays areason for failing a drive out of an array.

Worse.. when you're running a high-availability configuration (xen,pacemaker, drbd, etc.) - one slow drive in an array on one server, dragsdown the DRBD mirror, as well. The good news: when I powered down thefailing system, the backup started to work just fine. The bad news: Itrashed some stuff before figuring this out. Sigh...

If I had known, I could have pulled one drive, plugged in a new one, letthe array rebuild, and kept on going. Unfortunately, what I did was...lots of diagnostics, lots of trial and error, ultimately trashing mysystem and some user data (not a lot.. good backups).. and ultimatelyhad to reinstall the o/s and restore from backup.


Four lessons learned:

- RAID and high-availability configurations are vulnerable to a singledrive failure- keep a close eye on the raw-read-error rates of drives (anything over0 raises questions)- be sure to purchase server-grade drives (they assume that failureswill be handled by a RAID array, so spend less time trying to recoverfrom a read error)- when one disk starts going, replace them all (assuming that they wentonline at the same time)... it's amazing how similar the lifetime is forall the disks in an array


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra

Reply to:

Follow-Ups:
- Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?
  - From: Mike McClain <mike.junk@cox.net>

References:
- Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?
  - From: Bryan Irvine <sparctacus@gmail.com>

Prev by Date: Re: Wiping hard drives
Next by Date: Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?
Previous by thread: Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?
Next by thread: Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?
Index(es):
- Date
- Thread