[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: problems with server dying.



On Tue, Jun 06, 2000 at 10:41:09AM +1000, Marc-Adrian Napoli wrote
> Hi all, as you can guess from the subject i have a server (debian 2.0,
> pentium 200, 500mb ram, 30 gig or so) that is dying on me at random times
> early in the morning!!
> 
> (quite annoying).
> 
> I've gathered the following from the logs:
> 
> Jun  5 06:43:42 godzilla kernel: EXT2-fs error (device 16:00): ext2_readdir:
> bad entry in directory #126
> 28: rec_len % 4 != 0 - offset=0, inode=3326860705, rec_len=34410,
> name_len=31708
> Jun  5 06:43:42 godzilla kernel: Remounting filesystem read-only
> Jun  5 06:43:42 godzilla kernel: EXT2-fs error (device 16:00):
> ext2_find_entry: bad entry in directory #
> 12628: rec_len % 4 != 0 - offset=0, inode=3326860705, rec_len=34410,
> name_len=31708
> Jun  5 06:43:42 godzilla kernel: Remounting filesystem read-only
> Jun  5 07:02:50 godzilla kernel: hdc: read_intr: status=0x59 { DriveReady
> SeekComplete DataRequest Error
>  }
> Jun  5 07:02:50 godzilla kernel: hdc: read_intr: error=0x40
>  UncorrectableError }, LBAsect=4512574, sec
> tor=4512574
> Jun  5 07:02:50 godzilla kernel: end_request: I/O error, dev 16:00, sector
> 4512574
> Jun  5 07:03:00 godzilla kernel: hdc: irq timeout: status=0xd0 { Busy }
> Jun  5 07:03:01 godzilla kernel: ide1: reset: success
> Jun  5 07:03:08 godzilla kernel: hdc: read_intr: status=0x59 { DriveReady
> SeekComplete DataRequest Error
>  }
> Jun  5 07:03:08 godzilla kernel: hdc: read_intr: error=0x40
>  UncorrectableError }, LBAsect=4512720, sec
> tor=4512720
> Jun  5 07:03:08 godzilla kernel: end_request: I/O error, dev 16:00, sector
> 4512720
> Jun  5 07:03:19 godzilla kernel: hdc: irq timeout: status=0xd0 { Busy }
> Jun  5 07:03:21 godzilla kernel: ide1: reset: success
> Jun  5 07:03:31 godzilla kernel: hdc: irq timeout: status=0xd0 { Busy }
> Jun  5 07:03:35 godzilla kernel: ide1: reset: success
> Jun  5 07:03:43 godzilla kernel: hdc: read_intr: status=0x59 { DriveReady
> SeekComplete DataRequest Error
>  }
> Jun  5 07:03:43 godzilla kernel: hdc: read_intr: error=0x01
>  AddrMarkNotFound }, LBAsect=4512574, secto
> r=4512574
> 
> When the techie on call at that time put a monitor on the box he saw
> "Couldn't get free page..." all the way down the screen and couldn't get a
> prompt. (Forcing us to hard reboot the system).
> 

Looks like /dev/hdc is in trouble.  While this could be caused
by other stuff on the same IDE cable or on your PCI bus, if all
of the messages are pointing at the same device (they seem to
be) it's the most likely source.

Most likely causes:
  - /dev/hdc is dying.
  - Overheating.  Especially if you have several drives, with
    maybe not as much air space between them as they might
    prefer, and have only a CPU and PSU fan.  Especially if
    there's an extended period of disk activity (say, some 7am 
    cron jobs) before things come unstuck.
  - Bad or poorly fitted IDE or power cable.

I'd have a close look at the cooling for /dev/hdc and maybe
give it more room or install a drive fan, but I'd probably also
source a replacement drive in case.



John P.
-- 
huiac@camtech.net.au
john@huiac.apana.org.au
http://www.mdt.net.au/~john Debian Linux admin & support:technical services



Reply to: