[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Etch: Stock kernel issues with ATI SB600 SATA controller and 2xSamsung HD403LJ ?



Hi,

I just ran into some troubles with the IO system of a new server running
Debian Etch (2.6.18-6-vserver-amd64 (Debian 2.6.18.dfsg.1-18etch1)) and
was wondering whether these might be related to some SB600 patches which
might not yet be part of the Debian kernel.

The problem occurs with both disks and even persists after the _entire_
hardware of the server was replaced. This therefore seems to be unlikely
to be a hardware defect of one of the individual
disks/motherboard/cables involved.

Controller:
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
( full lspci -vv: http://pastesite.com/326/123 )

Disks:
two SAMSUNG HD403LJ (FW: CT100-12) in a software raid1
( full smartctl/hdparm output: http://pastesite.com/328/123 )

Complete bootup dmesg output:
http://pastesite.com/327/123


The system runs without problems but after a couple of days or weeks of
heavy IO load one of the following two situations can occur.

a) Error during access to a disk, leading to the step by step
degradation of the DMA/PIO mode for the disk.
  ata2.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2 frozen
  ata2.00: tag 0 cmd 0x61 Emask 0x44 stat 0x40 err 0x0 (timeout)
  ata2: soft resetting port
  ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  ata2.00: configured for UDMA/133
  ata2: EH complete
  ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
  ...
  ata2.00: configured for UDMA/100
  ...
  ata2.00: configured for UDMA/66
  ...
  ...
  ata2.00: configured for PIO4

All that happens within one second (according to the syslog timestamps).
Somewhere in between raid1 gives up and drops the disk from the array.
But the disk remains accessible after the degradation and can still be
used normally.


b) Error during access to disk but the dma/pio mode is not changed. The
disk becomes totally inaccessible till the system is rebooted.
  ata1.00: exception Emask 0x40 SAct 0xf SErr 0x800 action 0x2 frozen
  ata1.00: tag 0 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
  ata1.00: tag 1 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
  ata1.00: tag 2 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
  ata1.00: tag 3 cmd 0x60 Emask 0x44 stat 0x40 err 0x0 (timeout)
  ata1: soft resetting port
  ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  ata1.00: configured for UDMA/133
  ata1: EH complete
  ata1.00: exception Emask 0x0 SAct 0xc SErr 0x0 action 0x2 frozen
  ata1.00: tag 2 cmd 0x60 Emask 0x4 stat 0x40 err 0x0 (timeout)
  ata1.00: tag 3 cmd 0x60 Emask 0x4 stat 0x40 err 0x0 (timeout)
  ata1: soft resetting port
  sda, sector 362827470
  sd 0:0:0:0: SCSI error: return code = 0x00040000
  end_request: I/O error, dev sda, sector 362827470

After that the disk can't be accessed in any way.
i.e.
  # smartctl -d ata -a /dev/sda
  Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

  # hdparm -I /dev/sda
  /dev/sda:
   HDIO_DRIVE_CMD(identify) failed: Input/output error

  # hdparm -w /dev/sda
  /dev/sda:
   HDIO_DRIVE_RESET failed: Inappropriate ioctl for device

full error output:
  http://pastesite.com/320/123
  http://pastesite.com/329/123
  http://pastesite.com/330/123


Has anyone experienced similar problems in the past?
I noticed that the SB600 driver has undergone some modifications/patches
since 2.6.18. Does anyone know if such driver updates/fixes are usually
back ported into the Debian stock kernel?

Greetings
Hans


Reply to: