[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

SATA SB700/SB80 io errors



Hey there,

Our DRBD primary machine expirenced a rather spontanous reboot some time ago.

We were happily starting / stopping kvm virtual machines, syncing a
new drbd resource and
then this happened:

...
Feb 29 06:53:47 node2 kernel: [217385.578661] ata3.00: disabled
Feb 29 06:53:47 node2 kernel: [217385.578703] sd 2:0:0:0: [sda]
Unhandled error code
Feb 29 06:53:47 node2 kernel: [217385.578707] sd 2:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 29 06:53:47 node2 kernel: [217385.578712] sd 2:0:0:0: [sda] CDB:
Read(10): 28 00 19 74 18 00 00 01 38 00
Feb 29 06:53:47 node2 kernel: [217385.661238] sd 2:0:0:0: [sda] Stopping disk
Feb 29 06:53:47 node2 kernel: [217385.661977] sd 2:0:0:0: [sda]
START_STOP FAILED
Feb 29 06:53:47 node2 kernel: [217385.661981] sd 2:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 29 06:53:47 node2 kernel: [217385.662391] ata4.00: disabled
Feb 29 06:53:47 node2 kernel: [217385.668821] sd 3:0:0:0: [sdb] Stopping disk
Feb 29 06:53:47 node2 kernel: [217385.668864] sd 3:0:0:0: [sdb]
START_STOP FAILED
Feb 29 06:53:47 node2 kernel: [217385.668867] sd 3:0:0:0: [sdb]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 29 06:53:47 node2 kernel: [217385.669000] ata5.00: disabled
Feb 29 06:53:47 node2 kernel: [217385.686506] md: super_written gets
error=-5, uptodate=0
Feb 29 06:53:47 node2 kernel: [217385.755989] md: super_written gets
error=-5, uptodate=0
Feb 29 06:53:47 node2 kernel: [217385.756202] sd 4:0:0:0: [sdc] Stopping disk
Feb 29 06:53:47 node2 kernel: [217385.756257] sd 4:0:0:0: [sdc]
START_STOP FAILED
Feb 29 06:53:47 node2 kernel: [217385.756260] sd 4:0:0:0: [sdc]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 29 06:53:47 node2 kernel: [217385.756779] ata6.00: disabled
Feb 29 06:53:47 node2 kernel: [217385.816675] md: super_written gets
error=-5, uptodate=0
Feb 29 06:53:47 node2 kernel: [217385.900415] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.900418]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.900421]  disk 0, o:0, dev:sda
Feb 29 06:53:47 node2 kernel: [217385.900424]  disk 1, o:0, dev:sdb
Feb 29 06:53:47 node2 kernel: [217385.900426]  disk 2, o:0, dev:sdc
Feb 29 06:53:47 node2 kernel: [217385.900429]  disk 3, o:0, dev:sdd
Feb 29 06:53:47 node2 kernel: [217385.900771] sd 5:0:0:0: [sdd] Stopping disk
Feb 29 06:53:47 node2 kernel: [217385.901157] sd 5:0:0:0: [sdd]
START_STOP FAILED
Feb 29 06:53:47 node2 kernel: [217385.901162] sd 5:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 29 06:53:47 node2 kernel: [217385.901487] ahci 0000:00:11.0: PCI
INT A disabled
Feb 29 06:53:47 node2 kernel: [217385.902756] pci-stub 0000:00:11.0:
claimed by stub
Feb 29 06:53:47 node2 kernel: [217385.904721] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.904727]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.904732]  disk 1, o:0, dev:sdb
Feb 29 06:53:47 node2 kernel: [217385.904735]  disk 2, o:0, dev:sdc
Feb 29 06:53:47 node2 kernel: [217385.904738]  disk 3, o:0, dev:sdd
Feb 29 06:53:47 node2 kernel: [217385.904752] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.904755]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.904757]  disk 1, o:0, dev:sdb
Feb 29 06:53:47 node2 kernel: [217385.904759]  disk 2, o:0, dev:sdc
Feb 29 06:53:47 node2 kernel: [217385.904762]  disk 3, o:0, dev:sdd
Feb 29 06:53:47 node2 kernel: [217385.916029] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.916035]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.916040]  disk 1, o:0, dev:sdb
Feb 29 06:53:47 node2 kernel: [217385.916042]  disk 2, o:0, dev:sdc
Feb 29 06:53:47 node2 kernel: [217385.916056] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.916058]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.916060]  disk 1, o:0, dev:sdb
Feb 29 06:53:47 node2 kernel: [217385.916062]  disk 2, o:0, dev:sdc
Feb 29 06:53:47 node2 kernel: [217385.932427] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.932432]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.932437]  disk 1, o:0, dev:sdb
Feb 29 06:53:47 node2 kernel: [217385.932450] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.932452]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.932455]  disk 1, o:0, dev:sdb
Feb 29 06:53:47 node2 kernel: [217385.948162] RAID5 conf printout:
Feb 29 06:53:47 node2 kernel: [217385.948168]  --- rd:4 wd:0
Feb 29 06:53:47 node2 kernel: [217385.949817] block drbd0: Barriers
not supported on meta data device - disabling
Feb 29 06:53:47 node2 kernel: [217385.950177] block drbd0: read:
error=-5 s=232535040s
Feb 29 06:53:47 node2 kernel: [217385.950184] block drbd0: Resync aborted.
Feb 29 06:53:47 node2 kernel: [217385.950189] block drbd0: conn(
SyncSource -> Connected ) disk( UpToDate -> Failed )
Feb 29 06:53:47 node2 kernel: [217385.981468] block drbd0: read:
error=-5 s=232536064s
Feb 29 06:53:47 node2 kernel: [217385.981479] block drbd0: read:
error=-5 s=232534016s
Feb 29 06:53:47 node2 kernel: [217385.981648] block drbd0: read:
error=-5 s=232535048s

<snip ~ 600 more lines like this...>

Feb 29 06:53:47 node2 kernel: [217385.985444] block drbd0: p write: error=-5
Feb 29 06:53:47 node2 kernel: [217386.016978] block drbd0: p write: error=-5
Feb 29 06:53:47 node2 kernel: [217386.136316] block drbd0: helper
command: /sbin/drbdadm pri-on-incon-degr minor-0
Feb 29 06:53:47 node2 kernel: [217386.153546] block drbd0: read:
error=-5 s=232539272s

Feb 29 06:53:47 node2 notify-pri-on-incon-degr.sh[25841]: invoked for lv0

Feb 29 06:53:48 node2 kernel: [217386.403458] lost page write due to
I/O error on drbd0
Feb 29 06:53:48 node2 kernel: [217386.471193] lost page write due to
I/O error on drbd0

Feb 29 06:53:48 node2 kernel: [217386.511306] block drbd1: p write: error=-5
Feb 29 06:53:48 node2 kernel: [217386.526164] block drbd1: disk(
UpToDate -> Failed )
Feb 29 06:53:48 node2 kernel: [217386.585614] block drbd1: p write: error=-5
Feb 29 06:53:48 node2 kernel: [217386.624749] block drbd1: disk(
Failed -> Diskless )
Feb 29 06:53:48 node2 kernel: [217386.624764] block drbd1: Notified
peer that my disk is broken.

Feb 29 06:53:48 node2 kernel: [217386.917071] ahci 0000:00:11.0: PCI
INT A -> GSI 19 (level, low) -> IRQ 19
Feb 29 06:53:48 node2 kernel: [217386.917872] ahci 0000:00:11.0: AHCI
0001.0200 32 slots 4 ports 3 Gbps 0xf impl SATA mode
Feb 29 06:53:48 node2 kernel: [217386.917879] ahci 0000:00:11.0:
flags: 64bit ncq sntf ilck pm led clo pmp pio slum part
Feb 29 06:53:48 node2 kernel: [217386.918363] scsi7 : ahci
Feb 29 06:53:48 node2 kernel: [217386.918492] scsi8 : ahci
Feb 29 06:53:48 node2 kernel: [217386.918571] scsi9 : ahci
Feb 29 06:53:48 node2 kernel: [217386.919291] scsi10 : ahci
Feb 29 06:53:48 node2 kernel: [217386.919361] ata7: SATA max UDMA/133
abar m1024@0xfe4ffc00 port 0xfe4ffd00 irq 30
Feb 29 06:53:48 node2 kernel: [217386.919367] ata8: SATA max UDMA/133
abar m1024@0xfe4ffc00 port 0xfe4ffd80 irq 30
Feb 29 06:53:48 node2 kernel: [217386.919372] ata9: SATA max UDMA/133
abar m1024@0xfe4ffc00 port 0xfe4ffe00 irq 30
Feb 29 06:53:48 node2 kernel: [217386.919377] ata10: SATA max UDMA/133
abar m1024@0xfe4ffc00 port 0xfe4ffe80 irq 30
Feb 29 06:53:49 node2 kernel: [217387.404053] ata9: SATA link up 3.0
Gbps (SStatus 123 SControl 300)
Feb 29 06:53:49 node2 kernel: [217387.404091] ata7: SATA link up 3.0
Gbps (SStatus 123 SControl 300)
Feb 29 06:53:49 node2 kernel: [217387.404116] ata8: SATA link up 3.0
Gbps (SStatus 123 SControl 300)
Feb 29 06:53:49 node2 kernel: [217387.404141] ata10: SATA link up 3.0
Gbps (SStatus 123 SControl 300)
Feb 29 06:53:49 node2 kernel: [217387.409780] ata9.00: ATA-8: SAMSUNG
HD103SJ, 1AJ10001, max UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.409786] ata9.00: 1953525168
sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Feb 29 06:53:49 node2 kernel: [217387.409823] ata8.00: ATA-8: SAMSUNG
HD103SJ, 1AJ10001, max UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.409828] ata8.00: 1953525168
sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Feb 29 06:53:49 node2 kernel: [217387.410142] ata7.00: ATA-8: SAMSUNG
HD103SJ, 1AJ10001, max UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.410149] ata7.00: 1953525168
sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Feb 29 06:53:49 node2 kernel: [217387.410198] ata10.00: ATA-8: SAMSUNG
HD103SJ, 1AJ10001, max UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.410203] ata10.00: 1953525168
sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Feb 29 06:53:49 node2 kernel: [217387.415580] ata9.00: configured for UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.415615] ata8.00: configured for UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.415939] ata7.00: configured for UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.415980] ata10.00: configured for UDMA/133
Feb 29 06:53:49 node2 kernel: [217387.428686] scsi 7:0:0:0:
Direct-Access     ATA      SAMSUNG HD103SJ  1AJ1 PQ: 0 ANSI: 5
Feb 29 06:53:49 node2 kernel: [217387.429015] sd 7:0:0:0: [sdf]
1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Feb 29 06:53:49 node2 kernel: [217387.429450] scsi 8:0:0:0:
Direct-Access     ATA      SAMSUNG HD103SJ  1AJ1 PQ: 0 ANSI: 5
Feb 29 06:53:49 node2 kernel: [217387.429756] scsi 9:0:0:0:
Direct-Access     ATA      SAMSUNG HD103SJ  1AJ1 PQ: 0 ANSI: 5
Feb 29 06:53:49 node2 kernel: [217387.430666] sd 9:0:0:0: [sdh]
1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Feb 29 06:53:49 node2 kernel: [217387.430741] sd 9:0:0:0: [sdh] Write
Protect is off
Feb 29 06:53:49 node2 kernel: [217387.430774] sd 9:0:0:0: [sdh] Write
cache: disabled, read cache: enabled, doesn't support DPO or FUA
Feb 29 06:53:49 node2 kernel: [217387.430974]  sdh:
Feb 29 06:53:49 node2 kernel: [217387.431199] sd 8:0:0:0: [sdg]
1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Feb 29 06:53:49 node2 kernel: [217387.431278] sd 8:0:0:0: [sdg] Write
Protect is off
Feb 29 06:53:49 node2 kernel: [217387.431313] sd 8:0:0:0: [sdg] Write
cache: disabled, read cache: enabled, doesn't support DPO or FUA
Feb 29 06:53:49 node2 kernel: [217387.436193]  sdg:
Feb 29 06:53:49 node2 kernel: [217387.436382] scsi 10:0:0:0:
Direct-Access     ATA      SAMSUNG HD103SJ  1AJ1 PQ: 0 ANSI: 5
Feb 29 06:53:49 node2 kernel: [217387.436580] sd 10:0:0:0: [sdi]
1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Feb 29 06:53:49 node2 kernel: [217387.436649] sd 10:0:0:0: [sdi] Write
Protect is off
Feb 29 06:53:49 node2 kernel: [217387.436682] sd 10:0:0:0: [sdi] Write
cache: disabled, read cache: enabled, doesn't support DPO or FUA
Feb 29 06:53:49 node2 kernel: [217387.436888]  sdi:
Feb 29 06:53:49 node2 kernel: [217387.437033] sd 7:0:0:0: [sdf] Write
Protect is off
Feb 29 06:53:49 node2 kernel: [217387.437064] sd 7:0:0:0: [sdf] Write
cache: disabled, read cache: enabled, doesn't support DPO or FUA
Feb 29 06:53:49 node2 kernel: [217387.437212]  sdf: unknown partition table
Feb 29 06:53:49 node2 kernel: [217387.439934] sd 9:0:0:0: [sdh]
Attached SCSI disk
Feb 29 06:53:49 node2 kernel: [217387.445677]  unknown partition table
Feb 29 06:53:49 node2 kernel: [217387.446324] sd 10:0:0:0: [sdi]
Attached SCSI disk
Feb 29 06:53:49 node2 kernel: [217387.451006]  unknown partition table
Feb 29 06:53:49 node2 kernel: [217387.451309] sd 8:0:0:0: [sdg]
Attached SCSI disk
Feb 29 06:53:49 node2 kernel: [217387.451325]
Feb 29 06:53:49 node2 kernel: [217387.452053] sd 7:0:0:0: [sdf]
Attached SCSI disk

<snip drbd does propably the right thing and initiates a reboot>

Feb 29 06:53:50 node2 notify-emergency-reboot.sh[25900]: invoked for lv0

Setup

Both nodes run squeeze stock 2.6.32-5-amd64 kernel.

node2 drbd primary
HP Proliant Micro Server
00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA
Controller [AHCI mode] (rev 40)

4 sata disks sd[a-d]
1 vg "data" 2.73TB
1 lv "export" 500GB / /dev/drbd1
1 lv "lv0" 500GB    / /dev/drbd0

node3 drbd secondary
HP Proliant Micro Server
00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA
Controller [AHCI mode] (rev 40)

4 sata disks sd[a-d]
1 vg "data" 2.73TB
1 lv "export" 500GB / /dev/drbd1
1 lv "lv0" 500GB    / /dev/drbd0

drbd resources

resource r0 {
device    /dev/drbd1;
disk      /dev/mapper/data-export;
meta-disk internal;
startup { wfc-timeout 90; }
# net { on-disconnect reconnect; }
disk { on-io-error detach; }
on node2 { address   10.1.5.2:7789; }
on node3 { address   10.1.5.3:7789; }
}

resource lv0 {
device    /dev/drbd0;
disk      /dev/mapper/data-lv0;
meta-disk internal;
startup { wfc-timeout 90; }
# net { on-disconnect reconnect; }
disk { on-io-error detach; }
on node2 { address   10.1.5.2:7790; }
on node3 { address   10.1.5.3:7790; }
}

switched gigabit ethernet hooks all this together

I was since able to reproduce the problem on another hp miniserver,
identical to this one but with slower and bigger disks in it - same sata
controller tough.

Other people might be having issues with this sata controller too:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/550559

The machines are deployed at the customers site, run currently mostly stable,
as long as we keep the io load down...

Any help is appreciated to get this sorted out.

Cheers Robert


Reply to: