[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#406581: marked as done (disk failures during access on SATA drives, Xen only)



Your message dated Thu, 4 Feb 2010 02:09:12 +0100
with message-id <20100204010912.GD2665@stro.at>
and subject line Re: disk failures during access on SATA drives, Xen only
has caused the Debian Bug report #406581,
regarding disk failures during access on SATA drives, Xen only
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
406581: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=406581
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: linux-image-2.6.18-3-xen-amd64
Version: 2.6.18-7
Severity: important

Our Xen test machine has two SATA controllers

  01:05.0 Mass storage controller: Promise Technology, Inc. PDC20375 (SATA150 TX2plus) (rev 02)
  01:08.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak 378/SATA 378) (rev 02)

and a total of three SATA drives (all SAMSUNG SP2004C) connected to
it. Two are connected to the SATA378 controller (the second one,
which is onboard), and the third is connected to the SATA150 one
(which is a PCI card). The system is an AMD Opteron, running etch
and native amd64. The three drives each hold 8 partitions, which are
turned into 8 RAID arrays, two RAID1 and 6 RAID5.

dmesg output right after boot is attached. So are lspci, cpuinfo and
mdstat. Please contact me for more information. I will be away from
the system for the next couple of weeks, but it'll be running the
non-Xen kernel and be accessible, and if needed, I can get
a colleague to do work on it for you.

The problem occurs sporadically, but only when booting the Xen
kernel. I have not once managed to reproduce it with the
2.6.18-3-amd64 kernel. I can reproduce it with the
2.6.18-3-xen-amd64 kernel more or less at will.

It seems that disk activity triggers it. For instance, booting and
letting a RAID5 spanned across the three disks resynchronise almost
always causes the problem to appear. This is what the log says in
such a case:

  kernel: ata3: command timeout
  kernel: ata3: no sense translation for status: 0x40
  kernel: ata3: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
  kernel: ata3: status=0x40 { DriveReady }
  kernel: sd 2:0:0:0: SCSI error: return code = 0x08000002
  kernel: sdb: Current: sense key: Aborted Command
  kernel: Additional sense: No additional sense information
  kernel: end_request: I/O error, dev sdb, sector 48044091
  kernel: raid5:md4: read error not correctable (sector 41425248 on sdb7).
  kernel: raid5: Disk failure on sdb7, disabling device. Operation continuing on 1 devices
  kernel: raid5:md4: read error not correctable (sector 41425256 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425264 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425272 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425280 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425288 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425296 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425304 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425312 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425320 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425328 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425336 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425344 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425352 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425360 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425368 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425376 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425384 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425392 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425400 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425408 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425416 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425424 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425432 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425440 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425448 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425456 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425464 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425472 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425480 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425488 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425496 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425504 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425512 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425520 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425528 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425536 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425544 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425552 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425560 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425568 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425576 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425584 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425592 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425600 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425608 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425616 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425624 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425632 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425640 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425648 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425656 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425664 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425672 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425680 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425688 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425696 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425704 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425712 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425720 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425728 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425736 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425744 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425752 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425760 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425768 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425776 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425784 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425792 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425800 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425808 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425816 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425824 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425832 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425840 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425848 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425856 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425864 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425872 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425880 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425888 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425896 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425904 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425912 on sdb7).
  kernel: raid5:md4: read error not correctable (sector 41425920 on sdb7).
  kernel: ata4: command timeout
  kernel: ata4: no sense translation for status: 0x40
  kernel: ata4: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
  kernel: ata4: status=0x40 { DriveReady }
  kernel: sd 3:0:0:0: SCSI error: return code = 0x08000002
  kernel: sdc: Current: sense key: Aborted Command
  kernel: Additional sense: No additional sense information
  kernel: end_request: I/O error, dev sdc, sector 48043483
  kernel: raid5: Disk failure on sdc7, disabling device. Operation continuing on 1 devices

Note that the disk and controller will change. Once it's ata3/4 and
sdb/c, at other times it's ata1/3 and sda/b. The disks themselves
have no SMART errors.

For instance, here's another instance:

  kernel: ata4: command timeout
  kernel: ata4: no sense translation for status: 0x40
  kernel: ata4: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
  kernel: ata4: status=0x40 { DriveReady }
  kernel: sd 3:0:0:0: SCSI error: return code = 0x08000002
  kernel: sdc: Current: sense key: Aborted Command
  kernel: Additional sense: No additional sense information
  kernel: end_request: I/O error, dev sdc, sector 56772315
  kernel: raid5: Disk failure on sdc7, disabling device. Operation continuing on 2 devices
  kernel: ata1: command timeout
  kernel: ata1: no sense translation for status: 0x40
  kernel: ata1: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
  kernel: ata1: status=0x40 { DriveReady }
  kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
  kernel: sda: Current: sense key: Aborted Command
  kernel: Additional sense: No additional sense information
  kernel: end_request: I/O error, dev sda, sector 56772907
  kernel: raid5:md4: read error not correctable (sector 50154064 on sda7).
  kernel: raid5: Disk failure on sda7, disabling device. Operation continuing on 1 devices
  kernel: raid5:md4: read error not correctable (sector 50154072 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154080 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154088 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154096 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154104 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154112 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154120 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154128 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154136 on sda7).
  kernel: raid5:md4: read error not correctable (sector 50154144 on sda7).

Following the above, other partitions will report failures and the
system will hardlock. Upon reboot, it's normal again (the RAID
recovery restarts), but no data seems to be lost.

See below for the list of modules at time of the crash. Note that
sata_nv is being loaded (by udev), but there are no additional SATA
ports other than the two on-board Promise ports and the two ports on
the PCI card. The sata_nv module can be freely removed.

Modules loaded:
Module                  Size  Used by
bridge                 63408  0 
netloop                11392  0 
tun                    16256  0 
ipv6                  285920  18 
ipt_MASQUERADE          8320  1 
iptable_nat            12292  1 
ipt_REJECT             10112  1 
xt_tcpudp               7936  22 
ipt_addrtype            6528  1 
ipt_LOG                11264  1 
xt_limit                7424  1 
xt_conntrack            7168  6 
ip_nat_ftp              8064  0 
ip_nat                 24492  3 ipt_MASQUERADE,iptable_nat,ip_nat_ftp
ip_conntrack_ftp       13136  1 ip_nat_ftp
ip_conntrack           63140  6 ipt_MASQUERADE,iptable_nat,xt_conntrack,ip_nat_ftp,ip_nat,ip_conntrack_ftp
nfnetlink              11976  2 ip_nat,ip_conntrack
iptable_filter          7808  1 
ip_tables              25192  2 iptable_nat,iptable_filter
x_tables               21896  9 ipt_MASQUERADE,iptable_nat,ipt_REJECT,xt_tcpudp,ipt_addrtype,ipt_LOG,xt_limit,xt_conntrack,ip_tables
dm_crypt               16400  0 
psmouse                44560  0 
serio_raw              12036  0 
i2c_nforce2            12544  0 
pcspkr                  7808  0 
shpchp                 42028  0 
pci_hotplug            20872  1 shpchp
i2c_core               27776  1 i2c_nforce2
evdev                  15360  0 
ext3                  138256  6 
jbd                    65392  1 ext3
mbcache                14216  1 ext3
dm_mirror              25344  0 
dm_snapshot            20536  0 
dm_mod                 62928  5 dm_crypt,dm_mirror,dm_snapshot
raid456               123680  7 
xor                    11024  1 raid456
raid1                  27136  2 
md_mod                 83484  11 raid456,raid1
ide_generic             5760  0 [permanent]
sd_mod                 25856  27 
ide_disk               20736  6 
generic                10756  0 [permanent]
amd74xx                19504  0 [permanent]
ide_core              148224  4 ide_generic,ide_disk,generic,amd74xx
sata_promise           18052  24 
tulip                  57760  0 
libata                107040  2 sata_promise
scsi_mod              153008  2 sd_mod,libata
ehci_hcd               36232  0 
ohci_hcd               24964  0 
fan                     9864  0 

-- System Information:
Debian Release: 4.0
  APT prefers unstable
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Shell:  /bin/sh linked to /bin/dash
Kernel: Linux 2.6.18-3-xen-amd64
Locale: LANG=en_GB, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)

-- 
 .''`.   martin f. krafft <madduck@debian.org>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems

Attachment: dmesg.bz2
Description: Binary data

Attachment: lspci.bz2
Description: Binary data

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 5
model name	: AMD Opteron(tm) Processor 242
stepping	: 10
cpu MHz		: 1600.035
cache size	: 1024 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips	: 4001.50
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts ttp
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid5 sda10[0] sdc10[2] sdb10[1]
      1991808 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md6 : active raid5 sda9[0] sdc9[2] sdb9[1]
      995712 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md5 : active raid5 sda8[0] sdc8[2] sdb8[1]
      16000512 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
        resync=DELAYED

md4 : active raid5 sda7[0] sdc7[3] sdb7[1]
      365108992 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
      [>....................]  recovery =  0.2% (440960/182554496) finish=192.7min speed=15748K/sec

md3 : active raid5 sda6[0] sdc6[2] sdb6[1]
      1991808 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md0 : active raid1 sda1[0] sdc1[2] sdb1[1]
      64128 blocks [3/3] [UUU]            

md2 : active raid5 sda5[0] sdc5[2] sdb5[1]
      497792 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid1 sda2[0] sdc2[2] sdb2[1]
      2000000 blocks [3/3] [UUU]          

unused devices: <none>

Attachment: signature.asc
Description: Digital signature (GPG/PGP)


--- End Message ---
--- Begin Message ---
closing as a bit aged Xen bug wihtout any activity.

as we all know they are not yet merged, so not much point
in leaving that bug report hanging.

happy kvm hacking.

thanks for the report anyway
-- 
maks


--- End Message ---

Reply to: