Bug#596528: Debuging large transfer under usb3 device

To: guillaume.jaouen@free.fr
Cc: 596528@bugs.debian.org, linux-usb@vger.kernel.org
Subject: Bug#596528: Debuging large transfer under usb3 device
From: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Date: Tue, 28 Feb 2012 12:19:01 -0800
Message-id: <[🔎] 20120228201901.GG6078@xanatos>
Reply-to: Sarah Sharp <sarah.a.sharp@linux.intel.com>, 596528@bugs.debian.org
In-reply-to: <1631973011.35885137.1330370293927.JavaMail.root@zimbra47-e8.priv.proxad.net>
References: <1811627919.35826750.1330367633689.JavaMail.root@zimbra47-e8.priv.proxad.net> <1631973011.35885137.1330370293927.JavaMail.root@zimbra47-e8.priv.proxad.net>
On Mon, Feb 27, 2012 at 08:18:13PM +0100, guillaume.jaouen@free.fr wrote:
> Hello Jonathan,
> 
> I followed your advices and try to track the usb 3.0 bug occuring during large transfer on usb 3.0 device with kernel 3.2.0.1.
> 
> You'll find in attachements as asked :
> 
> - full "dmesg" output from booting
> - "lsusb -vvv" output

Can you also run `sudo lspci -vvv` for me, so I know which host
controller you have?

> - syslog output (demonstrating what's happening exactly when file transfer crash at Feb 27 10:28:04)
> - steps to reproduce : Just start 10~20 ftp sessions in order to transfer 60GB of data to an usb3.0 hdd

I assume this is your USB 3.0 hard drive with the Lucid Port controller?

> Bus 007 Device 002: ID 1759:5002 LucidPort Technology, Inc. 
> Device Descriptor:
>   bLength                18
>   bDescriptorType         1
>   bcdUSB               3.00
>   bDeviceClass            0 (Defined at Interface level)
>   bDeviceSubClass         0 
>   bDeviceProtocol         0 
>   bMaxPacketSize0         9
>   idVendor           0x1759 LucidPort Technology, Inc.
>   idProduct          0x5002 

And it looks like you have an identical device attached?

> Bus 007 Device 003: ID 1759:5002 LucidPort Technology, Inc. 
> Device Descriptor:
>  bLength                18
>  bDescriptorType         1
>  bcdUSB               3.00
>  bDeviceClass            0 (Defined at Interface level)
>  bDeviceSubClass         0 
>  bDeviceProtocol         0 
>  bMaxPacketSize0         9
>  idVendor           0x1759 LucidPort Technology, Inc.
>  idProduct          0x5002 
>  bcdDevice           21.54
>  iManufacturer           1 USB 3.0
>  iProduct                2 MassStorage Device
>  iSerial                 3       JP2921HQ0T1MHA

Are you trying to run raid across the two USB 3.0 drives?

> - how reproducible it is : 100% of my tests on my computer produces a
> crash like the one describe in previous syslog details

By "crash" do you mean error messages from SMART like these?

> Feb 27 07:36:53 debian kernel: [47463.948457] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
> Feb 27 07:36:53 debian kernel: [47463.949672] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
> Feb 27 07:36:53 debian anacron[12322]: Job `cron.daily' terminated (exit status: 1) (mailing output)
> Feb 27 07:39:01 debian /USR/SBIN/CRON[13561]: (root) CMD (  [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete)
> Feb 27 07:41:06 debian anacron[12322]: Normal exit (1 job run)
> Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118
> Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 53 to 54
> Feb 27 07:57:59 debian kernel: [48730.038307] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
> Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sdb [SAT], failed to read SMART Attribute Data
> Feb 27 07:57:59 debian kernel: [48730.040680] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
> Feb 27 07:57:59 debian kernel: [48730.043291] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
> Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sdb [SAT], Read SMART Self Test Log Failed
> Feb 27 07:57:59 debian kernel: [48730.047108] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
> Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sdb [SAT], Read Summary SMART Error Log failed
> Feb 27 07:57:59 debian kernel: [48730.048271] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
> Feb 27 07:57:59 debian kernel: [48730.049410] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint

Or do you mean an actual kernel crash (oops)?  Your dmesg didn't contain
a crash (or any other error messages), and I can't get enough info out
of your syslog files.  Can you please only capture dmesg after running
`sudo dmesg -n 8`?

> - symptoms : During transfer the kernel xhci_hcd driver complain about
> "Stalled endpoint, then the transfer became to time out "Transfer
> error on endpoint"

That means your USB device stalled several SCSI commands.  As I said,
that's really not an "error".  The USB device just didn't like a
particular SCSI command, so it stalled the command.  The SCSI core kind
of strugs its shoulders and continues talking to the device.  So the
warnings about stalled endpoints are harmless.  You could enable
usb-storage debugging and see which SCSI commands the device is
stalling, but I don't think that's your underlying issue with the
failure with the drives.

BTW, what are these errors in your dmesg about?  If some userspace
program is sending bad SCSI commands to the device, then it's no wonder
the device was stalling the commands:

> [    2.265783] md: raid10 personality registered for level 10
> [    2.284315] mdadm: sending ioctl 800c0910 to a partition!
> [    2.284377] mdadm: sending ioctl 800c0910 to a partition!
> [    2.284436] mdadm: sending ioctl 1261 to a partition!
> [    2.284487] mdadm: sending ioctl 1261 to a partition!
> [    2.284965] mdadm: sending ioctl 1261 to a partition!
> [    2.285016] mdadm: sending ioctl 1261 to a partition!
> [    2.285379] mdadm: sending ioctl 1261 to a partition!
> [    2.285431] mdadm: sending ioctl 1261 to a partition!
> [    2.285809] mdadm: sending ioctl 1261 to a partition!
> [    2.285860] mdadm: sending ioctl 1261 to a partition!

Later something like electrical noise caused a transfer error on the bus
(or the device might have been simply sending garbage).  It's possible
the xHCI host controller actually caused the transfer error as well.
You could try the USB 3.0 mass storage devices under EHCI, but the
results wouldn't be too conclusive, since you're probably not stressing
the drive enough because the USB 2.0 speed is 10x slower than USB 3.0.

> - how they differ from what you expect : i would expect that transfer
> of large data sets even if they're done slowly (1Mb/s due to isp
> bandwith restriction) wouldn't time out and should complete correctly.

It's possible your USB storage device is just broken, or your host
controller.  Also, did you mention the USB 3.0 host controller is an
Express Card?  Because Express Cards are limited to only one PCIe lane,
so they will restrict your USB 3.0 bandwidth.  A PCIe add-in card is
better, and integrated PCIe is best for increased bandwidth.

> - background : I had this bug since I used 2.6 kernel branch. Under
> 2.6, these bug make the whole kernel freeze and the only way to
> recover from it was to hard reset the computer.
> - link to the bug log for the full story : http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596528

That bug report says you were running 2.6.32.  That was the first kernel
to have USB 3.0 support, and everything was beta.  A ton of new features
were added and bug fixes when into the kernel in the last two and a half
years since 2.6.32 has been released, and I'm not sure how much Debian
has backported.  I would suggest just using the latest stable kernel
instead.  Let's focus on fixing your problem with 3.2.

> Since I'm under 3.2 kernel branch, things are better as now I only
> have to unmount the devices under root, shutdown external hdd, remove
> the usb 3.0 express card in order to make the xhci_hcd driver reload
> into the kernel, restart external hdd and mount them back.

Does it work if you don't eject the express card and just plug the USB
3.0 devices back in?  Does it work if you just unload the xHCI driver
and reload it?  If not, I would suspect your host controller is hosed
(although the xHCI driver possibly could have done something to wedge
it).  You shouldn't have to eject the express card.

I really needed your dmesg when this happened, not the rsyslog:

Feb 27 10:28:00 debian smartd[3607]: Device: /dev/sdb [SAT], Read SMART Self Test Log Failed
Feb 27 10:28:00 debian smartd[3607]: Device: /dev/sdb [SAT], Read Summary SMART Error Log failed
Feb 27 10:28:00 debian kernel: [57730.360624] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
Feb 27 10:28:00 debian kernel: [57730.361826] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
Feb 27 10:28:04 debian kernel: [57734.270107] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint
Feb 27 10:28:04 debian kernel: [57734.270478] xhci_hcd 0000:0d:00.0: WARN: transfer error on endpoint
Feb 27 10:28:24 debian kernel: [57754.380054] xhci_hcd 0000:0d:00.0: Timeout while waiting for address device command
Feb 27 10:28:44 debian kernel: [57774.584078] xhci_hcd 0000:0d:00.0: Timeout while waiting for address device command
Feb 27 10:28:44 debian kernel: [57774.788116] usb 7-1: device not accepting address 3, error -62
Feb 27 10:29:04 debian kernel: [57794.900060] xhci_hcd 0000:0d:00.0: Timeout while waiting for reset device command
Feb 27 10:29:04 debian kernel: [57794.900074] usb 7-1: Cannot reset HCD device state
Feb 27 10:29:24 debian kernel: [57815.124102] xhci_hcd 0000:0d:00.0: Timeout while waiting for reset device command
Feb 27 10:29:24 debian kernel: [57815.124116] usb 7-1: Cannot reset HCD device state
Feb 27 10:29:45 debian kernel: [57835.236089] xhci_hcd 0000:0d:00.0: Timeout while waiting for reset device command
Feb 27 10:29:45 debian kernel: [57835.236103] usb 7-1: Cannot reset HCD device state
Feb 27 10:29:45 debian kernel: [57835.236217] usb 7-1: USB disconnect, device number 3
Feb 27 10:29:45 debian kernel: [57835.236417] sd 5:0:0:0: Device offlined - not ready after error recovery
Feb 27 10:29:45 debian kernel: [57835.236901] scsi 5:0:0:0: [sdd] killing request
Feb 27 10:29:45 debian kernel: [57835.236911] scsi 5:0:0:0: [sdd] killing request
Feb 27 10:29:45 debian kernel: [57835.243312] scsi 5:0:0:0: [sdd] Unhandled error code
Feb 27 10:29:45 debian kernel: [57835.243320] scsi 5:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Feb 27 10:29:45 debian kernel: [57835.243329] scsi 5:0:0:0: [sdd] CDB: Write(10): 2a 00 dc b9 be a8 00 00 f0 00
Feb 27 10:29:45 debian kernel: [57835.243349] end_request: I/O error, dev sdd, sector 3703160488
Feb 27 10:29:45 debian kernel: [57835.243417] scsi 5:0:0:0: [sdd] Unhandled error code
Feb 27 10:29:45 debian kernel: [57835.243422] scsi 5:0:0:0: [sdd]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Feb 27 10:29:45 debian kernel: [57835.243429] scsi 5:0:0:0: [sdd] CDB: Write(10): 2a 00 dc b9 bf 98 00 00 f0 00
Feb 27 10:29:45 debian kernel: [57835.243448] end_request: I/O error, dev sdd, sector 3703160728

I'm guessing a bit, but it looks like a transfer error happened, and
then the usb-storage driver tried to reset the device.  Then it looks
like the host controller just stopped responding to commands (including
the Set Address command and Reset Device command).  That's what the
"Timeout...command" messages are about.

I can't be sure that your host controller is the thing that's broken
unless you rebuild your kernel with CONFIG_USB_DEBUGGING and
CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and capture the full dmesg
starting just before that transfer error.  You'll really want to be
running 3.3, since that cleaned up a lot of the xHCI driver debugging,
and the log file will be much smaller.

It's possible, although not likely, that we're over writing the link TRB
on the command ring and causing the host controller to step off into
lala land and access bad memory.  My other theory is that your express
card is just broken and can't handle the throughput.  Or perhaps it's an
early prototype that made it out into the market without having good
transfer error support.

Sarah Sharp
Reply to:
Prev by Date: Processed: Re: Bug#633423: Problem with autofs 64-bit kernel/32-bit userspace compatibility
Next by Date: Processing of linux-2.6_3.2.7-1_amd64.changes
Previous by thread: Bug#655376: kernel: pl2302 kernel driver bug
Next by thread: Processing of linux-2.6_3.2.7-1_amd64.changes
Index(es):
- Date
- Thread