Bug#808953: xhci regression for large transfers (commit e210c422b)

To: Ron <ron@debian.org>
Cc: linux-usb@vger.kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, 808953@bugs.debian.org
Subject: Bug#808953: xhci regression for large transfers (commit e210c422b)
From: Mathias Nyman <mathias.nyman@intel.com>
Date: Fri, 08 Jan 2016 16:10:09 +0200
Message-id: <[🔎] 568FC341.8070202@intel.com>
Reply-to: Mathias Nyman <mathias.nyman@intel.com>, 808953@bugs.debian.org
In-reply-to: <[🔎] 20160107184724.GZ10797@hex.shelbyville.oz>
References: <[🔎] 20160102063255.GQ10797@hex.shelbyville.oz> <[🔎] 568E8661.9040202@intel.com> <[🔎] 20160107162228.GX10797@hex.shelbyville.oz> <[🔎] 20160107163920.GY10797@hex.shelbyville.oz> <[🔎] 20160107184724.GZ10797@hex.shelbyville.oz>

On 07.01.2016 20:47, Ron wrote:

On Fri, Jan 08, 2016 at 03:09:20AM +1030, Ron wrote:

On Fri, Jan 08, 2016 at 02:52:28AM +1030, Ron wrote:

On Thu, Jan 07, 2016 at 05:38:09PM +0200, Mathias Nyman wrote:

Hi

On 02.01.2016 08:32, Ron wrote:


Hi,

It appears the commit e210c422b6fdd2dc123bedc588f399aefd8bf9de
"xhci: don't finish a TD if we get a short transfer event mid TD"
is causing transfers larger than 16kB to be unreliable.

If I limit transfers to be no larger than 16kB, then it also works as
expected in an XHCI port with an unmodified build of Linus' current
head (v4.4-rc7-76-g9c982e8), but transfers larger than that do not.
I see an alternating cycle of a successful transfer, followed by two
that will time out waiting in libusb (with a 5 second timeout set),
before getting another successful transfer and the cycle repeating.

I can run more tests and dig into this deeper if the reason for it
isn't immediately obvious in hindsight.


Thanks for the info,
I can't spot anything obvious, but my brain might still be in vacation mode.

Could you reproduce it with the attached patch, it only adds extra debugging?

We should either see no output, or the following sequence:

  1. "mid bulk/intr SP, wait for last TRB event"
  2. "last trb has length set"
  3. "and last trb is SHORT_TX, OK"



I guess one out of 3 ain't good ...  all I see logged is:

  [   60.015708] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
  [   65.017374] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
  [   70.455451] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
  [   75.456248] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event

I'm passing 5 seconds to libusb as the requested timeout.


And if I limit the maximum transfer size to 16kB (the above was with
64kB transfers), then I see nothing logged at all.

So if that code does indeed look sane, perhaps the issue is really
in the code that's splitting large transfers doing something funny?


So, tracing this a bit deeper, what I'm seeing is that in the case of
a transfer of 16kB or less, there's only one TRB in the TD, so it's
always the last_trb and the wait code never triggers (ie. basically
the same code path as with this patch reverted).

In the case of a larger transfer, there are N / 16kB TRBs in the TD,
and in the case where it stalls, I get a short transfer on the first
TRB, and then it just hangs waiting until it times out, reporting:

  xhci_hcd ep 0x81 - asked for 65536 bytes, 16382 bytes untransferred

with event_trb == td->first_trb.


What's still not clear to me yet is which assumption is broken.
Clearly we aren't actually getting more TRB events after the short
transfer.

And the other thing which possibly makes this odd is that the FTDI will
always return 2 bytes in a packet, even when there "no data" in it,
since it prefixes each packet with 2 "modem status" bytes.  Which means
the 16382 untransferred above is it returning 2 status bytes and none
of the data that we actually requested from it (which is a normal thing
for it to do sometimes, and if we issue another read the data will
eventually come - but I don't yet know if that's also contributing here).


Sorry for the trickle of small facts, I'm still wrapping my head around
everything I need to know to understand this properly.


This is all good information.

We mark all TRBs for incoming bulk transfers as "interrupt on short transfer"
and the last TRB of the TD as "interrupt on completion"

If the we get a short packet event mid TD xhc should create two events, the first
one for the trb that was short, and a second one one the last trb, with the "interrupt on completion"

There was an errata added to the specs specifying in more detail what this second events status should be,
but the second event should be there anyways. controller will "fast forward" through the remains of the TD after
a short packet, generating the event.

It's possible that something goes wrong in updating the dequeue pointer of either the endpoint ring or
event ring, leading to handling the wrong event, or wrong trb for that endpoint (perhaps handling the same "mid TD" TRB twice?)

I have yet another patch. This one prints out the content of the short transfer event, and the event following a short transfer.

Could you give it a run?

Thanks
- Mathias

>From 2551cef961bdf1ca0e647b41bf4dc0d71e514315 Mon Sep 17 00:00:00 2001
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Fri, 8 Jan 2016 15:56:12 +0200
Subject: [PATCH] xhci: FOR TESTING ONLY, display event trb for short packet
 related events

display content of the event trb in case of short packets events, or
in case the event for this endpoint follows a short packet event.

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index ed6ac7e..daabcfe 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2342,6 +2342,20 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 	event_dma = le64_to_cpu(event->buffer);
 	trb_comp_code = GET_COMP_CODE(le32_to_cpu(event->transfer_len));
 	/* Look for common error cases */
+
+	if (trb_comp_code == COMP_SHORT_TX ||
+	    ep_ring->last_td_was_short) {
+		xhci_warn(xhci, "@%016llx %08x %08x %08x %08x %s\n",
+			  (unsigned long long) xhci_trb_virt_to_dma(
+				  xhci->event_ring->deq_seg,
+				  xhci->event_ring->dequeue),
+			  lower_32_bits(le64_to_cpu(event->buffer)),
+			  upper_32_bits(le64_to_cpu(event->buffer)),
+			  le32_to_cpu(event->transfer_len),
+			  le32_to_cpu(event->flags),
+			  ep_ring->last_td_was_short ? "last td short" : "");
+	}
+
 	switch (trb_comp_code) {
 	/* Skip codes that require special handling depending on
 	 * transfer type
-- 
1.9.1

Reply to:

Follow-Ups:
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>

References:
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Mathias Nyman <mathias.nyman@intel.com>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>

Prev by Date: Processed: Re: mod flags for 809815
Next by Date: Bug#808953: xhci regression for large transfers (commit e210c422b)
Previous by thread: Bug#808953: xhci regression for large transfers (commit e210c422b)
Next by thread: Bug#808953: xhci regression for large transfers (commit e210c422b)
Index(es):
- Date
- Thread