Bug#808953: xhci regression for large transfers (commit e210c422b)

To: Mathias Nyman <mathias.nyman@intel.com>
Cc: Ron <ron@debian.org>, linux-usb@vger.kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, 808953@bugs.debian.org
Subject: Bug#808953: xhci regression for large transfers (commit e210c422b)
From: Ron <ron@debian.org>
Date: Tue, 12 Jan 2016 07:33:11 +1030
Message-id: <[🔎] 20160111210311.GL10797@hex.shelbyville.oz>
Reply-to: Ron <ron@debian.org>, 808953@bugs.debian.org
In-reply-to: <[🔎] 56939CC2.8070501@intel.com>
References: <[🔎] 20160102063255.GQ10797@hex.shelbyville.oz> <[🔎] 568E8661.9040202@intel.com> <[🔎] 20160107162228.GX10797@hex.shelbyville.oz> <[🔎] 20160107163920.GY10797@hex.shelbyville.oz> <[🔎] 20160107184724.GZ10797@hex.shelbyville.oz> <[🔎] 568FC341.8070202@intel.com> <[🔎] 20160108151916.GB10797@hex.shelbyville.oz> <[🔎] 56939CC2.8070501@intel.com>

On Mon, Jan 11, 2016 at 02:14:58PM +0200, Mathias Nyman wrote:
> On 08.01.2016 17:19, Ron wrote:
> >On Fri, Jan 08, 2016 at 04:10:09PM +0200, Mathias Nyman wrote:
> >>On 07.01.2016 20:47, Ron wrote:
> >>>On Fri, Jan 08, 2016 at 03:09:20AM +1030, Ron wrote:
> >>>>On Fri, Jan 08, 2016 at 02:52:28AM +1030, Ron wrote:
> >>>>>On Thu, Jan 07, 2016 at 05:38:09PM +0200, Mathias Nyman wrote:
> >>>>>>Hi
> >>>>>>
> >>>>>>On 02.01.2016 08:32, Ron wrote:
> >>>>>>>
> >>>>>>>Hi,
> >>>>>>>
> >>>>>>>It appears the commit e210c422b6fdd2dc123bedc588f399aefd8bf9de
> >>>>>>>"xhci: don't finish a TD if we get a short transfer event mid TD"
> >>>>>>>is causing transfers larger than 16kB to be unreliable.
> >>>>>>>
> >>>>>>>If I limit transfers to be no larger than 16kB, then it also works as
> >>>>>>>expected in an XHCI port with an unmodified build of Linus' current
> >>>>>>>head (v4.4-rc7-76-g9c982e8), but transfers larger than that do not.
> >>>>>>>I see an alternating cycle of a successful transfer, followed by two
> >>>>>>>that will time out waiting in libusb (with a 5 second timeout set),
> >>>>>>>before getting another successful transfer and the cycle repeating.
> >>>>>>>
> >>>>>>>I can run more tests and dig into this deeper if the reason for it
> >>>>>>>isn't immediately obvious in hindsight.
> >>>>>>>
> >>>>>>
> >>>>>>Thanks for the info,
> >>>>>>I can't spot anything obvious, but my brain might still be in vacation mode.
> >>>>>>
> >>>>>>Could you reproduce it with the attached patch, it only adds extra debugging?
> >>>>>>
> >>>>>>We should either see no output, or the following sequence:
> >>>>>>
> >>>>>>  1. "mid bulk/intr SP, wait for last TRB event"
> >>>>>>  2. "last trb has length set"
> >>>>>>  3. "and last trb is SHORT_TX, OK"
> >>>>>
> >>>>>
> >>>>>I guess one out of 3 ain't good ...  all I see logged is:
> >>>>>
> >>>>>  [   60.015708] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
> >>>>>  [   65.017374] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
> >>>>>  [   70.455451] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
> >>>>>  [   75.456248] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
> >>>>>
> >>>>>I'm passing 5 seconds to libusb as the requested timeout.
> >>>>
> >>>>And if I limit the maximum transfer size to 16kB (the above was with
> >>>>64kB transfers), then I see nothing logged at all.
> >>>>
> >>>>So if that code does indeed look sane, perhaps the issue is really
> >>>>in the code that's splitting large transfers doing something funny?
> >>>
> >>>So, tracing this a bit deeper, what I'm seeing is that in the case of
> >>>a transfer of 16kB or less, there's only one TRB in the TD, so it's
> >>>always the last_trb and the wait code never triggers (ie. basically
> >>>the same code path as with this patch reverted).
> >>>
> >>>In the case of a larger transfer, there are N / 16kB TRBs in the TD,
> >>>and in the case where it stalls, I get a short transfer on the first
> >>>TRB, and then it just hangs waiting until it times out, reporting:
> >>>
> >>>  xhci_hcd ep 0x81 - asked for 65536 bytes, 16382 bytes untransferred
> >>>
> >>>with event_trb == td->first_trb.
> >>>
> >>>
> >>>What's still not clear to me yet is which assumption is broken.
> >>>Clearly we aren't actually getting more TRB events after the short
> >>>transfer.
> >>>
> >>>And the other thing which possibly makes this odd is that the FTDI will
> >>>always return 2 bytes in a packet, even when there "no data" in it,
> >>>since it prefixes each packet with 2 "modem status" bytes.  Which means
> >>>the 16382 untransferred above is it returning 2 status bytes and none
> >>>of the data that we actually requested from it (which is a normal thing
> >>>for it to do sometimes, and if we issue another read the data will
> >>>eventually come - but I don't yet know if that's also contributing here).
> >>>
> >>>
> >>>Sorry for the trickle of small facts, I'm still wrapping my head around
> >>>everything I need to know to understand this properly.
> >>>
> >>
> >>This is all good information.
> >>
> >>We mark all TRBs for incoming bulk transfers as "interrupt on short transfer"
> >>and the last TRB of the TD as "interrupt on completion"
> >>
> >>If the we get a short packet event mid TD xhc should create two events, the first
> >>one for the trb that was short, and a second one one the last trb, with the "interrupt on completion"
> >
> >There is one thing we need to be careful of here, with the FTDI at least.
> >Once we get a short packet, we do actually need to end the transaction
> >there and not append any further data to it.  The reason for that is
> >those modem status bytes.  The only way to find them and filter them from
> >the actual data you want is to split the received transfer back into
> >wMaxPacketSize'd blocks and remove the first 2 bytes from each packet,
> >since it prepends them to every packet.
> >
> >If a subsequent trb after the short was to add more data to that, we'd
> >lose that framing and not be able to extract the data from it reliably
> >(since we wouldn't know where the short in the middle ended).
> >
> >I'm not sure if the events we expect in this case already cover that,
> >but it seemed worth mentioning in case they might return more data.
> >
> >>There was an errata added to the specs specifying in more detail what this second events status should be,
> >>but the second event should be there anyways. controller will "fast forward" through the remains of the TD after
> >>a short packet, generating the event.
> >
> >If I understand that part though, then we shouldn't actually get more
> >data, just a final completion event for the whole TD ...
> >
> >>It's possible that something goes wrong in updating the dequeue pointer of either the endpoint ring or
> >>event ring, leading to handling the wrong event, or wrong trb for that endpoint (perhaps handling the same "mid TD" TRB twice?)
> >>
> >>I have yet another patch. This one prints out the content of the short transfer event, and the event following a short transfer.
> >
> >Ok, here's a longer dump of what that outputs, which I'll annotate to
> >show what's actually going on with the device at each point.
> >
> >The extra values in parens of the "asked for" lines are
> >(td->last_trb - td->first_trb), and (event_trb - td->first_trb)
> >which was how I tracked which trb it was stalling on earlier.
> >
> >
> 
> 
> The logs shows that we never get the second event for the TD,
> 
> The transfer ring and corresponding events:
> EP TRB addr,	event,		comment
> 
> d207cce0,	01(Success), 	Previous event was a SHORT
> d207ccf0,	0d(SHORT  ), 	only TRB of 1 long TD, returned
>   -new TD-
> d207cd00,	0d(SHORT  ), 	1st TRB of 4 long TD. Previous event was SHORT. wait
> d207cd10,	,		no event
> d207cd20,	,		no event
> d207cd30,	,		no event
>  -new TD-
> d207cd40,	0d(SHORT),	1st TRB of 4 long TD. Previous event was short. wait
> d207cd50,	,		no event
> d207cd60,	,		no event
> d207cd70,	,		no event
>  -new TD-
> d207cd80,	,		no event
> d207cd90, 	,		no event	
> d207cda0, 	,		no event
> d207cdb0,	01(Success),	Previous event was short
>  - new TD -
> d207cdc0, 0d(SHORT), only TRB in  1 long TD
>  - new TD -
> ...
> 
> There are no events for the TDs last TRB at d207cd30 or d207cd70.
> An event would have been logged as the previous events were short transfers (at d207cd00 and d207cd40)
> 
> Looks like some controllers are not following the specs and we can't rely on the second event.
> 
> So the patch should be reverted, and instead handle the controllers that work according to specs
> as a special case. So we need to return the TD after the first SHORT event, and then make sure a possible
> second event doesn't mess up anything.
> 
> Just for reference, xhci specs 1.0 section 4.10.1.1 specify that:
> 
> "If the Short Packet occurred while processing a Transfer TRB with only an ISP flag set, then
> two events shall be generated for the transfer; one for the Transfer TRB that the short
> Packet occurred on, and a second for the last TRB with the IOC flag set.
> 
> Software shall not interpret an Short Packet Event as indicating that the TD that it is
> associated with is “complete”, unless the TRB Pointer field of the Transfer Event references
> the last TRB of the TD."
> 
> ISP = interrupt on short packet ( we set for all bulk in TRBs)
> IOC = interupt on completion    ( we set for last TRB in TD)


One last data point then, if it's looking like a controller snafu, the
logs above were from testing on a bare metal host, and I'd first seen
this bug on a different machine, also bare metal (both motherboards are
Gigabyte, but different ages and one has an 'Intel chipset' on it, the
other AMD) ...

But I've also reproduced this in a QEMU/KVM VM using the nec-xhci
controller that it provides.  In that case the device was actually being
passed through to the guest from an EHCI socket on the host machine.

I haven't yet seen a machine where this does work as expected.

  Thanks!
  Ron

Reply to:

References:
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Mathias Nyman <mathias.nyman@intel.com>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Mathias Nyman <mathias.nyman@intel.com>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Mathias Nyman <mathias.nyman@intel.com>

Prev by Date: Bug#810726: linux-image-4.3.0-1-amd64: Kernel 4.3 breaks Hauppauge HD-PVR recording
Next by Date: Bug#810767: linux-image-3.16.0-4-amd64: kernel crashes with skb_under_panic when trying to use tc_mirred
Previous by thread: Bug#808953: xhci regression for large transfers (commit e210c422b)
Next by thread: Processed: limit source to linux, tagging 809083, tagging 807624, tagging 808623
Index(es):
- Date
- Thread