Bug#409349: usbhid: control queue full; hung apcupsd task

To: Steven Chamberlain <steven@pyro.eu.org>
Cc: Jonathan Nieder <jrnieder@gmail.com>, 409349@bugs.debian.org, hugo vanwoerkom <huugske@gmail.com>, Fred Korz <korz@optonline.net>, Tom Wright <thomaswright@cantab.net>, Samuele Giovanni Tonon <samu@linuxasylum.net>
Subject: Bug#409349: usbhid: control queue full; hung apcupsd task
From: David Fries <david@fries.net>
Date: Wed, 4 Apr 2012 23:27:03 -0500
Message-id: <[🔎] 20120405042703.GE3011@spacedout.fries.net>
Reply-to: David Fries <david@fries.net>, 409349@bugs.debian.org
In-reply-to: <[🔎] 4F7CC08E.5000406@pyro.eu.org>
References: <4D3B326C.1080605@pyro.eu.org> <4F3EE7F8.2010706@pyro.eu.org> <20120218082825.GD4209@burratino> <[🔎] 20120404212301.GA18337@burratino> <[🔎] 4F7CC08E.5000406@pyro.eu.org>

On Sat, 22 Jan 2011 19:39:24 +0000 Steven Chamberlain wrote:
> Usually I see the stream of the "control queue full" errors first, but
> this time apcupsd hung very soon after boot:
> INFO: task apcupsd:21604 blocked for more than 120 seconds.
> [<ffffffffa01bdeb9>] ? ohci_urb_dequeue+0xd9/0xe9 [ohci_hcd]
> [<ffffffffa003f10c>] ? usb_kill_urb+0x9d/0xbb [usbcore]
> [<ffffffff8106681a>] ? autoremove_wake_function+0x0/0x2e
> [<ffffffffa0209feb>] ? usbhid_init_reports+0x8c/0xee
> [<ffffffffa020b396>] ? hiddev_ioctl+0x2bf/0x632 [usbhid]
> This was despite raising apcupsd's POLLTIME from 10 to 60 seconds.

My usbmon dump showed that requests were being made that the UPS
wasn't returning or failing, sounds like a task for a bus analyzer
and a read through the USB standard.

usbhid_init_report is called from usbhid_start and from hid.c "called
on probe to start the device", and as such wouldn't be called each
time apcupsd polls the UPS.  So increasing the POLLTIME wouldn't have
helped.  I don't know how many reports it is requesting, but the
timeout in usbhid_wait_io is 10 seconds, so 12 could add up to 120
seconds.  Even so I would expect it to eventually timeout.  Now if it
was sitting in ohci_urb_dequeue for 120 seconds that is a problem.

On Wed, Apr 04, 2012 at 10:43:42PM +0100, Steven Chamberlain wrote:
> found 409349 linux-2.6/2.6.32-41
> thanks
> 
> On 04/04/12 22:23, Jonathan Nieder wrote:
> > Ping.  Is nobody trying to use an APC UPS connected by USB any more? :)

I have three.  I only had problems with one OHCI SiS rev 07 PCI
1039:7001 controller, and once I added the patch for "introduce
timeout for stuck ctrl/out URBs" I've not had the issue anymore.

> Yep!  This week on 2.6.32-41 I noticed apcupsd hang after about a day,
> followed by the 'control queue full' errors an hour later.  That was
> with the patches suggested in #631287 applied, just in case they made
> any difference to this.

My guess is the control queue message full problem just enlarges the
window of opportunity fro 'BUG during access to hiddev' bug report
#631287.  I think it would help avoid crashing when it gets into this
situation, but from the patch titles I don't think it would help avoid
it.

The queue is 256 long, and with apcupsd querying every 10 seconds,
256*10/60=42 minutes once an entry gets blocked.  With the timeout set
to 5 seconds in the "introduce timeout" patch and apcups polling every
10 seconds it no matter if every request failed to complete it
shouldn't be able to fill up.  But that's only if apcupsd is only
submitting one control request every 10 seconds, but looking at my old
logs that isn't the case.  For my UPS it seemed to only ignore a
transfer rarely, maybe there are some that are frequent enough to be a
problem, in which case the timeout patch wouldn't be able to keep up.
It would be nice to have a usbmon dump of it happening to see if that
is the case or not.

> > What kernels work and don't work?
> 
> I already tagged affected versions in the BTS;  the first report of this
> was on 2.6.17 and I myself recall seeing it as far back as 2.6.26.
> 
> 
> I hadn't been able to test the UPS on a 3.y kernel until this week.
> Right now I have the UPS hooked up to a new server running 3.2.0-2-amd64
> (Debian 3.2.12-1) to see if it recurs.

What UPS do you have that's the problem?  Mine was,
BackUPS Pro 500
firmware 16.3.D USB FW:4

> Then in a few weeks, I'll be taking the original hardware (on which the
> issue was easily reproducible) out of production use, so I can try 3.y
> kernels on that too.  (Until now, I couldn't try 3.y kernels on that box
> as it needs to run OpenVZ containers).

What kind of timeframe is easily reproducible?  Can you run
information gathering patches on the box, if so I'll put something
together.  Then again this has been going on for years, if you prefer
to wait a few weeks until it is not in production use that's fine.

Back when I was first debugging this I added some printk messages for
when the queue length increased.  If you'll give them a try I'll look
them up and also printout that jiffies value as well.

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

Reply to:

References:
- Bug#409349: usbhid: control queue full; hung apcupsd task
  - From: Jonathan Nieder <jrnieder@gmail.com>
- Bug#409349: usbhid: control queue full; hung apcupsd task
  - From: Steven Chamberlain <steven@pyro.eu.org>

Prev by Date: Processed: tagging 645547
Next by Date: Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
Previous by thread: Bug#631287: BUG during access to hiddev (APC UPS)
Next by thread: Bug#409349: usbhid: control queue full; hung apcupsd task
Index(es):
- Date
- Thread