Bug#692607: linux-image-3.2.0-4-686-pae: Kernel crash when coming out of screen saver
Steinar Bang <sb@dod.no> writes:
>>>>>> Ben Hutchings <ben@decadent.org.uk>:
>
>> Please send a readable photograph of this text.
>
> The problem occurred for the third time, and I couldn't find the camera,
> so I'm typing in what's shown on the console.
>
> This time it had happened while the macine was sitting unmanned and I
> can't say it had anything to do with the screen saver, unless someone
> unintentionally have moved the mouse.
>
> I also note that it says "invalid opcode". This machine has an Intel P4
> CPU. Is it too old for the current kernels?
>
> Console text follows:
> [523708.506472] ------------[ cut here ]-----------
> [523708.506472] kernel BUG at /build/build-linux_3.2.32-1-i386-Z3rOrf/linux-3.2.32/kernel/workqueue.c:1040!
This should not be a BUG IMHO, and it is in fact made easier debuggable
in newer kernels:
commit f5b2552b4ebbeadcadde1532d7bbd3f850719046
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri Apr 13 22:06:58 2012 +0300
workqueue: change BUG_ON() to WARN_ON()
This BUG_ON() can be triggered if you call schedule_work() before
calling INIT_WORK(). It is a bug definitely, but it's nicer to just
print a stack trace and return.
Reported-by: Matt Renzelmann <mjr@cs.wisc.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5abf42f..66ec08d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1032,7 +1032,10 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
cwq = get_cwq(gcwq->cpu, wq);
trace_workqueue_queue_work(cpu, cwq, work);
- BUG_ON(!list_empty(&work->entry));
+ if (WARN_ON(!list_empty(&work->entry))) {
+ spin_unlock_irqrestore(&gcwq->lock, flags);
+ return;
+ }
cwq->nr_in_flight[cwq->work_color]++;
work_flags = work_color_to_flags(cwq->work_color);
Any chance that could be included in Debian wheezy kernels, although I
guess it does not meet stable requirements?
> [523708.506472] invalid opcode: 0000 [#1] SMP
> [523708.506472] Modules linked in: mperf speedstep_lib ip6table_filter ip6_tables cpufreq_powersave iptable_filter ip_tables cpufreq_stats cpufreq_conservative cpufreq_userspace ebtable_nat ebtables x_tables ppdev lp bnep rfcomm bluetooth rfkill crc16 binfmt_misc fuse nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_intel8x0 snd_ac97_codec i915 snd_pcm_oss snd_mixer_oss snd_pcm video snd_page_alloc drm_kms_helper snd_seq_midi snd_seq_midi_event psmouse snd_rawmidi snd_seq snd_seq_device snd_timer snd pcspkr drm i2c_i801 i2c_algo_bit soundcore ac97_bus i2c_core iTCO_wdt serio_raw evdev parport_pc iTCO_vendor_support parport processor thermal_sys rng_core button shpchp usbhid hid ext3 mbcache jbd dm_mod sg sd_mod sr_mod cdrom crc_t10dif ata_generic floppy ata_piix libata uhci_hcd e
> hci_hcd tg3 usbcore libphy scsi_mod usb_common [last unloaded: scsi_wait_scan]
> [523708.506472]
> [523708.506472] Pid: 0, comm: swapper/0 Not tainted 3.2.0-4-686-pae #1 Debian 3.2.32-1 Hewlett-Packard HP d530 CMT(DZ036T)/085Ch
> [523708.506472] EIP: 0060:[<c10494b1>] EFLAGS: 00010013 CPU: 0
> [523708.506472] EIP is at __queue_work+0x193/0x1f4
> [523708.506472] EAX: f739e56c EBX: f708c800 ECX: 00000020 EDX: f739e568
> [523708.506472] ESI: c14b5240 EDI: 00000010 EBP: 00000046 ESP: f5809f60
> [523708.506472] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [523708.506472] Process swapper/0 (pid: 0, ti=f5808000 task=c13defe0 task.ti=c13d8000)
> [523708.506472] Stack:
> [523708.506472] f739e568 f085fe80 00000000 f085fe80 00000000 00000010 f7398000 c1049555
> [523708.506472] f739e568 f739e000 f0871400 f85abe17 c11e601f 0c00a511 00008000 00001930
> [523708.506472] f739e568 00000006 f739e028 00000046 00000046 f71147c0 f58068d4 00000010
> [523708.506472] Call Trace:
> [523708.506472] [<c1049555>] ? queue_work_on+0x25/0x30
> [523708.506472] [<f85abe17>] ? i8xx_irq_handler+0x6b/0x1dc [i915]
I took a quick look at this, and my guess is that i8xx_irq_handler
tries to queue an error event through i915_handle_error() here.
The error_work work_struct is initialized in intel_irq_init(), so I
cannot see how the error can happen unless something scribbles over it
at some point. Which may be what happens here? That would be a lot
easier to see if we could have queue_work fail with a warning instead.
Maybe add a few extra debugging tests to i915_handle_error() to see if
this is indeed what happens here? Completely untested of course:
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 32e1bda..614f3f4 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1414,6 +1414,19 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
}
}
+/* debugging helper only... */
+static bool safe_queue_work(struct workqueue_struct *wq, struct work_struct *work)
+{
+ if (WARN_ON(!test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) &&
+ !list_empty(&work->entry))) {
+ pr_err("work->data=0x%08lx, &work->entry=%p, work->entry.next=%p, work->entry.prev=%p\n",
+ *work_data_bits(work), &work->entry, work->entry.next, work->entry.prev );
+ return false;
+ }
+
+ return queue_work(wq, work);
+}
+
/**
* i915_handle_error - handle an error interrupt
* @dev: drm device
@@ -1444,7 +1457,7 @@ void i915_handle_error(struct drm_device *dev, bool wedged)
wake_up_all(&ring->irq_queue);
}
- queue_work(dev_priv->wq, &dev_priv->error_work);
+ safe_queue_work(dev_priv->wq, &dev_priv->error_work);
}
static void i915_pageflip_stall_check(struct drm_device *dev, int pipe)
Bjørn
Reply to: