[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#692607: linux-image-3.2.0-4-686-pae: Kernel crash when coming out of screen saver



Steinar Bang <sb@dod.no> writes:
>>>>>> Ben Hutchings <ben@decadent.org.uk>:
>
>> Please send a readable photograph of this text.
>
> The problem occurred for the third time, and I couldn't find the camera,
> so I'm typing in what's shown on the console.
>
> This time it had happened while the macine was sitting unmanned and I
> can't say it had anything to do with the screen saver, unless someone
> unintentionally have moved the mouse.
>
> I also note that it says "invalid opcode".  This machine has an Intel P4
> CPU.  Is it too old for the current kernels?
>
> Console text follows:
> [523708.506472] ------------[ cut here ]-----------
> [523708.506472] kernel BUG at /build/build-linux_3.2.32-1-i386-Z3rOrf/linux-3.2.32/kernel/workqueue.c:1040!

This should not be a BUG IMHO, and it is in fact made easier debuggable
in newer kernels:


commit f5b2552b4ebbeadcadde1532d7bbd3f850719046
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Fri Apr 13 22:06:58 2012 +0300

    workqueue: change BUG_ON() to WARN_ON()
    
    This BUG_ON() can be triggered if you call schedule_work() before
    calling INIT_WORK().  It is a bug definitely, but it's nicer to just
    print a stack trace and return.
    
    Reported-by: Matt Renzelmann <mjr@cs.wisc.edu>
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5abf42f..66ec08d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1032,7 +1032,10 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
        cwq = get_cwq(gcwq->cpu, wq);
        trace_workqueue_queue_work(cpu, cwq, work);
 
-       BUG_ON(!list_empty(&work->entry));
+       if (WARN_ON(!list_empty(&work->entry))) {
+               spin_unlock_irqrestore(&gcwq->lock, flags);
+               return;
+       }
 
        cwq->nr_in_flight[cwq->work_color]++;
        work_flags = work_color_to_flags(cwq->work_color);



Any chance that could be included in Debian wheezy kernels, although I
guess it does not meet stable requirements?



> [523708.506472] invalid opcode: 0000 [#1] SMP
> [523708.506472] Modules linked in: mperf speedstep_lib ip6table_filter ip6_tables cpufreq_powersave iptable_filter ip_tables cpufreq_stats cpufreq_conservative cpufreq_userspace ebtable_nat ebtables x_tables ppdev lp bnep rfcomm bluetooth rfkill crc16 binfmt_misc fuse nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_intel8x0 snd_ac97_codec i915 snd_pcm_oss snd_mixer_oss snd_pcm video snd_page_alloc drm_kms_helper snd_seq_midi snd_seq_midi_event psmouse snd_rawmidi snd_seq snd_seq_device snd_timer snd pcspkr drm i2c_i801 i2c_algo_bit soundcore ac97_bus i2c_core iTCO_wdt serio_raw evdev parport_pc iTCO_vendor_support parport processor thermal_sys rng_core button shpchp usbhid hid ext3 mbcache jbd dm_mod sg sd_mod sr_mod cdrom crc_t10dif ata_generic floppy ata_piix libata uhci_hcd e
>  hci_hcd tg3 usbcore libphy scsi_mod usb_common [last unloaded: scsi_wait_scan]
> [523708.506472] 
> [523708.506472] Pid: 0, comm: swapper/0 Not tainted 3.2.0-4-686-pae #1 Debian 3.2.32-1 Hewlett-Packard HP d530 CMT(DZ036T)/085Ch
> [523708.506472] EIP: 0060:[<c10494b1>] EFLAGS: 00010013 CPU: 0
> [523708.506472] EIP is at __queue_work+0x193/0x1f4
> [523708.506472] EAX: f739e56c EBX: f708c800 ECX: 00000020 EDX: f739e568
> [523708.506472] ESI: c14b5240 EDI: 00000010 EBP: 00000046 ESP: f5809f60
> [523708.506472]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [523708.506472] Process swapper/0 (pid: 0, ti=f5808000 task=c13defe0 task.ti=c13d8000)
> [523708.506472] Stack:
> [523708.506472]  f739e568 f085fe80 00000000 f085fe80 00000000 00000010 f7398000 c1049555
> [523708.506472]  f739e568 f739e000 f0871400 f85abe17 c11e601f 0c00a511 00008000 00001930
> [523708.506472]  f739e568 00000006 f739e028 00000046 00000046 f71147c0 f58068d4 00000010
> [523708.506472] Call Trace:
> [523708.506472]  [<c1049555>] ? queue_work_on+0x25/0x30
> [523708.506472]  [<f85abe17>] ? i8xx_irq_handler+0x6b/0x1dc [i915]


I took a quick look at this, and my guess is that i8xx_irq_handler
tries to queue an error event through i915_handle_error() here.

The error_work work_struct is initialized in intel_irq_init(), so I
cannot see how the error can happen unless something scribbles over it
at some point.  Which may be what happens here?  That would be a lot
easier to see if we could have queue_work fail with a warning instead.

Maybe add a few extra debugging tests to i915_handle_error() to see if
this is indeed what happens here?  Completely untested of course:


diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 32e1bda..614f3f4 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1414,6 +1414,19 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
 	}
 }
 
+/* debugging helper only... */
+static bool safe_queue_work(struct workqueue_struct *wq, struct work_struct *work)
+{
+	if (WARN_ON(!test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) &&
+	    !list_empty(&work->entry))) {
+		pr_err("work->data=0x%08lx, &work->entry=%p, work->entry.next=%p, work->entry.prev=%p\n",
+			*work_data_bits(work), &work->entry, work->entry.next, work->entry.prev );
+		return false;
+	}
+
+	return queue_work(wq, work);
+}
+
 /**
  * i915_handle_error - handle an error interrupt
  * @dev: drm device
@@ -1444,7 +1457,7 @@ void i915_handle_error(struct drm_device *dev, bool wedged)
 			wake_up_all(&ring->irq_queue);
 	}
 
-	queue_work(dev_priv->wq, &dev_priv->error_work);
+	safe_queue_work(dev_priv->wq, &dev_priv->error_work);
 }
 
 static void i915_pageflip_stall_check(struct drm_device *dev, int pipe)




Bjørn


Reply to: