[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#636797: linux-image-2.6.32-5-amd64: avoid divide-by-zero ("divide error: 0000") in scheduler



Hi Ben--

Thanks for the quick followup!

On 08/07/2011 12:36 PM, Ben Hutchings wrote:
> On Fri, 2011-08-05 at 18:36 -0400, Daniel Kahn Gillmor wrote:
>> We've applied the attached patch (a simple workaround to ensure no
>> division-by-zero) to the debian packages for several weeks in production
>> (over a month on some machines) and haven't seen a recurrence of the
>> problem.
>
> This doesn't really fix the bug - division by zero is just a symptom of
> a more fundamental problem which has yet to be identified.

yep, that's why i called it a workaround :)

> As a result,
> it hasn't been accepted upstream and won't be accepted in Debian.
> 
> That said, I would consider applying a variant that WARNs before 'fixing
> up' the zero divisor, as a *temporary* measure to aid in understanding
> the bug (more like
> <https://bugzilla.kernel.org/show_bug.cgi?id=16991#c13>).

That sounds reasonable to me.  Are you up for preparing such a patch or
do you need me to do it?

> I notice your 'oops' messages show 'Tainted: G W' which indicates there
> was an earlier kernel warning.  What was the previous warning?

hmm, we've seen this on multiple machines, and they didn't all have a
prior warning.  in the referenced machine, though, it was 5 months
previously, a netdev watchdog timeout.  It doesn't seem related to me,
but i'm happy to include the dump here in case anyone else can extract
meaning from it:

>> 2011-01-04_10:28:18.85061 [3129874.324489] ------------[ cut here ]------------
>> 2011-01-04_10:28:18.89235 [3129874.329286] WARNING: at /build/buildd-linux-2.6_2.6.32-28-amd64-EUJiNq/linux-2.6-2.6.32/debian/build/source_amd64_none/net/sched/sch_generic.c:261 dev_watchdog+0xe2/0x194()
>> 2011-01-04_10:28:18.89236 [3129874.344808] Hardware name: PowerEdge R410
>> 2011-01-04_10:28:18.89237 [3129874.348981] NETDEV WATCHDOG: eth0 (bnx2): transmit queue 1 timed out
>> 2011-01-04_10:28:18.89238 [3129874.355561] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs ext4 jbd2 crc16 ext2 bridge stp kvm_intel kvm tun loop snd_pcm snd_timer snd soundcore snd_page_alloc dcdbas pcspkr psmouse serio_raw evdev button power_meter processor ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc dm_crypt dm_mod raid1 md_mod sd_mod crc_t10dif sg sr_mod cdrom ata_generic uhci_hcd mpt2sas ehci_hcd thermal ata_piix thermal_sys usbcore nls_base scsi_transport_sas libata scsi_mod bnx2 [last unloaded: scsi_wait_scan]
>> 2011-01-04_10:28:18.89240 [3129874.408913] Pid: 0, comm: swapper Not tainted 2.6.32-5-amd64 #1
>> 2011-01-04_10:28:18.89240 [3129874.415063] Call Trace:
>> 2011-01-04_10:28:18.89242 [3129874.417740]  <IRQ>  [<ffffffff81261c12>] ? dev_watchdog+0xe2/0x194
>> 2011-01-04_10:28:18.89243 [3129874.424219]  [<ffffffff81261c12>] ? dev_watchdog+0xe2/0x194
>> 2011-01-04_10:28:18.89244 [3129874.430018]  [<ffffffff8104dd6c>] ? warn_slowpath_common+0x77/0xa3
>> 2011-01-04_10:28:18.89245 [3129874.436423]  [<ffffffff81261b30>] ? dev_watchdog+0x0/0x194
>> 2011-01-04_10:28:18.89246 [3129874.442131]  [<ffffffff8104ddf4>] ? warn_slowpath_fmt+0x51/0x59
>> 2011-01-04_10:28:18.89247 [3129874.448276]  [<ffffffff81041b41>] ? enqueue_task_fair+0x3e/0x82
>> 2011-01-04_10:28:18.89248 [3129874.454420]  [<ffffffff8103fbfa>] ? task_rq_lock+0x46/0x79
>> 2011-01-04_10:28:18.89249 [3129874.460132]  [<ffffffff8104a252>] ? try_to_wake_up+0x2a7/0x2b9
>> 2011-01-04_10:28:18.89250 [3129874.466191]  [<ffffffff81261b04>] ? netif_tx_lock+0x3d/0x69
>> 2011-01-04_10:28:18.89250 [3129874.471989]  [<ffffffff8124c97c>] ? netdev_drivername+0x3b/0x40
>> 2011-01-04_10:28:18.89251 [3129874.478132]  [<ffffffff81261c12>] ? dev_watchdog+0xe2/0x194
>> 2011-01-04_10:28:18.89252 [3129874.483930]  [<ffffffff8103a9cd>] ? __wake_up_common+0x44/0x72
>> 2011-01-04_10:28:18.89253 [3129874.489992]  [<ffffffff81057560>] ? cascade+0x5f/0x77
>> 2011-01-04_10:28:18.89253 [3129874.495278]  [<ffffffff8105a337>] ? run_timer_softirq+0x1c9/0x268
>> 2011-01-04_10:28:18.89254 [3129874.501594]  [<ffffffff81053aaf>] ? __do_softirq+0xdd/0x1a2
>> 2011-01-04_10:28:18.89256 [3129874.507398]  [<ffffffff8102419a>] ? lapic_next_event+0x18/0x1d
>> 2011-01-04_10:28:18.89256 [3129874.513458]  [<ffffffff81011cac>] ? call_softirq+0x1c/0x30
>> 2011-01-04_10:28:18.89257 [3129874.519166]  [<ffffffff8101322b>] ? do_softirq+0x3f/0x7c
>> 2011-01-04_10:28:18.89261 [3129874.524774]  [<ffffffff8105391e>] ? irq_exit+0x36/0x76
>> 2011-01-04_10:28:19.85162 [3129874.530164]  [<ffffffff81024c68>] ? smp_apic_timer_interrupt+0x87/0x95
>> 2011-01-04_10:28:19.85163 [3129874.536911]  [<ffffffff81011673>] ? apic_timer_interrupt+0x13/0x20
>> 2011-01-04_10:29:45.93714 x9d/0xb8 [processor]
>> 2011-01-04_10:29:45.93717 [3129874.551277]  [<ffffffffa01c024c>] ? acpi_idle_enter_c1+0x78/0xb8 [processor]
>> 2011-01-04_10:29:45.93718 [3129874.558550]  [<ffffffff81238f62>] ? cpuidle_idle_call+0x94/0xee
>> 2011-01-04_10:29:45.93719 [3129874.564695]  [<ffffffff8100feb1>] ? cpu_idle+0xa2/0xda

hth,

	--dkg

Attachment: signature.asc
Description: OpenPGP digital signature


Reply to: