Bug#623275: marked as done (linux-2.6: [x86] Null pointer dereference in hrtick_start_fair)

To: Moritz Mühlenhoff <jmm@inutil.org>
Subject: Bug#623275: marked as done (linux-2.6: [x86] Null pointer dereference in hrtick_start_fair)
From: owner@bugs.debian.org (Debian Bug Tracking System)
Date: Mon, 24 Jun 2013 17:30:36 +0000
Message-id: <[🔎] handler.623275.D623275.13720948921320.ackdone@bugs.debian.org>
References: <20130624172757.GA22021@pisco.westfalen.local> <1303165215.23072.24.camel@ank32.eng.vmware.com>

Your message dated Mon, 24 Jun 2013 19:27:57 +0200
with message-id <20130624172757.GA22021@pisco.westfalen.local>
and subject line Closing
has caused the Debian Bug report #623275,
regarding linux-2.6: [x86] Null pointer dereference in hrtick_start_fair
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
623275: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=623275
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems

--- Begin Message ---

To: submit@bugs.debian.org
Subject: linux-2.6: [x86] Null pointer dereference in hrtick_start_fair
From: Alok Kataria <akataria@vmware.com>
Date: Mon, 18 Apr 2011 15:20:15 -0700
Message-id: <1303165215.23072.24.camel@ank32.eng.vmware.com>
Reply-to: akataria@vmware.com

Package: linux-2.6
Version: 2.6.26-*

This is the same bug that was reported in PR 538332, that bug was
archived so submitting new changes here.


Sorry for not replying on the other bug earlier. I totally forgot about
this bug report, until I received multiple reports of hitting this bug
more often while running reboot loop tests on the debian (5.x) kernel,
and started looking at this closely.

First of all this is the panic message that we see.

<1>[ 1.890083] BUG: unable to handle kernel NULL pointer dereference at 00000000
<1>[ 1.890083] IP: [<c0119118>] hrtick_start_fair+0x63/0x12c
<4>[ 1.890083] *pde = 00000000
<0>[ 1.890083] Oops: 0000 [#1] SMP
<4>[ 1.890083] Modules linked in:
<4>[ 1.890083]
<4>[ 1.890083] Pid: 11, comm: khelper Not tainted (2.6.26-2-686 #1)
<4>[ 1.890083] EIP: 0060:[<c0119118>] EFLAGS: 00010046 CPU: 0
<4>[ 1.890083] EIP is at hrtick_start_fair+0x63/0x12c
<4>[ 1.890083] EAX: 00000000 EBX: c1413ffc ECX: 00000001 EDX: 00000001
<4>[ 1.890083] ESI: df47d900 EDI: c1413fc0 EBP: df4bd228 ESP: df499f20
<4>[ 1.890083] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
<0>[ 1.890083] Process khelper (pid: 11, ti=df498000 task=df48e8c0 task.ti=df498000)
<0>[ 1.890083] Stack: df4bd200 df4bd200 c02c4d40 df47d900 c1413fc0 00000001 c0118966 c1413fc0
<0>[ 1.890083] df47d900 00000001 c011898c df47d900 c1413fc0 c011b6fa 00000003 00000002
<0>[ 1.890083] df47feb0 df47fed4 00000001 00000001 c0118511 00000000 00000003 df47fedc
<0>[ 1.890083] Call Trace:
<0>[ 1.890083] [<c0118966>] enqueue_task+0x52/0x5d
<0>[ 1.890083] [<c011898c>] activate_task+0x1b/0x26
<0>[ 1.890083] [<c011b6fa>] try_to_wake_up+0xaf/0xf1
<0>[ 1.890083] [<c0118511>] __wake_up_common+0x2e/0x58
<0>[ 1.890083] [<c011a686>] complete+0x28/0x36
<0>[ 1.890083] [<c012ebfa>] __call_usermodehelper+0x0/0x4b
<0>[ 1.890083] [<c012f0ae>] run_workqueue+0x74/0xf2
<0>[ 1.890083] [<c012f789>] worker_thread+0x0/0xbd
<0>[ 1.890083] [<c012f83c>] worker_thread+0xb3/0xbd
<0>[ 1.890083] [<c0131a44>] autoremove_wake_function+0x0/0x2d
<0>[ 1.890083] [<c0131983>] kthread+0x38/0x5d
<0>[ 1.890083] [<c013194b>] kthread+0x0/0x5d
<0>[ 1.890083] [<c01044f7>] kernel_thread_helper+0x7/0x10
<0>[ 1.890083] =======================
<0>[ 1.890083] Code: 00 b8 51 09 31 c0 e8 17 95 00 00 f6 05 40 45 37 c0 40 0f 84 d5 00 00 00 f6 87 28 04 00 00 04 0f 85 c8 00 00 00 8b 87 4c 04 00 00 <8b> 00 83 78 7c 00 0f 84 b6 00 00 00 83 7b 08 01 0f 86 ac 00 0
0
<0>[ 1.890083] EIP: [<c0119118>] hrtick_start_fair+0x63/0x12c SS:ESP 0068:df499f20
<4>[ 1.890083] ---[ end trace a7919e7f17c0a725 ]---


I think there maybe a race between hrtimer_start and hrtick_start_fair,
which can cause this. 

The timer->base for the rq (run queue) of all possible cpu's is setup in
__hrtimer_init and at this time all cpu's rq.timer->base points to
cpu0's hrtimer_bases. 

When cpu1 start running, the first time hrtimer_start gets called on
this cpu, it will try to change the base to the local cpu's base, as
seen in __hrtimer_init it is still pointing to cpu0's base, the switch
is done in switch_hrtimer_base. 

At the time this is happening on cpu1, if cpu0 tries to access cpu1
runqueue's timer base (rq.timer->base), without calling lock_timer_base
it may see the null value. As seen in the stack trace this may happen,
when cpu0 might be trying to wake up a task which is on cpu1's runqueue,
and it may see rq.timer->base as NULL. 


static inline struct hrtimer_clock_base *
switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
{
....
 /* See the comment in lock_timer_base() */
      timer->base = NULL;                    <<==== after this, cpu0 might see the base as NULL for cpu1's runqueue.
      spin_unlock(&base->cpu_base->lock); 
      spin_lock(&new_base->cpu_base->lock);
      timer->base = new_base;
....
}


I didn't dig when was the race introduced, but it seems to exist on
mainline 2.6.26 too, looking at recent kernels, the hrtimer code has
been revamped quite a bit here and the race doesn't exist on those
versions. 

Can you please take a look at the analysis and let me know if you have
any comments. 

Thanks,
Alok

--- End Message ---

--- Begin Message ---

To: 538081-done@bugs.debian.org, 539059-done@bugs.debian.org, 541073-done@bugs.debian.org, 542440-done@bugs.debian.org, 623019-done@bugs.debian.org, 623066-done@bugs.debian.org, 623239-done@bugs.debian.org, 623275-done@bugs.debian.org

Subject: Closing

From: Moritz Mühlenhoff <jmm@inutil.org>

Date: Mon, 24 Jun 2013 19:27:57 +0200

Message-id: <20130624172757.GA22021@pisco.westfalen.local>
Hi,
your bug has been filed against the "linux-2.6" source package and was filed for
a kernel older than the recently released Debian 7.x / Wheezy with a severity
less than important.

We don't have the ressources to reproduce the complete backlog of all older kernel
bugs, so we're closing this bug for now. If you can reproduce the bug with Debian Wheezy
or a more recent kernel from testing or unstable, please reopen the bug by sending
a mail to control@bugs.debian.org with the following three commands included in the
mail:

reopen BUGNUMBER
reassign BUGNUMBER src:linux
thanks

Cheers,
        Moritz
--- End Message ---

Reply to:

Prev by Date: Bug#623066: marked as done (linux-image-2.6.38-2-686: Module acer-wmi is not loaded on Acer Aspire 5610 (regression))
Next by Date: Bug#623239: marked as done (linux-image-2.6.32-5-amd64: after Squeeze upgrade, HP DL380-G5 server lost its video output at boot)
Previous by thread: Bug#623066: marked as done (linux-image-2.6.38-2-686: Module acer-wmi is not loaded on Acer Aspire 5610 (regression))
Next by thread: Bug#623239: marked as done (linux-image-2.6.32-5-amd64: after Squeeze upgrade, HP DL380-G5 server lost its video output at boot)
Index(es):
- Date
- Thread