[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#623275: linux-2.6: [x86] Null pointer dereference in hrtick_start_fair



Package: linux-2.6
Version: 2.6.26-*

This is the same bug that was reported in PR 538332, that bug was
archived so submitting new changes here.


Sorry for not replying on the other bug earlier. I totally forgot about
this bug report, until I received multiple reports of hitting this bug
more often while running reboot loop tests on the debian (5.x) kernel,
and started looking at this closely.

First of all this is the panic message that we see.

<1>[ 1.890083] BUG: unable to handle kernel NULL pointer dereference at 00000000
<1>[ 1.890083] IP: [<c0119118>] hrtick_start_fair+0x63/0x12c
<4>[ 1.890083] *pde = 00000000
<0>[ 1.890083] Oops: 0000 [#1] SMP
<4>[ 1.890083] Modules linked in:
<4>[ 1.890083]
<4>[ 1.890083] Pid: 11, comm: khelper Not tainted (2.6.26-2-686 #1)
<4>[ 1.890083] EIP: 0060:[<c0119118>] EFLAGS: 00010046 CPU: 0
<4>[ 1.890083] EIP is at hrtick_start_fair+0x63/0x12c
<4>[ 1.890083] EAX: 00000000 EBX: c1413ffc ECX: 00000001 EDX: 00000001
<4>[ 1.890083] ESI: df47d900 EDI: c1413fc0 EBP: df4bd228 ESP: df499f20
<4>[ 1.890083] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
<0>[ 1.890083] Process khelper (pid: 11, ti=df498000 task=df48e8c0 task.ti=df498000)
<0>[ 1.890083] Stack: df4bd200 df4bd200 c02c4d40 df47d900 c1413fc0 00000001 c0118966 c1413fc0
<0>[ 1.890083] df47d900 00000001 c011898c df47d900 c1413fc0 c011b6fa 00000003 00000002
<0>[ 1.890083] df47feb0 df47fed4 00000001 00000001 c0118511 00000000 00000003 df47fedc
<0>[ 1.890083] Call Trace:
<0>[ 1.890083] [<c0118966>] enqueue_task+0x52/0x5d
<0>[ 1.890083] [<c011898c>] activate_task+0x1b/0x26
<0>[ 1.890083] [<c011b6fa>] try_to_wake_up+0xaf/0xf1
<0>[ 1.890083] [<c0118511>] __wake_up_common+0x2e/0x58
<0>[ 1.890083] [<c011a686>] complete+0x28/0x36
<0>[ 1.890083] [<c012ebfa>] __call_usermodehelper+0x0/0x4b
<0>[ 1.890083] [<c012f0ae>] run_workqueue+0x74/0xf2
<0>[ 1.890083] [<c012f789>] worker_thread+0x0/0xbd
<0>[ 1.890083] [<c012f83c>] worker_thread+0xb3/0xbd
<0>[ 1.890083] [<c0131a44>] autoremove_wake_function+0x0/0x2d
<0>[ 1.890083] [<c0131983>] kthread+0x38/0x5d
<0>[ 1.890083] [<c013194b>] kthread+0x0/0x5d
<0>[ 1.890083] [<c01044f7>] kernel_thread_helper+0x7/0x10
<0>[ 1.890083] =======================
<0>[ 1.890083] Code: 00 b8 51 09 31 c0 e8 17 95 00 00 f6 05 40 45 37 c0 40 0f 84 d5 00 00 00 f6 87 28 04 00 00 04 0f 85 c8 00 00 00 8b 87 4c 04 00 00 <8b> 00 83 78 7c 00 0f 84 b6 00 00 00 83 7b 08 01 0f 86 ac 00 0
0
<0>[ 1.890083] EIP: [<c0119118>] hrtick_start_fair+0x63/0x12c SS:ESP 0068:df499f20
<4>[ 1.890083] ---[ end trace a7919e7f17c0a725 ]---


I think there maybe a race between hrtimer_start and hrtick_start_fair,
which can cause this. 

The timer->base for the rq (run queue) of all possible cpu's is setup in
__hrtimer_init and at this time all cpu's rq.timer->base points to
cpu0's hrtimer_bases. 

When cpu1 start running, the first time hrtimer_start gets called on
this cpu, it will try to change the base to the local cpu's base, as
seen in __hrtimer_init it is still pointing to cpu0's base, the switch
is done in switch_hrtimer_base. 

At the time this is happening on cpu1, if cpu0 tries to access cpu1
runqueue's timer base (rq.timer->base), without calling lock_timer_base
it may see the null value. As seen in the stack trace this may happen,
when cpu0 might be trying to wake up a task which is on cpu1's runqueue,
and it may see rq.timer->base as NULL. 


static inline struct hrtimer_clock_base *
switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
{
....
 /* See the comment in lock_timer_base() */
      timer->base = NULL;                    <<==== after this, cpu0 might see the base as NULL for cpu1's runqueue.
      spin_unlock(&base->cpu_base->lock); 
      spin_lock(&new_base->cpu_base->lock);
      timer->base = new_base;
....
}


I didn't dig when was the race introduced, but it seems to exist on
mainline 2.6.26 too, looking at recent kernels, the hrtimer code has
been revamped quite a bit here and the race doesn't exist on those
versions. 

Can you please take a look at the analysis and let me know if you have
any comments. 

Thanks,
Alok





Reply to: