[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)



On Fri, May 11, 2012 at 12:25:01PM -0300, gustavo panizzo <gfa> wrote:
> adding debian-boot
> 
> 
> i've installed unstable on the box (using debootstrap) and it boots
> 3.2.0-2-sparc64 sucessfully, networking works
> 
> obp diags shows no errors
> 
> but when i boot from network using 
> http://d-i.debian.org/daily-images/sparc/daily/netboot/boot.img 11-05-2012
> 
> i get the following error
> 
>   ┌───────────────┤ Detecting link on eth0; please wait... ├────────────────┐
>   │                                                                         │
>   │                                  100%                         [  246.994391] Unable to handle kernel NULL pointer dereference
>             247.074490] tsk->{mm,active_mm}->context = 000000000000019f     │
> 14;10H[  247.164534] tsk->{mm,active_mm}->pgd = fffff8001d48c000            │
> [  247.240508] Kernel panic - not syncing: Aiee, killing interrupt handler! │
> [  247.328648] Call Trace:                                                  │
> [  247.360793]  [000000000045dcd4] do_exit+0x94/0x708                       │
> [  247.423821]  [0000000000427550] die_if_kernel+0x2a0/0x2c8────────────────┘
> [  247.494864]  [0000000000768c84] unhandled_fault+0x8c/0x98
> [  247.565915]  [000000000076936c] do_sparc64_fault+0x6dc/0x780
> [  247.640377]  [0000000000407880] sparc64_realfault_common+0x10/0x20
> [  247.721722]  [0000000010015680] gem_poll+0x9fc/0x1328 [sungem]
> [  247.798478]  [0000000000697110] net_rx_action+0x9c/0x234
> [  247.868369]  [00000000004607f0] __do_softirq+0xdc/0x1c4
> [  247.937125]  [000000000042a76c] do_softirq+0x54/0x80
> [  248.002442]  [0000000000460a6c] irq_exit+0x38/0x94
> [  248.065474]  [000000000042df38] timer_interrupt+0x90/0xa8
> [  248.136516]  [00000000004209d4] tl0_irq14+0x14/0x20
> [  248.200692]  [000000000049e764] touch_softlockup_watchdog+0x4/0xc
> [  248.280888]  [00000000008f07e4] start_kernel+0x390/0x3a0
> [  248.350783]  [0000000000750b88] tlb_fixup_done+0x80/0x88
> [  248.420672]  [0000000000000000]           (null)
> [  248.481416] Press Stop-A (L1-A) to return to the boot prom

Interesting, so we are doing something funky during link detection to 
trip this bug. The code which does it is in netcfg:

http://anonscm.debian.org/gitweb/?p=d-i/netcfg.git;a=tree

Here's the relevant code from netcfg-common.c:

1277     debconf_capb(client, "progresscancel");
1278     debconf_subst(client, "netcfg/link_detect_progress", "interface", if_name);
1279     debconf_progress_start(client, 0, 100, "netcfg/link_detect_progress");
1280     for (count = 0; count < link_waits; count++) {
1281         usleep(250000);
1282         if (debconf_progress_set(client, 50 * count / link_waits) == 30) {
1283             /* User cancelled on us... bugger */
1284             rv = 0;
1285             break;
1286         }
1287         if (ethtool_lite(if_name) == 1) /* ethtool-lite's CONNECTED */ {
1288             if (gateway.s_addr && !is_wireless_iface(if_name)) {
1289                 for (count = 0; count < gw_tries; count++) {
1290                     if (di_exec_shell_log(arping) == 0)
1291                         break;
1292                     if (debconf_progress_set(client, 50 + 50 * count / gw_tries) == 30)
1293                         break;
1294                 }
1295             }
1296             rv = 1;
1297             break;
1298         }
1299         debconf_progress_set(client, 100);
1300     }

Only two non-trivial things here: execution of ethtool_lite(if_name) 
and invocation of arping. I would put my money on the former (defined 
in ethtool_lite.c), because it uses low-level ioctls to query the 
interface state.

You can test whether running it would trigger a failure on your 
machine by downloading ethtool_lite.c and building it as a standalone 
binary, the following commands appear to do the trick:

$ sudo apt-get build-dep netcfg
[...]
$ gcc -o ethtool-lite -DTEST ethtool-lite.c -ldebconfclient -ldebian-installer
$ sudo ./ethtool-lite eth0
ethtool-lite: eth0 is connected.
$

If that triggers a null pointer exception on your machine (try it both 
with and without network brought up and check dmesg afterwards), we 
will be in a very good position to report it upstream for fixing.

Best regards,
-- 
Jurij Smakov                                           jurij@wooyd.org
Key: http://www.wooyd.org/pgpkey/                      KeyID: C99E03CC



Reply to: