Bug#671895: [sparc] Kernel NULL pointer dereference in sungem/gem_poll() (Re: updates)
On Fri, May 11, 2012 at 12:25:01PM -0300, gustavo panizzo <gfa> wrote:
> adding debian-boot
>
>
> i've installed unstable on the box (using debootstrap) and it boots
> 3.2.0-2-sparc64 sucessfully, networking works
>
> obp diags shows no errors
>
> but when i boot from network using
> http://d-i.debian.org/daily-images/sparc/daily/netboot/boot.img 11-05-2012
>
> i get the following error
>
> ┌───────────────┤ Detecting link on eth0; please wait... ├────────────────┐
> │ │
> │ 100% [ 246.994391] Unable to handle kernel NULL pointer dereference
> 247.074490] tsk->{mm,active_mm}->context = 000000000000019f │
> 14;10H[ 247.164534] tsk->{mm,active_mm}->pgd = fffff8001d48c000 │
> [ 247.240508] Kernel panic - not syncing: Aiee, killing interrupt handler! │
> [ 247.328648] Call Trace: │
> [ 247.360793] [000000000045dcd4] do_exit+0x94/0x708 │
> [ 247.423821] [0000000000427550] die_if_kernel+0x2a0/0x2c8────────────────┘
> [ 247.494864] [0000000000768c84] unhandled_fault+0x8c/0x98
> [ 247.565915] [000000000076936c] do_sparc64_fault+0x6dc/0x780
> [ 247.640377] [0000000000407880] sparc64_realfault_common+0x10/0x20
> [ 247.721722] [0000000010015680] gem_poll+0x9fc/0x1328 [sungem]
> [ 247.798478] [0000000000697110] net_rx_action+0x9c/0x234
> [ 247.868369] [00000000004607f0] __do_softirq+0xdc/0x1c4
> [ 247.937125] [000000000042a76c] do_softirq+0x54/0x80
> [ 248.002442] [0000000000460a6c] irq_exit+0x38/0x94
> [ 248.065474] [000000000042df38] timer_interrupt+0x90/0xa8
> [ 248.136516] [00000000004209d4] tl0_irq14+0x14/0x20
> [ 248.200692] [000000000049e764] touch_softlockup_watchdog+0x4/0xc
> [ 248.280888] [00000000008f07e4] start_kernel+0x390/0x3a0
> [ 248.350783] [0000000000750b88] tlb_fixup_done+0x80/0x88
> [ 248.420672] [0000000000000000] (null)
> [ 248.481416] Press Stop-A (L1-A) to return to the boot prom
Interesting, so we are doing something funky during link detection to
trip this bug. The code which does it is in netcfg:
http://anonscm.debian.org/gitweb/?p=d-i/netcfg.git;a=tree
Here's the relevant code from netcfg-common.c:
1277 debconf_capb(client, "progresscancel");
1278 debconf_subst(client, "netcfg/link_detect_progress", "interface", if_name);
1279 debconf_progress_start(client, 0, 100, "netcfg/link_detect_progress");
1280 for (count = 0; count < link_waits; count++) {
1281 usleep(250000);
1282 if (debconf_progress_set(client, 50 * count / link_waits) == 30) {
1283 /* User cancelled on us... bugger */
1284 rv = 0;
1285 break;
1286 }
1287 if (ethtool_lite(if_name) == 1) /* ethtool-lite's CONNECTED */ {
1288 if (gateway.s_addr && !is_wireless_iface(if_name)) {
1289 for (count = 0; count < gw_tries; count++) {
1290 if (di_exec_shell_log(arping) == 0)
1291 break;
1292 if (debconf_progress_set(client, 50 + 50 * count / gw_tries) == 30)
1293 break;
1294 }
1295 }
1296 rv = 1;
1297 break;
1298 }
1299 debconf_progress_set(client, 100);
1300 }
Only two non-trivial things here: execution of ethtool_lite(if_name)
and invocation of arping. I would put my money on the former (defined
in ethtool_lite.c), because it uses low-level ioctls to query the
interface state.
You can test whether running it would trigger a failure on your
machine by downloading ethtool_lite.c and building it as a standalone
binary, the following commands appear to do the trick:
$ sudo apt-get build-dep netcfg
[...]
$ gcc -o ethtool-lite -DTEST ethtool-lite.c -ldebconfclient -ldebian-installer
$ sudo ./ethtool-lite eth0
ethtool-lite: eth0 is connected.
$
If that triggers a null pointer exception on your machine (try it both
with and without network brought up and check dmesg afterwards), we
will be in a very good position to report it upstream for fixing.
Best regards,
--
Jurij Smakov jurij@wooyd.org
Key: http://www.wooyd.org/pgpkey/ KeyID: C99E03CC
Reply to: