Bug#805971: linux-image-3.16.0-4-amd64: [PATCH] Xen domU "unable to handle kernel NULL pointer dereference"

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#805971: linux-image-3.16.0-4-amd64: [PATCH] Xen domU "unable to handle kernel NULL pointer dereference"
From: Sebastian Pipping <sebastian@pipping.org>
Date: Tue, 24 Nov 2015 13:18:16 +0100
Message-id: <[🔎] 20151124121816.7035.99576.reportbug@localhost>
Reply-to: Sebastian Pipping <sebastian@pipping.org>, 805971@bugs.debian.org

Package: linux-image-3.16.0-4-amd64
Version: 3.16.7-ckt11-1+deb8u6
Severity: important

Hi!


Inside a Xen domU, with the combination of

 * latest kernel of jessie (3.16.7-ckt11-1+deb8u6)
   or related kernel from wheezy-backports (3.16.7-ckt11-1+deb8u6~bpo70+1) and

 * 2 network interfaces and

 * 24 VCPUs ..

I see error "unable to handle kernel NULL pointer dereference" during start-up
...

  [    0.755434] xen_netfront: can't alloc rx grant refs
  [    0.758359] BUG: unable to handle kernel NULL pointer dereference at
0000000000000018
  [    0.761622] IP: [<ffffffffa018bc09>] netback_changed+0x989/0xf00
[xen_netfront]
  [    0.761622] PGD 0
  [    0.761622] Oops: 0000 [#1] SMP
  [    0.761622] Modules linked in: ata_piix xen_blkfront(+) xen_netfront(+)
libata crc32c_intel floppy scsi_mod
  [    0.761622] CPU: 1 PID: 129 Comm: xenwatch Not tainted
3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt11-1+deb8u6~bpo70+1
  [    0.761622] Hardware name: Xen HVM domU, BIOS 4.4.1 10/26/2015
  [    0.761622] task: ffff88003bbd53f0 ti: ffff88003bbd8000 task.ti:
ffff88003bbd8000
  [    0.761622] RIP: 0010:[<ffffffffa018bc09>]  [<ffffffffa018bc09>]
netback_changed+0x989/0xf00 [xen_netfront]
  [    0.761622] RSP: 0018:ffff88003bbdbde8  EFLAGS: 00010202
  [    0.761622] RAX: 0000000000000000 RBX: ffff880032398d00 RCX:
0000000000000001
  [    0.761622] RDX: 00000000000322a7 RSI: ffff880032398d98 RDI:
0000000000005729
  [    0.761622] RBP: 0000000000098d00 R08: 0000000000000001 R09:
ffffffff8172b600
  [    0.761622] R10: ffffea0000af94c0 R11: ffffea0000af9b38 R12:
ffff880036a61000
  [    0.761622] R13: ffff8800322a6000 R14: ffff880036a618c0 R15:
ffff8800322a7000
  [    0.761622] FS:  0000000000000000(0000) GS:ffff88003ce20000(0000)
knlGS:0000000000000000
  [    0.761622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [    0.761622] CR2: 0000000000000018 CR3: 0000000001811000 CR4:
00000000001406e0
  [    0.761622] Stack:
  [    0.761622]  ffff88003b5e0c20 ffff880032391381 ffff8800323912c4
ffff880000000018
  [    0.761622]  ffff88003b5e0c00 0000001400000001 ffff880032398d98
ffff88003b5e0c00
  [    0.761622]  0000000000000000 ffff8800328798f1 0000000800000001
0000003800000001
  [    0.761622] Call Trace:
  [    0.761622]  [<ffffffff81381d50>] ? xenbus_thread+0x2a0/0x2a0
  [    0.761622]  [<ffffffff81381dea>] ? xenwatch_thread+0x9a/0x140
  [    0.761622]  [<ffffffff810b13b0>] ? __wake_up_sync+0x20/0x20
  [    0.761622]  [<ffffffff81090741>] ? kthread+0xc1/0xe0
  [    0.761622]  [<ffffffff81090680>] ? flush_kthread_worker+0xb0/0xb0
  [    0.761622]  [<ffffffff8154be58>] ? ret_from_fork+0x58/0x90
  [    0.761622]  [<ffffffff81090680>] ? flush_kthread_worker+0xb0/0xb0
  [    0.761622] Code: 63 38 fe e9 5c fb ff ff 48 8b 7c 24 20 48 c7 c2 cb d2 18
a0 be f4 ff ff ff 31 c0 e8 72 4a 1f e1 eb a2 48 8b 43 20 48 8b 74 24 30 <48> 8b
78 18 e8 8e 4b 1f e1 85 c0 0f 88 d5 fd ff ff 48 8b 43 20
  [    0.761622] RIP  [<ffffffffa018bc09>] netback_changed+0x989/0xf00
[xen_netfront]
  [    0.761622]  RSP <ffff88003bbdbde8>
  [    0.761622] CR2: 0000000000000018
  [    0.761622] ---[ end trace 6123087ce2740115 ]---

... and the second network interface ends up unusuable.

It turns out, what's happening is that:

 * by default, the hypervisor allocates 32 grant table entries and

 * network interface can need more than 32.

 * Now function talk_to_netback (drivers/net/xen-netfront.c) calls
   function xennet_create_queues (drivers/net/xen-netfront.c) to create
   num_queues many queues.

 * xennet_create_queues goes on as long as it can and stores
   the number of queues created at info->netdev->real_num_tx_queues.

 * Now function talk_to_netback continues operation with the (wrong) assumption
   that num_queues queues are in place, while it may be fewer than that.
   So yyncing num_queues with info->netdev->real_num_tx_queues fixes the
problem.

Viktor Dukhovni published a patch on 2015-09-09 at
http://lists.xenproject.org/archives/html/xen-users/2015-09/txtbaRgWqxpT4.txt ,
already.  His patch also fixes the "only created %d queues" message:
unpatched it is using the wanted number of queues (rather than the number of
queues created), by mistake.

I'm hoping for an updated kernel package including Viktor's patch, soon.


For a workaround, one can use something like gnttab_max_nr_frames=256
to increase the size of the grant table (with GRUB_CMDLINE_XEN_DEFAULT in
/etc/default/grub).  Again, it's no more than a workaround and requires
rebooting the hypervisor (which upgrading the domU to a fixed kernel does not).

Many thanks in advance,



Sebastian



-- System Information:
Debian Release: 7.9
  APT prefers oldstable-updates
  APT policy: (500, 'oldstable-updates'), (500, 'oldstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.2.0-4-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Reply to:

Follow-Ups:
- Bug#805971: linux-image-3.16.0-4-amd64: [PATCH] Xen domU "unable to handle kernel NULL pointer dereference"
  - From: Ian Campbell <ijc@debian.org>

Prev by Date: Bug#805949: linux-tools: FTBFS when built with dpkg-buildpackage -A (no binary artifacts)
Next by Date: Bug#805971: linux-image-3.16.0-4-amd64: [PATCH] Xen domU "unable to handle kernel NULL pointer dereference"
Previous by thread: Bug#805949: marked as done (linux-tools: FTBFS when built with dpkg-buildpackage -A (no binary artifacts))
Next by thread: Bug#805971: linux-image-3.16.0-4-amd64: [PATCH] Xen domU "unable to handle kernel NULL pointer dereference"
Index(es):
- Date
- Thread