Bug#666021: more data

To: 666021@bugs.debian.org
Subject: Bug#666021: more data
From: Tomas Pospisek <tpo@sourcepole.ch>
Date: Tue, 31 May 2016 15:18:14 +0200 (CEST)
Message-id: <[🔎] alpine.DEB.2.11.1605311442380.2078@hier>
Reply-to: Tomas Pospisek <tpo@sourcepole.ch>, 666021@bugs.debian.org

We were seeing the same problem, as reported here, often. Our logs wouldshow something like this:


[301933.088794] swapper/3: page allocation failure: order:0, mode:0x20
[301933.088817] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2
[301933.088852] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 1106 10/17/2011
[301933.088886]  0000000000000000 ffffffff8150e835 0000000000000020 ffff88043fac3c80
[301933.088921]  ffffffff81142d3f 0000000000000000 0000000000000000 0000003100000002
[301933.088955]  ffff88043fdede68 ffff88043fdeec00 ffff88043fad5ee8 ffff88043fad5ed8
[301933.088990] Call Trace:
[301933.089004]  <IRQ>  [<ffffffff8150e835>] ? dump_stack+0x5d/0x78
[301933.089029]  [<ffffffff81142d3f>] ? warn_alloc_failed+0xdf/0x130
[301933.089051]  [<ffffffff81146fea>] ? __alloc_pages_nodemask+0x8ca/0xb30
[301933.089074]  [<ffffffff8118630d>] ? alloc_pages_current+0x9d/0x150
[301933.089095]  [<ffffffff8140d11b>] ? __netdev_alloc_frag+0x8b/0x140
[301933.089117]  [<ffffffff814134af>] ? __netdev_alloc_skb+0x6f/0xf0
[301933.089145]  [<ffffffffa00d2ae2>] ? rtl8169_poll+0x2b2/0x690 [r8169]
[301933.089169]  [<ffffffff8141f7b0>] ? net_rx_action+0x140/0x240
[301933.089192]  [<ffffffff8106c6a1>] ? __do_softirq+0xf1/0x290
[301933.089212]  [<ffffffff8106ca75>] ? irq_exit+0x95/0xa0
[301933.089232]  [<ffffffff81517822>] ? do_IRQ+0x52/0xe0
[301933.089252]  [<ffffffff8151566d>] ? common_interrupt+0x6d/0x6d
[301933.089272]  <EOI>  [<ffffffff8108ae2d>] ? __hrtimer_start_range_ns+0x1cd/0x390
[301933.089307]  [<ffffffff813e011f>] ? cpuidle_enter_state+0x4f/0xc0
[301933.089329]  [<ffffffff813e0118>] ? cpuidle_enter_state+0x48/0xc0
[301933.089351]  [<ffffffff810a8398>] ? cpu_startup_entry+0x2f8/0x400
[301933.089372]  [<ffffffff81042cbf>] ? start_secondary+0x20f/0x2d0

The stack trace would *allways* cross at least partly network functionsand in particular netdev_alloc, except ...

... except for one other group of failures that would be coming from therbd/ceph driver and would look like this:


May 27 03:46:32 vil kernel: kworker/4:1: page allocation failure: order:1, mode:0x204020
May 27 03:46:32 vil kernel: CPU: 4 PID: 426028 Comm: kworker/4:1 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2
May 27 03:46:32 vil kernel: Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 1106 10/17/2011
May 27 03:46:32 vil kernel: Workqueue: rbd2 rbd_request_workfn [rbd]
May 27 03:46:32 vil kernel:  0000000000000000 ffffffff8150e835 0000000000204020 ffff880233303a80
May 27 03:46:32 vil kernel:  ffffffff81142d3f 0000000000000000 0000000000000000 ffffffff00000002
May 27 03:46:32 vil kernel:  000000013fdede00 ffff88043fdeec00 0000000000000046 0000000000000001
May 27 03:46:32 vil kernel: Call Trace:
May 27 03:46:32 vil kernel:  [<ffffffff8150e835>] ? dump_stack+0x5d/0x78
May 27 03:46:32 vil kernel:  [<ffffffff81142d3f>] ? warn_alloc_failed+0xdf/0x130
May 27 03:46:32 vil kernel:  [<ffffffff81146fea>] ? __alloc_pages_nodemask+0x8ca/0xb30
May 27 03:46:32 vil kernel:  [<ffffffff8118d08b>] ? kmem_getpages+0x5b/0x110
May 27 03:46:32 vil kernel:  [<ffffffff8118e6bf>] ? fallback_alloc+0x1cf/0x210
May 27 03:46:32 vil kernel:  [<ffffffff8118f2a0>] ? kmem_cache_alloc+0x1f0/0x450
May 27 03:46:32 vil kernel:  [<ffffffffa063d0f3>] ? ceph_osdc_alloc_request+0x53/0x2f0 [libceph]
May 27 03:46:32 vil kernel:  [<ffffffffa05ba33f>] ? rbd_osd_req_create.isra.25+0x6f/0x170 [rbd]
May 27 03:46:32 vil kernel:  [<ffffffffa05bafb6>] ? rbd_img_request_fill+0x2b6/0x910 [rbd]
May 27 03:46:32 vil kernel:  [<ffffffffa05be16b>] ? rbd_request_workfn+0x24b/0x390 [rbd]
May 27 03:46:32 vil kernel:  [<ffffffff810817c2>] ? process_one_work+0x172/0x420
May 27 03:46:32 vil kernel:  [<ffffffff81081e53>] ? worker_thread+0x113/0x4f0
May 27 03:46:32 vil kernel:  [<ffffffff81510d61>] ? __schedule+0x2b1/0x700
May 27 03:46:32 vil kernel:  [<ffffffff81081d40>] ? rescuer_thread+0x2d0/0x2d0
May 27 03:46:32 vil kernel:  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
May 27 03:46:32 vil kernel:  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
May 27 03:46:32 vil kernel:  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
May 27 03:46:32 vil kernel:  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180

We have now set:

  # cat /etc/sysctl.conf
  ...
  vm.min_free_kbytes = 65536
  ...
  # sysctl -p /etc/sysctl.conf

Since Ben Hutchings' explanation in [1] in this bug report makes very muchsense, and matches quite closely with what we are seeing I am assumingthat the setting above will fix our problem. If not I'll report back here.


Note that from what I can see our system is *NOT* under memory pressure.

I think that this problem is fundamentaly a *kernel bug*:

I don't think that an admin should be required to have intricate knowledgeabout memory allocation procedures in the network stack in order to beable to operate a server.

On the contrary it's the network stack that should be able to communicateto the virtual memory subsystem that it needs memory pages badly, and thevirtual memory subsystem should tale that criticality into account andswap out pages or call the OOM subsystem to kill stuff.

Also there doesn't seem to be sufficient info coming along with the errormessage "swapper/3: page allocation failure: order:0, mode:0x20" toenable a sysadmin to figure out why there was an allocation failure.

Note that the original reason why I ended up in this bug report here was,that some *VM*'s filesystem would be mounted read-only for no apparentreason.


The course of events that would lead up to that symptom was apparently:

* some app makes accesses the file system
* which makes ext4 do something
* which makes the in kernel jbd2 journaling daemon/thread do an operation
* which is passed down to the rdb block device
* which makes rbd do something over the network (?)
* which triggers rbd to allocate memory
* which fails
* so rbd reports the error back to jbd2
* jbd2 reports back to ext4
* ext4 remounts the file system read only

This is from our logs:

May 27 03:57:07 vil kernel: rbd: rbd2: write 1000 at 1883c1000 result -12
May 27 03:57:07 vil kernel: end_request: I/O error, dev rbd2, sector 12852744
May 27 03:57:07 vil kernel: Buffer I/O error on device rbd2, logical block 1606593
May 27 03:57:07 vil kernel: lost page write due to I/O error on rbd2
May 27 03:57:07 vil kernel: JBD2: Error -5 detected when updating journal superblock for rbd2-8.
May 27 03:57:31 vil kernel: EXT4-fs error (device rbd2): ext4_journal_check_start:56: Detected aborted journal
May 27 03:57:31 vil kernel: EXT4-fs (rbd2): Remounting filesystem read-only
May 27 03:57:31 vil kernel: EXT4-fs (rbd2): ext4_writepages: jbd2_start: 13312 pages, ino 796517; err -30

Thanks,
*t

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=666021#90

Reply to:

Follow-Ups:
- Bug#666021: separate ceph bug
  - From: Tomas Pospisek <tpo@sourcepole.ch>

Prev by Date: Bug#666021: bug present in current jessie kernel
Next by Date: Processed: bug present in current jessie kernel
Previous by thread: Processed: bug present in current jessie kernel
Next by thread: Bug#666021: separate ceph bug
Index(es):
- Date
- Thread