Bug#700755: huge slab_unreclaimable in Xen domU
On Wed, 2013-02-20 at 12:18 +0100, Josip Rodin wrote:
> On Wed, Feb 20, 2013 at 10:27:02AM +0000, Ian Campbell wrote:
> > On Sun, 2013-02-17 at 00:22 +0100, Josip Rodin wrote:
> > > Package: linux-image-2.6.32-5-xen-amd64
> >
> > This is in a guest, right? Is it possible to try the non-Xen amd64
> > flavour? I forget the exact status in Squeeze but IIRC most of the domU
> > functionality is present in the -amd64 flavour with the -xen-amd64
> > flavour only being required for dom0 and some of the more advanced domU
> > features.
> >
> > The reason I ask this is that the non-xen flavour is closer to mainline
> > and therefore should be easier to track down the issue with.
> >
> > If you are also able separately to try this with the Wheezy kernel that
> > would be very useful too.
>
> OK, I can install both (it's got PV-GRUB), which do you prefer to test first?
> I'm asking because it'll likely take a few weeks for the bug to appear,
> judging by what it did before.
Probably at this stage I would be more interested in making sure Wheezy
was going to be OK first.
> > > The thing I noticed was the slab_unreclaimable explosion, by a factor
> > > of 122. That... doesn't sound like something that should be happenning.
> >
> > Indeed. Is the system responsive enough to login and
> > examine /proc/slabinfo? There is probably one which has exploded in
> > size, it may even be sufficient to observe this over time and see if one
> > seems to be slowly creeping upwards towards $doom.
> >
> > > I'm going to try to run slabtop the next time I catch it in this state,
> > > in order to try to glean some more information.
> >
> > That would be great.
>
> I did post two consecutive slabtop results...
Sorry, I just have missed these.
> I thought they had all the
> relevant info from /proc/slabinfo.
>
> The two large elements that grew both in the total number of objects and
> the active number were (extracted from my previous message):
>
> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
> first readout:
> 65419 65419 100% 4.00K 14179 8 453728K kmalloc-4096
> 65390 65390 100% 2.06K 13338 15 426816K net_namespace
> second readout:
> 65428 65428 100% 4.00K 14181 8 453792K kmalloc-4096
> 65391 65391 100% 2.06K 13339 15 426848K net_namespace
>
> How do I trace which process is calling this?
I'm not sure. The net_namespace one should be easy enough to track in
the code since:
net_cachep = kmem_cache_create("net_namespace", sizeof(struct
net),
and therefore users of net_cachep must be responsible, I'd expect there
to be not all that many of those. Are you actually using network
namespaces in the guest?
The kmalloc-4096 one seems a lot more generic, tracking the users down
is going to be harder I should think.
The Debian kernels have SLUB:
/boot/config-2.6.32-5-xen-amd64:CONFIG_SLUB_DEBUG=y
/boot/config-2.6.32-5-xen-amd64:CONFIG_SLUB=y
/boot/config-2.6.32-5-xen-amd64:# CONFIG_SLUB_DEBUG_ON is not set
/boot/config-2.6.32-5-xen-amd64:# CONFIG_SLUB_STATS is not set
(same as native). Documentation/vm/slub.txt has some info on adding
debugging stuff there, e.g. adding slub_debug to the command line. It
doesn't look like rebuilding with the other two option would initially
be useful (the first is equivalent to the command line option anyway)
Ian.
>
> In comparison, now, under seemingly normal circumstances, slabtop looks like
> this on that machine:
>
> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
> 56124 25272 45% 0.11K 1559 36 6236K buffer_head
> 24843 12898 51% 0.19K 1183 21 4732K dentry
> 23100 16107 69% 1.01K 1540 15 24640K nfs_inode_cache
> 11456 6403 55% 0.06K 179 64 716K kmalloc-64
> 10208 8864 86% 0.12K 319 32 1276K kmalloc-128
> 7308 5275 72% 0.55K 522 14 4176K radix_tree_node
> 4947 4940 99% 0.08K 97 51 388K sysfs_dir_cache
> 3584 3573 99% 0.01K 7 512 28K kmalloc-8
> 3200 2016 63% 0.79K 160 20 2560K ext3_inode_cache
> 2068 1981 95% 0.18K 94 22 376K vm_area_struct
> 1792 1790 99% 0.02K 7 256 28K kmalloc-16
> 1692 1631 96% 0.63K 141 12 1128K proc_inode_cache
> 1632 1588 97% 1.00K 102 16 1632K kmalloc-1024
> 1472 1442 97% 0.25K 92 16 368K kmalloc-256
> 1428 1129 79% 0.19K 68 21 272K kmalloc-192
> 1296 1284 99% 4.00K 162 8 5184K kmalloc-4096
> 1275 1270 99% 2.06K 85 15 2720K net_namespace
> [...]
>
--
Ian Campbell
Current Noise: Old Man's Child - Twilight Damnation
"Once they go up, who cares where they come down? That's not my department."
-- Werner von Braun
Reply to: