Need help analyzing (kernel?) memory usage and reclaiming RAM (Debian Stretch)
Hello,
(please let me know if this is more appropriate somewhere else, e.g. on
ebian-kernel)
I need help debugging/solving a weird memory problem. The symptoms are
the usual ones for high memory usage: free/available memory is getting
low, systems start swapping, disk I/O increases, performance drops.
However, from what I can see, the memory is not used up by user space
processes but from the Kernel (NOT caches/buffers), see commands output
at the end.
I'm still puzzled about what exactly eats all the RAM and how to reclaim
it (without rebooting the machine, of course!). Any help would be highly
appreciated!
Some findings so far:
- same problem on many systems, all Debian 9 Stretch, all running stock
4.9 kernel from the official package, all amd64 virtual machines on
several (different) VMware ESXi hosts.
- not all Stretch systems seem to be affected, but we haven't yet found
the common ground.
- problem can occur after some days or some weeks, not at the same time
on all affected machines. And not at the same time for all VMs on the
same host
- problem only occurs on Stretch systems, not Jessie, even running on
the same host.
- we haven't yet seen the problem on real hardware machines, only VMs
(but since the vast majority of our systems are VMs, this may not be
relevant)
- problem seems not directly related to the machine's load. it occurs on
machines that are mostly idle as well as on more heavily-loaded
systems
- problem occurs the same on single-core VMs as well as on multi-core
VMs
- problem occurs the same on VMs running on single-socket hosts as well
as on multi-socket hosts
- problem occurs the same on VMs running on hosts with different
hypervisor releases, both VMware ESXi 5.5 and 6.5, both standalone and
in a vSphere cluster.
Here's the output from some commands I hope to be helpful:
The machine in this example is a RADIUS server but has not even gone
productive ... no incoming client requests yet. (But the problem is not
related to the RADIUS server software - OSC Radiator - since the same
symptoms show on different machines: not only RADIUS servers but also
nameservers, shell servers or jumphosts, etc.)
[values while the problem persists:]
------------------------------------------------------------------------
root@rad-m2m-srv02:~# free -thwl
total used free shared buffers cache available
Mem: 987M 910M 59M 0B 704K 16M 13M
Low: 987M 927M 59M
High: 0B 0B 0B
Swap: 2,0G 345M 1,7G
Total: 3,0G 1,2G 1,7G
root@rad-m2m-srv02:~# smem -twk
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 914.9M 11.1M 903.8M
userspace memory 13.0M 5.5M 7.4M
free memory 59.4M 59.4M 0
----------------------------------------------------------
987.3M 76.1M 911.2M
root@rad-m2m-srv02:~# smem -uktr
User Count Swap USS PSS RSS
root 39 332.8M 10.4M 12.4M 44.7M
msch 6 7.0M 0 607.0K 8.3M
_chrony 1 360.0K 4.0K 20.0K 572.0K
messagebus 1 580.0K 4.0K 17.0K 480.0K
postfix 2 1.6M 0 13.0K 568.0K
daemon 1 208.0K 4.0K 6.0K 72.0K
---------------------------------------------------
50 342.5M 10.4M 13.0M 54.7M
root@rad-m2m-srv02:~# sort -k2,2nr /proc/meminfo
VmallocTotal: 34359738367 kB
CommitLimit: 2602636 kB
SwapTotal: 2097148 kB
SwapFree: 1741028 kB
MemTotal: 1010976 kB
DirectMap4k: 1007488 kB
Committed_AS: 465128 kB
Slab: 79680 kB
SUnreclaim: 69268 kB
MemFree: 61068 kB
DirectMap2M: 40960 kB
SReclaimable: 10412 kB
Active: 6944 kB
Inactive: 6660 kB
AnonPages: 6608 kB
PageTables: 5804 kB
Cached: 5748 kB
Mapped: 4660 kB
SwapCached: 3988 kB
Active(file): 3920 kB
Inactive(anon): 3828 kB
Active(anon): 3024 kB
KernelStack: 2992 kB
Inactive(file): 2832 kB
Hugepagesize: 2048 kB
Buffers: 1020 kB
Dirty: 8 kB
AnonHugePages: 0 kB
Bounce: 0 kB
HardwareCorrupted: 0 kB
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
HugePages_Total: 0
MemAvailable: 0 kB
Mlocked: 0 kB
NFS_Unstable: 0 kB
Shmem: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
Unevictable: 0 kB
VmallocChunk: 0 kB
VmallocUsed: 0 kB
Writeback: 0 kB
WritebackTmp: 0 kB
root@rad-m2m-srv02:~# ps aux --sort=-rss | head -15
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 34718 12.0 0.5 29596 5672 ? D 09:01 0:00 /usr/bin/python3 -Es /usr/bin/lsb_release --short --description
root 26491 3.1 0.2 79328 2860 ? D 08:04 1:50 apt-get update -qq
root 32551 6.8 0.2 119036 2800 ? D 08:51 0:43 /usr/bin/python3 /usr/bin/unattended-upgrade
root 34719 0.0 0.2 41164 2232 pts/1 R+ 09:02 0:00 ps aux --sort=-rss
msch 33960 0.1 0.1 23720 1844 pts/0 Ss 08:58 0:00 -bash
root 34492 0.2 0.1 23816 1812 pts/1 S 09:00 0:00 -bash
msch 33996 0.0 0.1 23576 1768 pts/1 Ss 08:58 0:00 bash -i
root 12792 2.2 0.1 159720 1748 ? D 06:06 3:54 /usr/bin/perl -w /usr/bin/apt-show-versions -i
root 34521 0.7 0.1 95180 1712 ? Ss 09:01 0:00 sshd: root@notty
root 15502 2.4 0.1 167660 1608 ? D 06:25 3:51 /usr/bin/perl -w /usr/bin/apt-show-versions -i
root 34527 1.7 0.1 14096 1596 ? Ss 09:01 0:00 /bin/bash /usr/bin/check_mk_agent
root 33947 0.0 0.1 95180 1564 ? Ss 08:58 0:00 sshd: msch [priv]
root 26486 0.0 0.1 9600 1436 ? S 08:04 0:00 /bin/bash 3600/mk_apt
root 26483 0.0 0.1 9588 1424 ? S 08:04 0:00 /bin/bash
root@rad-m2m-srv02:~# lsof | wc -l
1943
root@rad-m2m-srv02:~# df -Th -t tmpfs
Filesystem Type Size Used Avail Use% Mounted on
tmpfs tmpfs 99M 12M 87M 12% /run
tmpfs tmpfs 494M 0 494M 0% /dev/shm
tmpfs tmpfs 5,0M 0 5,0M 0% /run/lock
tmpfs tmpfs 494M 0 494M 0% /sys/fs/cgroup
tmpfs tmpfs 1,0G 0 1,0G 0% /tmp
tmpfs tmpfs 99M 0 99M 0% /run/user/0
tmpfs tmpfs 99M 0 99M 0% /run/user/2029
root@rad-m2m-srv02:~# vmware-toolbox-cmd stat balloon
0 MB
root@rad-m2m-srv02:~# cat /sys/kernel/debug/vmmemctl
balloon capabilities: 0x1e
used capabilities: 0x1e
is resetting: n
target: 0 pages
current: 0 pages
rateSleepAlloc: 2048 pages/sec
timer: 3968363
doorbell: 0
start: 7 ( 0 failed)
guestType: 7 ( 0 failed)
2m-lock: 0 ( 0 failed)
lock: 0 ( 0 failed)
2m-unlock: 0 ( 0 failed)
unlock: 0 ( 0 failed)
target: 3968363 ( 6 failed)
prim2mAlloc: 0 ( 0 failed)
primNoSleepAlloc: 0 ( 0 failed)
primCanSleepAlloc: 0 ( 0 failed)
prim2mFree: 0
primFree: 0
err2mAlloc: 0
errAlloc: 0
err2mFree: 0
errFree: 0
doorbellSet: 6
doorbellUnset: 7
root@rad-m2m-srv02:~# nice vmstat -w 1 10
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 5 356620 60868 1140 16280 37 19 704 31 3 2 1 2 97 1 0
1 4 356180 60372 320 16224 3008 624 6180 1236 1109 1915 2 18 0 80 0
2 5 356632 61476 320 15568 2776 1452 3128 2012 1146 1802 1 14 0 85 0
1 3 356592 62228 324 15244 2848 952 3784 1564 1029 1780 0 11 0 89 0
2 4 356732 61492 612 15544 2864 1144 3932 1720 1164 1839 2 9 0 89 0
1 4 357252 62836 556 15248 4000 1800 4432 3048 1398 2359 1 15 0 84 0
0 4 356700 61744 448 15248 3368 668 3368 1276 1093 2039 0 9 0 91 0
2 4 356708 61372 456 16272 1940 868 4744 888 876 1377 0 12 0 88 0
0 4 356704 61744 1156 14700 2740 660 4828 1940 1123 1768 0 14 0 86 0
0 4 357556 62240 680 15568 2908 1476 5436 2064 1062 1804 1 15 0 84 0
root@rad-m2m-srv02:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 9.8 (stretch)
Release: 9.8
Codename: stretch
root@rad-m2m-srv02:~# uname -a
Linux rad-m2m-srv02 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux
root@rad-m2m-srv02:~# w
09:02:30 up 45 days, 22:20, 1 user, load average: 5,13, 5,03, 6,58
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
msch pts/0 10.208.105.87 08:58 4.00s 0.26s 0.03s script memdebug
root@rad-m2m-srv02:~#
[values directly after rebooting:]
------------------------------------------------------------------------
root@rad-m2m-srv02:~# w
09:23:02 up 4 min, 1 user, load average: 0,01, 0,08, 0,04
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
msch pts/0 10.208.105.87 09:21 6.00s 0.26s 0.02s sshd: msch [priv]
root@rad-m2m-srv02:~# free -thwl
total used free shared buffers cache available
Mem: 987M 112M 610M 4,3M 16M 247M 735M
Low: 987M 377M 610M
High: 0B 0B 0B
Swap: 2,0G 0B 2,0G
Total: 3,0G 112M 2,6G
root@rad-m2m-srv02:~# smem -twk
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 287.1M 226.6M 60.5M
userspace memory 93.8M 37.8M 56.0M
free memory 606.4M 606.4M 0
----------------------------------------------------------
987.3M 870.8M 116.5M
root@rad-m2m-srv02:~# smem -uktr
User Count Swap USS PSS RSS
root 19 0 62.9M 72.8M 128.7M
postfix 6 0 7.9M 12.1M 42.9M
msch 4 0 3.7M 7.3M 19.4M
messagebus 1 0 1.2M 1.5M 3.8M
_chrony 1 0 896.0K 1020.0K 2.8M
daemon 1 0 228.0K 309.0K 2.1M
---------------------------------------------------
32 0 76.9M 95.0M 199.7M
root@rad-m2m-srv02:~# sort -k2,2nr /proc/meminfo
VmallocTotal: 34359738367 kB
CommitLimit: 2602636 kB
SwapFree: 2097148 kB
SwapTotal: 2097148 kB
MemTotal: 1010976 kB
DirectMap2M: 983040 kB
MemAvailable: 753520 kB
MemFree: 624520 kB
Cached: 234508 kB
Active: 161672 kB
Inactive: 142964 kB
Inactive(file): 138936 kB
Committed_AS: 124808 kB
Active(file): 108028 kB
DirectMap4k: 65408 kB
Active(anon): 53644 kB
AnonPages: 53300 kB
Slab: 36968 kB
Mapped: 36760 kB
SReclaimable: 19424 kB
SUnreclaim: 17544 kB
Buffers: 16836 kB
Shmem: 4392 kB
Inactive(anon): 4028 kB
PageTables: 3836 kB
KernelStack: 2748 kB
Hugepagesize: 2048 kB
Dirty: 60 kB
AnonHugePages: 0 kB
Bounce: 0 kB
HardwareCorrupted: 0 kB
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
HugePages_Total: 0
Mlocked: 0 kB
NFS_Unstable: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
SwapCached: 0 kB
Unevictable: 0 kB
VmallocChunk: 0 kB
VmallocUsed: 0 kB
Writeback: 0 kB
WritebackTmp: 0 kB
root@rad-m2m-srv02:~# ps aux --sort=-rss | head -15
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 651 0.1 2.6 78748 26992 ? S 09:18 0:00 /usr/bin/perl /opt/radiator/bin/radiusd -daemon -pid_file /var/run/radiator.pid -config_file /opt/radiator/etc/radiator.cfg -I /opt/radiator/share/perl/5.24.1/
root 411 0.0 1.7 153488 18144 ? Ss 09:18 0:00 /usr/bin/VGAuthService
root 221 0.1 1.0 136488 10464 ? Ss 09:18 0:00 /usr/bin/vmtoolsd
postfix 2033 0.1 0.8 88652 8968 ? S 09:22 0:00 smtp -t unix -u
postfix 2034 0.0 0.8 87480 8132 ? S 09:22 0:00 tlsmgr -l -t unix -u
root 1 0.3 0.6 57052 6736 ? Ss 09:18 0:00 /sbin/init
root 1462 0.0 0.6 95180 6736 ? Ss 09:21 0:00 sshd: msch [priv]
postfix 2031 0.0 0.6 83352 6700 ? S 09:22 0:00 cleanup -z -t unix -u
postfix 649 0.0 0.6 83296 6600 ? S 09:18 0:00 qmgr -l -t unix -u
postfix 2032 0.0 0.6 83260 6600 ? S 09:22 0:00 trivial-rewrite -n rewrite -t unix -u
postfix 648 0.0 0.6 83248 6284 ? S 09:18 0:00 pickup -l -t unix -u
root 527 0.0 0.6 69952 6168 ? Ss 09:18 0:00 /usr/sbin/sshd -D
msch 1464 0.0 0.6 64832 6144 ? Ss 09:21 0:00 /lib/systemd/systemd --user
root 251 0.0 0.5 47844 5872 ? Ss 09:18 0:00 /lib/systemd/systemd-udevd
root@rad-m2m-srv02:~# lsof | wc -l
1605
root@rad-m2m-srv02:~# df -Th -t tmpfs
Filesystem Type Size Used Avail Use% Mounted on
tmpfs tmpfs 99M 4,3M 95M 5% /run
tmpfs tmpfs 494M 0 494M 0% /dev/shm
tmpfs tmpfs 5,0M 0 5,0M 0% /run/lock
tmpfs tmpfs 494M 0 494M 0% /sys/fs/cgroup
tmpfs tmpfs 1,0G 0 1,0G 0% /tmp
tmpfs tmpfs 99M 0 99M 0% /run/user/2029
root@rad-m2m-srv02:~# vmware-toolbox-cmd stat balloon
0 MB
root@rad-m2m-srv02:~# cat /sys/kernel/debug/vmmemctl
balloon capabilities: 0x1e
used capabilities: 0x1e
is resetting: n
target: 0 pages
current: 0 pages
rateSleepAlloc: 2048 pages/sec
timer: 292
doorbell: 0
start: 1 ( 0 failed)
guestType: 1 ( 0 failed)
2m-lock: 0 ( 0 failed)
lock: 0 ( 0 failed)
2m-unlock: 0 ( 0 failed)
unlock: 0 ( 0 failed)
target: 292 ( 0 failed)
prim2mAlloc: 0 ( 0 failed)
primNoSleepAlloc: 0 ( 0 failed)
primCanSleepAlloc: 0 ( 0 failed)
prim2mFree: 0
primFree: 0
err2mAlloc: 0
errAlloc: 0
err2mFree: 0
errFree: 0
doorbellSet: 1
doorbellUnset: 1
root@rad-m2m-srv02:~# nice vmstat -w 1 10
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 622948 16868 254624 0 0 728 254 104 231 4 2 88 5 0
0 0 0 622948 16868 254624 0 0 0 0 53 98 0 0 100 0 0
0 0 0 622948 16876 254600 0 0 0 20 50 96 0 0 100 0 0
0 0 0 622948 16876 254600 0 0 0 0 50 91 0 0 100 0 0
0 0 0 622948 16876 254600 0 0 0 0 43 84 0 0 100 0 0
0 0 0 622948 16876 254604 0 0 0 0 57 105 1 0 99 0 0
0 0 0 622948 16876 254600 0 0 0 0 53 106 0 1 99 0 0
0 0 0 622948 16876 254600 0 0 0 0 50 91 1 0 99 0 0
1 0 0 622948 16876 254600 0 0 0 0 49 96 0 0 100 0 0
0 0 0 622948 16876 254600 0 0 0 12 50 94 0 1 99 0 0
root@rad-m2m-srv02:~#
------------------------------------------------------------------------
Anything else I could check to help pinpoint the memory hog?
Thanks in advance!
Martin
--
Martin Schwarz * Karlsruhe, Germany * http://kuroi.de/
Reply to: