Please note, the way I am detecting if the oom_adj bug is present,
is by checking if the sshd user processes are -17. If they are, then
the bug is present. ATTEMPT 1: apt-get install hashalot cryptsetup make g++ cpp automake ncurses-dev bison flex curl build-essential kernel-package locales locales-all - oom problem not found (pre-built from previous install) dpkg -i linux-headers-2.6.32.41-grsec_2.6.32.41-grsec-10.00.Custom_amd64.deb linux-image-2.6.32.41-grsec_2.6.32.41-grsec-10.00.Custom_amd64.deb - oom problem not found (booted into new kernel) shutdown -r now - oom problem DETECTED!!! (booted into old kernel) dpkg --purge linux-image-2.6.32.41-grsec linux-headers-2.6.32.41-grsec shutdown -r now - oom problem DETECTED!!! Hmmmm.... apt-get purge hashalot cryptsetup make g++ cpp automake ncurses-dev bison flex curl build-essential kernel-package locales locales-all shutdown -r now - oom problem STILL DETECTED!!! WTF!!!!!!!!!!!!!!! So, it would appear that either the kernel install or the "apt-get" caused irreversible changes on the system... REINSTALL ATTEMPT 2: (this time, I'm going to use tripwire to detect filesystem changes between each reboot) Installed openssh-server and configured tripwire. Made test modification to /usr/sbin and ran tripwire, to ensure its logging correctly. - oom problem not found rebooted - oom problem DETECTED!! Purged openssh-server - oom problem not found Reinstalled openssh-server - oom problem not found rebooted - oom problem DETECTED So, we've now pinpointed it down to the openssh package... This makes sense, because the oom_adj is inherited from whatever binary the process forks from.. In this case, the LXCs are started up from an ssh session, which is why it inherits the ssh -17 oom_adj. Here is what tripwire is reporting has changed: Modified: "/usr/sbin" "/usr/sbin/sshd" "/usr/lib" "/usr/lib/openssh" "/usr/lib/openssh/sftp-server" "/usr/lib/sftp-server" "/var/run/acpid.pid" "/var/run/acpid.socket" "/var/run/atd.pid" "/var/run/crond.pid" "/var/run/crond.reboot" "/var/run/exim4/exim.pid" "/var/run/mdadm/monitor.pid" "/var/run/portmap.pid" "/var/run/portmap_mapping" "/var/run/rpc.statd.pid" "/var/run/rsyslogd.pid" "/var/run/sm-notify.pid" "/var/log/dmesg.0" "/etc" "/etc/adjtime" "/etc/default" "/etc/default/ssh" "/etc/lvm/cache" "/etc/lvm/cache/.cache" "/etc/mtab" "/etc/network/if-up.d" "/etc/network/if-up.d/openssh-server" "/etc/network/run" "/etc/network/run/ifstate" "/etc/pam.d" "/etc/pam.d/sshd" "/etc/passwd-" "/etc/shadow-" "/etc/ssh" "/etc/ssh/ssh_host_dsa_key" "/etc/ssh/ssh_host_dsa_key.pub" "/etc/ssh/ssh_host_rsa_key" "/etc/ssh/ssh_host_rsa_key.pub" "/etc/ssh/sshd_config" "/etc/tripwire" "/etc/ufw" "/etc/ufw/applications.d" "/etc/ufw/applications.d/openssh-server" "/etc/init.d" "/etc/init.d/.depend.boot" "/etc/init.d/.depend.start" "/etc/init.d/.depend.stop" "/etc/init.d/ssh" "/etc/rc2.d" "/etc/rc2.d/S18ssh" "/etc/rc3.d" "/etc/rc3.d/S18ssh" "/etc/rc4.d" "/etc/rc4.d/S18ssh" "/etc/rc5.d" "/etc/rc5.d/S18ssh" "/etc/passwd" "/etc/shadow" "/root" "/root/.nano_history" Added: "/var/log/dmesg.1.gz" "/var/log/dmesg.2.gz" "/root/.bash_history" Removed: "/etc/nologin" I thought maybe the reason this is happening is because ssh (when being started from rc.d) is inheriting the -17 from init.. but it appears init doesn't have -17: root@vicky:~# cat /proc/1/oom_adj 0 Then I thought maybe its because the ssh versions are diff, but they are not: root@vicky:/home/foxx# dpkg -l | grep openssh-server ii openssh-server 1:5.5p1-6 secure shell (SSH) server, for secure access from remote machines root@courtney.internal [/home/foxx] > dpkg -l | grep openssh-server ii openssh-server 1:5.5p1-6 secure shell (SSH) server, for secure access from remote machines foxx@vicky:~$ md5sum /usr/sbin/sshd f8c11462e8f2a7bf80e212e06041492b /usr/sbin/sshd root@courtney.internal [/home/foxx] > md5sum /usr/sbin/sshd f8c11462e8f2a7bf80e212e06041492b /usr/sbin/sshd Then I made sure that the sshd_config's matched, and that I was using the same login process for both (shared key only)... they both matched, yet the problem still happens.. Then I thought maybe it's inheriting from the sshd server process, but turns out that isn't it either: (broken server) root 1583 0.0 0.0 49168 1140 ? Ss 12:42 0:00 /usr/sbin/sshd root@vicky:~# cat /proc/1583/oom_adj -17 (working server) root 2105 0.0 0.0 49184 1152 ? Ss 00:47 0:00 /usr/sbin/sshd root@courtney.internal [/home/foxx] > cat /proc/2105/oom_adj -17 So, I looked through the process tree to see where it was inheriting from at what stage.. (working server) root@courtney.internal [/home/foxx] > ps faux | grep sshd (-17) root 2105 0.0 0.0 49184 1152 ? Ss 00:47 0:00 /usr/sbin/sshd (0) root 4735 0.0 0.0 76668 3356 ? Ss 12:47 0:00 \_ sshd: foxx [priv] foxx 4740 0.0 0.0 76668 1644 ? S 12:47 0:00 \_ sshd: foxx@pts/0 root 4757 0.0 0.0 112344 876 pts/0 S+ 12:48 0:00 \_ grep sshd (broken server) foxx@vicky:~$ ps faux | grep sshd (-17) root 1583 0.0 0.0 49168 1140 ? Ss 12:42 0:00 /usr/sbin/sshd (-17) root 1616 0.0 0.0 70488 3376 ? Ss 12:43 0:00 \_ sshd: root@pts/0 (-17) root 1685 0.2 0.0 70488 3292 ? Ss 12:50 0:00 \_ sshd: foxx [priv] foxx 1688 0.0 0.0 70488 1576 ? S 12:50 0:00 \_ sshd: foxx@pts/1 foxx 1715 0.0 0.0 7544 840 pts/1 S+ 12:50 0:00 \_ grep sshd As you can see, the line where it says "sshd: foxx [priv]" is causing the -17 oom_adj. Accoridng to the documentation, this appears to be where the privilege seperation takes place. So, now I started to check the ssh packages themselves, and make sure the repos are exactly the same on both servers. At this point, I realise that the working server is slightly out of date on the following packages: root@courtney.internal [/home/foxx] > md5sum /etc/apt/sources.list 00bcf3cf28e2994f9b512f0a8ffb0765 /etc/apt/sources.list root@vicky:/etc# md5sum /etc/apt/sources.list 00bcf3cf28e2994f9b512f0a8ffb0765 /etc/apt/sources.list root@courtney.internal [/home/foxx] > apt-get upgrade The following packages will be upgraded: bind9-host dnsutils exim4 exim4-base exim4-config exim4-daemon-light host isc-dhcp-client isc-dhcp-common libbind9-60 libdns69 libisc62 libisccc60 libisccfg62 liblwres60 linux-base linux-image-2.6.32-5-amd64 linux-libc-dev sd-agent The one that springs immediately to my attention is linux-base. (working server) root@courtney.internal [/home/foxx] > dpkg -l | grep linux-base ii linux-base 2.6.32-31 Linux image base package (broken server) root@vicky:/etc# dpkg -l | grep linux-base ii linux-base 2.6.32-34squeeze1 Linux image base package Sooooo, I bite the bullet, and perform an upgrade of linux-base on the working server... root@courtney.internal [/home/foxx] > apt-get install linux-base Setting up linux-base (2.6.32-34squeeze1) ... I then re-run the dryrun upgrade command this to make sure its upgraded: bind9-host dnsutils exim4 exim4-base exim4-config exim4-daemon-light host isc-dhcp-client isc-dhcp-common libbind9-60 libdns69 libisc62 libisccc60 libisccfg62 liblwres60 linux-image-2.6.32-5-amd64 linux-libc-dev sd-agent (as you can see its disappeared from the list) I then reboot the server.. and wait for the longest 3 minutes of my life.. But guess what... it didn't break :/ So, I bite another bullet, and upgrade the remaining packages on the server: bind9-host dnsutils exim4 exim4-base exim4-config exim4-daemon-light host isc-dhcp-client isc-dhcp-common libbind9-60 libdns69 libisc62 libisccc60 libisccfg62 liblwres60 linux-image-2.6.32-5-amd64 linux-libc-dev sd-agent I then make sure both servers are running the exact same stock kernel from Debian (as the working server was using an old kernel). root@vicky:/etc# dpkg -l | grep linux | grep image ii linux-base 2.6.32-34squeeze1 Linux image base package ii linux-image-2.6-amd64 2.6.32+29 Linux 2.6 for 64-bit PCs (meta-package) ii linux-image-2.6.32-5-amd64 2.6.32-34squeeze1 Linux 2.6.32 for 64-bit PCs root@courtney.internal [/home/foxx] > dpkg -l | grep linux | grep image ii linux-base 2.6.32-34squeeze1 Linux image base package ii linux-image-2.6-amd64 2.6.32+29 Linux 2.6 for 64-bit PCs (meta-package) ii linux-image-2.6.32-5-amd64 2.6.32-34squeeze1 Linux 2.6.32 for 64-bit PCs root@vicky:/etc# uname -a Linux vicky 2.6.32-5-amd64 #1 SMP Wed May 18 23:13:22 UTC 2011 x86_64 GNU/Linux root@courtney.internal [/home/foxx] > uname -a Linux courtney.internal 2.6.32-5-amd64 #1 SMP Wed May 18 23:13:22 UTC 2011 x86_64 GNU/Linux After another long 3 minute wait, I test for the oom bug... and guess what.. it's STILL NOT DOING IT!!! :( So now I check for differences in the /etc/pam.d... I notice pam_cap.so is missing in common-auth on the broken server: root@courtney.internal [/etc/pam.d] > dpkg -l | grep cap ii libcap2 1:2.19-3 support for getting/setting POSIX.1e capabilities ii libcap2-bin 1:2.19-3 basic utility programs for using capabilities Broken server: â PAM profiles to enable: â â â â [*] Unix authentication â Working server: â [*] Unix authentication â â [*] Inheritable Capabilities Management â So, I install 'libcap2-bin' on the broken server, reboot.. and still no god damn luck. At this point /etc/pam.d on both servers are matching (md5sum matches up on all files).. So now, I decide to check all files relating to openssh-server.. again, all matches up fine. Then I start to get really pissed off, and check the md5sum for all files in /etc: (left is working server, right is broken) root@courtney.internal [/etc/pam.d] > diff /tmp/etcmd5-courtney /tmp/etcmd5-vicky -y --suppress-common-lines a81fbd39142e18d5ed1fb8a7b3ecce71 /etc/adjtime | fa9192c6cdaab85ec952576ab3139fd1 /etc/adjtime 7fcee51274f69cdf5d4c8b7be799637b /etc/apt/trustdb.gpg | 1319acca28ae6475a915ca0684d0cd62 /etc/apt/trustdb.gpg a3710991fcce0b1574586450c81095e1 /etc/apt/trusted.gpg | d802712c9255f13bea3bea87b83180b1 /etc/apt/trusted.gpg 366d165a9f5de024d3a21b9dc51de057 /etc/bash.bashrc | 5b3c3bc73d236e4e1b6f9b6c1ed5964e /etc/bash.bashrc 109789e71a8cf8e21302bf44b5b716f7 /etc/blkid.tab | aa0de4c3c85ae212f6c59b6b89b21b9a /etc/blkid.tab 2de357d9da09d738c179e8d269696b9c /etc/blkid.tab.old | aa0de4c3c85ae212f6c59b6b89b21b9a /etc/blkid.tab.old 22133b5bd4023d48b50863d2b1c7b25e /etc/console-setup/cached.k | bdd92d16a8172f9e7ea3f866b59b4fc4 /etc/console-setup/cached.k b88b0f0a4d3b4beec0cc4b09b6c2aaea /etc/cron.daily/ntp < 4e5aa59f38b520c5a45d3fdc7cdec46c /etc/cron.daily/sysstat | d1e8b20b26a33a6a0781d59bc312442e /etc/cron.daily/tripwire 455c3c071b6daabb4e4490828975034c /etc/cron.d/sysstat < 1cffe509bba095a0f7ece99a971e6e9a /etc/crypttab < 756141f7eacf1a272a2f6e51646b3aa4 /etc/default/cryptdisks < 6bba39eb6c39aef755f1fadb48ded5a5 /etc/default/lxc < cd7a62fbb18fa8fe5893dee93064b328 /etc/default/ntp < e0d7efac23e911c65f44b08de446e837 /etc/default/rsync < 21614b7a3d91ee06750feedbfdaec898 /etc/default/sysstat < fbc234ecd0f7e8bc1c394bbde5867be1 /etc/dhcp/dhclient-exit-hoo < 1a2b9d0a869e2aa885ae3621c557fb95 /etc/dpkg/shlibs.default < 84b1e69080569cc5c613a50887af5200 /etc/dpkg/shlibs.override < 297521889d690871ec9d89c5eeff745a /etc/emacs/site-start.d/50a < c10e4cfb6fb21c281f04105d9482736a /etc/exim4/update-exim4.con | 2109f7e59c5d7ab2431aad0f095e2e34 /etc/exim4/update-exim4.con 0bec0044c716f14083f72c42af543d16 /etc/fstab | ab91b08889eb01fa6cd2364eba136ae8 /etc/fstab c9457cf5b2196da67d5ac816d1c86a4f /etc/fuse.conf < e030dc891a3af1f7779429b5c0554c98 /etc/gdb/gdbinit < 5d151dd5c443ed7b2a5ded95740bf00d /etc/glusterd/glusterd.info < f8ab4b0d63d43e8385e8e0a7b0b0fdba /etc/group | 6c8ccd77ad88953d80a0c8230feb43b0 /etc/group b2eb12a6eb86aa16730fcc78f3856f99 /etc/group- | f8366cb252dab0be81d315dbb0bfd54d /etc/group- 1721fcab19363ee66aa66829b9876f0e /etc/gshadow | 9a53c91ec405b6c2589b257d1f610e11 /etc/gshadow b43f2208e196045e2e4eff32a32a43cb /etc/gshadow- | 5db46dc414e73d989833f1718646ec40 /etc/gshadow- 289108902ba56c6d3d10392b994f5063 /etc/hostname | be7724203e323a7c97fe531e3662521c /etc/hostname f6c39850a5646ce96a62f8bbfadcab12 /etc/hosts | e7358f34e94f27ce975c2beb64a5fd31 /etc/hosts e9d8dadacde9e17f0a9b19951109bd15 /etc/init.d/cryptdisks | 94c8893c0233f51b5b35c44afcc9d064 /etc/init.d/.depend.boot a544f8db0b5b71722ebf28cd29d5c99f /etc/init.d/cryptdisks-earl | 018bfdbf3ce7d5000d4771861558084c /etc/init.d/.depend.start 47f49c3084a87495a4b21b16d62f08ce /etc/init.d/.depend.boot | efcaecb9b1dbf7ee6999ce6d7fe6cbce /etc/init.d/.depend.stop c32f1bac3bf2ead96ef4328f8fa8b6a4 /etc/init.d/.depend.start < 7a5306deeb6f58cf9fddc44176e944b2 /etc/init.d/.depend.stop < 0a911b5d7bdf62f4a27f59544859a25c /etc/init.d/enc < d0b8cce6d932e1cd90812ce32c3f81a4 /etc/init.d/fuse < 88def01f8a173be24e1347d215d713f1 /etc/init.d/github01 < a8e1b7caac5f373ae3ee68cfc6703c4c /etc/init.d/lxc < 1593209e2edaef7930940759b07caee1 /etc/init.d/ntp < 9d74671cca3077de30a6cbed26d4cd0e /etc/init.d/rsync < 7702ad8bd63cbe13b8bb455199435191 /etc/init.d/screen-cleanup < ee350831ec30475b16b8bda31a3f24de /etc/init.d/sd-agent < 8fb5289db2c7f67aa9347ae7e8b445dd /etc/init.d/sws01 < e3cf21d607c6852e2e5013524c657c6e /etc/init.d/sysstat < bc93dd93f82749814f2bde70d9428c0d /etc/initramfs-tools/conf.d < 98fc27159733746de9890831d51d95d4 /etc/iptables.up.rules < 02e24efb09d5343d249780465b59bfd6 /etc/ld.so.cache | ac8d6701fefa22e490c2c956a047f224 /etc/ld.so.cache 3459aad5fab462e1c126997b4ac449bb /etc/logcheck/ignore.d.serv < 32d3f9199b3438cd41ed3cb1122135b7 /etc/lvm/backup/default | c0adb70d988e7ad895b42aa23644dce0 /etc/lvm/backup/mech a5ab01460cb3dba4daedd86002bdba67 /etc/lvm/cache/.cache | 4436f6d9c98cb918c4757f48446ddefc /etc/lvm/backup/ssd > a96425623ae854ba979f7fd8b901bd21 /etc/lvm/cache/.cache f51730e7056489397b8b2a9a9b99662c /etc/mailname | 1cc22bdbd9f0dd61c3dbdf484f5e1954 /etc/mailname 2b0e1a3f52757145d65c353ca49d7756 /etc/mdadm/mdadm.conf | 75ddc14817703ef468326041acd8bfb1 /etc/mdadm/mdadm.conf 2ed9e1f60da7f3f87faca9879d0a6531 /etc/mtab | d6eb5404b14026e3cb6377fdbfa36a07 /etc/mtab 0a38058aafd42b7b4105505379194e1b /etc/nanorc | fc57b93c907fefbccf12317d40b4a204 /etc/nanorc d8e3272886cfc3ca3e103840f17c28b3 /etc/network/interfaces | 11aed0887d5bd14a426e2a4f7c7d4f4a /etc/network/interfaces 0925a154d88cf0bc2d0f588ab7c670c9 /etc/network/run/ifstate | 40a731c20283789c9916b1978b8b56b8 /etc/network/run/ifstate 1e47bfedf84da1cdfaa45ee016fb3609 /etc/networks | 5293c479ab40a68b4fa7a6eeff755256 /etc/networks 3e250ecaf470e1d3a2b68edd5de46bfd /etc/ntp.conf < a3bf39d554578558648717be72405bb4 /etc/passwd | ec13c66df3dee36481a8c3d432e54d8f /etc/passwd c495660bf88840c35e5c3ede628e5e5d /etc/passwd- | ec13c66df3dee36481a8c3d432e54d8f /etc/passwd- b08c4faf56551a861d7ae6858ac52b2e /etc/profile | b94c2e3df2a779ac12080942df4d86ea /etc/profile fe0b86955e4eb444f17f54d086580b1f /etc/resolv.conf | 2bc8c1c0361ac0fae5581bcaf8d7f136 /etc/resolv.conf c22ef5f592ae97d3152e1d58657e2c8a /etc/rssh.conf < f4aa40956bb6f150815b4d60a505760c /etc/screenrc < 78b737784042d973d6bed47e7667b1bb /etc/sd-agent/config.cfg < 4eccd6267f438812bfa1d4eb8ac05217 /etc/shadow | e2f45652caa1cbb84c778adc75f7545b /etc/shadow 676a49b9dbe67ce8be7a2921f7e10570 /etc/shadow- | e2f45652caa1cbb84c778adc75f7545b /etc/shadow- 3c1144bd2727cf00af012021fa3de4c5 /etc/shells | 0e85c87e09d716ecb03624ccff511760 /etc/shells 9fa92b39192a027af603fbff3d2f42eb /etc/siege/siegerc < fb778297a8e612868e41225cf4db7c9d /etc/siege/urls.txt < 813856cf9d8c29095b3a4e19d92d3da0 /etc/ssh/ssh_host_dsa_key | 3f4beaeb582ce81b42cca475e65dc75a /etc/ssh/ssh_host_dsa_key 75d221c8d4abe42699ff813e5a1e8cc7 /etc/ssh/ssh_host_dsa_key.p | 55e34345f7a2e1ac5ec7ce78543487e7 /etc/ssh/ssh_host_dsa_key.p b85a52219856a7ecf95d625a1bee5068 /etc/ssh/ssh_host_rsa_key | 70af0ef16b661edd96c317058ef55a78 /etc/ssh/ssh_host_rsa_key 3aea4190a19facc76222e69c5700f5ac /etc/ssh/ssh_host_rsa_key.p | b8524bc48e4d5c71c69888b452e8d6ae /etc/ssh/ssh_host_rsa_key.p 16e9567a6298125264967d276e6a139f /etc/sudoers | c5dab0f2771411ed7e67d6dab60a311f /etc/sudoers a86605ae7354f25d8060bcb5ad83edf7 /etc/sysctl.conf | 2c6f89fdb09aeac5735144497a261782 /etc/sysctl.conf e52dbe02e5da26d9be965373676e9355 /etc/sysstat/sysstat < fa92b01baa2130e26822c30fb27ac56e /etc/sysstat/sysstat.ioconf < > d87271b624ab8e93c2e51cd59bade631 /etc/tripwire/site.key > 8f6ebb12f511b46fbf0203d978c3bf01 /etc/tripwire/tw.cfg > 1821c7a0d207a168f1d7c766f238e816 /etc/tripwire/twcfg.txt > 717b4afa4f0f8614f3947441a3ddf422 /etc/tripwire/tw.pol > 92c9b38e95c90eebf1d746633a81909c /etc/tripwire/tw.pol.bak > d08d31fa833b50d2fb9a37f97b07cbd0 /etc/tripwire/twpol.txt > fdbfa3e0879f0d959bbdfd5601ef4d4f /etc/tripwire/vicky-local.k aeb6fe5dcfc873b0632ba92345ed16e2 /etc/udev/rules.d/70-persis | 0fdf03b558e118edcf8ce29abaf296f1 /etc/udev/rules.d/70-persis 24cc33b9f96e3189b7e34cf5484cb99f /etc/udev/rules.d/70-persis | 4a49e7ddeacbb3ded8bb3968f219362c /etc/udev/rules.d/70-persis I patched up the /etc/init.d/.depend.* files so the 'ssh' entries are matching.. still. no. luck It's now 24 hours of trying to fix this issue, and I'm getting extremely pissed off :@ If anyone could offer up some advice, it would be VERY much appreciated. On 30/05/2011 01:24, Cal Leeming [Simplicity Media Ltd] wrote: Another quick update on this... |