bind9, openswan crashes wheezy VPS
Hello everyone.
I have a VPS running a fresh install of wheezy, installed by me from
scratch (including kernel). Everything seems to be running fine,
except for bind9 and openswan which literally crash the vps as
explained below.
I'll start with bind9, since I have more info there. It's setup as a
name server authoritative for two zones. Querying both zones works fine
from localhost and the internet over ipv4, and ipv6. The problem comes
up when I try to use bind9 to resolve other domains from
localhost. When resolving certain domains, the vps literally
crashes. I have to send it a boot request, and it boots up again
starting with grub, to the login prompt. It doesn't matter if I use
dig to query localhost by hand, or if I have nameserver::1, or
nameserver 127.0.0.1 in resolv.conf. It doesn't matter if I query A
records, or AAAA records (if those exist). The results are the same,
bind9 resolves some domains, and crashes on others. There are no
errors in logs. If I use dig by hand, type in:
dig @localhost www.debian.org.
and press enter, the crash happens right there and then, I have to
send the vps a boot request at that point.
Here's a list of domains that work fine, and those which crash the
machine.
crashes:
www.ietf.org.
www.linux-speakup.org.
ftp.us.debian.org.
www.debian.org.
 
works fine:
www.yahoo.com.
www.google.com.
www.fsf.org.
There are probably many more from both categories. In the case of a
query that works, I can get a cname record, and query that until I get
answers for a and aaaa records without problems. It doesn't matter if
I do, or don't use forwarders. If I put my vps provider's name servers
in resolv.conf, I can query everything just fine. 
When using the stock wheezy kernel, the machine would sometimes crash
during boot right after printing "starting bind9," before the ok that
comes after. This was true especially if starting named without the -4
flag to disable ipv6. There were also random crashes every couple of
days or so when I wasn't logged into the machine watching for
them. All this seems to have gone away after I upgraded to
linux 3.9 from wheezy-backports, and just the query crashes remain.
I know someone who is with the same VPS provider and runs fedora 16 in
his VPS. I have a shell account on his system, and have been able to
verify for myself by using dig that it's possible to query all the
domains I listed above using his local bind9 on his machine with no
crashes. As far as I can tell (lspci, /proc/cpuinfo), his vps is
configured exactly like mine as far as hardware, except for RAM and HD
capacity. That's all the info I have on the bind9 problem.
As far as openswan, it's setup with one connection, configured as
responder using the native netkey stack. When openswan starts, I get
this in /var/log/syslog:
Aug  9 23:07:16 vserver kernel: [  504.009595] NET: Registered
protocol family 15
Aug  9 23:07:16 vserver ipsec_setup: Starting Openswan IPsec
U2.6.37-g955aaafb-dirty/K3.9-0.bpo.1-amd64...
Aug  9 23:07:16 vserver ipsec_setup: Using NETKEY(XFRM) stack
Aug  9 23:07:16 vserver kernel: [  504.132588] Initializing XFRM
netlink socket
Aug  9 23:07:16 vserver kernel: [  504.194202] AVX instructions are
not detected.
Aug  9 23:07:16 vserver kernel: [  504.202914] AVX instructions are
not detected.
Aug  9 23:07:16 vserver ipsec_setup: ...Openswan IPsec started
Aug  9 23:07:16 vserver ipsec__plutorun: adjusting ipsec.d to
/etc/ipsec.d
Aug  9 23:07:16 vserver pluto: adjusting ipsec.d to /etc/ipsec.d
Aug  9 23:07:16 vserver ipsec__plutorun: 002 loading certificate from
/etc/ipsec.d/certs/servercert.pem
Aug  9 23:07:16 vserver ipsec__plutorun: 002   loaded host cert file
'/etc/ipsec.d/certs/servercert.pem' (1505 bytes)
Aug  9 23:07:16 vserver ipsec__plutorun: 002   no subjectAltName
matches ID '%fromcert', replaced by subject DN
Aug  9 23:07:16 vserver ipsec__plutorun: 002 added connection
description "l2tp"
The machine crashes when I try to initiate a connection from a win7
client. Nothing gets written to the logs here, so the output below is
the last screen full I get when logged into the vps via the serial
console using out of band access, with the vps running in run level 1,
and invoke-rc.d ipsec start done by hand:
pluto[2266]: packet from 10.0.0.1:500: received Vendor ID
payload [draft-ietf-ipsec-nat-t-ike-02_n] meth=106, but already using
method 109
pluto[2266]: packet from 10.0.0.1:500: ignoring Vendor ID
payload [FRAGMENTATION]
pluto[2266]: packet from 10.0.0.1:500: ignoring Vendor ID
payload [MS-Negotiation Discovery Capable]
pluto[2266]: packet from 10.0.0.1:500: ignoring Vendor ID
payload [Vid-Initial-Contact]
pluto[2266]: packet from 10.0.0.1:500: ignoring Vendor ID
payload [IKE CGA version 1]
pluto[2266]: "l2tp"[1] 10.0.0.1 #1: responding to Main Mode from
unknown peer 10.0.0.1
pluto[2266]: "l2tp"[1] 10.0.0.1 #1: OAKLEY_GROUP 20 not
supported.  Attribute OAKLEY_GROUP_DESCRIPTION
pluto[2266]: "l2tp"[1] 10.0.0.1 #1: OAKLEY_GROUP 19 not
supported.  Attribute OAKLEY_GROUP_DESCRIPTION
pluto[2266]: "l2tp"[1] 10.0.0.1 #1: transition from state
STATE_MAIN_R0 to state STATE_MAIN_R1
pluto[2266]: "l2tp"[1] 10.0.0.1 #1: STATE_MAIN_R1: sent MR1,
expecting MI2pluto[2266]: "l2tp"[1] 10.0.0.1 #1: NAT-Traversal:
Result using RFC 3947 (NAT-Traversal): peer is NATed
pluto[2266]: "l2tp"[1] 10.0.0.1 #1: transition from state
STATE_MAIN_R1 to state STATE_MAIN_R2
pluto[2266]: "l2tp"[1] 10.0.0.1 #1: STATE_MAIN_R2: sent MR2,
expecting MI3
That's all the info I have on the openswan issue. This vps is of
course running lots more than just bind9 and openswan. Apache,
proftpd, postfix, spamassassin, clamav, opendkim, just to name a
few. All of those appear to be running without problems.
As far as the vps itself, it is based on KVM/QEMU with one cpu, and
one gig of RAM. The network card uses the virtio_net module, and the
HD shows up as /dev/vda (I assume using the virtio_blk module, which
is also automatically loaded). Based on the login banner I get when
using out of band access, the host seems to be running openbsd. I'm not sure if the machine providing the out of band account and
the host my vps is running on are actually one and the same
though. According to /proc/cpu, the KVM/QEMU version seems to be
0.9.1.
Any help in at least figuring out what is causing this, if not
actually having a fully functional bind9 and openswan is much
appreciated. If more info is necessary, I'll see what I can do.
Greg
-- 
web site: http://www.gregn.net
gpg public key: http://www.gregn.net/pubkey.asc
skype: gregn1
(authorization required, add me to your contacts list first)
--
Free domains: http://www.eu.org/ or mail dns-manager@EU.org
Reply to: