Bug#756343: Fix gethostbyname() sending data on random file descriptors in wheezy, already done in jessie

Package: libc6

Version: 2.13-38+deb7u1

Severity: normal

Hello,

On test systems running stress workloads we were regularly encountering a bug

in gethostbyname that is fixed in libc6 in jessie. For completeness I've

included the entire repro/investigation process; however, we are fairly sure the

bug is the same as debian bug #722075. I'm writing to inquire if this bugfix can

be backported to wheezy (stable).

We encountered this bug on fractional core VMs running workloads that stress

disk, cpu, and networking. As part of that testing we make many concurrent HTTP

request in python, the relevant code being similar to:

> def GetURL(**kwargs):

> url = "" href="http://www.example.com/">http://www.example.com/'

> request = urllib2.Request(url)

> return urllib2.urlopen(request, **kwargs).read()

> def HammerGetHostByID():

> while True:

> try:

> GetURL(timeout=1)

> except:

> pass

> for _ in xrange(10):

> thread = threading.Thread(target=HammerGetHostByID)

> thread.start()

Running a workload like this in 500 VMs running wheezy would yield O(8) failures

over 24 hours with the following output:

*** glibc detected *** /usr/bin/python: double free or corruption (out)

Digging a little deeper with a debugger we found that whenever these were hit,

the stack would contain _nss_dns_gethostbyname4_r and have garbage stack frames

above that. The gethostbyname() call most likely comes from the above urlopen.

Given this observation, we suspected a connection to debian bug #722075, and

attempted the following patch to libc6:

diff -rupN eglibc-2.13/resolv/res_send.c eglibc-2.13-mod/resolv/res_send.c

--- eglibc-2.13/resolv/res_send.c 2010-03-26 14:08:35.000000000 -0700

+++ eglibc-2.13-mod/resolv/res_send.c 2014-07-02 10:23:28.521088097 -0700

@@ -1330,6 +1330,7 @@ send_dg(res_state statp,

retval = reopen (statp, terrno, ns);

if (retval <= 0)

return retval;

+ pfd[0].fd = EXT(statp).nssocks[ns];

}

goto wait;

With this single-line patch we no longer hit the 'double free or corruption'

message even when running 100 VMs for over 5 days. I extracted the above code

fix from https://lists.debian.org/debian-glibc/2014/06/msg00013.html, but

modified the diff to fit on 2.13-38+deb7u1.

If a fix similar to this could be included in wheezy stable at some point it

would be much appreciated.

We were running kernel: Debian 3.14.5-1~bpo70+1, libc6: 2.13-38+deb7u1

Thanks,

Marcus Ewert