[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#756343: Fix gethostbyname() sending data on random file descriptors in wheezy, already done in jessie



Package: libc6
Version: 2.13-38+deb7u1
Severity: normal

Hello,

On test systems running stress workloads we were regularly encountering a bug
in gethostbyname that is fixed in libc6 in jessie. For completeness I've
included the entire repro/investigation process; however, we are fairly sure the
bug is the same as debian bug #722075. I'm writing to inquire if this bugfix can
be backported to wheezy (stable).

We encountered this bug on fractional core VMs running workloads that stress
disk, cpu, and networking. As part of that testing we make many concurrent HTTP
request in python, the relevant code being similar to:

> def GetURL(**kwargs):
>   url = "" href="http://www.example.com/">http://www.example.com/'
>   request = urllib2.Request(url)
>   return urllib2.urlopen(request, **kwargs).read()
>
> def HammerGetHostByID():
>   while True:
>     try:
>       GetURL(timeout=1)
>     except:
>       pass
>
> for _ in xrange(10):
>   thread = threading.Thread(target=HammerGetHostByID)
>   thread.start()

Running a workload like this in 500 VMs running wheezy would yield O(8) failures
over 24 hours with the following output:

*** glibc detected *** /usr/bin/python: double free or corruption (out)

Digging a little deeper with a debugger we found that whenever these were hit,
the stack would contain _nss_dns_gethostbyname4_r and have garbage stack frames
above that. The gethostbyname() call most likely comes from the above urlopen.

Given this observation, we suspected a connection to debian bug #722075, and
attempted the following patch to libc6:

diff -rupN eglibc-2.13/resolv/res_send.c eglibc-2.13-mod/resolv/res_send.c
--- eglibc-2.13/resolv/res_send.c 2010-03-26 14:08:35.000000000 -0700
+++ eglibc-2.13-mod/resolv/res_send.c 2014-07-02 10:23:28.521088097 -0700
@@ -1330,6 +1330,7 @@ send_dg(res_state statp,
  retval = reopen (statp, terrno, ns);
  if (retval <= 0)
  return retval;
+ pfd[0].fd = EXT(statp).nssocks[ns];
  }
  }
  goto wait;

With this single-line patch we no longer hit the 'double free or corruption'
message even when running 100 VMs for over 5 days. I extracted the above code
fix from https://lists.debian.org/debian-glibc/2014/06/msg00013.html, but
modified the diff to fit on 2.13-38+deb7u1.

If a fix similar to this could be included in wheezy stable at some point it
would be much appreciated.

We were running kernel: Debian 3.14.5-1~bpo70+1, libc6: 2.13-38+deb7u1

Thanks,
Marcus Ewert

Reply to: