Bug#756343: Fix gethostbyname() sending data on random file descriptors in wheezy, already done in jessie
Hi,
On Mon, Jul 28, 2014 at 04:43:37PM -0700, Marcus Ewert wrote:
> Package: libc6
> Version: 2.13-38+deb7u1
> Severity: normal
>
> Hello,
>
> On test systems running stress workloads we were regularly encountering a
> bug
> in gethostbyname that is fixed in libc6 in jessie. For completeness I've
> included the entire repro/investigation process; however, we are fairly
> sure the
> bug is the same as debian bug #722075. I'm writing to inquire if this
> bugfix can
> be backported to wheezy (stable).
>
> We encountered this bug on fractional core VMs running workloads that stress
> disk, cpu, and networking. As part of that testing we make many concurrent
> HTTP
> request in python, the relevant code being similar to:
>
> > def GetURL(**kwargs):
> > url = 'http://www.example.com/'
> > request = urllib2.Request(url)
> > return urllib2.urlopen(request, **kwargs).read()
> >
> > def HammerGetHostByID():
> > while True:
> > try:
> > GetURL(timeout=1)
> > except:
> > pass
> >
> > for _ in xrange(10):
> > thread = threading.Thread(target=HammerGetHostByID)
> > thread.start()
>
> Running a workload like this in 500 VMs running wheezy would yield O(8)
> failures
> over 24 hours with the following output:
>
> *** glibc detected *** /usr/bin/python: double free or corruption (out)
>
> Digging a little deeper with a debugger we found that whenever these were
> hit,
> the stack would contain _nss_dns_gethostbyname4_r and have garbage stack
> frames
> above that. The gethostbyname() call most likely comes from the above
> urlopen.
>
> Given this observation, we suspected a connection to debian bug #722075, and
> attempted the following patch to libc6:
>
> diff -rupN eglibc-2.13/resolv/res_send.c eglibc-2.13-mod/resolv/res_send.c
> --- eglibc-2.13/resolv/res_send.c 2010-03-26 14:08:35.000000000 -0700
> +++ eglibc-2.13-mod/resolv/res_send.c 2014-07-02 10:23:28.521088097 -0700
> @@ -1330,6 +1330,7 @@ send_dg(res_state statp,
> retval = reopen (statp, terrno, ns);
> if (retval <= 0)
> return retval;
> + pfd[0].fd = EXT(statp).nssocks[ns];
> }
> }
> goto wait;
>
> With this single-line patch we no longer hit the 'double free or corruption'
> message even when running 100 VMs for over 5 days. I extracted the above
> code
> fix from https://lists.debian.org/debian-glibc/2014/06/msg00013.html, but
> modified the diff to fit on 2.13-38+deb7u1.
>
> If a fix similar to this could be included in wheezy stable at some point it
> would be much appreciated.
>
I have just committed the change in our stable branch [1]. We'll upload
the package a bit before the next Debian stable release, if the release
team agrees with the changes (which is likely in that case).
[1] http://anonscm.debian.org/viewvc/pkg-glibc?view=revision&revision=6227
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aurelien@aurel32.net http://www.aurel32.net
Reply to: