[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: NPTL not properly cleaning up threads on SMP systems?



# submitters and upstream can reproduce it
tags 673596 - unreproducible
clone 673596 -1
retitle -1 pthread_join returns before child thread has been reaped
forwarded -1 http://thread.gmane.org/gmane.comp.lib.glibc.user/1356/focus=1359
# difficult
severity -1 wishlist
forwarded 673596 http://thread.gmane.org/gmane.comp.lib.glibc.user/1356
block 673596 by -1
tags 673596 + patch
quit

Thibaut Girka wrote[1]:

> I've recently tried to build eglibc from Debian's source package,
> but it aborted because of a failed test[1]: nptl/tst-eintr1.
> This test failed because one of the pthread_create calls returned EAGAIN.
> Indeed, after some time, instead of 12~23 threads,
> “grep Threads /proc/$PID/status” reveals much more threads (up to several thousands).

Siddhesh Poyarekar explained[2]:

> On a serious note, I see this all the time on my tests runs and it is
> in fact a kernel issue. The problem here is that the kernel needs to
> signal the joining thread before it exits and the time taken from that
> point to actual freeing of task resources seems to be longer than the
> time it takes for the process to spawn another thread after joining
> and for that thread to return.
>
> A very simple way to eliminate this is to introduce a very small
> usleep within the thread function.

Debian eglibc maintainers: how about this patch (untested)?

It modifies tst-eintr1 to try again instead of failing when hitting
the race.

	if (e == EAGAIN)
	  continue;

An alternative would be to mark tst-eintr1 as an expected failure, but
I prefer this workaround since it means other kinds of failure in the
test can still be caught.

It is not clear to me whether the underlying problem is a bug in
pthread_join() or just a quality of implementation issue (in which
case it would also be a bug in the testsuite).  POSIX defines
pthread_join() as waiting for the target thread's termination:

	The pthread_join() function shall suspend execution of the
	calling thread until the target thread terminates, unless the
	target thread has already terminated. 

It seems to imply that joined threads do not count towards an
implementation's limits:

	It is unspecified whether a thread that has exited but remains
	unjoined counts against {PTHREAD_THREADS_MAX}.

However, the (non-normative) rationale suggests that pthread_join()
can be emulated through a procedure that would be subject to the same
race that breaks tst-eintr1:

	It is true that a programmer could simulate this function if it
	were not provided by passing extra state as part of the argument
	to the start_routine().  The terminating thread would set a flag
	to indicate termination and broadcast a condition that is part
	of that state; a joining thread would wait on that condition
	variable. 
[...]
	Thus, while not a primitive, including pthread_join() in this
	volume of POSIX.1-2008 was considered valuable.

Meh.

Hope that helps,
Jonathan

[1] http://thread.gmane.org/gmane.comp.lib.glibc.user/1356
[2] http://thread.gmane.org/gmane.comp.lib.glibc.user/1356/focus=1358
Index: debian/changelog
===================================================================
--- debian/changelog	(revision 5284)
+++ debian/changelog	(working copy)
@@ -11,6 +11,11 @@
   * patches/hurd-i386/libpthread_nort.diff: Add patch to revert upstream librt
     usage.
 
+  [ Jonathan Nieder ]
+  * patches/any/unsubmitted-tst-eintr1-eagain.diff: new patch to work around a
+    race that lets pthread_create hit resource limits when the kernel takes
+    too long to clean up after joined threads.  Closes: #673596.
+
  -- Aurelien Jarno <aurel32@debian.org>  Sun, 03 Jun 2012 22:51:53 +0200
 
 eglibc (2.13-33) unstable; urgency=medium
Index: debian/patches/any/unsubmitted-tst-eintr1-eagain.diff
===================================================================
--- debian/patches/any/unsubmitted-tst-eintr1-eagain.diff	(revision 0)
+++ debian/patches/any/unsubmitted-tst-eintr1-eagain.diff	(working copy)
@@ -0,0 +1,25 @@
+2012-06-06  Jonathan Nieder  <jrnieder@gmail.com>
+
+	* nptl/tst-eintr1.c (tf1): Tolerate EAGAIN from pthread_create.
+
+---
+
+--- a/nptl/tst-eintr1.c
++++ b/nptl/tst-eintr1.c
+@@ -49,6 +49,16 @@
+ 	      puts ("pthread_create returned EINTR");
+ 	      exit (1);
+ 	    }
++	  if (e == EAGAIN)
++	    {
++	      /* The kernel might not have processed the last few
++	         pthread_join()s yet.  Tolerate that, but record the
++	         event in test output so attentive people reading
++	         logs can notice if pthread_join() stops working
++	         altogether.  */
++	      write (STDOUT_FILENO, "!", 1);
++	      continue;
++	    }
+ 
+ 	  char buf[100];
+ 	  printf ("tf1: pthread_create failed: %s\n",
Index: debian/patches/series
===================================================================
--- debian/patches/series	(revision 5284)
+++ debian/patches/series	(working copy)
@@ -361,3 +361,4 @@
 any/local-sunrpc-dos.diff
 any/cvs-ld.so-rpath-origin.diff
 any/cvs-pthread-builtin-expect.diff
+any/unsubmitted-tst-eintr1-eagain.diff

Reply to: