--- Begin Message ---
Package: libc6
Version: 2.7-5
Severity: normal
In my multithreaded application I'm finding calls to pthread_cond_wait
are occasionally not woken by pthread_cond_broadcast.
Some possibly relevent factors:
* This is a single CPU, single core box
* There's typically 1-3 threads calling pthread_cond_wait
* There's a single global cond used, but each thread has their own lock
* A maintenance thread (although the roles change) acquires all of the
threads' locks, ensuring they're all asleep. It then calls
pthread_cond_broadcast, followed by releasing all their locks
* The maintanance thread does this repeatedly, successfully waking up
other threads from the cond, as well as repeatedly acquiring and
releasing the hung thread's lock
* I've verified with my own logging and strace that the maintenance
thread is acquiring the same lock passed to pthread_cond_wait by the
hung thread
* A snippet from strace's log (full size 43 megs):
http://pastebin.com/f41c0c791
* I've verified with gdb that the hung thread is in "#0 0xb7edf820 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0"
* Attaching and detaching gdb causes the hung thread to wakeup and
finish normally.
I was told of a patch on IRC, but I was later told it did not affect
x86 (which I'm using). For posterity, here's what I had written:
Additionally, I was told on IRC of a patch set to glibc's locking code
that came out after 2.7. I haven't verified if these would fix it, or
if they're even related, but it's something to consider.
http://sources.redhat.com/bugzilla/show_bug.cgi?id=5240
Three changed files are linked there. I was given a 4th on IRC, which
I'm told is a correction.
http://sourceware.org/cgi-bin/cvsweb.cgi/libc/nptl/sysdeps/unix/sysv/linux/lowlevellock.c.diff?cvsroot=glibc&r1=1.18&r2=1.19
-- System Information:
Debian Release: lenny/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: i386 (i686)
Kernel: Linux 2.6.22-2-k7 (SMP w/1 CPU core)
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash
Versions of packages libc6 depends on:
ii libgcc1 1:4.2.2-1 GCC support library
libc6 recommends no packages.
-- debconf information excluded
--
Adam Olsen, aka Rhamphoryncus
--- End Message ---
--- Begin Message ---
- To: 465652-done@bugs.debian.org
- Subject: Re: Bug#465652: Info received (Bug#465652: Acknowledgement (libc6: Occasional failed wakeup in pthread_cond_wait))
- From: "Adam Olsen" <rhamph@gmail.com>
- Date: Tue, 19 Feb 2008 11:11:57 -0700
- Message-id: <aac2c7cb0802191011m42f74984la4d9ec1fab26008b@mail.gmail.com>
- In-reply-to: <handler.465652.B465652.120336219127169.ackinfo@bugs.debian.org>
- References: <aac2c7cb0802181116w161ab5b5ia018286f5326ca1a@mail.gmail.com> <handler.465652.B465652.120336219127169.ackinfo@bugs.debian.org>
Sorry folks, user error. :( I finally noticed the relevant paragraph in SUSv2:
"The effect of using more than one mutex for concurrent
pthread_cond_wait() or pthread_cond_timedwait() operations on the same
condition variable is undefined; that is, a condition variable becomes
bound to a unique mutex when a thread waits on the condition variable,
and this (dynamic) binding ends when the wait returns."
In my defence, the man page is ambiguous. I interpreted it as only
explaining how to use pthread_cond_wait, not as a property of the
condition itself (otherwise why wouldn't pthread_cond_init take a
mutex argument?):
A condition variable must always be associated with a mutex, to avoid
the race condition where a thread prepares to wait on a condition vari‐
able and another thread signals the condition just before the first
thread actually waits on it.
--
Adam Olsen, aka Rhamphoryncus
--- End Message ---