[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#884776: libc6: pthread_cond_broadcast() blocks indefinitely on process-shared cond-var



Hi,

On 2017-12-19 16:41, Florian Schmidt wrote:
> Package: libc6
> Version: 2.25-3
> Severity: important
> Tags: upstream
> 
> TL;DR: pthread_cond_broadcast() on a process shared condition variable
> will block indefinitely when another process that
> pthread_cond_wait()'ed on this condition gets killed and restarted.
> 
> starting with libc6:armd64 version 2.25-3 from debian testing/buster, our
> in-house robotics communication middleware (links_and_nodes) does no
> longer behave as expected:
> 
> this middleware uses process shared mutex'es and condition variables
> in shared memory (pthread_mutex_t, pthread_cond_t, shm_open) for
> synchronization between processes.
> 
> attached is a simple self-contained test-case where a
> "publisher"-process repeatedly increments a counter in shm (while
> holding a pshared mutex) and then broadcasts the condition.

Thanks a lot for this testcase, it helps a lot to understand the issue
and will definitely help to get this issue solved.

> another process (lets call it "subscriber") is blocking waiting on
> that same condition variable (with the same mutex).
> 
> this works as expected. until the subscriber is killed/terminated by
> any signal while it is waiting.
> when the subscriber is then started a 2nd time, the publisher gets
> blocked in its call to pthread_cond_broadcast()!

In practice the issue is also there when there are two or more
subscribers, as soon as one subscriber is killed, the problem occurs.
It seems to that the condvar code is not able to detect that one of the
waiters got killed while the mutex is locked, though it's able to detect
that there are no waiter at all. That's why pthread_cond_broadcast waits
indefinitely in futex_wait.

> with glibc <= 2.24 this caused no problems. i saw that there is a new
> condvar impl in glibc 2.25 -- so this is probably something for
> upstream.

I confirm this is due to the new condvar implementation, more precisely
to the following commit:

| commit ed19993b5b0d05d62cc883571519a67dae481a14
| Author: Torvald Riegel <triegel@redhat.com>
| Date:   Wed May 25 23:43:36 2016 +0200
|
|     New condvar implementation that provides stronger ordering guarantees.

Your test case works just before this commit and fail when it's applied.
 
> with the attached test case i can reproduce this problem with debians
> libc6 version 2.25-3 (buster), 2.25-5 (sid) and 2.26-0experimental2
> (experimental).

I have also tested that the problem is still present in upstream git
HEAD.

It looks to me the best is to take this issue upstream. Do you want me
to forward your bug report and your example in the upstream bugzilla [1] 
or do you prefer to do it yourself?

Aurelien

[1] https://sourceware.org/bugzilla

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net


Reply to: