[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#884776: libc6: pthread_cond_broadcast() blocks indefinitely on process-shared cond-var



Package: libc6
Version: 2.25-3
Severity: important
Tags: upstream

TL;DR: pthread_cond_broadcast() on a process shared condition variable
will block indefinitely when another process that
pthread_cond_wait()'ed on this condition gets killed and restarted.

starting with libc6:armd64 version 2.25-3 from debian testing/buster, our
in-house robotics communication middleware (links_and_nodes) does no
longer behave as expected:

this middleware uses process shared mutex'es and condition variables
in shared memory (pthread_mutex_t, pthread_cond_t, shm_open) for
synchronization between processes.

attached is a simple self-contained test-case where a
"publisher"-process repeatedly increments a counter in shm (while
holding a pshared mutex) and then broadcasts the condition.

another process (lets call it "subscriber") is blocking waiting on
that same condition variable (with the same mutex).

this works as expected. until the subscriber is killed/terminated by
any signal while it is waiting.
when the subscriber is then started a 2nd time, the publisher gets
blocked in its call to pthread_cond_broadcast()!

with glibc <= 2.24 this caused no problems. i saw that there is a new
condvar impl in glibc 2.25 -- so this is probably something for
upstream.

with the attached test case i can reproduce this problem with debians
libc6 version 2.25-3 (buster), 2.25-5 (sid) and 2.26-0experimental2
(experimental).

on stable with 2.24-11+deb9u1 this problem does not occour!

attached is a Makefile, main.cpp and howto-reproduce.txt which should
be all thats needed to reproduce it (tested with gcc version 7.2.1
20171205 (Debian 7.2.0-17) and others...):

  $ make
  $ ./condvar-test publisher

in another terminal:

  $ ./condvar-test subscriber

all works fine. now kill the subscriber via a signal, eg SIGINT with
Ctrl-C. publisher is still happy. now restart the subscriber:

  $ ./condvar-test subscriber

this will cause the publisher to get blocked in
pthread_cond_broadcast()!

known workarounds: when the subscriber gets killed/terminated anywhere
outside of the critical section / not while blocked in _wait(), the
problem does not occour!
e.g. capturing the signal, and then doing a clean shutdown after
pthread_cond_wait() returned with EAGAIN.

this will be a major problem for us, because this synchronization is
provided by means of a shared library. and we can hardly control how
processes terminate. (and telling the average user how to do signal
handling is also not very convincing -- also letting the library catch
any/all signals to be able to return cleanly from _wait() is not a
good option...)

or is this usage of pthread_mutex,_cond considered to be bad/wrong? how?

still its an unexpected change in behaviour and i currently don't see
a clean way to solve this.

(i could also successfully reproduce this issue on different machines
with glibc >= 2.25, also with older 3.x and newer 4.9 kernels)

-- System Information:
Debian Release: buster/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 4.8.0-rt1+ (SMP w/4 CPU cores; PREEMPT)
Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE=C.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: unable to detect

Versions of packages libc6 depends on:
ii  libgcc1  1:7.2.0-17

libc6 recommends no packages.

Versions of packages libc6 suggests:
ii  debconf [debconf-2.0]  1.5.65
pn  glibc-doc              <none>
pn  libc-l10n              <none>
pn  locales                <none>

-- debconf information excluded
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>

#include <pthread.h>

#define SHM_NAME "test_shm"

#define check_error_ret_null(func, ...) do {								\
		int ret = func(__VA_ARGS__);								\
		if(ret != 0) {										\
			fprintf(stderr, "error calling " #func "(): %d %s!\n", ret, strerror(ret));	\
			return NULL;									\
		}											\
	} while(0)


unsigned int t = 10;

typedef struct {
	pthread_mutex_t mutex;
	pthread_cond_t cond;
	
	unsigned int triggered;
} shm_t;

shm_t* mmap_fd(int fd) {
	shm_t* ret = (shm_t*)mmap(NULL, sizeof(shm_t), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE | MAP_LOCKED, fd, 0);
	close(fd);
	if((void*)ret == MAP_FAILED) {
		fprintf(stderr, "could not mmap: %d %s!\n", errno, strerror(errno));
		return NULL;
	}
	return ret;
}

shm_t* create_shm() {	
	int fd = shm_open(SHM_NAME, O_RDWR | O_CREAT, 0777);
	if(fd == -1) {
		fprintf(stderr, "could not shm_open(" SHM_NAME ", O_CREAT): %d, %s\n", errno, strerror(errno));
		return NULL;
	}
	if(ftruncate(fd, sizeof(shm_t))) {
		fprintf(stderr, "could not ftruncate shm %d to %d\n", fd, (unsigned int)sizeof(shm_t));
		return NULL;
	}
	shm_t* shm = mmap_fd(fd);
	if(!shm)
		return NULL;
	memset(shm, 0, sizeof(shm_t));
	{ // init mutex
		pthread_mutexattr_t attr;
		check_error_ret_null(pthread_mutexattr_init,          &attr);
		check_error_ret_null(pthread_mutexattr_setpshared,    &attr, PTHREAD_PROCESS_SHARED); // pshared!
		check_error_ret_null(pthread_mutexattr_setrobust_np,  &attr, PTHREAD_MUTEX_ROBUST); // if owner dies _lock() returns EOWNERDEAD
		check_error_ret_null(pthread_mutexattr_setprotocol,   &attr, PTHREAD_PRIO_INHERIT);
		check_error_ret_null(pthread_mutex_init, &shm->mutex, &attr);
		check_error_ret_null(pthread_mutexattr_destroy,       &attr);
	}
	

	{ // init condition variable
		pthread_condattr_t attr;
		check_error_ret_null(pthread_condattr_init,         &attr);
		check_error_ret_null(pthread_condattr_setpshared,   &attr, PTHREAD_PROCESS_SHARED); // pshared!
		check_error_ret_null(pthread_condattr_setclock,     &attr, CLOCK_REALTIME);
		check_error_ret_null(pthread_cond_init, &shm->cond, &attr);
		check_error_ret_null(pthread_condattr_destroy,      &attr);
	}
	
	return shm;
}

shm_t* open_shm() {
	int fd = shm_open(SHM_NAME, O_RDWR, 0777);
	if(fd == -1) {
		fprintf(stderr, "could not shm_open(" SHM_NAME "'): %d, %s\n", errno, strerror(errno));
		return NULL;
	}
	return mmap_fd(fd);
}

int publisher() {
	shm_t* shm = create_shm();
	if(!shm)
		return 1;
	
	while(true) {
		int ret = pthread_mutex_lock(&shm->mutex);
		if(ret != 0) {
			fprintf(stderr, "pthread_mutex_lock() returned %d: %s\n", ret, strerror(ret));
			if(ret == EOWNERDEAD)
				pthread_mutex_consistent_np(&shm->mutex);
			else
				return 1;
		}
		
		shm->triggered ++; // do something
		printf("tick %d!\n", shm->triggered);
		pthread_mutex_unlock(&shm->mutex);
		
		pthread_cond_broadcast(&shm->cond);
    
		sleep(1);
	}
	return 0;
}
int subscriber() {
	shm_t* shm = open_shm();
	if(!shm)
		return 1;
	
	unsigned int last_triggered = shm->triggered;
	while(true) {
		int ret = pthread_mutex_lock(&shm->mutex);
		if(ret != 0) {
			fprintf(stderr, "pthread_mutex_lock() returned %d: %s\n", ret, strerror(ret));
			if(ret == EOWNERDEAD)
				pthread_mutex_consistent_np(&shm->mutex);
			else
				return 1;
		}		
		while(shm->triggered == last_triggered) {
			pthread_cond_wait(&shm->cond, &shm->mutex);
			if(shm->triggered == last_triggered)
				printf(" ...spurious wakeup\n");
		}
		last_triggered = shm->triggered;
		pthread_mutex_unlock(&shm->mutex);
		
		printf("tock %d\n", last_triggered);
	}
	return 0;
}


int do_start(const char* what) {
	if(!strcmp(what, "publisher"))
		return publisher();
	if(!strcmp(what, "subscriber"))
		return subscriber();
	fprintf(stderr, "invalid arg: '%s'\n", what);
	return 1;
}

int main(int argc, char* argv[]) {
	if(argc != 2 || do_start(argv[1])) {
		printf("usage:\n"
		       "first start publisher:\n"
		       "  %s publisher\n"
		       "then, in another shell start subscriber:\n"
		       "  %s subscriber\n"
		       "... this should work as expected.\n"
		       "now kill the subscriber and restart it:\n"
		       "  %s subscriber\n"
		       "on glibc-2.25 it will block itself AND the publisher!\n",
		       argv[0],
		       argv[0],
		       argv[0]);
		return 1;
	}
	return 0;
}
i can reproduce the following problem on a debian testing(buster) with

  $ dpkg -s libc6:amd64 | grep Version
  Version: 2.25-3

the same test executed on a debian stable(stretch) with

  $ dpkg -s libc6:amd64 | grep Version
  Version: 2.24-11+deb9u1

does not show any problems and works as expected.

kernel is 4.8.0 on a intel core2 quad cpu with ht disabled.
gcc version 7.2.1 20171205 (Debian 7.2.0-17).

## how to reproduce

build the executable by calling make:

  $ make

start he publisher, it will create a shared-memory with a process
shared mutex and condition variable:

  $ ./condvar-test publisher

this publisher will broadcast the condition every second with a new
counter value.

in another terminal start the subscriber. it will grab the mutex,
if there is no new counter value, it will do a blocking wait on the
condition variable until it sees a new counter value. this value is
then stored, the mutex unlocked and then the stored value is printed:

  $ ./condvar-test subscriber

this is the normative case and it should work as expected.

now stop the subscriber by sending an appropriate signal, you could
for example press Ctrl-C on your terminal to send SIGINT:

  ...
  tock XY
  ^C
  $

now the publisher keeps running, which is fine.

my expectation would now be that i can restart the subscriber and it
would again print counter values. but with glibc-2.25 it does not:

  $ ./condvar-test subscriber

it does not output anything, but what's even worse is that now the
publisher blocks in the call to pthread_cond_broadcast()!

a gdb backtrace of the publisher process looks like this:
(gdb) bt
#0  0x00007f23ce759847 in futex_wait (private=<optimized out>, expected=3, futex_word=0x7f23cf638038) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=3, futex_word=0x7f23cf638038) at ../sysdeps/nptl/futex-internal.h:135
#2  __condvar_quiesce_and_switch_g1 (private=<optimized out>, g1index=<synthetic pointer>, wseq=<optimized out>, cond=0x7f23cf638028) at pthread_cond_common.c:413
#3  __pthread_cond_broadcast (cond=0x7f23cf638028) at pthread_cond_broadcast.c:73
#4  0x0000563d0df3f755 in publisher () at main.cpp:109
#5  0x0000563d0df3f895 in do_start (what=0x7ffe636bdc07 "publisher") at main.cpp:146
#6  0x0000563d0df3f97e in main (argc=2, argv=0x7ffe636bd1c8) at main.cpp:167

while the subscriber process hangs (as somewhat expected) in pthread_cond_wait():
(gdb) bt
#0  0x00007ffff7118b26 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffff7ff4054) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7ffff7ff4000, cond=0x7ffff7ff4028) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7ffff7ff4028, mutex=0x7ffff7ff4000) at pthread_cond_wait.c:655
#3  0x0000555555555820 in subscriber () at main.cpp:131
#4  0x00005555555558b3 in do_start (what=0x7fffffffebfc "subscriber") at main.cpp:148
#5  0x000055555555597e in main (argc=2, argv=0x7fffffffe978) at main.cpp:167
CXXFLAGS ?= -Wall -g -O0 -pthread
LDFLAGS ?= -pthread -lrt

condvar-test: main.cpp Makefile
	$(CXX) -o $@ $< $(CXXFLAGS) $(LDFLAGS)


Reply to: