[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)



Le 31/03/2019 à 22:53, Alexis Murzeau a écrit :
> Le 31/03/2019 à 15:19, Aurelien Jarno a écrit :
>> This bug is very likely a bug present in old glibc versions. It has been
>> brought to light when enabling TLS support in openblas and not by a new
>> glibc version.
>>
>> Right now the bug has been workarounded by disabling TLS support in
>> openblas. The way to handle this bug is to write a small testcase that
>> can be forwarded upstream. It's not an easy task though.
>>
> 
> Hi,
> 
> I've made a test case here [0].
> I've not tested it against latest glibc commit.
> But it does reproduce the deadlock with glibc 2.28 on Linux.
> 
> To run the test case, do this:
> ```
> gcc test_compiler_tls.c -o test_compiler_tls -ldl -g -pthread
> gcc test_compiler_tls_lib.c -shared -o test_compiler_tls_lib.so \
>  -g -pthread -fPIC
> ./test_compiler_tls ./test_compiler_tls_lib &
> gdb --pid $! -ex 'thr a a bt'
> ```
> 
> This reproduce the deadlock that I've found in openblas:
> 1- The test_thread open the library which call its constructor
> 2- The library's constructor create a thread
>    `thread_that_use_tls_after_sleep`
> 3- The thread `thread_that_use_tls_after_sleep` sleep for 100ms (this
>    needs to be enough so dl_close is called before the sleep ends)
> 3- The test_thread close the library with dl_close
> 4- dl_close lock `dl_load_lock` and call the library's destructor
> 5- The library's destructor wait `thread_that_use_tls_after_sleep` to
>    finish
> 6- The `thread_that_use_tls_after_sleep` thread try to read the TLS
>    variable which cause a call to `__tls_get_addr`
> 7- `__tls_get_addr` cause a deadlock in `tls_get_addr_tail` trying to
>    lock the same `dl_load_lock` as dl_close does
> 8- Nothing happen because dl_close thread is waiting for the
>    `thread_that_use_tls_after_sleep` thread to finish which having the
>    lock and the latter thread try to lock the same lock as dl_close and
>    so never exit.
> 
> See [1] for the stacktrace.
> 
> Thread 3 is the library's thread created in its constructor and joined
> in its destructor.
> Thread 2 is the thread that does dl_open and dl_close.
> Thread 1 is a "monitoring" thread to implement a timeout of 10s (useful
> if this tests need to run on a CI system)
> 
> Where dl_close lock the `dl_load_lock`: [2]
> Where tls_get_addr_tail lock the `dl_load_lock`: [3]
> 
> [0]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7
> [1]:
> https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7#file-gdb_stacktrace-txt
> [2]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-close.c#L812
> [3]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-tls.c#L761
> 

Related links:
https://bugzilla.redhat.com/show_bug.cgi?id=1409899
https://sourceware.org/bugzilla/show_bug.cgi?id=2377


Actually, the hang is caused by a C++ here, but that's the same deadlock
(the C++ exception require the `dl_load_lock´ lock).

It seems from the first link that using thread stuff in constructor and
destructor is risky and not well supported and that applications should
just avoid doing this.

I didn't find a really related bug in sourceware bugzilla, maybe we
should forward our bug to them ?

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F

Attachment: signature.asc
Description: OpenPGP digital signature


Reply to: