Le 31/03/2019 à 22:53, Alexis Murzeau a écrit : > Le 31/03/2019 à 15:19, Aurelien Jarno a écrit : >> This bug is very likely a bug present in old glibc versions. It has been >> brought to light when enabling TLS support in openblas and not by a new >> glibc version. >> >> Right now the bug has been workarounded by disabling TLS support in >> openblas. The way to handle this bug is to write a small testcase that >> can be forwarded upstream. It's not an easy task though. >> > > Hi, > > I've made a test case here [0]. > I've not tested it against latest glibc commit. > But it does reproduce the deadlock with glibc 2.28 on Linux. > > To run the test case, do this: > ``` > gcc test_compiler_tls.c -o test_compiler_tls -ldl -g -pthread > gcc test_compiler_tls_lib.c -shared -o test_compiler_tls_lib.so \ > -g -pthread -fPIC > ./test_compiler_tls ./test_compiler_tls_lib & > gdb --pid $! -ex 'thr a a bt' > ``` > > This reproduce the deadlock that I've found in openblas: > 1- The test_thread open the library which call its constructor > 2- The library's constructor create a thread > `thread_that_use_tls_after_sleep` > 3- The thread `thread_that_use_tls_after_sleep` sleep for 100ms (this > needs to be enough so dl_close is called before the sleep ends) > 3- The test_thread close the library with dl_close > 4- dl_close lock `dl_load_lock` and call the library's destructor > 5- The library's destructor wait `thread_that_use_tls_after_sleep` to > finish > 6- The `thread_that_use_tls_after_sleep` thread try to read the TLS > variable which cause a call to `__tls_get_addr` > 7- `__tls_get_addr` cause a deadlock in `tls_get_addr_tail` trying to > lock the same `dl_load_lock` as dl_close does > 8- Nothing happen because dl_close thread is waiting for the > `thread_that_use_tls_after_sleep` thread to finish which having the > lock and the latter thread try to lock the same lock as dl_close and > so never exit. > > See [1] for the stacktrace. > > Thread 3 is the library's thread created in its constructor and joined > in its destructor. > Thread 2 is the thread that does dl_open and dl_close. > Thread 1 is a "monitoring" thread to implement a timeout of 10s (useful > if this tests need to run on a CI system) > > Where dl_close lock the `dl_load_lock`: [2] > Where tls_get_addr_tail lock the `dl_load_lock`: [3] > > [0]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7 > [1]: > https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7#file-gdb_stacktrace-txt > [2]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-close.c#L812 > [3]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-tls.c#L761 > Related links: https://bugzilla.redhat.com/show_bug.cgi?id=1409899 https://sourceware.org/bugzilla/show_bug.cgi?id=2377 Actually, the hang is caused by a C++ here, but that's the same deadlock (the C++ exception require the `dl_load_lock´ lock). It seems from the first link that using thread stuff in constructor and destructor is risky and not well supported and that applications should just avoid doing this. I didn't find a really related bug in sourceware bugzilla, maybe we should forward our bug to them ? -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F
Attachment:
signature.asc
Description: OpenPGP digital signature