[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#934752: libc6: SEGFAULTs caused by tcache after upgrade to Buster



Package: glibc
Version: 2.28-10:amd64

Dear Maintainer,

We are running manually compiled Apache and OpenSSL on Debian servers in Debian-based chroots. After chroot upgrade from Stretch to Buster we started to see strange SEGFAULTs.
The strange means they appear only on 2 servers out of 6.
Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon E3-1220 v6 produced crashes.
It did not matter if the host Debian was Stretch or Buster.

I was able to collect coredumps and get backtraces. They look like:
(gdb) bt
#0  tcache_get (tc_idx=0) at malloc.c:2934
#1  __GI___libc_malloc (bytes=3) at malloc.c:3042
#2  0x00007fd8cc0961be in CRYPTO_malloc (num=3, file=0x7fd8cc2a548c "ssl/statem/extensions_clnt.c", line=1376) at crypto/mem.c:222 #3  0x00007fd8cc26c7b9 in tls_parse_stoc_ec_pt_formats (s=0x7fd8640592d0, pkt=0x7fd864061810, context=256, x=0x0, chainidx=0)
    at ssl/statem/extensions_clnt.c:1376
#4  0x00007fd8cc266af5 in tls_parse_extension (s=0x7fd8640592d0, idx=TLSEXT_IDX_ec_point_formats, context=256, exts=0x7fd864061770, x=0x0, chainidx=0)
    at ssl/statem/extensions.c:715
#5  0x00007fd8cc266bbb in tls_parse_all_extensions (s=0x7fd8640592d0, context=256, exts=0x7fd864061770, x=0x0, chainidx=0, fin=1)
    at ssl/statem/extensions.c:748
#6  0x00007fd8cc2798b6 in tls_process_server_hello (s=0x7fd8640592d0, pkt=0x7fd83cff8440) at ssl/statem/statem_clnt.c:1698 #7  0x00007fd8cc277fc7 in ossl_statem_client_process_message (s=0x7fd8640592d0, pkt=0x7fd83cff8440) at ssl/statem/statem_clnt.c:1039 #8  0x00007fd8cc275499 in read_state_machine (s=0x7fd8640592d0) at ssl/statem/statem.c:636 #9  0x00007fd8cc274f15 in state_machine (s=0x7fd8640592d0, server=0) at ssl/statem/statem.c:434 #10 0x00007fd8cc274a1b in ossl_statem_connect (s=0x7fd8640592d0) at ssl/statem/statem.c:250 #11 0x00007fd8cc25b098 in SSL_do_handshake (s=0x7fd8640592d0) at ssl/ssl_lib.c:3599 #12 0x00007fd8cc257199 in SSL_connect (s=0x7fd8640592d0) at ssl/ssl_lib.c:1653 #13 0x00007fd8c957c934 in ssl_io_filter_handshake (filter_ctx=0x7fd85809a090) at ssl_engine_io.c:1243 #14 0x00007fd8c957deca in ssl_io_filter_output (f=0x7fd85809a0e8, bb=0x7fd85406b8b0) at ssl_engine_io.c:1760
..

(gdb) bt
#0  tcache_get (tc_idx=0) at malloc.c:2934
#1  __GI___libc_malloc (bytes=16) at malloc.c:3042
#2  0x00007fd8cc0961be in CRYPTO_malloc (num=16, file=0x7fd8cc159913 "crypto/bio/bss_mem.c", line=115) at crypto/mem.c:222 #3  0x00007fd8cc0961f1 in CRYPTO_zalloc (num=16, file=0x7fd8cc159913 "crypto/bio/bss_mem.c", line=115) at crypto/mem.c:230 #4  0x00007fd8cbf9ca0a in mem_init (bi=0x7fd860044130, flags=0) at crypto/bio/bss_mem.c:115 #5  0x00007fd8cbf9cb3d in mem_new (bi=0x7fd860044130) at crypto/bio/bss_mem.c:138 #6  0x00007fd8cbf9541a in BIO_new (method=0x7fd8cc204980 <mem_method>) at crypto/bio/bio_lib.c:94 #7  0x00007fd8cc2454a3 in ssl3_init_finished_mac (s=0x7fd8600a7be0) at ssl/s3_enc.c:322 #8  0x00007fd8cc281eae in tls_setup_handshake (s=0x7fd8600a7be0) at ssl/statem/statem_lib.c:91 #9  0x00007fd8cc274ea2 in state_machine (s=0x7fd8600a7be0, server=0) at ssl/statem/statem.c:419 #10 0x00007fd8cc274a1b in ossl_statem_connect (s=0x7fd8600a7be0) at ssl/statem/statem.c:250 #11 0x00007fd8cc25b098 in SSL_do_handshake (s=0x7fd8600a7be0) at ssl/ssl_lib.c:3599 #12 0x00007fd8cc257199 in SSL_connect (s=0x7fd8600a7be0) at ssl/ssl_lib.c:1653 #13 0x00007fd8c957c934 in ssl_io_filter_handshake (filter_ctx=0x7fd8580e8b78) at ssl_engine_io.c:1243 #14 0x00007fd8c957deca in ssl_io_filter_output (f=0x7fd8580e8bd0, bb=0x55b212b0d518) at ssl_engine_io.c:1760
..

SSLv3 and TLS code path looked quite distinct to cause the same problem.
Based on info that SEGFAULTs are related to memory allocation in new libc and CPU performance I found
http://51.15.138.76/patch/17499/
where Wilco Dijkstra discuss some problems with tcache which "leads to various crashes in benchtests"

As workaround I tried to
export GLIBC_TUNABLES=glibc.malloc.tcache_count=0
in Apache startup script and I saw no SEGFAULT since.

I have coredumps but they contain production private keys for Apache which I can't share and to make things even worse they are 1,6GB each.

I understand this is heisenbug which you won't be able to reproduce. The CPU model dependency is beyond my comprehension. I'm curious if you are familiar with the new tcache and if you think if the patch in discussion can help. I'll try to build libc6 package with it to confirm final solution but I'm confused by the patch tree so far.

-- System Information:
Debian Release: Buster
Architecture: amd64 (x86_64)
Kernel: Linux 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64 GNU/Linux

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 801ba1f499b566e677b763fc84f8ba86f4f7ccd0..4db7283cc27118cd7d39410febf7be8f7633780a 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -2915,10 +2915,12 @@ typedef struct tcache_entry
    time), this is for performance reasons.  */
 typedef struct tcache_perthread_struct
 {
-  char counts[TCACHE_MAX_BINS];
+  unsigned short counts[TCACHE_MAX_BINS];
   tcache_entry *entries[TCACHE_MAX_BINS];
 } tcache_perthread_struct;
 
+#define MAX_TCACHE_COUNT 65535	/* Maximum value of counts[] entries.  */
+
 static __thread bool tcache_shutting_down = false;
 static __thread tcache_perthread_struct *tcache = NULL;
 
@@ -5114,8 +5116,11 @@ do_set_tcache_max (size_t value)
 static __always_inline int
 do_set_tcache_count (size_t value)
 {
-  LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
-  mp_.tcache_count = value;
+  if (value <= MAX_TCACHE_COUNT)
+    {
+      LIBC_PROBE (memory_tunable_tcache_count, 2, value, mp_.tcache_count);
+      mp_.tcache_count = value;
+    }
   return 1;
 }
 

Reply to: