Hi, While trying to debug a segfault in one of Ganeti's Haskell daemons (#751886), I came across a memory corruption bug which I can only assume comes from the GHC RTS "hijacking" all of GMPs memory management to manage it via the SM[1]. As outlined in #751886[2], the said daemon uses FFI calls to libcurl to initiate TLS-encrypted communications. Currently, the haskell bindings are linked against the GnuTLS version of libcurl, which was recently updated to link against gnutls28 instead of gnutls26. gnutls28 uses nettle (and thus GMP) for crypto material operations, and what *presumably* happens is the following: 1. A curl multi handler is constructed, with SSL key and certificate loaded via gnutls/nettle. Nettle uses GMP's data types to store key parameters and the memory of GMP is allocated from GHC's heap[1]. 2. Going back-and-forth between Haskell and C space, eventually a GC run is triggered. The GC cannot find Haskell object references to the memory allocated by GMP calls via FFI and thus marks it as free. 3. Some other object takes over the heap chunk and on the next FFI call, the SSL keys have been overwritten by random data. The result is an unrecoverable SSL error ("Decrypt error"), or worse, a segfault. Now, this looks like a pretty ugly situation, primarily because GMP is a widely-used library and also because FFI is widely used to interface with a lot of external libraries. The are many ways around or out of this situation, all of them with their disadvantages: 1. Have haskell-curl depend on the OpenSSL version of libcurl. This looks more like an ugly workaround and will likely have licensing implications. However, it will solve #751886 for the time being. 2. Patch GHC's FFI implementation to reset GMP's memory allocator to/from malloc when jumping between Haskell and FFI. This is almost certainly not threadsafe for a start, and I have no idea what other implications it may have. 3. Build GHC with integer-simple as INTEGER_LIBRARY, suffering an unspecified performance hit for really large numbers. I tried this with 7.6.3-10 from testing and the result was FTBS (unfortunately I don't have the error message handy). Also upstream GHC states that they do not test their builds with integer-simple, so I expect QA to be an issue in this case. There are almost certainly more options that I didn't consider. Could someone with better insight of GHC internals please share their views on this issue? Thanks, Apollon [1] https://ghc.haskell.org/trac/ghc/wiki/ReplacingGMPNotes/TheCurrentGMPImplementation [2] https://bugs.debian.org/751886#15
Attachment:
signature.asc
Description: Digital signature