Hi,
While trying to debug a segfault in one of Ganeti's Haskell daemons
(#751886), I came across a memory corruption bug which I can only assume
comes from the GHC RTS "hijacking" all of GMPs memory management to
manage it via the SM[1].
As outlined in #751886[2], the said daemon uses FFI calls to libcurl to
initiate TLS-encrypted communications. Currently, the haskell bindings
are linked against the GnuTLS version of libcurl, which was recently
updated to link against gnutls28 instead of gnutls26. gnutls28 uses
nettle (and thus GMP) for crypto material operations, and what
*presumably* happens is the following:
1. A curl multi handler is constructed, with SSL key and certificate
loaded via gnutls/nettle. Nettle uses GMP's data types to store key
parameters and the memory of GMP is allocated from GHC's heap[1].
2. Going back-and-forth between Haskell and C space, eventually a GC
run is triggered. The GC cannot find Haskell object references to
the memory allocated by GMP calls via FFI and thus marks it as free.
3. Some other object takes over the heap chunk and on the next FFI
call, the SSL keys have been overwritten by random data. The result
is an unrecoverable SSL error ("Decrypt error"), or worse, a
segfault.
Now, this looks like a pretty ugly situation, primarily because GMP is a
widely-used library and also because FFI is widely used to interface
with a lot of external libraries.
The are many ways around or out of this situation, all of them with
their disadvantages:
1. Have haskell-curl depend on the OpenSSL version of libcurl. This
looks more like an ugly workaround and will likely have licensing
implications. However, it will solve #751886 for the time being.
2. Patch GHC's FFI implementation to reset GMP's memory allocator
to/from malloc when jumping between Haskell and FFI. This is almost
certainly not threadsafe for a start, and I have no idea what other
implications it may have.
3. Build GHC with integer-simple as INTEGER_LIBRARY, suffering an
unspecified performance hit for really large numbers. I tried this
with 7.6.3-10 from testing and the result was FTBS (unfortunately I
don't have the error message handy). Also upstream GHC states that
they do not test their builds with integer-simple, so I expect QA to
be an issue in this case.
There are almost certainly more options that I didn't consider. Could
someone with better insight of GHC internals please share their views on
this issue?
Thanks,
Apollon
[1] https://ghc.haskell.org/trac/ghc/wiki/ReplacingGMPNotes/TheCurrentGMPImplementation
[2] https://bugs.debian.org/751886#15
Attachment:
signature.asc
Description: Digital signature