[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#723982: [amd64/g++] Suspected toolchain bug causing dlopen to segfault



reassign 723982 tulip
thanks

On Sat, Feb 01, 2014 at 02:27:49PM +0100, Yann Dirson wrote:
> [resend with bugs CC'd]
> 
> Hello,
> 
> Context:
> 
> http://bugs.debian.org/734318 - tulip: [amd64] segfaults inside dlopen when loading plugins
> http://bugs.debian.org/723982 - dlopen: segfaults right inside call_init
> 
> What we get here is a number of plugins that when dlopen'd cause an
> obscure segfault inside libc code.  Upstream (CC'd) say they have
> heard of such problems (on Ubuntu 13.10), that people have worked
> around by downgrading the compiler.
> 
> This sounds like either a toolchain regression, or possibly some
> edge-case that worked by chance with old compilers and now fail.

This is exactly that the bug is in tulip and up to know it worked only by 
chance on x86_64. The segfault occurs in dl-init.c when call_init is
calling all the init functions from DT_INIT_ARRAY. This is done in C by
this code:

|      addrs = (ElfW(Addr) *) (init_array->d_un.d_ptr + l->l_addr);
|      for (j = 0; j < jm; ++j)
|        ((init_t) addrs[j]) (argc, argv, env);

which is translated in assembly code into:

|    0x00007ffff7deb926 <+134>:   lea    0x8(%rbx,%rax,8),%r14
|    0x00007ffff7deb92b <+139>:   nopl   0x0(%rax,%rax,1)
|    0x00007ffff7deb930 <+144>:   mov    %r13,%rdx
|    0x00007ffff7deb933 <+147>:   mov    %r12,%rsi
|    0x00007ffff7deb936 <+150>:   mov    %ebp,%edi
|    0x00007ffff7deb938 <+152>:   callq  *(%rbx)
|    0x00007ffff7deb93a <+154>:   add    $0x8,%rbx
|    0x00007ffff7deb93e <+158>:   cmp    %r14,%rbx
|    0x00007ffff7deb941 <+161>:   jne    0x7ffff7deb930 <call_init+144>
|    0x00007ffff7deb943 <+163>:   pop    %rbx
|    0x00007ffff7deb944 <+164>:   pop    %rbp
|    0x00007ffff7deb945 <+165>:   pop    %r12
|    0x00007ffff7deb947 <+167>:   pop    %r13
|    0x00007ffff7deb949 <+169>:   pop    %r14
|    0x00007ffff7deb94b <+171>:   retq


As you can see the value of addrs is stored in %rbx and is incremented
by 8 at each loop. The segfault occurs at address 0x00007ffff7deb938
when trying to dereference %rbx. When it happens, %rbx has its upper
32 bits clobbered and thus point to the lower 32-bit of addrs[j].

Tracing that with GDB, it appeared %rbx is clobbered in the System::init
constructor from tulip. This code probes among other things uses the
CPUID instruction using assembly code:

|        __asm__ __volatile__ ("xchgl    %%ebx,%0\n\t"
|                                                "cpuid  \n\t"
|                                                "xchgl  %%ebx,%0\n\t"
|                                                : "+r" (b), "=a" (a), "=c" (c), "=d" (d)
|                                                : "1" (infoType), "2" (c));

As you can see %ebx is saved with xchgl before the %cpuid instruction
and restored after the same way. While that works correctly on x86, on
x86_64 the 32 upper bits get zeroed. BOOM !

I would suggest to use <cpuid.h> (which is available since GCC 4.4)
instead of this buggy assembly code to probe the CPU. In the meantime I
am reassigning the bug to tulip.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net


Reply to: