[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: mdadm gives segmentatin fault on wheezy. RAID array now incomplete.



Hendrik Boom wrote:
> I ran
> mdadm /dev/md1 --add /dev/sdd2
> and got a segmentation fault.

Ouch.  Scary!

> april:/farhome/hendrik# cat /proc/mdstat
> Personalities : [raid1] 
> md1 : active raid1 sdb2[1]
>       2391295864 blocks super 1.2 [2/1] [_U]
>       
> md0 : active raid1 sda4[0] sdc4[1]
>       706337792 blocks [2/2] [UU]
>       
> unused devices: <none>
> april:/farhome/hendrik# mdadm /dev/md1 --add /dev/sdd2
> Segmentation fault
> april:/farhome/hendrik# 

I read the subsequent email responses but I think they went the wrong
direction.  The segfault was in mdadm not the disk.  It isn't the disk
with the problem.  The problem is with mdadm.  The solution is
therefore to find and fix mdadm not the disk.

Or it is in a library loaded by mdadm.  But there are only three and
they are used by every program.

  $ ldd -d -r /sbin/mdadm
        linux-vdso.so.1 (0x00007fffe6bb4000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f78df260000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f78df632000)

I suspect that you have one of several possibilities.  Some type of
file system corruption leading to a problem with the mdadm binary.  I
would checksum the package and see if it points to something.  If you
are lucky it will and then will know that is it.

  # debsums mdadm

It could also be some type of api mismatch between versions of
program, libs, kernel system calls.  I don't know.  I am reaching on
this one.

First I would make sure your system is up to date all around.  You
said you used Wheezy.  I would verify that you are up to date all
around.  I have seen people think they were up to date but forgot to
run 'update' first and so they were actually not.  I have seen people
have failures with the upgrade but not notice the failure and so
actually had broken packages and did not know it.

  apt-get update
  apt-get upgrade
  apt-get dist-upgrade

You might try re-installing just the mdadm package.

  apt-get install --reinstall mdadm

> What now?

I strongly suspect a broken system.  Because mdadm on Wheezy is
working fine for zillions of other people.  If it were a bug in mdadm
then I suspect that it would have been hit by many others.  It isn't.
So I suspect something uniquely wrong on your system.  That is why I
think you should start by trying to figure out what specifically about
your system is going on and fixing it there.

If I could not fix the problem by any other means then two more
difficult options would be available.  I would shutdown and remove the
disks from the faulty system and mount them on a different known good
system and then use the other working mdadm and fix the disk problem.
This would actually be a good test of something else too.  If the
problem followed to the known good system then it is clearly a data
dependent bug in mdadm.  If not then it is a broken system in some
way.  Afterward you could move the disks back to the original machine.
Since the raid had been sync'd then the raid back on the original
machine should also boot up sync'd.  That would not really address the
mdadm segfault problem.  However you might not care at that point.
Not unless some other problem pops up.

The next thing would be to get the source to mdadm and compile it
locally on the system.  The step through the program in the debugger.
While running in the debugger the segfault will be trapped and you
should be able to see what part of the code is triggering the problem.

  # apt-get build-dep mdadm
  $ apt-get source mdadm
  $ cd mdadm-3.2.5
  $ ./debian/rules build
  $ ./mdadm --version
  mdadm - v3.2.5 - 18th May 2012

Do the build-dep as root.  Do the rest as yourself, non-root.  But
then to run the debugger is a very long howto that will vary depending
upon many things.  I run gdb within emacs.  And for mdadm it all needs
to be run as root to have the right access.  After that I must leave
it there.  But debugging the program should allow you to figure out
what is actually segfaulting.  If it is a program bug then it could be
fixed and reported.

Bob

Attachment: signature.asc
Description: Digital signature


Reply to: