[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

A Very Bad umount



This has all the earmarks of a race condition because it is
totally intermittent.  It succeeds maybe 80% of the time.

	I am using rsync to backup a Linux system to a pair of
thumb drives which both appear to be healthy.  The mounting
process goes as follows:

#Combine 2 256-GB drives in to 1 512 GB drive.

mount /rsnapshot1
mount /rsnapshot2
mhddfs /rsnapshot1,/rsnapshot2 /var/cache/rsnapshot -o mlimit=100M 

	If one does

# df -h /var/cache/rsnapshot
Filesystem               Size  Used Avail Use% Mounted on
/rsnapshot1;/rsnapshot2  463G  173G  267G  40% /var/cache/rsnapshot

	That all works as it should.  One can run rsnapshot and
get a backup of today's file system.

	The /etc/rsnapshot.conf file is set to call the mount
process before rsync runs and then do the umount after it finishes

cmd_preexec	/usr/local/etc/mtbkmedia

# Specify the path to a script (and any optional arguments) to run right
# after rsnapshot syncs files
#
cmd_postexec	/usr/local/etc/umbkmedia

	My problem may be with how I am unmounting everything so
umbkmedia follows:

#!/bin/sh
umount /var/cache/rsnapshot /rsnapshot2 /rsnapshot1 
exit 0

Normally, this simply works and /var/cache/rsnapshot ends up
empty but when one of these intermittent explosions happens, I
receive the following

Date:    Tue, 11 Sep 2018 00:06:23 -0500
From:    root@wb5agz (Cron Daemon)
Subject: Cron <root@wb5agz> /usr/local/etc/daily_backup

From root@wb5agz Tue Sep 11 00: 06:24 2018

/bin/rm: cannot remove '/var/cache/rsnapshot/halfday.1/wb5agz/home/usr/lib/i386
-linux-gnu': Transport endpoint is not connected
/bin/rm: cannot remove '/var/cache/rsnapshot/halfday.1/wb5agz/home/usr/lib/libg
pgme-pth.so.11': Transport endpoint is not connected

	That is the beginning of what was, to day, a 152-line
message in which all of the error messages ended in
"Transport endpoint is not connected"

	When I have discovered one of these crashes, I have
re-run the script as root and it usually runs perfectly the
second time defying the definition of madness which is to keep
doing the same and expect different results.  You frequently get
them in the form of a proper backup.

	Today, I manually re-ran the backup and this time, it
actually failed from the command line with the same error
messages for each file mentioned.  The spew frequently highlights
a different set of directories.

	Look at the two drives later and they are fine except
that one does not get the last backup as rsync saw the errors and
you're left with the last good backup.

	I did a ls /var/cache/rsnapshot after the big spew and
got an error about "Transport endpoint is not connected" again.

	I have actually tried 

umount /rsnapshot2 /rsnapshot1 /var/cache/rsnapshot

as well as

umount /var/cache/rsnapshot /rsnapshot2 /rsnapshot1 .  I was
thinking that the order might make a difference but have gotten
as many good runs with either order.

	If one looks in /var/log/syslog, one sees the mounting of
the two drives and no errors and there are no errors reported if
you watch it happen.

	Are there any ideas on how to do the umount to insure
that all the inodes are in the state they should be in before the
umount is done?
Normally, this blocks until every inode is set and the umount succeeds.

	I have been chasing this rabbit for quite a while now and
it can sometimes be weeks without a spew, just long enough to
think that the last rejiggering of the order for unmounting or
someother futile rearranging of the Titanic's deck chairs actually
made a difference.

	Any constructive ideas are appreciated.  If I left the
drives mounted all the time, there would be no spew but since
these are backup drives, having them mounted all the time is
quite risky.

Martin McCormick WB5AGZ


Reply to: