[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

A question about deleting a big file structure from a big disk in Jessie: Why does this work? I'm really worried.



For several years I have been making daily backups of my four Debian
computers using Rsync and a small script of my own devising. The data
has been accumulating on an external USB drive in a partition with the
label, gfx5. Some time ago I decided to a make a copy of these data,
so I would have more than one copy. I had to use Rsync to do this
because it I were to use cp the copies of files labeled by different
dates and hard-link together on gfx5 would exceed the capacity on the
target disk (which was/is labeled gfx2). This is a simple one line
command to Rsync.  When I tried, the job would always crash well
before completion. Sometimes, a simple repeat invocation would make
further progress, sometimes not. I became curious. As I tried
different variations of how to observe the progress of transfer as it
happened, I acquired copies of failed transfers, and then discovered
that I could not reliably delete a failed copy by using the obvious,
'rm -rfv ... '
I discovered that the command 'find -depth -print -delete'
sometimes worked when 'rm -rfv ...' did not. But in both cases the
deletion failed because 'gfx2' has been remounted read-only, which
makes it impossible to update the target directory tree.

I have not tried it, but from my investigation I'm sure that a
massive delete of some obsolete file structure from the HD that
was /dev/sda1 during Debian install would trigger a remount-ro,
which surely would lead to a system crash in short order.

I investigated further. These investigations were done on a computer
which I call 'gq'. I set up experiments on 'gq' by using ssh to issue
commands in 'gq' from my main desktop computer, 'big'. I set up several
ssh windows into 'gq'. My first discovery was that after a crash while
attempting to delete with 'find -depth -print -delete ', there was a
long delay in remounting 'gfx2' while the mount command emptied the
journal (ext4) on 'gfx2'.

Next I tried 'find -depth -print -delete ', with some extra windows into
'gq' in which I issued the command 'sync'. The return from 'sync' was
delayed, sometimes as much as a minute, and if I didn't issue 'sync'
commands frequently enough, there was never a return from 'sync', just
the crash of the 'find' command. So frequent sync commands delayed the
crash.

I found two other ways to delay the crash:
1) using nice as in: ' nice -n 19 find -depth -print -delete'
   (this, I think, slows down the main running job in relation to the
   running of the kernel.)
2) using cntrl-Z to pause the 'find' job for a while
   (which I think also allows the kernel to catch up with the journal)
   I could also monitor the progress of the journal run, by issuing a
   sync command in a separate ssh window. 

I'm worried about what I found. I want to interest someone who has far
more knowledge about how the kernel actually works internally to look
into this. I done other experiments more complicated to report, I can't
find anything comforting about this situation. If you think it's OK,
you probably don't understand, IMHO.

Kind regards,
-- 
Paul E Condon           
pecondon@mesanetworks.net


Reply to: