[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Delete 4 million files



On Wed, Mar 25, 2009 at 03:10:42PM +0000, Tzafrir Cohen wrote:
> On Wed, Mar 25, 2009 at 07:53:06AM -0500, Ron Johnson wrote:
> > On 2009-03-25 05:16, Tzafrir Cohen wrote:
> >>> Tapani Tarvainen schrieb:
> >>>>> kj wrote:
> >>>>>> Now, I've been running the usual find . -type f -exec rm {} \;
> >>>>>> but this is going at about 700,000 per day.  Would simply doing 
> >>>>>> an rm  -rf on the Maildir be quicker?  Or is there a better 
> >>>>>> way?
> >>>> While rm -rf would certainly be quicker and is obviously preferred
> >>>> when you want to remove everything in the directory, the find version
> >>>> could be speeded significantly by using xargs, i.e.,
> >>>> find . -type f -print0 | xargs -0 rm
> >>>> This is especially  useful if you want to remove files selecticely
> >>>> instead of everything at once.
> >>
> >> And this requires traversing the directory not just a single time but 4
> >> milion times (once per rm process). Not to mention merely spawning 4
> >> milion such processes is not fun (but spawning those would probably fit
> >> nicely within 10 minutes or so)
> >
> > But isn't that (preventing the spawning of 4M processes) the reason why 
> > xargs was created?
> 
> $ for i in `seq 4000000`;  echo hi; done | xargs /bin/echo | wc
>     252 4000000 12000000
...
> So it's indeed not 4M processes, but still quite a few. But wrost:
> you're traversing the directory many times. And you're telling rm in
> which explicit order to remove files, rather than simply the native
> order of the files in the directory (or whatever is convinient for the
> implementor). Which probably requires rm a number of extra lookups in
> the directory.

can you explain what you mean by "traversing"?  i haven't
confirmed with strace, but i assume the only process doing
open(".", O_DIRECTORY) and getdents is the single find process.
then, each of the (approx 1000) rm processes are making about
4000 unlinks.

syscall-wise, the only differences between rm -r and find | xargs
rm would be the 4000 extra forks and a bunch of writes and reads
of the list of filenames from the pipe.  compared to the 4000000
unlinks in either case, that overhead hardly seems like the wrost
part ;)

unless your filesystem has an optimization for removing subtrees
and your tool knows to ask for it, i'd guess you're probably
spending most of your time waiting for the filesystem to remove
entries and invalidate caches.

--Rob*

-- 
/-------------------------------------------------------------\
| "If we couldn't laugh we would all go insane"               |
|              --Jimmy Buffett,                               |
|                "Changes in Latitudes, Changes in Attitudes" |
\-------------------------------------------------------------/


Reply to: