[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Delete 4 million files



On Wed, Mar 25, 2009 at 03:10:42PM +0000, Tzafrir Cohen (tzafrir@cohens.org.il) wrote:

> $ for i in `seq 4000000`; do echo something quite longer; done | xargs
> /bin/echo | wc
>     756 12000000 92000000
[...]
> So it's indeed not 4M processes, but still quite a few.

Even 756 is much less than 4M.

> But wrost:
> you're traversing the directory many times. And you're telling rm in
> which explicit order to remove files, rather than simply the native
> order of the files in the directory (or whatever is convinient for the
> implementor). Which probably requires rm a number of extra lookups in
> the directory.

Interesting point; I hadn't thought of that.

How much fork() costs in comparison to reading
a directory entry, well that'd depend on things like
disk and cpu speed, available memory, filesystem type &c.

To get an idea which way it falls I did a quick test with 500k files
(created by seq 500000 | xargs touch) on my box.

First on an ext3 filesystem:

rm -rf testd                          4m11.909s
find testd -type f |xargs rm          4m42.025s
find testd -type rm -exec rm {} \;   62m59.030s
find testd -type f -delete            4m19.340s

Then on tmpfs:

rm -rf testd                           0m2.507s
find testd -type f |xargs rm           0m6.318s
find testd -type rm -exec rm {} \;   58m34.645s
find testd -type f -delete             0m3.362s

So, it would seem the number of rm calls indeed dominates
the time needed, not directory traversal.

Of course here xargs was helped by the fact that filenames
were short (most 12 characters with the directory name),
but the speedup over -exec is still rather impressive.

If anyone can come up with a scenario where -exec
is significantly faster than xargs, I'd be interested.

-- 
Tapani Tarvainen


Reply to: