[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reading an empty directory after reboot is very slow

Petter Adsen wrote:
> Can someone please enlighten me as to why the entry for this directory
> is so large, even though it is empty? Since it's apparently obvious to
> everyone else, I would very much like to know :)

In the old days directories were simply an array of fixed sized
integers and fixed sized names.  The fixed maximum size of the name
was 14 characters.  The fixed size integer was the inode address.
That it was fixed sized allowed directories to be read into an array
of structures very efficiently using readdir(2) system call.
Individual entries were written by lseek(2)'ing to the proper slot and
writing just that fixed size entry leaving the rest of the (special
directory) file untouched.  Directories were an array of file entries.

That strategy was in use for many years in Unix systems.  But it has
unpleasant limitations.  Short fixed sized file names 14 characters or
less were one.  The linear nature of the array data structure was
another.  The array influenced things considerably.  When a file was
deleted the entree was simply zero'd out.  lseek() to the proper
entry and zero it out.  That meant that directories could grow in size
but never shrink.  If a directory became full it was easy to extend it
by writing the array longer.  But if an early entry in the array was
deleted the system would zero it out rather than move each and every
entry in the file system down a slot.  (I always wondered why they
didn't simply take the *last* entry and move it down to the deleted
entry and simply keep the array always compacted.  I wonder.  But they
didn't do it that way.)

The array makes it obvious why large directores are slow.  If you are
opening a single named file then the way to find the associated inode
number in the directory is to start reading at the start of the array
and continue looking for a match on the file name.  When a match is
found then you have the inode number associated with it.  The matched
file is always the last one you look for because you stop reading at
that time.  If there is a million files in the directory and it was
the last entry in the list then you must read the entire million
entries before finding the one you want.

Then along came B-tree data structures for directories.  It is an on
disk specialization of a tree data structure.  Really cool.


Using B-trees for directories was a huge revolution in performance and
efficiency.  This is the "dir_index" capability in ext3 and ext4.
Other file systems use B-trees similarly but call it different things.
Now even directories with a huge number of entries are fast to read
individual files.  But only if the directory has been converted or
created using B-trees.  The use of B-trees solves most of the
performance problems.

But!  Even with B-trees ext3 and ext4 do not shrink the directory size
when files are removed from them.  Doing so would expend effort.
Depending upon the deletion it would cause rotations in the data
structure and may take a bit of time to accomplish.  It is certainly
more complicated to code.


Therefore these new ext3 and ext4 file systems take advantage of the
fact that previously file system's directories had never been shrunk
before.  The new ones don't shrink either.  They use it as an speed
optimization and simply zero out the entries too.  Therefore in the
wild directories grow in size but don't shrink in size.

Other file systems such as xfs designed for large files and large
numbers of files DO shrink when files are removed.  That is one of the
reasons why xfs is recommended for industrial strength use.  It was
designed to handle those kinds of workloads.  However that means that
it must spend time when files are deleted to clean everything up.
Everything has a cost.  But xfs is one of the file systems that do
shrink directories when files are deleted.


Attachment: signature.asc
Description: Digital signature

Reply to: