Cam Hutchison wrote: > I don't think dir_index has anything to do with it. An index speeds up > lookups. You are not doing lookups; you are traversing the entire data > structure. I think you might be right about Vincent's problem. It seems there is a large amount of uncached data that needs to be read from disk. It takes a long time to read a lot of data from a large number of files. > A B-tree data structure can take longer to traverse than a > contiguous array data structure due to prefetching generally being > beneficial to arrays, but less so to pointer-based structures. > > It's slow because every block of the directory needs to be read to get > the contents, even if every block contains empty entries. You don't know > that until you've read it. This depends exactly upon what you are doing, how you are doing it, and where do you enclose the boundary of the problem. For the usual case of opening a file the answer is no. It doesn't need to read the entire thing. That is the key point in B-trees and other external index data structures. When looking up a name only a much reduced set of disk blocks need to be read. With a B-tree the root node is stored first and read as the first disk block. That first block contains as many index entries as will fit. Let's say at least a 1000. Those entries are sorted and contain a multi-level index of other nodes. If the directory is small it may contain all of them in a single level index. If the directory is large then those entries will point to other directory nodes in a multi-level index. One page was loaded for the root node. Root nodes will usually be kept in cache ram making their accesses very fast. Using information from the root node will allow seeking to the next directory node that contains the filename being opened. Assuming 1000 entries per page this additional page load will contain another 1000 entries. A directory huge by today's standards may require 2-3 page reads only in order to open any file in a huge directory. Opening a file does not require reading the entire directory database. What is the order of lookup for a binary tree? Log base 2. A B-tree with a 1000 entries per page would have a page lookup order as per the number of entries per page. Log base 1000 in this case. This makes for a very shallow tree structure. This is the magic that allows a very few page reads to open any individual file. As compared to a linear array search. On the other hand if you enclose the problem at the highest level and look at the work needed to read every file in the directory then of course you will be traversing all of the data in both cases. That work outside of opening files is going to be the same. The file opening part will still be much better in the B-tree case because in the linear directory array case opening the file takes much longer. > >drwx------ 2 vlefevre vlefevre 8409088 2015-03-24 14:04:33 Mail/oldarc/cur/ > > Your large directory is about 3.5 times the size of this one, so we > would expect all things being equal that it would take 30s to read the > larger directory based on the time of reading your maildir. > > One thing that is likely not equal is fragmentation. It is quite > possible that your 30MB directory is fragmented across the disk and > involves many seeks to read it all. If you really want to know if this > is the case, use debugfs(8) to have a look: Use filefrag to determine the file fragmentation? $ filefrag /var/tmp /var/tmp: 2 extents found $ filefrag -v ~/Mail/Lists/debian/user/cur ...lots of information... Bob
Attachment:
signature.asc
Description: Digital signature