[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

dpkg list-file performance



Hi,

So, I've been having difficulties with dpkg's performance on cold disk cache. dpkg's list files in /var/lib/dpkg/info are inefficient. Before doing most operations, dpkg calls ensure_allinstfiles_available() which reads in the contents of each file into a global hash table. As there are thousands of these small files scattered throughout the disk, a sequential read is a very expensive operation.

On my machine (Ubuntu Intrepid), dpkg --search takes nearly 30 seconds on cold cache:

% dump-disk-cache
% time dpkg-query --search /bin/ls
coreutils: /bin/ls
dpkg-query --search /bin/ls  0.47s user 0.45s system 3% cpu 29.536 total

dpkg also reads the list files when installing packages, so we are affected there.

The current list-files are good for the query "given a package, what did it install". They also have fairly fast updates. However, they are extremely poorly suited for the query "given a file, what package(s) installed it" or if you need to read it all in at once. So, I propose adding a cache for the data.

As a proof-of-concept, I have a series of patches[1] which implement a simple cache: putting everything into a tar file. The first two refactor some of the code; they can probably be merged now. The other two add a cache. Note: the latter two are purely proof-of-concept and have numerous technical problems, are incomplete, etc. Please do not consider them for merging.

Some numbers:

[dpkg master (1b5a009da6fdd38b2b51bd551c09880f890566f7)]
% dump-disk-cache
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.51s user 0.53s system 3% cpu 30.324 total
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.33s user 0.08s system 93% cpu 0.435 total

[dpkg tarfile-proof-of-concept]
% dump-disk-cache
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.47s user 0.07s system 37% cpu 1.461 total
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.42s user 0.08s system 83% cpu 0.587 total

There is a performance regression on warm cache, but I suspect this is largely because of I'm piping to the system tar binary. Of course, a real implemention would be more reasonable.

The time to do a search (very strongly bottlenecked on reading the list files) goes down from over 30 seconds to under 1.5. I think this is a clear improvement. Only ensure_allinstfiles_available() is touched and it gives the same result, so this should not effect the rest of the program.

I'm not sure what implementation is the most acceptable. Ideally, a database of sorts that avoids reading in all the data would be best, but it seems that will be difficult to integrate with the existing code without touch many code paths. A tar file may have problems with --delete rewriting the entire file.

Thoughts?


David Benjamin


[1] http://github.com/davidben/dpkg/tree/tarfile-proof-of-concept


Reply to: