[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: dpkg list-file performance

On 08/31/2009 02:19 AM, Raphael Hertzog wrote:
--install will also install a new .list file and updating a single big
file containing the full list of files would not be very efficient when
done multiple times in a single apt-get run.

This is true, which is why I do not think tar is a good solution. I intended it purely as a proof-of-concept, to demonstrate that there is indeed something wrong with the .list files.

I had some thoughts of doing something like blanking out expired entries in this single file and appending new ones. (And if we want to be really fancy, we could keep track of what blank regions there are and insert new entries in spots that fit.) The file could be compacted periodically when it is more than X% blank.

That said, taking this to its logical conclusion would simply result in reimplementing a large portion of some kind of database (be it full-blown SQL or something simpler). In which case, I think the correct solution is to use an existing implementation. I don't know how objectionable such a dependency would be, so I wanted to involve upstream before spending a lot of time on something people categorically object to. Hence this email and proof-of-concept.

A database also means we have some hope of later moving to an implementation that reads data in on-demand instead of all at the beginning. This should give considerable memory savings since we only store information about files we looked at.

I also remember some experimental work of Sean Finney who put a cache in
a sqlite db. You might be interested to look up in Google and/or in the
BTS if you can find out this old patch.

Thanks! I will look at this more closely. At first glance, Sean seems to have had the same thoughts as I do. If a sqlite (or whatever) cache is deemed acceptable to include into dpkg, I would love to help with stabilizing it.

At a glance, I did notice some complaints as to potential reliability loss. I would argue that, if such a change is implemented as a cache and not a data replacement, there should be no reliability problems. As you noted, updating the .list files is cheap, so retaining them should not be expensive. As long as we can reliably detect when the database becomes inaccurate (with false positives acceptable), we can regenerate the cache and move on. I think this should not be a difficult problem.

David Benjamin

Reply to: