Re: dpkg list-file performance
On 08/31/2009 02:19 AM, Raphael Hertzog wrote:
--install will also install a new .list file and updating a single big
file containing the full list of files would not be very efficient when
done multiple times in a single apt-get run.
This is true, which is why I do not think tar is a good solution. I
intended it purely as a proof-of-concept, to demonstrate that there is
indeed something wrong with the .list files.
I had some thoughts of doing something like blanking out expired entries
in this single file and appending new ones. (And if we want to be really
fancy, we could keep track of what blank regions there are and insert
new entries in spots that fit.) The file could be compacted periodically
when it is more than X% blank.
That said, taking this to its logical conclusion would simply result in
reimplementing a large portion of some kind of database (be it
full-blown SQL or something simpler). In which case, I think the correct
solution is to use an existing implementation. I don't know how
objectionable such a dependency would be, so I wanted to involve
upstream before spending a lot of time on something people categorically
object to. Hence this email and proof-of-concept.
A database also means we have some hope of later moving to an
implementation that reads data in on-demand instead of all at the
beginning. This should give considerable memory savings since we only
store information about files we looked at.
I also remember some experimental work of Sean Finney who put a cache in
a sqlite db. You might be interested to look up in Google and/or in the
BTS if you can find out this old patch.
Thanks! I will look at this more closely. At first glance, Sean seems to
have had the same thoughts as I do. If a sqlite (or whatever) cache is
deemed acceptable to include into dpkg, I would love to help with
At a glance, I did notice some complaints as to potential reliability
loss. I would argue that, if such a change is implemented as a cache and
not a data replacement, there should be no reliability problems. As you
noted, updating the .list files is cheap, so retaining them should not
be expensive. As long as we can reliably detect when the database
becomes inaccurate (with false positives acceptable), we can regenerate
the cache and move on. I think this should not be a difficult problem.