Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
On 2013-03-26 01:20, Nick Black wrote:
> Stefan Fritsch left as an exercise for the reader:
>> Well, the contents files are much larger than the package files and
>> are usually used less frequently. So some users may prefer to download
>> the contents files only when necessary. Apart from that, I don't see
>
> then can't they just leave apt-file uninstalled? especially as installing it
> would perform the initial apt-get update?
>
>> any problem. But that's not my decision anymore :-)
>
> yeah i'm not wedded to any particular solution, but this one seems right to
> me. if it's something that's been thraded out at length, though, no need to
> entertain my suggestions.
>
Sounds like we should have some config variable that people can set (or
clear) to disable Contents fetching via apt-get update. Assuming the
APT side of this would support such a use-case, I think we can have it.
But to be honest, I would really like to remove the "apt-file update"
if I can get away with it. It always seemed like something APT ought to
do... though I suppose if I end up delegating the entire thing to
apt-get update it will not really have any maintenance overhead.
>> - Significant speedup could be attained by recompressing the local
>> file with lzop instead of gzip. You write "processing time is roughly
>
I tried a bit with lzop and indeed it seems to half my runtimes with
search and show (at least the non-regex variant). Though it comes at
lower compression rates, which is not a problem atm but might be when
multi-arch support is "added" (also see my comment about redundancy below).
> Absolutely. If we can make local changes, there's all kinds of things we can
> do. I left this entire class of optimizations aside.
>
> For that matter, if we stripped the free-form header section, that would,
> perhaps surprisingly, *vastly* simplify my code. Again, I wanted to do an
> implementation which conformed precisely to current disk layouts and such,
> since I want to deploy this in SprezzOS independently of Debian's decision.
>
There are also things we could do at update time:
* pre-appending / to all paths as people expect that there is a leading
slash. To this end, apt-file is currently trying to rewrite people's
search pattern to match reality but I hope we could eventually avoid
that (because it does not work in all cases etc.).
* remove redundancy between Contents-* files. Between unstable and
testing (or i386 and amd64) there is a huge overlap in files. That
would likely allow us to scale better as the number of architectures
and distributions enabled increase.
(related bugs include #658794, #578727 and #632254)
* make optimized caches for certain use-cases like "list/show". Maybe
even "match pattern X against programs in default PATH".
The second item probably require merging the Contents files, which we
probably need to do in a very efficient manner. I believe the files are
pre-sorted, so we could abuse this to do the "merge" part of mergesort
without having the whole ordeal loaded in memory (which is sadly quickly
measured in GB).
>> - Try benchmarks with a single core, too. It's nice if you can use
>> more cores but you don't want to have too much regression on single
>> core systems.
>
> Yep, I will send those around. I'm not doing anything stupid like launching
> a fixed-sized pool; libblossom offers us per-CPU workers etc.
>
>> - apt-file regex search performance sucks because it doesn't use grep.
>> Nowadays grep has -P, so grep could be used, too. Which regex type do
>> you use?
>
Also possibly because Perl (Python, Java etc.) uses an expensive regular
expression implementation[1].
> Hold on for a bit of theory:
>
> - I'm matching multiple patterns using an "Advanced Aho-Corasick
> automaton". The set of all AACAs is clearly a subset of all DFA (discrete
> finite automatons).
>
I think you mean s/discrete/deterministic/ as NFAs (which can be used to
match any regular language as well) is a "Non-deterministic finite
automaton"
> - The set of languages recognized by DFAs is equivalent to the set of
> languages recognized by regular languages.
>
> - This, any regular operation can be encoded into a DFA, though possibly at
> a high cost in states. See Sipser or Hopcroft+Ullman.
>
> - Thus, we just encode the regular operations as alternates in our AACA.
> Since we already match the AACA in time independent of the number of
> patterns, adding these alternate patterns costs us no time in the main,
> but only in the preprocessing.
>
> I'm doing basically the exact same thing grep/egrep does: Boyer-Moore-Galil
> for one pattern, or AAC for multiple patterns.
>
>> [...]
> Hack on!
>
> --nick
>
>
True, but the perl "regular repression" is in fact more powerful than a
NFA. Admittedly I believe the only real feature that exceeds NFAs is
the "backref"s, which are thankfully not used that often.
I have no concerns about compiling the "perl regex" case into a
DFA/NFA were possible, but we have to either handle the backref case or
explicitly document that backrefs are not supported.
I am planning on doing an apt-file 3 release post Wheezy where I
permit backwards incompatible changes (e.g. exit code and making -F
default with show/list), so either way we choose to do it will be fine.
~Niels
[1] http://swtch.com/~rsc/regexp/regexp1.html
Article is from 2007, so things could have changed. Though it is my
understanding that they haven't.
NB: The first two graphs do not have same unit on the Y-axis (i.e.
seconds vs. micro seconds).
Reply to: