Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)

To: Nick Black <nick.black@sprezzatech.com>
Cc: Stefan Fritsch <sf@sfritsch.de>, deity@lists.debian.org, 703366@bugs.debian.org
Subject: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
From: Niels Thykier <niels@thykier.net>
Date: Tue, 26 Mar 2013 08:53:32 +0100
Message-id: <[🔎] 515153FC.8090003@thykier.net>
Reply-to: Niels Thykier <niels@thykier.net>, 703366@bugs.debian.org
In-reply-to: <[🔎] 20130326002009.GA12758@qemfd.net>
References: <[🔎] 20130318203819.9159.9957.reportbug@k.lan> <[🔎] 20130325150229.GA8055@qemfd.net> <[🔎] 20130325153034.GA8553@qemfd.net> <[🔎] 201303260000.14021.sf@sfritsch.de> <[🔎] 20130326002009.GA12758@qemfd.net>

On 2013-03-26 01:20, Nick Black wrote:
> Stefan Fritsch left as an exercise for the reader:
>> Well, the contents files are much larger than the package files and 
>> are usually used less frequently. So some users may prefer to download 
>> the contents files only when necessary. Apart from that, I don't see 
> 
> then can't they just leave apt-file uninstalled? especially as installing it
> would perform the initial apt-get update?
> 
>> any problem. But that's not my decision anymore :-)
> 
> yeah i'm not wedded to any particular solution, but this one seems right to
> me. if it's something that's been thraded out at length, though, no need to
> entertain my suggestions.
> 

Sounds like we should have some config variable that people can set (or
clear) to disable Contents fetching via apt-get update.  Assuming the
APT side of this would support such a use-case, I think we can have it.
  But to be honest, I would really like to remove the "apt-file update"
if I can get away with it.  It always seemed like something APT ought to
do... though I suppose if I end up delegating the entire thing to
apt-get update it will not really have any maintenance overhead.

>> - Significant speedup could be attained by recompressing the local 
>> file with lzop instead of gzip. You write "processing time is roughly
> 

I tried a bit with lzop and indeed it seems to half my runtimes with
search and show (at least the non-regex variant).  Though it comes at
lower compression rates, which is not a problem atm but might be when
multi-arch support is "added" (also see my comment about redundancy below).

> Absolutely. If we can make local changes, there's all kinds of things we can
> do. I left this entire class of optimizations aside.
> 
> For that matter, if we stripped the free-form header section, that would,
> perhaps surprisingly, *vastly* simplify my code. Again, I wanted to do an
> implementation which conformed precisely to current disk layouts and such,
> since I want to deploy this in SprezzOS independently of Debian's decision.
> 

There are also things we could do at update time:

 * pre-appending / to all paths as people expect that there is a leading
   slash.  To this end, apt-file is currently trying to rewrite people's
   search pattern to match reality but I hope we could eventually avoid
   that (because it does not work in all cases etc.).
 * remove redundancy between Contents-* files.  Between unstable and
   testing (or i386 and amd64) there is a huge overlap in files.  That
   would likely allow us to scale better as the number of architectures
   and distributions enabled increase.
   (related bugs include #658794, #578727 and #632254)
 * make optimized caches for certain use-cases like "list/show".  Maybe
   even "match pattern X against programs in default PATH".

The second item probably require merging the Contents files, which we
probably need to do in a very efficient manner.  I believe the files are
pre-sorted, so we could abuse this to do the "merge" part of mergesort
without having the whole ordeal loaded in memory (which is sadly quickly
measured in GB).

>> - Try benchmarks with a single core, too. It's nice if you can use 
>> more cores but you don't want to have too much regression on single 
>> core systems.
> 
> Yep, I will send those around. I'm not doing anything stupid like launching
> a fixed-sized pool; libblossom offers us per-CPU workers etc.
> 
>> - apt-file regex search performance sucks because it doesn't use grep. 
>> Nowadays grep has -P, so grep could be used, too. Which regex type do 
>> you use?
> 

Also possibly because Perl (Python, Java etc.) uses an expensive regular
expression implementation[1].

> Hold on for a bit of theory:
> 
>  - I'm matching multiple patterns using an "Advanced Aho-Corasick
>    automaton". The set of all AACAs is clearly a subset of all DFA (discrete
>    finite automatons).
> 

I think you mean s/discrete/deterministic/ as NFAs (which can be used to
match any regular language as well) is a "Non-deterministic finite
automaton"

>  - The set of languages recognized by DFAs is equivalent to the set of
>    languages recognized by regular languages.
> 
>  - This, any regular operation can be encoded into a DFA, though possibly at
>    a high cost in states. See Sipser or Hopcroft+Ullman.
> 
>  - Thus, we just encode the regular operations as alternates in our AACA.
>    Since we already match the AACA in time independent of the number of
>    patterns, adding these alternate patterns costs us no time in the main,
>    but only in the preprocessing.
> 
> I'm doing basically the exact same thing grep/egrep does: Boyer-Moore-Galil
> for one pattern, or AAC for multiple patterns.
> 
>> [...]
> Hack on!
> 
> --nick
> 
> 

True, but the perl "regular repression" is in fact more powerful than a
NFA.  Admittedly I believe the only real feature that exceeds NFAs is
the "backref"s, which are thankfully not used that often.
  I have no concerns about compiling the "perl regex" case into a
DFA/NFA were possible, but we have to either handle the backref case or
explicitly document that backrefs are not supported.
  I am planning on doing an apt-file 3 release post Wheezy where I
permit backwards incompatible changes (e.g. exit code and making -F
default with show/list), so either way we choose to do it will be fine.

~Niels

[1] http://swtch.com/~rsc/regexp/regexp1.html

Article is from 2007, so things could have changed.  Though it is my
understanding that they haven't.

NB: The first two graphs do not have same unit on the Y-axis (i.e.
seconds vs. micro seconds).

Reply to:

References:
- Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
  - From: Stefan Fritsch <sf@sfritsch.de>
- Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
  - From: Nick Black <nick.black@sprezzatech.com>
- Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
  - From: Nick Black <nick.black@sprezzatech.com>
- Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
  - From: Stefan Fritsch <sf@sfritsch.de>
- Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
  - From: Nick Black <nick.black@sprezzatech.com>

Prev by Date: Bug#703909: marked as done (ITP: php-archive-tar -- Tar file management class)
Next by Date: Bug#703962: ITP: logic-analyzer-rpi -- Use your Raspberry Pi as a logic analyzer
Previous by thread: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Next by thread: Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)
Index(es):
- Date
- Thread