[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#703366: RFH: apt-file -- search for files within Debian packages (command-line interface)



Stefan Fritsch left as an exercise for the reader:
> Well, the contents files are much larger than the package files and 
> are usually used less frequently. So some users may prefer to download 
> the contents files only when necessary. Apart from that, I don't see 

then can't they just leave apt-file uninstalled? especially as installing it
would perform the initial apt-get update?

> any problem. But that's not my decision anymore :-)

yeah i'm not wedded to any particular solution, but this one seems right to
me. if it's something that's been thraded out at length, though, no need to
entertain my suggestions.

> - Significant speedup could be attained by recompressing the local 
> file with lzop instead of gzip. You write "processing time is roughly

Absolutely. If we can make local changes, there's all kinds of things we can
do. I left this entire class of optimizations aside.

For that matter, if we stripped the free-form header section, that would,
perhaps surprisingly, *vastly* simplify my code. Again, I wanted to do an
implementation which conformed precisely to current disk layouts and such,
since I want to deploy this in SprezzOS independently of Debian's decision.

> - Try benchmarks with a single core, too. It's nice if you can use 
> more cores but you don't want to have too much regression on single 
> core systems.

Yep, I will send those around. I'm not doing anything stupid like launching
a fixed-sized pool; libblossom offers us per-CPU workers etc.

> - apt-file regex search performance sucks because it doesn't use grep. 
> Nowadays grep has -P, so grep could be used, too. Which regex type do 
> you use?

Hold on for a bit of theory:

 - I'm matching multiple patterns using an "Advanced Aho-Corasick
   automaton". The set of all AACAs is clearly a subset of all DFA (discrete
   finite automatons).

 - The set of languages recognized by DFAs is equivalent to the set of
   languages recognized by regular languages.

 - This, any regular operation can be encoded into a DFA, though possibly at
   a high cost in states. See Sipser or Hopcroft+Ullman.

 - Thus, we just encode the regular operations as alternates in our AACA.
   Since we already match the AACA in time independent of the number of
   patterns, adding these alternate patterns costs us no time in the main,
   but only in the preprocessing.

I'm doing basically the exact same thing grep/egrep does: Boyer-Moore-Galil
for one pattern, or AAC for multiple patterns.

> - Are you limiting the used memory? Remember there may still be VMs 
> with 256MB RAM and you shouldn't cause swapping on such systems.

Even if they only have 256MB of physical RAM, they still have however large
a virtual address space. mmap() is not going to map in all the requested
pages; it's just associating a VMA in our process with them. They'll be
faulted in as they're referenced. Thus it's not how much we have mmap()d
(which can be large, equivalent to the sum of the compressed files plus
dynamic state). Our dynamic state per thread is limited, however, so we'll
never be say trying to uncompress 256MB of text (this would be bad for
parallelism anyway).

Hope that answers these questions. Feel free to hit me with more.

Hack on!

--nick


-- 
nick black     http://www.sprezzatech.com -- unix and hpc consulting
to make an apple pie from scratch, you need first invent a universe.


Reply to: