Re: apt-file performance characterization

To: Apt Team <deity@lists.debian.org>
Cc: Debian Development <debian-devel@lists.debian.org>
Subject: Re: apt-file performance characterization
From: nick black <nick.black@sprezzatech.com>
Date: Sun, 24 Mar 2013 19:11:02 -0400
Message-id: <[🔎] 20130324231102.GA27950@qemfd.net>
In-reply-to: <20130324210339.GA22904@qemfd.net>
References: <20130324210339.GA22904@qemfd.net>
nick black left as an exercise for the reader:
> I've written up a performance characterization of apt-file(1) as stands. The
> preliminary version is available here:
> 
> 	http://nick-black.com/raptorial-file

OK, finished this writeup. I've updated the file at the link above, and
included it below. I'm going to go ahead and knock out bugs 705[0] and
698[1], which will together implement everything described herein.
Hopefully that'll be done by tomorrow morningish, EDT.

Hack on!

--rigorously, nick

[0] https://www.sprezzatech.com/bugs/show_bug.cgi?id=705
[1] https://www.sprezzatech.com/bugs/show_bug.cgi?id=698

Optimizing apt-file
===================

All timings were taken on a quadcore Core i7 2600K overclocked to 5.0GHz,
reading files from an Intel 320 SSD. All data sets fit within main memory.

apt-file has a more complex performance space than did apt-show-versions.
Understanding this space is critical to optimizing apt-file, and evaluating
any optimizations made. apt-file doesn't perform string searches itself; it
uses grep, zgrep and zfgrep. Furthermore, while apt-file is not multithreaded,
it does run multiple processes (apt-file -> gzip | grep), and thus natively
takes advantage of multiple cpus within a given Contents file. In this top(1)
output, the last column is the CPU number:

 8529 dank   7952  848  688 R  21.9  0:00.66 grep                3 
 8522 dank  66508  17m 3436 S  11.6  0:00.35 apt-file            0 
 8528 dank   8840  668  508 R  10.6  0:00.32 gzip                2 

Of course, on a unicore machine, this use of three different processes will
provide no computational benefit (it might provide I/O benefit).

It is worth noting that the -x (regex) functionality of apt-file appears to be
broken or at least incomplete. Neither of the following searches, for instance,
yield any matches for "compiz" (or indeed all of the matches for "frog"):

	[skynet](0) $ apt-file search -x frog\|compiz
	frog: /usr/bin/frog
	wmfrog: /usr/bin/wmfrog
	[skynet](0) $ apt-file search -x \(frog\)\|\(compiz\)
	frog: /usr/bin/frog
	wmfrog: /usr/bin/wmfrog
	[skynet](0) $ 

Running "apt-file frog compiz" finds all of the frog matches, but none of the
compiz matches. It appears that apt-file cannot use multiple patterns off the
command line, but it doesn't bother to warn the user about that. We can create
a pattern file, however, and that works:

	[skynet](0) $ echo -e "frog\ncompiz" > searches 
	[skynet](0) $ apt-file -f search searches  | wc -l
	3874
	[skynet](0) $ apt-file search compiz  | wc -l
	1112
	[skynet](0) $ apt-file search frog | wc -l
	2762
	[skynet](0) $ echo $((3874 - 1112))
	2762
	[skynet](0) $ 

Characterizing apt-file performance
===================================

Background material:

	"Flexible Pattern Matching in Strings" by Navarro et al
	"Introduction to Algorithms" by Cormen et al

String search algorithms tend to be optimized for one use or another. Searching
within DNA, for instance, is a very different game than searching within
UTF-32, which is very different from searching within UTF-8. We will consider
only exact string matching, and not "fuzzy" string matching (i.e., matching
within some string distance). Parameterizing the problem space under this
condition yields:

 - alphabet size
 - needle length
 - haystack length
 - number of needles
 - needle heterogeneity
 - frequency of occurrence

Note that in the presence of Kleene closure, there might be an infinite number
of needles. In the case of an online algorithm, the haystack is of infinite
size. All other parameters are finite. The alphabet size and haystack length
are independent of apt-file(1). Let's explore performance when we vary the
other parameters:

Varying needle length
=====================

It would be nice to have short needles of various length which held the
frequency of occurrence constant, but that'll have to come another day. It'll
be easy, as we'll see, to vary frequency without varying length.

[skynet](0) $ for i in compizabd compizab compiza compiz compi comp com co c ; do time apt-file search $i | wc -l ; done
0

real	0m1.655s
0

real	0m1.673s
0

real	0m1.654s
1112

real	0m1.854s
17293

real	0m2.850s
86277

real	0m4.971s
225415

real	0m7.667s
998349

real	0m22.491s
4019431

real	1m30.639s
[skynet](0) $ 

apt-file(1) clearly performs better for longer strings, although this could be
due to their lesser frequency of occurrence. Times are approximately equal for
the three longest strings, but they also have equal frequency (0 occurrences).
This points to a backwards-skipping algorithm such as Boyer-Moore (with longer
strings, we get longer skips, and the ratio of skipped text for different
string lengths (N, N-1) converges towards 1 as N grows large (i.e., there's
a much bigger difference between skipping 1 vs 2 than skipping 10 vs 11). This
could all be due to I/O, though. To check, we comment out the print command in
the innermost loop of print_winners(); I'll spare you the block of timings and
just assert that the absolute timings fell, but not by much. It does indeed
appear that a backwards-skipping algorithm is in use.

Needle heterogeneity
====================

We'll hold length and frequency constant, and vary the homogeneity:

[skynet](0) $ for i in abcd aaaa aaaa abcd  ; do time apt-file search $i | wc -l ; done
113

real	0m1.739s
user	0m2.047s
sys	0m0.163s
122

real	0m1.827s
user	0m2.227s
sys	0m0.117s
122

real	0m1.813s
user	0m2.157s
sys	0m0.167s
113

real	0m1.715s
user	0m2.003s
sys	0m0.167s
[skynet](0) $ 

The slightly longer times on "aaaa" vs "abcd", despite nearly equal frequency,
suggest that a backwards-skipping rather than a forwards-skipping algorithm is
in use, as a homogeneous needle will affect the former more (questionable!).

Number of needles
=================

Frequency of occurrence
=======================

We'll hold length constant, and vary the number of matches:

[skynet](0) $ for i in th gh es zz ; do time apt-file search $i | wc -l ; done
590993

real	0m14.883s
user	0m20.290s
sys	0m0.953s
186325

real	0m7.139s
user	0m10.440s
sys	0m0.453s
1626325

real	0m36.047s
user	0m43.220s
sys	0m2.497s
19455

real	0m2.698s
user	0m3.927s
sys	0m0.200s
[skynet](0) $ 

All things being equal, we'd expect that more matches would be faster, as it
would allow more text to be skipped (we can stop matching a given haystack as
soon as we find a match). That it does not suggests that the postprocessing
step of apt-file(1) (a perl "uniq sort") probably has crap running time. Once
again, I tested this with no output, and got very similar numbers.


Speeding it up
==============

First, a constraint: we're going to use unmodified zlib to inflate the files,
which precludes matching as we inflate (there is no way to insert such a
function into zlib 1.2.7). There will thus necessarily be a strong dependency
between chunk inflation and chunk processing. This has two effects, both
negative: we can't process a chunk in parallel with that chunk's inflation, and
we are likely to suffer both a write-miss (during deflation) and a read-miss
(during processing) rather than merely a write-miss, or possibly no cache miss
at all for non-matching text (if we inflated and processed non-matches wholly
within registers).

Let's first look at how to best parallelize this, as that tends to affect
program organization overall, forcing shape on the other code.

There is one Contents file per |repository X architecture| (the "source"
pseudo-architecture does not currently provide a Contents file). Most
installations are likely to have between one and three Contents files ("main",
"contrib", and "non-free" for some primary repository), where one of these
files is very much larger than the other two. As an example:

[skynet](0) $ ls -l /var/cache/apt/apt-file/
    82015 ftp.us.debian.org_debian_dists_unstable_contrib_Contents-amd64.gz
 25868596 ftp.us.debian.org_debian_dists_unstable_main_Contents-amd64.gz
   794733 ftp.us.debian.org_debian_dists_unstable_non-free_Contents-amd64.gz
     1330 www.sprezzatech.com_apt_dists_unstable_contrib_Contents-amd64.gz
  4551278 www.sprezzatech.com_apt_dists_unstable_main_Contents-amd64.gz
    28211 www.sprezzatech.com_apt_dists_unstable_non-free_Contents-amd64.gz
[skynet](0) $ 

Ratios of length are: 5.68, 5.73, 9.69, 2.91, 21.21. Leaving out the second set
of repositories, they're 32.55 and 9.69 for Debian and 161.33 and 21.21 for
SprezzOS. Either way, there's about three orders of magnitude difference
between the smallest and the largest. Remember that these are compressed
versions of the files -- we'll be scanning proportionately more text.

So, we can parallelize across the files, but that's of limited utility; the
largest file(s) are likely to strongly dominate total execution time.
Nonetheless, we have gone ahead and done this. This has the added benefit that
if there is a problem reading any particular file (i.e. I/O delay due to a bad
sector), the other files can still be processed.

Much more benefit can be had by parallelizing within a file (recall that
apt-file(1) already enjoys producer:consumer parallelism within a file by its
multiprocess approach). Remembering that we never want more threads in the
running state than there are available processors, there's three major
approaches we could pursue:

 (a) one thread inflates to multiple buffers in succession. worker threads
      start in stepwise fashion as buffers become available.
 (b) worker threads lock on the file being inflated; upon taking ownership,
      they inflate a chunk to their own buffer, and work on it.
 (c) worker threads statically chunk up the file to be inflated, then seek out
      the first zlib breakpoint following the start of their chunk. this is
      similar to how apt-show-versions works, since it doesn't know stanza
      boundaries ahead of time.

Options (a) and (b) are very similar -- they are equivalent for small files
("small" here meaning less than the product of chunk size and processor count).
The difference is that when all worker threads are occupied in (a), the
inflating thread continues to run, whereas inflation pauses when all worker
threads are running in (b). A fundamental advantage of (b) is that it uses a
fixed amount of memory small relative to large files, whereas (a) can require
up to the full inflated size (this occurs if the inflating thread runs ahead of
the worker threads). Option (b) has threads waiting whenever they finish at the
same time. Option (a) has threads waiting whenever any thread finishes before
the inflation thread has made a chunk available. It seems clear that (b) is the
better choice of the two.

Option (c) is interesting. It does a bit more work than (b), but eliminates all
delays (most importantly the compulsory startup delays) provided a friendly
chunk size during deflation. Should processing time be similar to inflation
time, this could become important. Currently, processing time is roughly
characterized as at least twice inflation time, and I'm not very knowledgeable
regarding zlib chunking characteristics, so this added complexity is not
planned. It's worth considering later, though, especially as the serial nature
of inflation will become more and more apparent as processor count increases.

Finally, we want to ensure we launch an optimal number of threads -- we don't
want to launch 8 threads on either a dual-core, power-limited tablet or a
1024-core supercomputer. Libblossom allows us to launch a thread per processing
unit. Rather than parallelizing directly within a directory or within a file,
then, we parallelize on a queue of file-offset objects. This queue has a fixed
upper size: one descriptor per thread (we'll never have 9 files open if we have
only 8 threads), where a descriptor is a map, a total map length, and the
offset through which processing has been scheduled. Once a map is fully
scheduled, its descriptor is removed. If there are no descriptors on the queue,
perform a readdir_r() on the DIR pointer (serialized by the kernel), and
possibly add a new descriptor to the queue (if the file returned by readdir_r()
is too large to process in one chunk).

We need unlock before calling readdir_r(), but in this case we have to relock
to add a new queue element. We could eliminate this by always performing a
readdir_r(), and simply adding the new element if there's an element to work on
already when we acquire the queue, cutting some overhead. This would violate
our assumption that we never have more files open than we have threads,
however, meaning that we'd no longer have a fixed queue size. This is no great
loss, however; the fixed size was more a nice property that fell out of our
design than a real goal. We can use a dynamic array and indices rather than
pointers, or even dynamically-allocated queue nodes, without much cost.

Let's take this idea further (d): rather than adding the new descriptor to the
queue and taking the front work element, the thread could immediately begin
working on the new descriptor. Enqueue a new node if the new descriptor is not
completely processed, and repeat this process until readdir_r() is exhausted.
This implies state equal to the aggregate length of the files, but means that
there is never delay blocking on another thread within a file.

Of course, we achieved this in (c) anyway, at the cost of a bit extra work.
Option (d) only improves on (c) in the fairly unlikely event that there are
more files than there are threads, burns probably at least as much extra work
with all those extraneous readdir_r() calls, and eats up more memory when it
does provide a benefit. We will go with (c).

Within a thread, we will use Boyer-Moore-Galil or perhaps Apostolico–Giancarlo
or Volnitsky (the cases for which AG beats BMG can be considered pathological
for apt-file(1), but hey, completion is good. Volnitsky is Wu-Manber-ish (read:
hash-based) and difficult to characterize. BMG is well-adapted to large
alphabets and low repetition of letters within a search term, which accurately
describes our search space.

Closing note
============

This has all assumed purely ASCII needles and haystacks, which seems valid
from what I've seen of the Contents files. This seems a pretty shortsighted
policy, though. If someone wants to point out chapter and verse where packages
are restricted to ASCII filenames in Debian, that'd comfort me a bit. I'd go
look it up, but I've been writing this for like four hours.

-- 
nick black     http://www.sprezzatech.com -- unix and hpc consulting
to make an apple pie from scratch, you need first invent a universe.
Attachment: signature.asc
Description: Digital signature
Reply to:
Prev by Date: Debian on illumos (Dyson) LiveCD
Next by Date: Re: is Kai Wasserbäch still active?
Previous by thread: Debian on illumos (Dyson) LiveCD
Next by thread: apt-file rewrite done
Index(es):
- Date
- Thread