[PoC Patch] Parallelizing file-info (for huge packages)
Hi
I have written a Proof of Concept patch for running file in parallel (in
collection/file-info). The parallelism should probably only be run in
some conditions (e.g. huge package etc). The patch will also remove the
file-info for dirs, since it is unused and always
"(setguid )?directory".
The rationale for the patch is that I did some trivial benchmarks to
find our bottleneck(s). For the test I used the eclipse source
package[1] and tmpfs. On my machine this results in lintian finishing
its check after a 1 minute and 10-15 seconds, where most of this time
(~1 minute) is spent running collections.
The slowest two appeared to be unpacked (~11-12 seconds) and file-info
(~52-54 seconds). The rest of the source collections are completed
within 1 the same second they are started.
Using this patch I can reduce file-info to about ~24 seconds[2]. The
eclipse binary packages seem to be gaining next to nothing from this
patch. I assume it has something to do with the source package
containing over 38k files, while the binary packages "only" had
300-400ish files[3].
The numbers seems to hold even if I remove the tmpfs (within +/- 3
seconds). All timing was done with "time" (thus all numbers have a
precision measurable in seconds).
~Niels
[1] eclipse 3.7~exp-2.dsc
Reason for choosing it: it was big and it was available!
[2] The machine did have plenty cores and RAM to spare.
[3] As determined by tar vjtf $file and dpkg --contents $file piped
through wc -l. I only checked the largest source tarball and the
largest binary package.
>From d3864e610edba18b25dbff2d8dc836b4cfc62fba Mon Sep 17 00:00:00 2001
From: Niels Thykier <niels@thykier.net>
Date: Wed, 31 Aug 2011 15:14:37 +0200
Subject: [PATCH] Parallelize file-info with up to 4 invocations
---
collection/file-info | 35 ++++++++++++++++++++++++++---------
1 files changed, 26 insertions(+), 9 deletions(-)
diff --git a/collection/file-info b/collection/file-info
index e61acb4..2a90959 100755
--- a/collection/file-info
+++ b/collection/file-info
@@ -22,7 +22,7 @@
use strict;
use warnings;
-use Cwd qw(realpath);
+use Cwd qw(cwd realpath);
use FileHandle;
use lib "$ENV{'LINTIAN_ROOT'}/lib";
use Util;
@@ -35,6 +35,7 @@ my $last = '';
my $helper = realpath("$0-helper");
my $outfile = realpath('./file-info');
+my $dir = cwd;
unlink($outfile);
@@ -48,28 +49,44 @@ open(INDEX, '<', 'index')
chdir('unpacked')
or fail("cannot chdir to unpacked directory: $!");
+my $i = 0;
+my @jobs;
+for ( ; $i < 4 ; $i++) {
# We ignore failures from file because sometimes file returns a non-zero exit
# status when it can't parse a file. So far, the resulting output still
# appears to be usable (although will contain "ERROR" strings, which Lintian
# doesn't care about), and the only problem was the exit status.
-my %opts = ( pipe_in => FileHandle->new,
- out => $outfile,
- fail => 'never' );
-spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]);
-$opts{pipe_in}->blocking(1);
+ my %opts = ( pipe_in => FileHandle->new,
+ out => "$outfile.$i",
+ fail => 'never' );
+ spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]);
+ $opts{pipe_in}->blocking(1);
+ push @jobs, \%opts;
+}
+
+$i = 0;
while (<INDEX>) {
chomp;
+ # skip directories as the output is uninteresting and not used anyway.
+ # (index has a type which is easier to check as well)
+ next if /^d/o;
$_ = (split(' ', $_, 6))[5];
s/ link to .*//;
s/ -> .*//;
s/(\G|[^\\](?:\\\\)*)\\(\d{3})/"$1" . chr(oct $2)/ge;
s/\\\\/\\/;
- printf {$opts{pipe_in}} "%s\0", $_;
+ printf {$jobs[$i]->{pipe_in}} "%s\0", $_;
+ $i = ($i + 1) & 3;
}
+
close(INDEX) or fail("cannot close index file: $!");
-close $opts{pipe_in};
-reap(\%opts);
+foreach my $opts (@jobs) {
+ close $opts->{pipe_in};
+ reap($opts);
+}
+system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile";
+
--
1.7.5.4
Reply to: