[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[PoC Patch] Parallelizing file-info (for huge packages)



Hi

I have written a Proof of Concept patch for running file in parallel (in
collection/file-info).  The parallelism should probably only be run in
some conditions (e.g. huge package etc).  The patch will also remove the
file-info for dirs, since it is unused and always
"(setguid )?directory".


The rationale for the patch is that I did some trivial benchmarks to
find our bottleneck(s).  For the test I used the eclipse source
package[1] and tmpfs.  On my machine this results in lintian finishing
its check after a 1 minute and 10-15 seconds, where most of this time
(~1 minute) is spent running collections.
  The slowest two appeared to be unpacked (~11-12 seconds) and file-info
(~52-54 seconds).  The rest of the source collections are completed
within 1 the same second they are started.
  Using this patch I can reduce file-info to about ~24 seconds[2].  The
eclipse binary packages seem to be gaining next to nothing from this
patch.  I assume it has something to do with the source package
containing over 38k files, while the binary packages "only" had
300-400ish files[3].

The numbers seems to hold even if I remove the tmpfs (within +/- 3
seconds).  All timing was done with "time" (thus all numbers have a
precision measurable in seconds).

~Niels

[1] eclipse 3.7~exp-2.dsc

Reason for choosing it: it was big and it was available!

[2] The machine did have plenty cores and RAM to spare.

[3] As determined by tar vjtf $file and dpkg --contents $file piped
through wc -l.  I only checked the largest source tarball and the
largest binary package.

>From d3864e610edba18b25dbff2d8dc836b4cfc62fba Mon Sep 17 00:00:00 2001
From: Niels Thykier <niels@thykier.net>
Date: Wed, 31 Aug 2011 15:14:37 +0200
Subject: [PATCH] Parallelize file-info with up to 4 invocations

---
 collection/file-info |   35 ++++++++++++++++++++++++++---------
 1 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/collection/file-info b/collection/file-info
index e61acb4..2a90959 100755
--- a/collection/file-info
+++ b/collection/file-info
@@ -22,7 +22,7 @@
 use strict;
 use warnings;
 
-use Cwd qw(realpath);
+use Cwd qw(cwd realpath);
 use FileHandle;
 use lib "$ENV{'LINTIAN_ROOT'}/lib";
 use Util;
@@ -35,6 +35,7 @@ my $last = '';
 
 my $helper = realpath("$0-helper");
 my $outfile = realpath('./file-info');
+my $dir = cwd;
 
 unlink($outfile);
 
@@ -48,28 +49,44 @@ open(INDEX, '<', 'index')
 chdir('unpacked')
     or fail("cannot chdir to unpacked directory: $!");
 
+my $i = 0;
+my @jobs;
+for ( ; $i < 4 ; $i++) {
 # We ignore failures from file because sometimes file returns a non-zero exit
 # status when it can't parse a file.  So far, the resulting output still
 # appears to be usable (although will contain "ERROR" strings, which Lintian
 # doesn't care about), and the only problem was the exit status.
-my %opts = ( pipe_in => FileHandle->new,
-	     out => $outfile,
-	     fail => 'never' );
-spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]);
-$opts{pipe_in}->blocking(1);
+    my %opts = ( pipe_in => FileHandle->new,
+                 out => "$outfile.$i",
+                 fail => 'never' );
+    spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]);
+    $opts{pipe_in}->blocking(1);
+    push @jobs, \%opts;
+}
+
+$i = 0;
 
 while (<INDEX>) {
     chomp;
+    # skip directories as the output is uninteresting and not used anyway.
+    # (index has a type which is easier to check as well)
+    next if /^d/o;
     $_ = (split(' ', $_, 6))[5];
     s/ link to .*//;
     s/ -> .*//;
     s/(\G|[^\\](?:\\\\)*)\\(\d{3})/"$1" . chr(oct $2)/ge;
     s/\\\\/\\/;
-    printf {$opts{pipe_in}} "%s\0", $_;
+    printf {$jobs[$i]->{pipe_in}} "%s\0", $_;
+    $i = ($i + 1) & 3;
 }
+
 close(INDEX) or fail("cannot close index file: $!");
 
-close $opts{pipe_in};
-reap(\%opts);
+foreach my $opts (@jobs) {
+    close $opts->{pipe_in};
+    reap($opts);
+}
+system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile";
+
 
 
-- 
1.7.5.4


Reply to: