Re: [PoC Patch] Parallelizing file-info (for huge packages)

To: debian-lint-maint@lists.debian.org
Subject: Re: [PoC Patch] Parallelizing file-info (for huge packages)
From: Niels Thykier <niels@thykier.net>
Date: Sat, 03 Sep 2011 10:38:48 +0200
Message-id: <[🔎] 4E61E798.3060804@thykier.net>
In-reply-to: <4E5E3C99.2070900@thykier.net>
References: <4E5E3C99.2070900@thykier.net>

On 2011-08-31 15:52, Niels Thykier wrote:
> [...]

I have looked at this some more and the original patch can be improved.
 xargs have two arguments of interest; first the --max-args to keep it
run in smaller batches.
  The second option is the --max-procs that makes xargs handle the
parallelization.  I strongly suspect that xargs does a much better job
here than my previous patch.
  All in all, the file-info script (with --max-args=4 processes) is now
down to ~15 seconds (from ~24) and the total unpack time for the eclipse
source is about ~27 seconds[0].  The attached patch goes on top of my
previous patch[1].

The downside of --max-procs is that the output from the sub-processes
becomes garbled, so we have to manually write to separate files and
merge the output.
  This is not difficult; my first approach is simply to use the pid to
give them a unique file (using append to avoid truncating an existing
file in case a pid is reused for a later process).
  This works great except those files add up... the lab ends up with (in
my test) 500+ of small "file-info"-parts.  Merging them is fairly
trivial, but I do not like the all "parts".

Do you have an idea to keep the parts down to a reasonable level?  My
best alternative is to make a "merging daemon" and have the
file-info-helper processes feed it with their output.  That would remove
all the file parts, but at the price of complexity and an extra process.

~Niels

[0] Only tested on tmpfs this time.

[1] The missing 0002 is my "poor man's benchmark" code in frontend/lintian.

>From 4488929c17957cc76cc35dbb6dca7b182e4396a1 Mon Sep 17 00:00:00 2001
From: Niels Thykier <niels@thykier.net>
Date: Sat, 3 Sep 2011 10:14:42 +0200
Subject: [PATCH 3/3] Use xargs's parallization in coll/file-info with
 --max-args

This is a couple of seconds faster on huge packages; xargs can do
a better job at scheduling the individual runs.  The downside is
that the output becomes garbled (unless written to "per-process
files" or similar).  For huge packages the amount of files easily
exceed 100 files.
---
 collection/file-info        |   27 ++++++++-------------------
 collection/file-info-helper |   12 ++++++++++--
 2 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/collection/file-info b/collection/file-info
index 2a90959..313e29a 100755
--- a/collection/file-info
+++ b/collection/file-info
@@ -49,22 +49,14 @@ open(INDEX, '<', 'index')
 chdir('unpacked')
     or fail("cannot chdir to unpacked directory: $!");
 
-my $i = 0;
-my @jobs;
-for ( ; $i < 4 ; $i++) {
 # We ignore failures from file because sometimes file returns a non-zero exit
 # status when it can't parse a file.  So far, the resulting output still
 # appears to be usable (although will contain "ERROR" strings, which Lintian
 # doesn't care about), and the only problem was the exit status.
-    my %opts = ( pipe_in => FileHandle->new,
-                 out => "$outfile.$i",
-                 fail => 'never' );
-    spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]);
-    $opts{pipe_in}->blocking(1);
-    push @jobs, \%opts;
-}
-
-$i = 0;
+my %opts = ( pipe_in => FileHandle->new,
+             fail => 'never' );
+spawn(\%opts, ['xargs', '-0r', '--max-procs=4', '--max-args=65', $helper, $outfile]);
+$opts{pipe_in}->blocking(1);
 
 while (<INDEX>) {
     chomp;
@@ -76,17 +68,14 @@ while (<INDEX>) {
     s/ -> .*//;
     s/(\G|[^\\](?:\\\\)*)\\(\d{3})/"$1" . chr(oct $2)/ge;
     s/\\\\/\\/;
-    printf {$jobs[$i]->{pipe_in}} "%s\0", $_;
-    $i = ($i + 1) & 3;
+    printf {$opts{pipe_in}} "%s\0", $_;
 }
 
 close(INDEX) or fail("cannot close index file: $!");
 
-foreach my $opts (@jobs) {
-    close $opts->{pipe_in};
-    reap($opts);
-}
-system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile";
+close $opts{pipe_in};
+reap(\%opts);
 
+system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile";
 
 
diff --git a/collection/file-info-helper b/collection/file-info-helper
index 3c7bde0..f6583da 100755
--- a/collection/file-info-helper
+++ b/collection/file-info-helper
@@ -3,7 +3,13 @@
 use strict;
 use warnings;
 
-while ( my $line = <> ) {
+my $ofile = shift;
+$ofile .= ".$$";
+open my $out, '>>', $ofile or die "opening $ofile: $!";
+
+open my $cmd, '-|', 'file', '-F', '', '--print0', '--', @ARGV or die "file: $!";
+
+while ( my $line = <$cmd> ) {
     my ($file, $type) = $line =~ (m/^(.*?)\x00(.*)$/o);
     if ($file =~ m/\.gz$/o && -e $file && ! -l $file && $type !~ m/compressed/o){
         # While file could be right, it is unfortunately
@@ -30,6 +36,8 @@ while ( my $line = <> ) {
         }
         $type = "$type, $text" if $text;
     }
-    printf "%s%c%s\n", $file , 0, $type;
+    printf $out "%s%c%s\n", $file , 0, $type;
 }
 
+close $cmd;
+close $out or die "closing $ofile: $!";
-- 
1.7.5.4

Reply to:

Prev by Date: Re: The $HISTORY part in reporting/html_reports
Next by Date: Bug#640186: [checks/files] regression: W: python2.6: third-party-package-in-python-dir
Previous by thread: Bug#640149: marked as done ([data/spelling/corrections-multiword] please add "is not enable")
Next by thread: Bug#640186: [checks/files] regression: W: python2.6: third-party-package-in-python-dir
Index(es):
- Date
- Thread