Re: [PoC Patch] Parallelizing file-info (for huge packages)
On 2011-08-31 15:52, Niels Thykier wrote:
> [...]
I have looked at this some more and the original patch can be improved.
xargs have two arguments of interest; first the --max-args to keep it
run in smaller batches.
The second option is the --max-procs that makes xargs handle the
parallelization. I strongly suspect that xargs does a much better job
here than my previous patch.
All in all, the file-info script (with --max-args=4 processes) is now
down to ~15 seconds (from ~24) and the total unpack time for the eclipse
source is about ~27 seconds[0]. The attached patch goes on top of my
previous patch[1].
The downside of --max-procs is that the output from the sub-processes
becomes garbled, so we have to manually write to separate files and
merge the output.
This is not difficult; my first approach is simply to use the pid to
give them a unique file (using append to avoid truncating an existing
file in case a pid is reused for a later process).
This works great except those files add up... the lab ends up with (in
my test) 500+ of small "file-info"-parts. Merging them is fairly
trivial, but I do not like the all "parts".
Do you have an idea to keep the parts down to a reasonable level? My
best alternative is to make a "merging daemon" and have the
file-info-helper processes feed it with their output. That would remove
all the file parts, but at the price of complexity and an extra process.
~Niels
[0] Only tested on tmpfs this time.
[1] The missing 0002 is my "poor man's benchmark" code in frontend/lintian.
>From 4488929c17957cc76cc35dbb6dca7b182e4396a1 Mon Sep 17 00:00:00 2001
From: Niels Thykier <niels@thykier.net>
Date: Sat, 3 Sep 2011 10:14:42 +0200
Subject: [PATCH 3/3] Use xargs's parallization in coll/file-info with
--max-args
This is a couple of seconds faster on huge packages; xargs can do
a better job at scheduling the individual runs. The downside is
that the output becomes garbled (unless written to "per-process
files" or similar). For huge packages the amount of files easily
exceed 100 files.
---
collection/file-info | 27 ++++++++-------------------
collection/file-info-helper | 12 ++++++++++--
2 files changed, 18 insertions(+), 21 deletions(-)
diff --git a/collection/file-info b/collection/file-info
index 2a90959..313e29a 100755
--- a/collection/file-info
+++ b/collection/file-info
@@ -49,22 +49,14 @@ open(INDEX, '<', 'index')
chdir('unpacked')
or fail("cannot chdir to unpacked directory: $!");
-my $i = 0;
-my @jobs;
-for ( ; $i < 4 ; $i++) {
# We ignore failures from file because sometimes file returns a non-zero exit
# status when it can't parse a file. So far, the resulting output still
# appears to be usable (although will contain "ERROR" strings, which Lintian
# doesn't care about), and the only problem was the exit status.
- my %opts = ( pipe_in => FileHandle->new,
- out => "$outfile.$i",
- fail => 'never' );
- spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]);
- $opts{pipe_in}->blocking(1);
- push @jobs, \%opts;
-}
-
-$i = 0;
+my %opts = ( pipe_in => FileHandle->new,
+ fail => 'never' );
+spawn(\%opts, ['xargs', '-0r', '--max-procs=4', '--max-args=65', $helper, $outfile]);
+$opts{pipe_in}->blocking(1);
while (<INDEX>) {
chomp;
@@ -76,17 +68,14 @@ while (<INDEX>) {
s/ -> .*//;
s/(\G|[^\\](?:\\\\)*)\\(\d{3})/"$1" . chr(oct $2)/ge;
s/\\\\/\\/;
- printf {$jobs[$i]->{pipe_in}} "%s\0", $_;
- $i = ($i + 1) & 3;
+ printf {$opts{pipe_in}} "%s\0", $_;
}
close(INDEX) or fail("cannot close index file: $!");
-foreach my $opts (@jobs) {
- close $opts->{pipe_in};
- reap($opts);
-}
-system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile";
+close $opts{pipe_in};
+reap(\%opts);
+system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile";
diff --git a/collection/file-info-helper b/collection/file-info-helper
index 3c7bde0..f6583da 100755
--- a/collection/file-info-helper
+++ b/collection/file-info-helper
@@ -3,7 +3,13 @@
use strict;
use warnings;
-while ( my $line = <> ) {
+my $ofile = shift;
+$ofile .= ".$$";
+open my $out, '>>', $ofile or die "opening $ofile: $!";
+
+open my $cmd, '-|', 'file', '-F', '', '--print0', '--', @ARGV or die "file: $!";
+
+while ( my $line = <$cmd> ) {
my ($file, $type) = $line =~ (m/^(.*?)\x00(.*)$/o);
if ($file =~ m/\.gz$/o && -e $file && ! -l $file && $type !~ m/compressed/o){
# While file could be right, it is unfortunately
@@ -30,6 +36,8 @@ while ( my $line = <> ) {
}
$type = "$type, $text" if $text;
}
- printf "%s%c%s\n", $file , 0, $type;
+ printf $out "%s%c%s\n", $file , 0, $type;
}
+close $cmd;
+close $out or die "closing $ofile: $!";
--
1.7.5.4
Reply to: