Re: mining system information of bugs
On Thu, Feb 23, 2017 at 11:53:33AM +0800, Paul Wise wrote:
> I'd be interested in stats of Debian releases, preferred suites, apt
> policies and how many bugs were filed from systems running Debian
> derivatives, or other distros if any.
Alas, this data looks really unreliable. Entries often point to a given
release while preferring a derivative, claiming to be stable not testing
while being filed many months before release (one-shot "apt -t testing
dist-upgrade"?), etc.
Only part I'd somewhat trust is "APT prefers unstable" and possibly testing,
but the converse, "APT prefers stable" is notoriously a lie.
All data below is 2016+, nulls not removed.
"Debian Release:"
Raw data:
stretch/sid | 9839
8.3 | 647
8.6 | 639
8.5 | 611
8.4 | 541
8.2 | 308
jessie/sid | 123
| 108
Jessie* | 71
8.0 | 58
7.11 | 55
7.9 | 52
7.10 | 26
8.1 | 19
sid (unstable) | 15
wheezy/sid | 15
7.4 | 13
sid | 9
6.0.10 | 9
7.8 | 7
squeeze/sid | 6
7.7 | 5
jessie | 4
7.0 | 4
stretch | 4
7.1 | 2
5.0.4 | 2
lenny/sid | 2
6.0.7 | 2
7.6 | 1
wheezy/testing | 1
7.2 | 1
6.0.4 | 1
Grouped by first digit:
stretch/sid | 9839
8 | 2827
7 | 166
jessie/sid | 123
| 108
Jessie* | 71 <-- why the star?
sid | 24
wheezy/sid | 15
6 | 12
squeeze/sid | 6
stretch | 4
lenny/sid | 2
5 | 2
wheezy/testing | 1
"APT prefers"
unstable | 4456
testing | 4385
stable | 3023
| 687
xenial | 152
oldstable | 142
unreleased | 71
buildd-unstable | 49
yakkety | 45
trusty | 44
wily | 34
proposed | 26
oldoldstable | 25
squeeze | 20
experimental | 20
jessie | 11
vivid | 3
precise | 3
zesty | 2
oneiric | 1
local | 1
> apt policies
I did not collect this part.
> Are you planning on producing these graphs/stats continuously/automatically?
>
> Got any information on how they were produced?
I did not have the foresight to automate the steps involved.
First, I logged into bugs-mirror.debian.org and tarred up .report files,
both open and archived. This took over two hours, I would really prefer not
burdening project machines this way. Picking only new reports can be done,
preferably without having to stat() everything.
I've then thrown them into a single dir and grepped away those without
"^-- System Information:$". Then parsed and imported them into SQL (script
attached).
Then came the worst part, fixing mangled data. There's a lot of common ways
to corrupt it and a bunch more uncommon ones. Looking in my .psql_history
I see for example:
select arch,count(*) from si group by arch;
update si set arch='s390' where arch like 's390Locale%';
update si set arch=null where arch like 'dpkg: %';
update si set arch=null where arch like 'sh: %';
update si set arch=null where arch like 'All%';
update si set arch=null where arch like 'packages%';
update si set arch=null where arch like 'Any%';
update si set arch=trim(arch),farch=trim(farch);
update si set arch=null where arch like 'should%';
update si set arch=null where arch like '3.4.10%';
update si set arch=null where arch like 'Linux%';
update si set arch=regexp_replace(arch,'\).*',')');
and so on. Ie, after you select distinct to manually check there's no
entry like "death to systemd!" you can assume any prose or mail-caused
corruption that contains the substring "systemd" means that, but writing
such rules can't be automated. Then there are random strings inserted
in the middle of a word, this somehow happens really a lot.
Only then you get to analyze the data. group by date_trunc('month',
timestamp) is the workhorse.
In the hindsight, I could have taken the other way: assume that any data
that's not valid after a few common rules can be thrown away -- but then,
even that single "Jessie*" entry that survived data massaging is over 0.5%
of data for 2016. And we really want to see reports with arch="or1k" and
such.
Meow!
--
Autotools hint: to do a zx-spectrum build on a pdp11 host, type:
./configure --host=zx-spectrum --build=pdp11
#!/usr/bin/perl -w
use DBI;
my $dbh = DBI->connect("dbi:Pg:dbname=bugs;host=narchost");
$dbh->do("begin");
$dbh->do(<<END);
create table si
(
bug varchar,
timestamp timestamp,
debrel varchar,
maxapt varchar,
arch varchar,
farch varchar,
kernel varchar,
locale varchar,
shell varchar,
init varchar
)
END
my $ins=$dbh->prepare("insert into si values(?,to_timestamp(?),?,?,?,?,?,?,?,?)");
undef $/;
opendir(my $dh, "si") or die "opendir\n";
while (readdir $dh)
{
my $f="si/$_";
$f=~/(\d+)\.report/ or next;
my $bug=$1;
my $time=(stat($f))[9];
open F, "<", "$f" or die "Can't open $f\n";
$_=<F>;
close F;
s/.*^-- System Information:$//sm or die "No System Information\n";
s/^ --.*//sm;
s/=3D/=/g;
s/=3D/=/g;
s/fran.ais/français/g;
my $debrel=$1 if /^Debian Release: (.*)/m;
my $maxapt=$1 if /^ APT prefers (.*)/m;
my $arch=$1 if /^Architecture: (.*)/m;
my $farch=$1 if /^Foreign Architectures: (.*)/m;
my $kernel=$1 if /^Kernel: (.*)/m;
my $locale=$1 if /^Locale: (.*)/m;
my $shell=$1 if /^Shell: (.*)/m;
my $init=$1 if /^Init: (.*)/m;
$ins->execute($bug,$time,$debrel,$maxapt,$arch,$farch,$kernel,$locale,$shell,$init)
or die;
}
$dbh->do("commit");
Reply to: