Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources

To: Johannes Schauer <j.schauer@email.de>, 757551@bugs.debian.org, Jakub Wilk <jwilk@debian.org>
Subject: Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
From: Niels Thykier <niels@thykier.net>
Date: Sun, 10 Aug 2014 10:39:48 +0200
Message-id: <[🔎] 53E72FD4.7040602@thykier.net>
Reply-to: Niels Thykier <niels@thykier.net>, 757551@bugs.debian.org
In-reply-to: <[🔎] 20140810080524.19236.68906@hoothoot>
References: <[🔎] 20140809074306.24380.17110.reportbug@hoothoot> <[🔎] 20140809212154.GA3876@jwilk.net> <[🔎] 20140810072845.19236.28120@hoothoot> <[🔎] 20140810074131.GA8080@jwilk.net> <[🔎] 20140810080231.19236.32539@hoothoot> <[🔎] 20140810080524.19236.68906@hoothoot>

On 2014-08-10 10:05, Johannes Schauer wrote:
> Quoting Johannes Schauer (2014-08-10 10:02:31)
>> > that's very useful and I added this as a test case. The updated patch is
>> > attached.
> and it would help to attach the right file...
> 
> 

Hi,

Thanks for working on this and providing a patch.  :)

I got a few questions to it though based on a very quick review.  My
comments are interleaved in the patch.

> 0001-check-whether-the-dep-5-debian-copyright-wildcards-m.patch
> 
> 
> From 040fdbf1b96d1b6dcfa55c67c52e72b3a58ab8ce Mon Sep 17 00:00:00 2001
> From: Johannes Schauer <j.schauer@email.de>
> Date: Fri, 8 Aug 2014 12:00:39 +0200
> Subject: [PATCH] check whether the dep-5 debian/copyright wildcards match all
>  files
> 
>  [...]
> 
> diff --git a/checks/source-copyright.desc b/checks/source-copyright.desc
> index 921e7b9..b912b2a 100644
> --- a/checks/source-copyright.desc
> +++ b/checks/source-copyright.desc
> [...]
> index 6b9a0c5..cfdaef7 100644
> --- a/checks/source-copyright.pm
> +++ b/checks/source-copyright.pm
> @@ -24,6 +24,9 @@ use strict;
>  use warnings;
>  use autodie;
>  
> +use Cwd qw(getcwd);

This does not seem to be used anywhere in the diff?

> +use File::Find qw();
> +
>  use List::MoreUtils qw(any);
>  use Text::Levenshtein qw(distance);
>  
> @@ -252,6 +255,8 @@ sub _parse_dep5 {
>      }
>  
>      my @commas_in_files;
> +    my %file_coverage = map { $_ => 0 } get_all_files($info);
> +    my %file_para_coverage = ();

Minor nit pick: Consider merging this with the @commas_in_files
declaration, a la:

  my (@commas_in_files, %file_para_coverage);


Alternatively, move it up under @commas_in_files and remove the "= ();".
 Then 5.20 should be able to this optimisation for us.

>      my $i = 0;
>      my $current_line = 0;
>      for my $para (@dep5) {
> @@ -297,6 +302,31 @@ sub _parse_dep5 {
>                  @commas_in_files = ($i, $files_fname);
>              }
>  
> +            # only attempt to evaluate globbing if commas could be legal
> +            if (not @commas_in_files or any { m/,/xsm } $info->sorted_index) {

The "any { m/,/xsm } $info->sorted_index" part is an invariant in the
"for my $para (@dep5)" loop.  I think we should move it out of the loop,
so we don't loop over all files in the index repeatedly just to conclude
that none of them have any commas.

> +                my @wildcards = split /[\n\t ]+/, $files;
> +                for my $wildcard (@wildcards) {
> +                    my ($regex, $wildcard_error) = wildcard_to_regex($wildcard);
> +                    if (defined $wildcard_error) {
> +                        tag 'invalid-escape-sequence-in-dep5-copyright', substr($wildcard_error, 0, 2) . " (paragraph at line $current_line)";
> +                        next;
> +                    }
> +
> +                    my $used = 0;
> +                    $file_para_coverage{$current_line} = 0;
> +                    for my $srcfile (keys %file_coverage) {
> +                        if ($srcfile =~ $regex) {
> +                            $used = 1;
> +                            $file_coverage{$srcfile} = $current_line;
> +                            $file_para_coverage{$current_line} = 1;
> +                        }
> +                    }
> +                    if (not $used) {
> +                        tag 'wildcard-matches-nothing-in-dep5-copyright', "$wildcard (paragraph at line $current_line)";
> +                    }
> +                }
> +            }
> +
>              my ($found_license, $full_license, $short_license, @short_licenses)
>                = parse_license($license,$current_line);
>              if ($found_license) {
> @@ -325,12 +355,21 @@ sub _parse_dep5 {
>                $current_line;
>          }
>      }
> -    if (@commas_in_files) {
> +    if (@commas_in_files and not any { m/,/xsm } $info->sorted_index) {

The "any {...} $info->sorted_index" here could be optimised away as well
if the change above is implemented.

>          my ($paragraph_no, $field_name) = @commas_in_files;
> [...]
>      }
>      while ((my $license, $i) = each %required_standalone_licenses) {
> @@ -400,6 +439,55 @@ sub get_field {
>      return;
>  }
>  
> +sub wildcard_to_regex {
> +    my ($regex) = @_;
> +    $regex =~ s,^\./+,,;
> +    $regex =~ s,//+,/,g;
> +    my $error;
> +    eval {
> +        $regex =~ s{
> +            (\*) |
> +            (\?) |
> +            ([^*?\\]+) |
> +            (\\[\\*?]) |
> +            (.+)
> +        }{
> +            if (defined $1) {
> +                '.*';
> +            } elsif (defined $2) {
> +                '.'
> +            } elsif (defined $3) {
> +                quotemeta($3);
> +            } elsif (defined $4) {
> +                $4;

Shouldn't this be quotemeta'ed as well? Otherwise "file.png" would
become eqv. to "file?png".

> +            } else {
> +                $error = $5;
> +                die;
> +            }
> +        }egx;
> +    };
> +    if ($@) {
> +        return (undef, $error);
> +    } else {
> +        return (qr/^(?:$regex)$/, undef);
> +    }
> +}
> +
> [...]

Reply to:

Follow-Ups:
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Jakub Wilk <jwilk@debian.org>
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Johannes Schauer <j.schauer@email.de>

References:
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Johannes Schauer <j.schauer@email.de>
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Jakub Wilk <jwilk@debian.org>
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Johannes Schauer <j.schauer@email.de>
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Jakub Wilk <jwilk@debian.org>
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Johannes Schauer <j.schauer@email.de>
- Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
  - From: Johannes Schauer <j.schauer@email.de>

Prev by Date: Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
Next by Date: Processed: tagging 734402
Previous by thread: Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
Next by thread: Bug#757551: lintian: check if DEP-5 debian/copyright covers all files in the unpacked sources
Index(es):
- Date
- Thread