Bug#796170: lintian: [new check] warn on non-UTF8 text files

To: Adam Borowski <kilobyte@angband.pl>, 796170@bugs.debian.org
Cc: Axel Beckert <abe@debian.org>
Subject: Bug#796170: lintian: [new check] warn on non-UTF8 text files
From: Niels Thykier <niels@thykier.net>
Date: Thu, 20 Aug 2015 09:58:25 +0200
Message-id: <[🔎] 55D588A1.5090401@thykier.net>
Reply-to: Niels Thykier <niels@thykier.net>, 796170@bugs.debian.org
In-reply-to: <[🔎] 20150819214359.9385.53342.reportbug@umbar.angband.pl>
References: <[🔎] 20150819214359.9385.53342.reportbug@umbar.angband.pl>

On 2015-08-19 23:43, Adam Borowski wrote:
> Package: lintian
> Version: 2.5.36.1
> Severity: wishlist
> Tags: patch
> 
> 

Hi,

Thanks for the suggestion and the prototype patch.

I think it is an interesting proposal and I think we could try it as an
experimental check to see if it is a feasible check.

I have some minor technical remarks to the patch interleaved to below.
It will also need a test case in the t/tests suite.


Axel (XTaran) and I are at DebConf if you need help with anything, etc. :)

> [...]
> 
> 
> 0001-New-experimental-tag-text-file-uses-obsolete-encodin.patch
> 
> 
>>From 902283f122c71c88b968abfc3c778686200c9361 Mon Sep 17 00:00:00 2001
> From: Adam Borowski <kilobyte@angband.pl>
> Date: Wed, 19 Aug 2015 23:32:39 +0200
> Subject: [PATCH] New experimental tag: text-file-uses-obsolete-encoding
> 
> Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> ---
>  checks/files.desc   | 11 +++++++++++
>  checks/files.pm     | 10 +++++++++-
>  lib/Lintian/Util.pm | 22 ++++++++++++++++++++++
>  3 files changed, 42 insertions(+), 1 deletion(-)
> 
> [...]
> diff --git a/checks/files.pm b/checks/files.pm
> index b816ed8..0940bbf 100644
> --- a/checks/files.pm
> +++ b/checks/files.pm
> @@ -27,7 +27,7 @@ use Lintian::Data;
>  use Lintian::Output qw(warning);
>  use Lintian::Tags qw(tag);
>  use Lintian::Util qw(drain_pipe fail is_string_utf8_encoded open_gz
> -  signal_number2name strip normalize_pkg_path);
> +  signal_number2name strip normalize_pkg_path file_is_non_utf8_text);
>  use Lintian::SlidingWindow;
>  
>  use constant BLOCKSIZE => 16_384;
> @@ -1514,6 +1514,14 @@ sub run {
>                    if $info->index($1);
>              }
>  
> +            # ---------------- encoding
> +            if (   $fname =~ m{^(?:usr/)?s?bin/}

Is this already guarded by a "$file->is_regular_file" (or
$file->is_file)?  Otherwise, this will break horribly with lintian
trying to open directories, named pipes, etc.

> +                or $fname =~ m{\.(?:pm|py|pl|txt)$}

Minor nit:  Perhaps \.(?:p[myl]|txt(?:\.gz)?)$

 * Would also test gzip-compressed text files
 * p[myl] is slightly more optimial than pm|py|pl

> [...]
> index 0b8fa5a..f68ea71 100644
> --- a/lib/Lintian/Util.pm
> +++ b/lib/Lintian/Util.pm
> @@ -62,6 +62,7 @@ BEGIN {
>            slurp_entire_file
>            file_is_encoded_in_non_utf8
>            is_string_utf8_encoded
> +          file_is_non_utf8_text
>            fail
>            strip
>            lstrip
> @@ -859,6 +860,27 @@ sub file_is_encoded_in_non_utf8 {
>      return $line;
>  }
>  
> +=item file_is_non_utf8_text (...)
> +
> +Both binary files and text files encoded in proper UTF8 give a negative
> +answer.
> +
> +=cut
> +
> +sub file_is_non_utf8_text {
> +    my ($file) = @_;
> +
> +    my $fd = ($file->file_info =~ m/gzip compressed/) ? $file->open_gz : $file->open;
> +    my $bad=0;
> +    while (<$fd>) {

I would like to see this use a named variable here (to keep the uses of
"$_" down).

> +        return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/);

The $fd might be a pipe to a subprocess (in the open_gz case).  It would
need a "drain_pipe" call to avoid spurious errors caused by gzip dying
by "broken pipe".

> [...]
>

Reply to:

Follow-Ups:
- Bug#796170: lintian: [new check] warn on non-UTF8 text files
  - From: Adam Borowski <kilobyte@angband.pl>

References:
- Bug#796170: lintian: [new check] warn on non-UTF8 text files
  - From: Adam Borowski <kilobyte@angband.pl>

Prev by Date: Build failed in Jenkins: lintian-tests_wheezy #250
Next by Date: Bug#796170: lintian: [new check] warn on non-UTF8 text files
Previous by thread: Bug#796170: lintian: [new check] warn on non-UTF8 text files
Next by thread: Bug#796170: lintian: [new check] warn on non-UTF8 text files
Index(es):
- Date
- Thread