Bug#796170: lintian: [new check] warn on non-UTF8 text files
On 2015-08-19 23:43, Adam Borowski wrote:
> Package: lintian
> Version: 2.5.36.1
> Severity: wishlist
> Tags: patch
>
>
Hi,
Thanks for the suggestion and the prototype patch.
I think it is an interesting proposal and I think we could try it as an
experimental check to see if it is a feasible check.
I have some minor technical remarks to the patch interleaved to below.
It will also need a test case in the t/tests suite.
Axel (XTaran) and I are at DebConf if you need help with anything, etc. :)
> [...]
>
>
> 0001-New-experimental-tag-text-file-uses-obsolete-encodin.patch
>
>
>>From 902283f122c71c88b968abfc3c778686200c9361 Mon Sep 17 00:00:00 2001
> From: Adam Borowski <kilobyte@angband.pl>
> Date: Wed, 19 Aug 2015 23:32:39 +0200
> Subject: [PATCH] New experimental tag: text-file-uses-obsolete-encoding
>
> Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> ---
> checks/files.desc | 11 +++++++++++
> checks/files.pm | 10 +++++++++-
> lib/Lintian/Util.pm | 22 ++++++++++++++++++++++
> 3 files changed, 42 insertions(+), 1 deletion(-)
>
> [...]
> diff --git a/checks/files.pm b/checks/files.pm
> index b816ed8..0940bbf 100644
> --- a/checks/files.pm
> +++ b/checks/files.pm
> @@ -27,7 +27,7 @@ use Lintian::Data;
> use Lintian::Output qw(warning);
> use Lintian::Tags qw(tag);
> use Lintian::Util qw(drain_pipe fail is_string_utf8_encoded open_gz
> - signal_number2name strip normalize_pkg_path);
> + signal_number2name strip normalize_pkg_path file_is_non_utf8_text);
> use Lintian::SlidingWindow;
>
> use constant BLOCKSIZE => 16_384;
> @@ -1514,6 +1514,14 @@ sub run {
> if $info->index($1);
> }
>
> + # ---------------- encoding
> + if ( $fname =~ m{^(?:usr/)?s?bin/}
Is this already guarded by a "$file->is_regular_file" (or
$file->is_file)? Otherwise, this will break horribly with lintian
trying to open directories, named pipes, etc.
> + or $fname =~ m{\.(?:pm|py|pl|txt)$}
Minor nit: Perhaps \.(?:p[myl]|txt(?:\.gz)?)$
* Would also test gzip-compressed text files
* p[myl] is slightly more optimial than pm|py|pl
> [...]
> index 0b8fa5a..f68ea71 100644
> --- a/lib/Lintian/Util.pm
> +++ b/lib/Lintian/Util.pm
> @@ -62,6 +62,7 @@ BEGIN {
> slurp_entire_file
> file_is_encoded_in_non_utf8
> is_string_utf8_encoded
> + file_is_non_utf8_text
> fail
> strip
> lstrip
> @@ -859,6 +860,27 @@ sub file_is_encoded_in_non_utf8 {
> return $line;
> }
>
> +=item file_is_non_utf8_text (...)
> +
> +Both binary files and text files encoded in proper UTF8 give a negative
> +answer.
> +
> +=cut
> +
> +sub file_is_non_utf8_text {
> + my ($file) = @_;
> +
> + my $fd = ($file->file_info =~ m/gzip compressed/) ? $file->open_gz : $file->open;
> + my $bad=0;
> + while (<$fd>) {
I would like to see this use a named variable here (to keep the uses of
"$_" down).
> + return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/);
The $fd might be a pipe to a subprocess (in the open_gz case). It would
need a "drain_pipe" call to avoid spurious errors caused by gzip dying
by "broken pipe".
> [...]
>
Reply to: