Re: Reading a file with unknown encoding

To: Francesco Poli <invernomuto@paranoici.org>
Cc: Debian-Ruby <debian-ruby@lists.debian.org>
Subject: Re: Reading a file with unknown encoding
From: Antonio Terceiro <terceiro@debian.org>
Date: Sat, 8 Sep 2018 11:11:08 -0300
Message-id: <[🔎] 20180908141108.GB16084@debian.org>
Mail-followup-to: Francesco Poli <invernomuto@paranoici.org>, Debian-Ruby <debian-ruby@lists.debian.org>
In-reply-to: <[🔎] 20180906001257.1ca794cdb6e9e27027d02f0e@paranoici.org>
References: <[🔎] 20180906001257.1ca794cdb6e9e27027d02f0e@paranoici.org>

On Thu, Sep 06, 2018 at 12:12:57AM +0200, Francesco Poli wrote:
> Proposed strategy
> =================
> 
> I've been thinking about a way to prevent apt-listbugs from
> barfing in those unusual cases.
> 
> Since the non US-ASCII characters, if present at all, will be in
> the comment lines (assuming the format of the file is valid!),
> it does not really matter much whether apt-listbugs is able to
> correctly represent those non US-ASCII characters.
> The comment lines will be skipped, as soon as detected as such.
>  
> Hence I thought I could do the following:
> 
>   $ cat read_ignore_bugs_encode.rb 
>   #!/usr/bin/ruby
>   
>   p ["Default external encoding:", Encoding.default_external]
>   puts "========="
>   
>   noncomments = []
>   
>   open("ignore_bugs").each { |line|
>     enc = line.encode(Encoding.default_external, undef: :replace, invalid: :replace)
>     p [line.encoding, line, enc.encoding, enc]
>     if /^\s*#/ =~ enc
>       next
>     end
>     if /^\s*(\S+)/ =~ enc
>       noncomments << $1
>     end
>   }
>   
>   puts "========="
>   noncomments.each { |elem|
>     p [elem.encoding, elem]
>   }
> 
> 
> This seems to work normally, when run in the same locale where the
> "ignore_bugs" file was created:
> 
>   $ ./read_ignore_bugs_encode.rb ["Default external encoding:",
>   #<Encoding:UTF-8>] ========= [#<Encoding:UTF-8>, "# first bug\n",
>   #<Encoding:UTF-8>, "# first bug\n"] [#<Encoding:UTF-8>, "123456\n",
>   #<Encoding:UTF-8>, "123456\n"] [#<Encoding:UTF-8>, "# secönd bug\n",
>   #<Encoding:UTF-8>, "# secönd bug\n"] [#<Encoding:UTF-8>, "234567\n",
>   #<Encoding:UTF-8>, "234567\n"] [#<Encoding:UTF-8>, "# a package\n",
>   #<Encoding:UTF-8>, "# a package\n"] [#<Encoding:UTF-8>,
>   "my-package0+\n", #<Encoding:UTF-8>, "my-package0+\n"] =========
>   [#<Encoding:UTF-8>, "123456"] [#<Encoding:UTF-8>, "234567"]
>   [#<Encoding:UTF-8>, "my-package0+"]
> 
> but also when run in a more limited locale:
> 
>   $ LC_ALL=C ./read_ignore_bugs_encode.rb ["Default external
>   encoding:", #<Encoding:US-ASCII>] ========= [#<Encoding:US-ASCII>,
>   "# first bug\n", #<Encoding:US-ASCII>, "# first bug\n"]
>   [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"]
>   [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n",
>   #<Encoding:US-ASCII>, "# sec??nd bug\n"] [#<Encoding:US-ASCII>,
>   "234567\n", #<Encoding:US-ASCII>, "234567\n"] [#<Encoding:US-ASCII>,
>   "# a package\n", #<Encoding:US-ASCII>, "# a package\n"]
>   [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>,
>   "my-package0+\n"] ========= [#<Encoding:US-ASCII>, "123456"]
>   [#<Encoding:US-ASCII>, "234567"] [#<Encoding:US-ASCII>,
>   "my-package0+"]
> 
> 
> What do you think?  Is the above described strategy reasonable?  Or do
> you see a flaw which will backfire in the future?

Looks OK to me, but it also looks a little bit too cautious, and
complex. In this case you only care about the lines that are uncommented
and only contain ASCII, so you can just ignore everything else:

----------------8<----------------8<----------------8<-----------------
$ cat /tmp/ignore_bugs 
123456
# secönd bug
234567
# a package
my-package0+
$ cat /tmp/read_bugs.rb 
ARGV.each do |f|
  File.readlines(f, encoding: Encoding::BINARY).each do |line|
    puts line if line !~ /^\s*#/
  end
end
$ ruby /tmp/read_bugs.rb /tmp/ignore_bugs 
123456
234567
my-package0+
$ LANG=C ruby /tmp/read_bugs.rb /tmp/ignore_bugs 
123456
234567
my-package0+
----------------8<----------------8<----------------8<-----------------

Attachment: signature.asc
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Reading a file with unknown encoding
  - From: Francesco Poli <invernomuto@paranoici.org>

References:
- Reading a file with unknown encoding
  - From: Francesco Poli <invernomuto@paranoici.org>

Prev by Date: Re: Reading a file with unknown encoding
Next by Date: Re: Reading a file with unknown encoding
Previous by thread: Re: Reading a file with unknown encoding
Next by thread: Re: Reading a file with unknown encoding
Index(es):
- Date
- Thread