On Thu, Sep 06, 2018 at 12:12:57AM +0200, Francesco Poli wrote:
> Proposed strategy
> =================
>
> I've been thinking about a way to prevent apt-listbugs from
> barfing in those unusual cases.
>
> Since the non US-ASCII characters, if present at all, will be in
> the comment lines (assuming the format of the file is valid!),
> it does not really matter much whether apt-listbugs is able to
> correctly represent those non US-ASCII characters.
> The comment lines will be skipped, as soon as detected as such.
>
> Hence I thought I could do the following:
>
> $ cat read_ignore_bugs_encode.rb
> #!/usr/bin/ruby
>
> p ["Default external encoding:", Encoding.default_external]
> puts "========="
>
> noncomments = []
>
> open("ignore_bugs").each { |line|
> enc = line.encode(Encoding.default_external, undef: :replace, invalid: :replace)
> p [line.encoding, line, enc.encoding, enc]
> if /^\s*#/ =~ enc
> next
> end
> if /^\s*(\S+)/ =~ enc
> noncomments << $1
> end
> }
>
> puts "========="
> noncomments.each { |elem|
> p [elem.encoding, elem]
> }
>
>
> This seems to work normally, when run in the same locale where the
> "ignore_bugs" file was created:
>
> $ ./read_ignore_bugs_encode.rb ["Default external encoding:",
> #<Encoding:UTF-8>] ========= [#<Encoding:UTF-8>, "# first bug\n",
> #<Encoding:UTF-8>, "# first bug\n"] [#<Encoding:UTF-8>, "123456\n",
> #<Encoding:UTF-8>, "123456\n"] [#<Encoding:UTF-8>, "# secönd bug\n",
> #<Encoding:UTF-8>, "# secönd bug\n"] [#<Encoding:UTF-8>, "234567\n",
> #<Encoding:UTF-8>, "234567\n"] [#<Encoding:UTF-8>, "# a package\n",
> #<Encoding:UTF-8>, "# a package\n"] [#<Encoding:UTF-8>,
> "my-package0+\n", #<Encoding:UTF-8>, "my-package0+\n"] =========
> [#<Encoding:UTF-8>, "123456"] [#<Encoding:UTF-8>, "234567"]
> [#<Encoding:UTF-8>, "my-package0+"]
>
> but also when run in a more limited locale:
>
> $ LC_ALL=C ./read_ignore_bugs_encode.rb ["Default external
> encoding:", #<Encoding:US-ASCII>] ========= [#<Encoding:US-ASCII>,
> "# first bug\n", #<Encoding:US-ASCII>, "# first bug\n"]
> [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"]
> [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n",
> #<Encoding:US-ASCII>, "# sec??nd bug\n"] [#<Encoding:US-ASCII>,
> "234567\n", #<Encoding:US-ASCII>, "234567\n"] [#<Encoding:US-ASCII>,
> "# a package\n", #<Encoding:US-ASCII>, "# a package\n"]
> [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>,
> "my-package0+\n"] ========= [#<Encoding:US-ASCII>, "123456"]
> [#<Encoding:US-ASCII>, "234567"] [#<Encoding:US-ASCII>,
> "my-package0+"]
>
>
> What do you think? Is the above described strategy reasonable? Or do
> you see a flaw which will backfire in the future?
Looks OK to me, but it also looks a little bit too cautious, and
complex. In this case you only care about the lines that are uncommented
and only contain ASCII, so you can just ignore everything else:
----------------8<----------------8<----------------8<-----------------
$ cat /tmp/ignore_bugs
123456
# secönd bug
234567
# a package
my-package0+
$ cat /tmp/read_bugs.rb
ARGV.each do |f|
File.readlines(f, encoding: Encoding::BINARY).each do |line|
puts line if line !~ /^\s*#/
end
end
$ ruby /tmp/read_bugs.rb /tmp/ignore_bugs
123456
234567
my-package0+
$ LANG=C ruby /tmp/read_bugs.rb /tmp/ignore_bugs
123456
234567
my-package0+
----------------8<----------------8<----------------8<-----------------
Attachment:
signature.asc
Description: PGP signature