Hello Debian Ruby experts, I have a question related to encodings in Ruby. Maybe the question is more fit for Ruby language mailing lists, but, since the issue arises in apt-listbugs (which is a Debian native package) and you are all nice, helpful and knowledgeable, I thought I could ask here... Description of the issue ======================== apt-listbugs reads a file ("ignore_bugs") where some bug numbers and/or package names are written, along with comments beginning with the '#' character. A generic file in the same format could look like: $ cat ignore_bugs # first bug 123456 # secönd bug 234567 # a package my-package0+ This file is usually encoded in the same encoding used by environment where apt-listbugs runs, so there's no special encoding issue. $ file ignore_bugs ignore_bugs: UTF-8 Unicode text The code that reads this file is similar to the following minimal example script (except for the "p" debug statements, of course): $ cat read_ignore_bugs.rb #!/usr/bin/ruby p ["Default external encoding:", Encoding.default_external] puts "=========" noncomments = [] open("ignore_bugs").each { |line| p [line.encoding, line] if /^\s*#/ =~ line next end if /^\s*(\S+)/ =~ line noncomments << $1 end } puts "=========" noncomments.each { |elem| p [elem.encoding, elem] } Running this script in a UTF-8 locale does not pose any issues: $ ./read_ignore_bugs.rb ["Default external encoding:", #<Encoding:UTF-8>] ========= [#<Encoding:UTF-8>, "# first bug\n"] [#<Encoding:UTF-8>, "123456\n"] [#<Encoding:UTF-8>, "# secönd bug\n"] [#<Encoding:UTF-8>, "234567\n"] [#<Encoding:UTF-8>, "# a package\n"] [#<Encoding:UTF-8>, "my-package0+\n"] ========= [#<Encoding:UTF-8>, "123456"] [#<Encoding:UTF-8>, "234567"] [#<Encoding:UTF-8>, "my-package0+"] However, there may be unusual cases where the file is written with an encoding, but then read by apt-listbugs in an environment with different locale settings, implying a different default external encoding. For instance, the file may be encoded in UTF-8 (either because it was written by hand with an editor running in a UTF-8 locale, or because it was written by apt-listbugs, when running in a UTF-8 locale), but then read by a successive execution of apt-listbugs in a US-ASCII locale (maybe because LC_ALL=C was set). This encoding mismatch may cause an ArgumentError to be raised, if some character is found in the file that is an invalid byte sequence in the current default external encoding. $ LC_ALL=C ./read_ignore_bugs.rb ["Default external encoding:", #<Encoding:US-ASCII>] ========= [#<Encoding:US-ASCII>, "# first bug\n"] [#<Encoding:US-ASCII>, "123456\n"] [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n"] Traceback (most recent call last): 2: from ./read_ignore_bugs.rb:8:in `<main>' 1: from ./read_ignore_bugs.rb:8:in `each' ./read_ignore_bugs.rb:10:in `block in <main>': invalid byte sequence in US-ASCII (ArgumentError) The problem is that the actual encoding of the file is unknown and unpredictable... Proposed strategy ================= I've been thinking about a way to prevent apt-listbugs from barfing in those unusual cases. Since the non US-ASCII characters, if present at all, will be in the comment lines (assuming the format of the file is valid!), it does not really matter much whether apt-listbugs is able to correctly represent those non US-ASCII characters. The comment lines will be skipped, as soon as detected as such. Hence I thought I could do the following: $ cat read_ignore_bugs_encode.rb #!/usr/bin/ruby p ["Default external encoding:", Encoding.default_external] puts "=========" noncomments = [] open("ignore_bugs").each { |line| enc = line.encode(Encoding.default_external, undef: :replace, invalid: :replace) p [line.encoding, line, enc.encoding, enc] if /^\s*#/ =~ enc next end if /^\s*(\S+)/ =~ enc noncomments << $1 end } puts "=========" noncomments.each { |elem| p [elem.encoding, elem] } This seems to work normally, when run in the same locale where the "ignore_bugs" file was created: $ ./read_ignore_bugs_encode.rb ["Default external encoding:", #<Encoding:UTF-8>] ========= [#<Encoding:UTF-8>, "# first bug\n", #<Encoding:UTF-8>, "# first bug\n"] [#<Encoding:UTF-8>, "123456\n", #<Encoding:UTF-8>, "123456\n"] [#<Encoding:UTF-8>, "# secönd bug\n", #<Encoding:UTF-8>, "# secönd bug\n"] [#<Encoding:UTF-8>, "234567\n", #<Encoding:UTF-8>, "234567\n"] [#<Encoding:UTF-8>, "# a package\n", #<Encoding:UTF-8>, "# a package\n"] [#<Encoding:UTF-8>, "my-package0+\n", #<Encoding:UTF-8>, "my-package0+\n"] ========= [#<Encoding:UTF-8>, "123456"] [#<Encoding:UTF-8>, "234567"] [#<Encoding:UTF-8>, "my-package0+"] but also when run in a more limited locale: $ LC_ALL=C ./read_ignore_bugs_encode.rb ["Default external encoding:", #<Encoding:US-ASCII>] ========= [#<Encoding:US-ASCII>, "# first bug\n", #<Encoding:US-ASCII>, "# first bug\n"] [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"] [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n", #<Encoding:US-ASCII>, "# sec??nd bug\n"] [#<Encoding:US-ASCII>, "234567\n", #<Encoding:US-ASCII>, "234567\n"] [#<Encoding:US-ASCII>, "# a package\n", #<Encoding:US-ASCII>, "# a package\n"] [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>, "my-package0+\n"] ========= [#<Encoding:US-ASCII>, "123456"] [#<Encoding:US-ASCII>, "234567"] [#<Encoding:US-ASCII>, "my-package0+"] What do you think? Is the above described strategy reasonable? Or do you see a flaw which will backfire in the future? Thanks for reading so far and for any help you may provide! P.S.: Please Cc me on replies, as I am not subscribed to the list. Thanks for your understanding! -- http://www.inventati.org/frx/ There's not a second to spare! To the laboratory! ..................................................... Francesco Poli . GnuPG key fpr == CA01 1147 9CD2 EFDF FB82 3925 3E1C 27E1 1F69 BFFE
Attachment:
pgpblWYDAEebC.pgp
Description: PGP signature