[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Reading a file with unknown encoding



On Thu, Sep 06, 2018 at 12:12:57AM +0200, Francesco Poli wrote:
> Proposed strategy
> =================
> 
> I've been thinking about a way to prevent apt-listbugs from
> barfing in those unusual cases.
> 
> Since the non US-ASCII characters, if present at all, will be in
> the comment lines (assuming the format of the file is valid!),
> it does not really matter much whether apt-listbugs is able to
> correctly represent those non US-ASCII characters.
> The comment lines will be skipped, as soon as detected as such.
>  
> Hence I thought I could do the following:
> 
>   $ cat read_ignore_bugs_encode.rb 
>   #!/usr/bin/ruby
>   
>   p ["Default external encoding:", Encoding.default_external]
>   puts "========="
>   
>   noncomments = []
>   
>   open("ignore_bugs").each { |line|
>     enc = line.encode(Encoding.default_external, undef: :replace, invalid: :replace)
>     p [line.encoding, line, enc.encoding, enc]
>     if /^\s*#/ =~ enc
>       next
>     end
>     if /^\s*(\S+)/ =~ enc
>       noncomments << $1
>     end
>   }
>   
>   puts "========="
>   noncomments.each { |elem|
>     p [elem.encoding, elem]
>   }
> 
> 
> This seems to work normally, when run in the same locale where the
> "ignore_bugs" file was created:
> 
>   $ ./read_ignore_bugs_encode.rb ["Default external encoding:",
>   #<Encoding:UTF-8>] ========= [#<Encoding:UTF-8>, "# first bug\n",
>   #<Encoding:UTF-8>, "# first bug\n"] [#<Encoding:UTF-8>, "123456\n",
>   #<Encoding:UTF-8>, "123456\n"] [#<Encoding:UTF-8>, "# secönd bug\n",
>   #<Encoding:UTF-8>, "# secönd bug\n"] [#<Encoding:UTF-8>, "234567\n",
>   #<Encoding:UTF-8>, "234567\n"] [#<Encoding:UTF-8>, "# a package\n",
>   #<Encoding:UTF-8>, "# a package\n"] [#<Encoding:UTF-8>,
>   "my-package0+\n", #<Encoding:UTF-8>, "my-package0+\n"] =========
>   [#<Encoding:UTF-8>, "123456"] [#<Encoding:UTF-8>, "234567"]
>   [#<Encoding:UTF-8>, "my-package0+"]
> 
> but also when run in a more limited locale:
> 
>   $ LC_ALL=C ./read_ignore_bugs_encode.rb ["Default external
>   encoding:", #<Encoding:US-ASCII>] ========= [#<Encoding:US-ASCII>,
>   "# first bug\n", #<Encoding:US-ASCII>, "# first bug\n"]
>   [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"]
>   [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n",
>   #<Encoding:US-ASCII>, "# sec??nd bug\n"] [#<Encoding:US-ASCII>,
>   "234567\n", #<Encoding:US-ASCII>, "234567\n"] [#<Encoding:US-ASCII>,
>   "# a package\n", #<Encoding:US-ASCII>, "# a package\n"]
>   [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>,
>   "my-package0+\n"] ========= [#<Encoding:US-ASCII>, "123456"]
>   [#<Encoding:US-ASCII>, "234567"] [#<Encoding:US-ASCII>,
>   "my-package0+"]
> 
> 
> What do you think?  Is the above described strategy reasonable?  Or do
> you see a flaw which will backfire in the future?

Looks OK to me, but it also looks a little bit too cautious, and
complex. In this case you only care about the lines that are uncommented
and only contain ASCII, so you can just ignore everything else:

----------------8<----------------8<----------------8<-----------------
$ cat /tmp/ignore_bugs 
123456
# secönd bug
234567
# a package
my-package0+
$ cat /tmp/read_bugs.rb 
ARGV.each do |f|
  File.readlines(f, encoding: Encoding::BINARY).each do |line|
    puts line if line !~ /^\s*#/
  end
end
$ ruby /tmp/read_bugs.rb /tmp/ignore_bugs 
123456
234567
my-package0+
$ LANG=C ruby /tmp/read_bugs.rb /tmp/ignore_bugs 
123456
234567
my-package0+
----------------8<----------------8<----------------8<-----------------

Attachment: signature.asc
Description: PGP signature


Reply to: