[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Questions regarding utf-8



Bob Hilliard wrote:
     1.  How can I determine what character encoding is used in a
         document without manually scanning the entire file?

You can't do that automatically, in generally. If you know what text
you expect, and you know the bytes you have in the file, you can
try a number of encodings, and see which of the encodings gives the characters you expect. As a manual procedure, this is best done with the help of /usr/share/i18n/charmaps. This lists the Unicode character position, the encoding-specific byte [sequence], and the character name.

So if you know you have \xe7, and you know it is c-cedilla, it could
be iso-8859-1. It could also be iso-8859-{2,3,9,14,15,16}, cp125{0,2,4,6}, DEC-MCS, SAMI-WS2, etc.

     2.  What is the best available filter to convert from encoding X
         to 7 bit ASCII?

cat(1). It can't get much better than that. If you have an encoding and
a file that has non-ASCII characters, you can't convert correctly to ASCII. So your choice is to
a) lose some information, e.g. transliterate non-representable
   characters, or replace them with a replacement character ('?')
b) break the encoding, i.e. use bytes not supported in the target
   encoding (ASCII).
Neither option is good, so I wouldn't claim that some is best.
cat(1) implements option b)

     3.  What is the difference between utf-8 and en_US.utf8?

The former is an encoding, the latter a locale. It is like
apples and oranges: both are fruit.

Regards,
Martin



Reply to: