Re: Questions regarding utf-8
Bob Hilliard wrote:
1. How can I determine what character encoding is used in a
document without manually scanning the entire file?
You can't do that automatically, in generally. If you know what text
you expect, and you know the bytes you have in the file, you can
try a number of encodings, and see which of the encodings gives the
characters you expect. As a manual procedure, this is best done with the
help of /usr/share/i18n/charmaps. This lists the Unicode character
position, the encoding-specific byte [sequence], and the character name.
So if you know you have \xe7, and you know it is c-cedilla, it could
be iso-8859-1. It could also be iso-8859-{2,3,9,14,15,16},
cp125{0,2,4,6}, DEC-MCS, SAMI-WS2, etc.
2. What is the best available filter to convert from encoding X
to 7 bit ASCII?
cat(1). It can't get much better than that. If you have an encoding and
a file that has non-ASCII characters, you can't convert correctly to
ASCII. So your choice is to
a) lose some information, e.g. transliterate non-representable
characters, or replace them with a replacement character ('?')
b) break the encoding, i.e. use bytes not supported in the target
encoding (ASCII).
Neither option is good, so I wouldn't claim that some is best.
cat(1) implements option b)
3. What is the difference between utf-8 and en_US.utf8?
The former is an encoding, the latter a locale. It is like
apples and oranges: both are fruit.
Regards,
Martin
Reply to: