Questions regarding utf-8
The Dict Protocol (RFC 2229) provides that databases shall be
encoded in utf-8. Since US ASCII is a subset of utf-8, pure ASCII is
acceptable for the databases.
Some third-party dictionaries, such as foldoc and The Jargon File
occasionally include 8 bit characters, such as 0xe7 for c-cedilla. In
order to fix these easily, I would like to know:
1. How can I determine what character encoding is used in a
document without manually scanning the entire file?
2. What is the best available filter to convert from encoding X
to 7 bit ASCII?
3. What is the difference between utf-8 and en_US.utf8?
Pointers to the appropriate documentation would be very welcome,
since I feel a need to become more knowledgeable about this subject.
Regards,
Bob
--
_
|_) _ |_ Robert D. Hilliard <hilliard@debian.org>
|_) (_) |_) 1294 S.W. Seagull Way <bob@bobhilliard.net>
Palm City, FL 34990 USA GPG Key ID: 390D6559
Reply to: