Re: Questions regarding utf-8
On Fri, May 09, 2003 at 02:51:23AM +0200, Andreas Bombe wrote:
> On Thu, May 08, 2003 at 07:50:50PM -0400, Bob Hilliard wrote:
> > Some third-party dictionaries, such as foldoc and The Jargon File
> > occasionally include 8 bit characters, such as 0xe7 for c-cedilla. In
> > order to fix these easily, I would like to know:
> >
> > 1. How can I determine what character encoding is used in a
> > document without manually scanning the entire file?
>
> Hm, I can't thing of an easy way. Maybe someone knows available tools
> to do that.
It's not really possible. Arbitrary 8-bit data could be anything;
without analysis based on a bunch of dictionary files for every language
likely to occur in the file I can't see how you could guess the encoding
for pre-UTF-8 text. ISO-8859-1 is probably the most likely for
predominantly English texts like foldoc and the Jargon File though.
Cheers,
--
Colin Watson [cjwatson@flatline.org.uk]
Reply to: