[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Questions regarding utf-8

I have a neural net program ( http://www.nongnu.org/libann/doc/libann_6.html#SEC26 ) which does something similar:

Given a text file, it will attempt to guess the natural language in which it was written.
I'm sure it would be fairly simple to modify it to guess the charset. If you point me to a reasonably large set of example files, I'll see what I can do. --- It would never
be 100% accurate, but would probably make a good guess at the problem.

Bob Hilliard wrote:

    1.  How can I determine what character encoding is used in a
        document without manually scanning the entire file?
> You can't do that automatically, in generally. If you know what text
> you expect, and you know the bytes you have in the file, you can
> try a number of encodings, and see which of the encodings gives the characters you expect. > As a manual procedure, this is best done with the help of /usr/share/i18n/charmaps. This > lists the Unicode character position, the encoding-spe cific byte [sequence], and the character name.

> So if you know you have \xe7, and you know it is c-cedilla, it could
> be iso-8859-1. It could also be iso-8859-{2,3,9,14,15,16}, cp125{0,2,4,6}, DEC-MCS, SAMI-WS2, etc.

Reply to: