[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: DDTSS suggestions



On Tue, Jun 14, 2011 at 09:08, Martijn van Oosterhout <kleptog@gmail.com> wrote:
> What I want to know is where there comes from and how are they
> inserted. The comment says "rosetta", but is it a program doing the
> talking, or are people typing directly. If I add a check to reject
> invalidly encoded input, are users going to see it? If not, it may be
> better to simply "fix" them (that is, replace broken characters by
> question marks).

I don't know where they come from but I have a better suggestion how
to deal with them.
iconv is able to convert between character encodings, even
approximating characters which don't exist in the target encoding.
This is done via utf-8//TRANSLIT as the target encoding. But you have
to tell iconv the input encoding, which we don't know. Luckily, there
is a parser which can guess the input encoding:
http://chardet.feedparser.org/
This is better than just replacing characters with question marks
because it should get the encoding usually right (if it's not
extremely short) and if it doesn't, translators have to do it again
anyway.

Regards,
~~helix84


Reply to: