[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: html2text with utf8 support: please test



Eugene V. Lyubimkin wrote:
> Utility html2text, version 1.3.2a-6, with "utf8" patch was just
> uploaded to experimental.  The patch allows to process UTF-8 files
> when '-utf8' option supplied. Input should be in UTF-8 and output will
> be in UTF-8 too.
>
> Please test this functionality - I believe that UTF-8 support is a
> good feature, especially for processing non-English documents.

Mmm, the way it is done looks wrong to me: there is no reason why the
input and output charsets should be related at all.  For the input,
html2text should recognize the meta http-equiv tag, that should work
for a lot of pages, else an input-charset option can be provided.  For
the output, the current locale's charset should be used (as returned by
nl_langinfo(CODESET) after calling setlocale(LC_CTYPE,"")), that should
work in almost all cases, else an output-charset option can be provided.

Yes, that means conversions.  But without that you can not put a sticker
"utf-8 support", only "limited utf-8 support".

Samuel


Reply to: