[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#344304: qa.debian.org: no charset specified when browsing news



On Sun, Feb 19, 2006 at 11:05:49PM -0500, Jeff Breidenbach wrote:
> > The new mhonarc config should do charset conversion if possible,
> > or just output the text as-is in the case charset of the mail is utf8
> > or unknown.
> 
> It's not that simple. Leaving a '<' character can cause security
> issues. Anyway, the relevant portion of the mhonarc manual is the
> <CHARSETCONVERTERS> resource. Take a look at mhonac::htmlize
> versus MHonArc::CharEnt::str2sgml and possibly discuss this on the
> upstream mailing list.
> 
> http://www.mhonarc.org/MHonArc/doc/resources/charsetconverters.html
> 
> However, I personally recommend that mhonarc be set to convert
> everything to UTF-8, no exceptions. That simplifies a lot of things,
> including the use of mixed languages in a single message. Mixed
> language index pages. Easier linguistic analysis and data mining
> of the HTML.  Etc.  Bending over backwards for incorrectly labelled
> character sets on inbound email seems more trouble than it is worth.

Thanks. I did look into the manual you quote above, but it didn't work
for me. Now I suddenly got it to work, apparantly if you don't specify
'override' but specify default, the default you supply is simply
ignored... *sigh* In addition, the whole mhonarc resource file is ignored
without any error message emitted if you specify
<lang>en_GB.UTF-8</lang> *doublesigh*. It is not helpful at all that the
documenation about charsetconverters doesn't even mention the existance
of the 'override' parameter and what it means, it's that I happened to
stumble upon it in an example. </rant>

The only problem I now have is that I want to convince mhonarc that if
no charset is specified, it should assume UTF-8, within Debian context
with our utf8 changelogs etc, that's a saner decision than assuming
latin1, which is what MHonArc::UTF8::to_utf8 apparantly does, totally
ignoring the current locale. The RFC's say one should assuming us-ascii,
which is simply undefined for non-7bit, so it isn't wrong to assuming
utf8 then. If anyone has a tip, please let me know, otherwise, it might
be easier to have a different workaround: adding this header when
there's not Content-Type header at all in a mail.

> Incidentally, I was probably put on CC: because I'm the mhonarc package
> maintainer. But I should also mention that one of my other hats is
> helping run mail-archive.com, which provides secondary archival service
> for all Debian mailing lists, with permission of the (former) DPL. The service
> is also available for any other Debian team or group currently wrestling
> with mhonarc configuration. So that is a possible fallback if needed.

The context here is the web part of the PTS, which actually isn't a
mailinglist. But thanks for the offer, and your help!

--Jeroen

-- 
Jeroen van Wolffelaar
Jeroen@wolffelaar.nl (also for Jabber & MSN; ICQ: 33944357)
http://Jeroen.A-Eskwadraat.nl



Reply to: