[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#567781: Converting English wml files to utf-8?


* Alexander Reichle-Schmehl <tolimar@debian.org> [2010-06-02 10:56:17 CEST]:
> Am 01.06.2010 17:05, schrieb Gerfried Fuchs:
> >  Like written in #debian-www, if people are still aware that in the
> > special areas for news entries and similar where data gets incorporated
> > into other languages entities still needs to get used (this won't change
> > before _all_ languages are converted to utf8!) then I am all for it and
> > see it as a step in the right direction.
> Which parts of the news entries would that be?

 For the News entries, the <define-tag pagetitle> (similarly in other
news parts like e.g. d-i), for the DPN the SUMMARY= part of the
wml::debian::projectnews::header, the DSAs have their <define-tag
description>, and then there are the regular .data files for various
other means.

 These areas are taken verbatim into translated pages and thus have to
stay 8bit clean, i.e. everything outside of the ascii range has to use
entities. This limitation has to stick around and to get remembered as
long as we don't settle on utf8 for everything in the CVS.

> I only know some RSS feeds created from wml files, but I guess it
> would be possible to solve that problem by telling the RSS creation
> script that the created RSS feed is utf-8 encoded.

 That's the easy part. :P  Though, thinking about it, having 8bit
characters in the english files in RSS feed aggregated parts might cause
troubles: The encoding of the language for which the rss feed is
generated might use a different encoding than utf8 and thus receive
b0rked characters for parts which they haven't translated yet.

 I think we need to limit ourself on entities for parts that got pulled
into RSS feeds, too. Unless of course the rss feed generator code is as
good as being able to pick up the encoding of the english subtree and
the encoding of the language itself and encoding-convert visa versa. But
that still depends on the assumption that every 8bit character used in
the english files is representable in the encoding of the specific

 So long,

Reply to: