[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

http://www.debian.org/sitemap.ja.html generation error etc.



Hi,

I have made change to sitemap and dwn generation script in such way it
fix Japanese pages as follows.  Please rebuild pages if needed and poke
me if I made mistakes.

cvs diff looks good and commited.

/cvsroot/webwml/webwml/english/sitemap.wml,v  <--  sitemap.wml
new revision: 1.41; previous revision: 1.40
/cvsroot/webwml/webwml/english/News/weekly/dwn-to-rdf.pl,v  <--  News/weekly/dwn-to-rdf.pl
new revision: 1.11; previous revision: 1.10

=================================================

I thought we did good UTF-8 transition after regenerating some news
pages.  Alas... I found issue.

 http://www.debian.org/sitemap.ja.html

Each line end with "ESC ( B" sequence.

This is ISO2202( http://en.wikipedia.org/wiki/ISO/IEC_2022 ) code
sequence indicating switch to ASCII (1 byte per character).

It must have made sense when this page used 7 bit ISO2202 but it does
not make sense.

I do not know how to fix it.  I have japanese/.wmlrc updated as:

-D CUR_LANG=Japanese
-D CUR_ISO_LANG=ja
-D CUR_LOCALE=ja_JP.UTF-8
-D CHARSET=utf-8
-D HOME~.
-D INTRO~intro
-D DEVEL~devel
-D DOC~doc
-D DISTRIB~distrib
-D MISC~misc
-D BUGS~Bugs
-D PICS~Pics
-D STYLE~style
-D VOTE~vote

This code is clearly added by webwml when generating sitemap.ja.html
from each file header.

.... aha... sitemap.wml has funny special case.  I am removing it now.

I checked english source for "grep -R "Japanese" *"

english/News/weekly/dwn-to-rdf.pl has funny encoded Japanese text too.

It is in EUCJP. It should be "セキュリティ上の更新。" in UTF-8.

This is difficult to edit since it is mixed encoding file.  Since Vim is
too smart for this, I used 8-bit-dumb-clean editor mcedit.

Osamu


Reply to: