[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: how to move to UTF-8 ? (was: An encoding problem)



Simon Paillard wrote:
> On Wed, Jul 29, 2009 at 06:27:02PM +0200, Frans Pop wrote:
>> FYI, I've just converted the Dutch translation to UTF-8.
> 
> Could you please describe the steps you have performed and how ?
> 
> For what we have identified:
> - recode wml files (using recode from recode package)
> find . -type d -exec recode latin1..utf8 {} \;

I actually used (sponge is from moreutils):
$ for i in $(find -type f); do \
	iconv -f iso-8859-15 -t utf-8 $i | sponge $i; \
  done

I then checked the result with 'cvs diff -u'. That showed some pages
(incorrectly) already had utf-8 encoded chars, so I reverted those.

It turned out that this mangled the generated $Date fields (2007-01-01
had become 2007/01/01); I corrected that by doing (possibly not strictly
necessary as the server would update them anyway on commit, but I wanted
my diffs clean):
$ for i in $(find -type f); do \
	sed -ri "s% ([0-9]{4})/([0-9]{2})/([0-9]{2}) ([0-9]{2}:)% \1-\2-\3 \4%" $i; \
  done

> - update the .wmlrc file
> -D CUR_LOCALE=fr_FR.UTF-8
> -D CHARSET=utf-8

Correct.

> - convert charset of po files
> cd po ; for file in *po ; do msgconv -t UTF-8 -o $file $file ; done

Not strictly necessary, but I did indeed do that as well.

> - some references to ISO-8859-15 (or old coding) in webpages about
>   website.
>   * pour le site web, devel/website/examples.wml et
>     international/french/web.wml
>   * pour la traduction, international/french/traduire.wml
> - *.UTF-8 locale on www-master -> OK, checked
> - redirections pages with specified charset
>   (devel/debian-installer/gtk-frontend.wml and distrib/cd.wml)

I did not check any of that TBH, but then we don't have the first few translated.

For the last, also: distrib/floppyinst.wml, distrib/netboot.wml.
I have updated those now. Thanks for the hint!


I also did a cleanup, replacing entities by encoded characters, e.g:
$ for i in $(find -type f); do sed -ri "s/ä/ä/g" $i; done

In the past it made sense to use entities to avoid encoding issues, but
with the switch to utf-8 that's less relevant and the regular characters
make the source more readable.

Cheers,
FJP


Reply to: