[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#292330: project: UTF-8 as default



On Sun, 30 Jan 2005, Roger Leigh wrote:
> Marco d'Itri <md@Linux.IT> writes:
> > rleigh@whinlatter.ukfsn.org wrote:
> >>I think the locales package is the place to start this.  For etch, I
> >>would like the UTF-8 locales to be the default for all languages (with
> > This would be stupid, pointless and would piss off a lot of people.
> 
> Please could you explain why?

Do your homework about Unicode and locales.  Hints for the googling:
Unicode CJK unification problems.

Also I can assure you 80% of the mail I see getting through the mail servers
I admin is either latin-1 encoded, or that Windows CP1252 monstruosity
(often mistagged as latin-1).  Too much of it without any sort of charset
declarations at all, since too many people use extremely crappy software.
It is even worse for web pages.

> > But since your native language is english I suppose that it may be
> > hard to you to understand the reason for this.
> 
> Please could you explain why English is different?

ASCII, and the fact that most other charsets are backwards-compatible to
ASCII for the first 128 codepoints.

Try living in an EBCDIC world for a small while, even if you only use
english.  You will understand quite fast.

> When I made the transition myself, I had to recode a number of files
> to UTF-8 from the local encoding I was using previously (ISO-8859-1).
> How does this differ for other languages and encodings?

It doesn't, really.  Not in that way.  The problem is usually data exchange,
and for CJK countries, that they often need extra language tagging which is
not available on Unicode, but which IS implied by the other charsets.  For
XML documents, this is easy (if a lot troublesome) to fix.  For regular text
files, well...

> Why?  It's an undeniable fact that there is a cost associated with the
> migration, but to avoid the migration will not be of long term benefit
> to users of those locales.

You are not in a position to know that yet, IMHO.  Do some research, and
then we can continue arguing if you still believe an UTF-8 default locale
for all countries is a good idea.

> emails without a specific charset which are not plain ASCII are most
> likely broken in the first place.  It's not our place to work around

They ARE broken, according to the MIME standard.  But they are too many.

If I start killing anything non-ASCII from the headers, which is also
illegal since rfc-822, I stop about 30% of the email flux (and no, most of
it is NOT spam).  That should give you an idea of the state of things in
Brazil.

I imagine in many other non-ASCII countries, things are just as bad if not
worse.

Heck, I keep rejecting emails with embedded NULLs and more than 8192
characters per line, which is unacceptable since RFC822, when the email
world was young and there was no SPAM.

> This goes against the general long-term plans for GNU/Linux i18n/l10n,
> since UTF-8 is intended to unify the locale encodings, not to
> perpetuate their mutual incompatibilities.

That does not fit the current reality for the system locale, to many of us.
Maybe in a few years.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh



Reply to: