[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#292330: project: UTF-8 as default



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Steve Langasek <vorlon@debian.org> writes:

> On Wed, Jan 26, 2005 at 02:53:52PM +0000, Roger Leigh wrote:
>
>> > By this, I'm not talking about enforcing this character code on the
>> > whole Debian system, but see to that: 1) Installing systems with
>> > UTF-8 is easier, also with locales not strictly in need of
>> > this. UTF-8 as default is not necessarily my ultimate goal (as the
>> > title suggests), but having the option of using UTF-8 (or other
>> > encodings) system-wide, no matter what languages are chosen.
>
>> I think the locales package is the place to start this.  For etch, I
>> would like the UTF-8 locales to be the default for all languages (with
>> language-specific encodings being offered as alternatives).
>
> Then please begin coordinating with the respective language teams involved
> with the debian installer, to ensure that we have a usable UTF-8 based
> console environment for all languages.  (Or hand us a d-i based graphical
> installer sprung fully-formed from your forehead, whichever you find
> easier.)

I wasn't trying to cause offence with my comments.  I fully appreciate
this isn't a trivial task.

For the last few weeks, I've been working on just that.  I'm slowly
writing a full framebuffer-based terminal emulator which will support
all the bi-di string specifications of ECMA-48, with full separation
between data and presentation components.  It will use FreeType (or
maybe even Pango) for the font rendering, and so should provide the
same level of text rendering support (and quality) you get under X,
though I plan for it to be a bit faster than the X terminals by more
intelligent glyph caching.

http://www.whinlatter.ukfsn.org/gtk/uterm-0.1.0.tar.bz2

There's not much to see yet.  I've written some of the basic classes,
plus most of the ECMA-35 and -43 support.  Over the last week or so
I've become a little side-tracked writing a code table editor, for
charset/element/area mapping/designation/invokation, but I hope to
have something usable within a few months.  Once the basic table
parser (input handling) and terminal classes are done, we can start on
the framebuffer driver.

(If anyone out there can provide any examples, either code or simple
explanation, of how the ECMA-48 data component and presentation
components normally interact, that would be of great benefit.  This is
required for bidirectional nested string handling, but it's not clear
what the implications are for line wrapping and the mappings between
the two components.  I'm also looking to get hold of several ISO
standards documents, but they are rather expensive.  If anyone can
help me get hold of any copies of these standards, that would also be
of immense help.)

Once I've got the basics written, I'll be making the arch repo
available.  If anyone's interested, feel free to get in touch.

> There's more to providing a working UTF-8 capable second-stage
> installer than just setting "UTF-8" in the locale name, and this is
> a major issue that makes UTF-8 a non-viable default for sarge.

I'm not suggesting this should be done for sarge, which is why I said
I'd like it for etch.  I'll be honest: I hadn't actually considered
the implications for the installer; I was rather more interested in
the working system after installation.

>> > 2) See to that all Debian packages handles UTF-8 properly.
>
>> This is a policy issue.  Not all packages need to handle it, so this
>> should be a reccommendation rather than a requirement.  For example,
>> there are specialised packages that only work with certain specific
>> encodings, and these should probably not be a priority to change.
>> Certainly, all general-purpose packages should be UCS-aware, though.
>
> I hope you're just conflating UCS-2 with UTF-8 here.  UCS-2 is a crap
> charset, which there's no reason at all for most Unix programs to support.

No.  UCS == Universal Character Set, a.k.a. ISO-10646.  I wasn't
referring to any specific encoding thereof, hence the lack of any
qualification.

A package's UCS support might involve using wide characters and
streams, particularly for more sophisticated processing and layout.
In this case it's more than just "UTF-8", even if that's what is used
for input and output.


Regards,
Roger

- -- 
Roger Leigh
                Printing on GNU/Linux?  http://gimp-print.sourceforge.net/
                Debian GNU/Linux        http://www.debian.org/
                GPG Public Key: 0x25BFB848.  Please sign and encrypt your mail.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iD8DBQFB+AM2VcFcaSW/uEgRAiHjAKCe7XdTeTLyC/FCIoBFDnZ/DCEJqgCdHYOc
BUgTP63kDQ/K7lKUJkSbDls=
=1w/v
-----END PGP SIGNATURE-----



Reply to: