On Mon, Aug 10, 2009 at 02:06:44PM +0200, Giacomo A. Catenazzi wrote: > Thomas Koch wrote: > >I've an issue, that I forgot to set the character encoding of > >tomcat to utf-8 after reinstalling a server. > >Now, before I report a wishlist(?) bug to tomcat, I want to ask > >(and invite to discuss) shouldn't utf8 be the default character > >set everywhere? So when installing a package from Debian I can > >assume that where a character encoding can be set, it't set to > >utf8. > >MySQL would be another example, which to my knowledge uses isoXYZ > >as default character encoding. > > Future debian systems will have a UTF-8 charset as default. > Look at debian-policy archives. For system users, yes, assuming you are talking about the C.UTF-8 proposal. For normal users, UTF-8 has been the default since Lenny. If having a C.UTF-8 locale always available for system services is required for them to fully support UTF-8, then that needs adding to glibc. For a locale available after /usr is mounted, a simple localedef invocation is all that's needed; for all times, after starting init, it needs the tables compiling into glibc as for the standard C locale. I've been looking at how to do the latter, but I'm not expert with the "3-level" locale tables and other glibc internals, so if anyone who knows the details of glibc locales could provide me with assistance/guidance here, that would be much appreciated. For reference, this is bug #522776. This would be great to have as a release goal for Squeeze, and (speculatively) a native C UTF-8 locale for Squeeze+1 to give us a default pure UTF-8 system from end-to-end. > A lot of debian files will be encoded in utf-8 (control, changelog > and manpages), and transformed in the needed charset runtime. I think "will" here implies it's something to be done in the future, but it's a requirement right now, and all but a few exceptions are already converted. > But for databases there are different issues. I think the best solution > is to do it as mediawiki: the UTF-8 data in put as binary blob: it is > difficult to have database engines and system libraries syncronized, and > it is also difficult to implement support for all Unicode characters. PostgreSQL seems to manage it without problems. Putting text in as a binary blob obviates most uses for having in a database in the first place. Sorting, indexing and querying requires being able to read it! Note that there are separate client and server (database) encodings for text as well. You may well get recoding between what the user sees and what's actually stored in the database, potentially at several points. Having UTF-8 on the server does not require it on the client (and vice versa). > But let to concentrate to the first task: having a good UTF-8 support > in all programs/terminals/etc. I think that part was already done quite some time ago. Any program that doesn't support UTF-8 is an exception, and should be fixed or removed. For the specific case of databases, what's being proposed here is making the default UTF-8. Existing databases should not be affected, since they would retain their current encoding. New databases should, however, use UTF-8. If a specific application needs a specific encoding in order to function correctly, then it's that application's responsibility to specify that when creating it i.e. overriding the default. If it doesn't do that already, it's already broken since it's currently unspecified. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail.
Attachment:
signature.asc
Description: Digital signature