[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Translations, locales and gconv



>From the debian-devel thread, I want to explore the details of how
Debian can achieve a sane translation handling configuration that also
includes splitting out the gconv and zoneinfo data into loosely
aggregated sets.

(gconv from glibc is several Mb of libraries that are not all needed at
the same time. zoneinfo data comes from tzdata and is similarly mostly
"unused" in a typical installation. Emdebian needs to reduce the
download and installed size of both of these data sets by about 70%.)

This is based on the tdeb proposal:
http://wiki.debian.org/i18n/TranslationDebs

but is focused on embedded usage where file space issues are critical -
e.g. Emdebian would have problems with the current proposal because
localised manpages should not be installed in Emdebian (which has no
manpages).

Emdebian currently splits every translation into a dedicated package -
if a source package has more than one .mo file per LC_MESSAGES/
directory, all are included in the same package:

$ dpkg -c /opt/emdebian/trunk/a/apt/trunk/apt-locale-de_0.7.8em1_all.deb 
drwxr-xr-x root/root         0 2007-10-31 17:45 ./
drwxr-xr-x root/root         0 2007-10-31 17:45 ./usr/
drwxr-xr-x root/root         0 2007-10-31 17:45 ./usr/share/
drwxr-xr-x root/root         0 2007-10-31 17:45 ./usr/share/locale/
drwxr-xr-x root/root         0 2007-10-31 17:45 ./usr/share/locale/de/
drwxr-xr-x root/root         0 2007-10-31 17:45 ./usr/share/locale/de/LC_MESSAGES/
-rwxr-xr-x root/root     31950 2007-10-31 17:45 ./usr/share/locale/de/LC_MESSAGES/apt.mo
-rwxr-xr-x root/root      6022 2007-10-31 17:45 ./usr/share/locale/de/LC_MESSAGES/libapt-inst1.1.mo
-rwxr-xr-x root/root     24556 2007-10-31 17:45 ./usr/share/locale/de/LC_MESSAGES/libapt-pkg4.6.mo

This is done with a tool called 'emlocale' which roughly equates to the
(unwritten) dpkg-gentdebsrc from the tdeb proposal. 

Emdebian also needs a way of splitting the gconv files out of glibc so
that only the necessary gconv files are packaged and installed -
depending on the configuration specified in emdebian-tools and
depending on user setup.

Similarly with the zoneinfo files from tzdata.

Together, the translations, the gconv support and the tzdata support
need to form a set of packages that can be omitted from certain builds,
added in their entirety for other builds and offered in various
combinations for users who need them. The scalable way of doing this is
for a secondary archive structure that is not part of the main dpkg or
apt cache data. The archive would need to be partitioned so that each
device would simply add a source for the support that the user needs,
e.g. one source per continent containing packages for all gconv and
zoneinfo support and translations.

Adding support for additional locales would mean adding new source
lists - this is needed to allow embedded devices to have a small cache
of this secondary data. (Otherwise there is little or no advantage over
simply including all these new packages in the main apt cache because
apt will simply collate them into one big list anyway.)

The tricky part is deciding which translation goes where because
Emdebian does need to limit the size of this secondary apt/dpkg cache
yet there is no absolute mapping between geography and languages
spoken. A possible method is to stick to the geographical and if users
in North America want languages from Europe, that source can simply be
added. At least that way, that particular user does not have cache data
or locale packages for Oceania, Asia or Africa which cuts the size of
the cache data by 75%. Note that, unlike Raphael's suggestion on the
tdeb page, Emdebian DOES need one package for one translation. Wasted
space is *not* an option when that space is wasted again and again for
each package installed. Collating the gconv and tzdata is acceptable
because it is only installed once. Collating the translations is not.

Emdebian is primarily concerned with file sizes - package sizes and
cache data sizes - because storage space is very expensive for
Emdebian, unlike Debian itself.

The user would be asked which timezone to use (as now) as well as which
language to use (as now). At a later date, the user could choose to add
new timezone and new language support. In Emdebian, the option would
also exist to have no timezone, no locale and no translation support
(e.g. for devices that do not produce user output).

In effect, dpkg-reconfigure locales would simply involve installing the
necessary support prior to configuring it.

There is already support for separate repositories for package
description translations, what Emdebian needs is a development of the
tdebs proposal that includes splitting out the gconv and tzdata files
alongside the translations themselves so that selecting a locale and
timezone installs the necessary packages prior to configuration, rather
than forcing all users to have all data on all systems, whether
configured or not.

Depending on how this is done, the gconv data and the tzdata zoneinfo
data might not need to be in the translation repository itself, just
not installed by default.

I'm looking for ideas and help setting this up. I have the time and
inclination to get this sorted out and it is long overdue. If Emdebian
is to get off the ground, this is just one of those issues that *must*
be solved.

So apologies for the really long mail, but here's a summary of how
Emdebian needs to handle tdebs:

1. Users must be able to download and install pt without pt_BR
2. Users must be able to fallback to pt if pt_BR is the preference but
	absent for a specific package.
3. No translation files are installed without explicit user intervention 
	or device configuration.
4. gconv and zoneinfo data split into continental groups
5. Nothing except the .mo file(s) in tdebs for Emdebian (and I'd rather
	not have to rebuild *all* your tdebs to do that). 
6. Whatever processes need to be run on the *user device* to achieve
	all this can only use C or C++. Perl is *not* part of Emdebian.
	There is no python support or any other interpreted language support.

Note that simply filtering out the localised manpages etc. is not ideal
because the larger tdeb still has to be downloaded and unpacked and
there simply might not be room to do that. My target device is likely
to have <7Mb free at any time. 

-- 


Neil Williams
=============
http://www.data-freedom.org/
http://www.nosoftwarepatents.com/
http://www.linux.codehelp.co.uk/

Attachment: pgprnWTB9vaFB.pgp
Description: PGP signature


Reply to: