Bug#456360: texlive-lang: Inaccurate package names
On Sat, Dec 15, 2007 at 11:16:07AM +0100, Jordà Polo wrote:
> OK. I have a few ideas, but I'm pretty busy at the moment. I'll think about
> it and try to come up with a consistent proposal. I'm not sure I'll succeed,
> but at least I'll try. Whatever the outcome, I'll surely reply to this bug
> in 2 months.
I finally had some time to take a look at this issue. At first I thought
it would be a matter of moving languages to the right group/family. But
it is not that easy since some collections are based on packages that
include resources for more than one language.
Frank Küster mentioned sizes, so let's take a look at the numbers. This
is the list of texlive-lang-* packages, sorted by Installed-Size:
164 texlive-lang-italian
168 texlive-lang-latin
172 texlive-lang-danish
176 texlive-lang-manju
204 texlive-lang-finnish
204 texlive-lang-spanish
216 texlive-lang-ukenglish
248 texlive-lang-dutch
304 texlive-lang-portuguese
336 texlive-lang-swedish
356 texlive-lang-norwegian
356 texlive-lang-other
504 texlive-lang-hungarian
540 texlive-lang-hebrew
812 texlive-lang-croatian
1096 texlive-lang-german
1752 texlive-lang-armenian
1820 texlive-lang-french
2240 texlive-lang-tibetan
3960 texlive-lang-czechslovak
5828 texlive-lang-mongolian
10080 texlive-lang-arab
10092 texlive-lang-polish
10164 texlive-lang-african
12212 texlive-lang-indic
12800 texlive-lang-vietnamese
14996 texlive-lang-cyrillic
16456 texlive-lang-greek
And listed below is a summary with random comments about the most
interesting packages (packages not listed below are "simple" packages
that only include hyphen files or fonts for a single language). There
are also a few comments in parenthesis that aren't really relevant to
the discussion, but may be of interest for the maintainers. Also, note
that in the following lines "package" actually refers to CTAN packages,
not Debian packages.
* texlive-lang-manju: Basically includes manjutex, a package that offers
support for Manju, a language with very few speakers and a writing
system derived from the Mongolian script[1]. (Btw, the documentation,
written in April 2001, says: «This package is founded on MonTeX and
will finally merge with MonTeX in order to provide all Mongolian
writings.» The MonTex documentation dates from 2002/07/01. You can
also read the following at ctan.org[2]: «This catalogue entry
describes the ‘original’ ManjuTeX; its functionality has now been
subsumed into monTeX, though the obsolete ManjuTeX remains on the
archive.» Does it make sense to include such obsolete packages in
Debian?)
1. http://en.wikipedia.org/wiki/Manchu_language
2. http://www.ctan.org/tex-archive/help/Catalogue/entries/manjutex.html
* texlive-lang-spanish: Hyphenation files for Catalan, Spanish and
Galician. It is interesting that Catalan and Galician are the only
languages that don't have their own texmf/tpm/hyphen-*.tpm file. Both
are included in texmf/tpm/hyphen-spanish.tpm, which is probably what
lead to the wrong description in the Debian package.
* texlive-lang-other: Hyphenation files for Coptic, Esperanto, Estonian,
Icelandic, Indonesian, Interlingua, Romanian, Serbian, Slovene,
Turkish, Sorbian and Welsh (title and description in
texmf/tpm/hyphen-welsh.tpm are wrong btw, Welsh != Czechoslovak).
* texlive-lang-german: Basically German resources. (It also includes
umlaute, which is obsolete according to the documentation: «This
package is obsolete! This package was superseeded by the inputenc
package which is included in any LaTeX 2ε system since December 1994.
Therefore this package is no longer supported; so please don’t use
umlaute, just use inputenc instead.» In Debian, there is also ginpenc,
which is included in texlive-latex-extra.)
* texlive-lang-french: Mostly French related packages, but it also
includes the Basque hyphenation files.
* texlive-lang-czechslovak: Based on packages that include resources for
both Czech and Slovak, so it probably makes sense as it is unless
someone wants to split them. The description and title in
hyphen-czechslovak.tpm is wrong though, that file doesn't include
«Fonts for typesetting some Czechslovak scripts» but «Czech and Slovak
hyphenation files» (the word czechslovak doesn't exist AFAIK).
* texlive-lang-mongolian: Includes hyphenation files and support for
writing Mongolian languages in various scripts, but it also supports
Manju (montex package). It also includes another package for Soyombo,
an ancient script.
* texlive-lang-arab: Includes arabi and arabtex. The former is used to
write Arabic and Farsi, while the latter is focused on Arabic but also
provides limited support for other languages written in the arabic
alphabet: Farsi, Dari, Urdu, Pashto, Maghribi. (Would it be a good
idea to rename this collection to arabic, which is the name of the
script?)
* texlive-lang-african: On the one hand, it includes fonts for many
african scripts (see doc/fonts/fc/fc.rme:71 for a full list). On the
other hand, fonts for the Ethiopic alphabet. Ethiopic fonts (ethiop,
ethiop-t1) are approximately twice as large as the other african fonts
(fc). Perhaps it would be a good idea to split this package.
* texlive-lang-indic: This package is part of texlive-bin, not
texlive-lang.
* texlive-lang-cyrillic: Hyphenation files for Bulgarian, Russian and
Ukranian; Fonts and support for Cyrillic languages; document classes
to make documents in accordance with russian standards, etc.
* texlive-lang-greek: Includes both, classical and modern Greek. (It is
rather large, so, would it make sense to split ancient/classical Greek
from modern Greek?)
After reviewing the collections I can understand some of them, specially
the larger ones. But I still can't understand why, among the tiny
language packs, some are individual and some are not.
What makes a language worth its own collection? How is that Italian or
Manju or have a collection while Romanian or Galician don't? It isn't a
matter of number of speakers (with ~60 speakers you can hardly beat
Manju), nor is it a problem of size, since most of these language packs
are similarly small. That's, IMHO, the problem that should be addressed.
If the number of packages and its sizes are that important, then these
factors should be taken into account. One option would be to include
languages smaller than X (1MiB? 0.5? I don't really know where to draw
the line) in "family" collections. So, for example, Estonian, Finnish
and Hungarian would become part of the uralic collection.
Anyway, this is not a real proposal yet, I just wanted to share my
thoughts since I know more people are following this bug report. Note
that IANAL (I am not a _linguist_), so I probably made some mistakes.
Comments and suggestions are welcome.
Reply to: