Compendia generation

To: debian-i18n@lists.debian.org
Subject: Compendia generation
From: Slavko <linux@slavino.sk>
Date: Tue, 14 Aug 2012 10:03:35 +0200
Message-id: <[🔎] 20120814100335.00370503@bonifac.skk>

Hi all,

i start new thread for this to separate it, because it is separate problem.

Dňa Sun, 12 Aug 2012 17:30:38 +0200 Christian PERRIER <bubulle@debian.org>
napísal:

> > What is your opinion, please?  
> 
> All these ideas seem to be good ideas.

OK, now i am in state to setup my machine to allow locally generating of
the compendia, for testing purpose. It seems, that i understood the
script's ideas (i hope), but now there are some questions.

There are some options how to generate the compendia and because i am
finding proper solution, i want ask all others what is best way. I see
these options:

1, Leave compendium at it is (One, as is)
=========================================

I will summary pro and pros of the actual state - it will be mixing well
know things and my opinions.

Actually the script to generate the compendium takes all available PO
files for given language and creates (some conversion is here, if needed -
this i want to discuss separately) and merges them to one big PO file by
the msgcat tool, without any manipulation.

(1)The result compendium contains information about:
 * all input files,
 * has comments from all input files
 * has merged headers from all input files
 * has all source files from all input files

I consider all lines, which starts with "#" (except flags) as useless for
using to initialize a translation from scratch or to update an already
existing translation. 

(2)Apart of these information, final compendium contains merging of all
different translations of the messages, which results in a fuzzy message,
which contains info about all files, where this msgid exists and all
translation forms (surely all, the identical are here more times). It
contains all untranslated strings from all input files too.

These fuzzy (except really outdated) messages are good indicators for the
difference in translations, once again, are useless for initialize and
update of the translations.

As result (by-effect, but IMO important) of this merging of different
translations is, that a lot of translated messages are switched to fuzzy.

(3)Finally, it contains a lot of obsolete messages, beside previous i
consider it as can (not must) be useful for translators. Then i leave this
to the next discussion.

I will take some statistics, to make some sense about amount of mentioned
information, which i did with the compendium-sk-stamp20120810.po:

			size (B)	% of orig
original		40 080 124	100
without (1)		23 055 240	57,5	(a)
without (2)		25 931 817	64,7	(b)
without (1+2)		17 999 483	44,9	(c)
without (1+2+3)		14 967 094	37,4	(d)

 * The (a) has removed comments, references and contexts by sed and
   stripped PO header manually (i really don't know how to clean it by
   some tool) from original.
 * The (b) was generated from original by "msgattrib --no-fuzzy".
 * The (c) was generated from (a) by "msgattrib --no-fuzzy"
 * The (d) last was generated from (a) by "msgattrib --translated
   --no-fuzzy", but the same must be from (c) by "msgattrib --translated".

2, Generate compendium with '--use-first' option (One, use first)
=================================================================

By this option, the msgcat will use any data only from their first
occurrence. The result compendium will has all comments, header
information and will preserve the translated status of the message, but
all only from first occurrence (file) and it is terrible to define, which
file will be first :-)

Actually i cannot give the size of this file, because i have no full
translation material downloaded yet, but by my opinion, the result will be
about 80 % of the original size.

Result compendium can be useful to initialize and to update translations,
but has one problem, caused by the "random" personality of the "first
occurrence" term and then some translated messages can contain unwanted
form of the translation.

3, Generate as is, but remove comments and fuzzy (One, nocomments,no-fuzzy)
===========================================================================

By this option, will be compendium generated as it is, but after
generation will be stripped the comments and fuzzy messages.

By this, the result size of the compendium will be cca 50 % of the original
(can be language depend), but result will be full usable to initialize and
to update translations, because it will contain only translated messages.
All different translations will be lost and will not poison of the
users :-)

4, Generate as is, but remove comments, fuzzy and obsolete (One, no-fuzzy,
no-obsolete)
==========================================================================

This option is similar to previous, but the obsoleted messages are removed
too. My knowledge is not enough to determine how are obsolete messages
useful for initialization and updating the translations, but can make
output lesser (cca another 10 %).

5, Generate two files - one some from previous options and one with fuzzy
messages (Two, some and fuzzy)
=========================================================================

This option is for tweaking translation, to provide two files. One
generated by some from previous options (will be selected latter) and one,
contains only fuzzy messages. Most of these fuzzy messages (i hope) can be
used to find of the translation differences and help translation teams
make their translations better.

6, Something other
==================

Have some other solutions/ideas, please, give it.

=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

Please, here are another views as mentioned yet. It is the server (CPU,
discs, etc) usage while generation and code change, here is i see it:

 * option 1 is the simplest to implement and will not take another server
   usage :-D
 * option 2 is very simple to implement and IMO will not take another
   server usage
 * option 3 and 4 are simple to implement, but takes another server usage
 * option 5 will depends on finally selected solution, but will not hard
   to implement, and seems that can take some another server usage

=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

Finally, I want to know, what is opinion of the other translators (and non
translators too), then i have prepared simple poll - the options of this
pool are marked as these in brackets after above mentioned options.

Please, give your opinion about this here:
http://slavino.sk/debian-kompendium, please it is my personal page and
then mostly in Slovak, i am sorry for this :-)

regards

-- 
Slavko
http://slavino.sk

Attachment: signature.asc
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Compendia generation
  - From: Slavko <linux@slavino.sk>

Prev by Date: Fwd: "Debian Community celebrates its 19th birthday" announcement. Please review and translate.
Next by Date: util-linux 2.20.1-5.1: Please translate debconf PO for the package util-linux
Previous by thread: Fwd: "Debian Community celebrates its 19th birthday" announcement. Please review and translate.
Next by thread: Re: Compendia generation
Index(es):
- Date
- Thread