[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Version 0.11 of po4a, and future directions


I'm pleased to introduce you a new version of po4a. This is version 0.11
(yup, same version than gettext ;)


First of all, i moved back to a single package organization. The former
organization (ie liblocale-po4a-perl, po-pod and po-man) was a nightmare to
maintain because most of the code was identical between po-pod and po-man.
For example, the diff between pod-gettextize and man-gettextize was one line
long, the only difference being which module you load (Locale::Po4a::Man.pm
or Locale::Po4a::Pod.pm).

So, now, there is only one package, and only one serie of binaries for all
modules. Of course, each binary take a new argument specifying which module
do you want to use. 

That's really easier to maintain, but I'll go into problems when a po4a
module have a dependance on extra libs/program like po-debiandoc does with
ngsml, for example. I guess po4a will recommand those extra package, and try
to gracefully fail if they are not present on the system, just like debconf
does when curse isn't installed.

I still have some problems to include the translated documentation of po4a
in the generated deb file, but i think it's a detail, and won't go further
on that issue here. Anyway, this translation is yet to be done ;)


I've added a new binary to the package called po4a-identity which is usefull
to test modules. It takes a document to translate as input, and produce its
translation without using any po file. If the module is idempotent, the
produced document should be exactly the same than the original one. This
allows me to speak now about the status of each modules.

In fact, no module can be idempotent, since we wrap paragraphs around. So,
to test the idempotence of a module, one have to compare the generated text
output from the original document to the one obtained with the po4a-identity


I did run some tests for all the pod files on my machine, and Pod.pm runs
almost perfectly. Here are the known problems

1) The wrapping is sometimes changed.
It's just weird. Sometimes, the two spaces after a ';' for example are kept
by pod2man (or groff?), which seems plainly wrong to me.

2) Text is sometime splited on wrong position
I have another problem with /usr/lib/perl5/Tk/MainWindow.pod (and some other
pages, see below) which contains: 
  C<" #n"> 
Lake of luck, in the po4a-identity version, this was splited on the space by
the wrapping. As result, in the original version, the man contain
 " #n"
and mine contain
 "" #n""
which is logic since C<blabla> is rewriten "blabla"

Complete list of pages having this problem on my box (from 564 pages ; note
that it depend on the choosen wrapping colon):

Beside of these two minor issues, the Pod.pm seems quite usable now.
On the way, I had to fix a bug here and mask another there:

3) handling of the string "0" is errorprone in Perl
I submitted the following patch against the Perl Bug Tracker:
--- Man.pm	2002-12-18 22:35:43.000000000 +0100
+++ /usr/share/perl/5.8.0/Pod/Man.pm	2002-12-18 22:36:27.000000000 +0100
@@ -759,7 +759,7 @@
         $index = $_;
         $index =~ s/^\s*[-*+o.]?(?:\s+|\Z)//;
-    $_ = '*' unless $_;
+    $_ = '*' unless length($_);
     if (@{ $$self{SHIFTS} } == @{ $$self{INDENTS} }) {
         $self->output (".RE\n");
Without this, "=item 0" was changed to "=item *" because the string "0" is
evaluated to wrong, even if it's not the empty string...

4) pod2man lies and don't wrap anything ; groff is smarter than podspec
I had a whole bunch of problems with pages containing stuff like:

(there is a lot of them in Tk bindings documentation). The problem is that
the first line isn't indented, so the pod specification says that it should
be wrapped. But in fact pod2man don't wrap anything personnaly, and let
groff do that for him. The problem is that in groff, the rule is that no
indented line is wrapped. So, on the previous example, groff will indent the
line "FUNCTION", and that's it. 

To mask this bug, I made that the Po4a::Pod.pm parser consider as verbatim
any paragraph with at least an indented line. That way, I consider too much
paragraphs as verbatim, but it should be harmless.


This module isn't idempotent at the source level, because of wrapping, and
because I wanted to make translator's life easier. So, if I see this chunk
in the original:
 | this is a stupid text, but
 | .B be carefull
 | it's not that easy to handle.
The translator will face this text in po file (note the use of pod sequence):
 | this is a stupid text, but B<be carefull> it's not that easy to handle.

And the produced text will contain

 | this is a stupid text, but \fBbe carefull\fR it's not that easy to handle.

raw results of tests

Before I comment them, here are the raw results:
 # of pages         : 4323

 Ignored pages      : 1432 (33%)
 parser fails       :  850 (20% of all; 29% of unignored)

 works perfectly    : 1660 (38% of all; 57% of unignored; 81% of processed)
 change wrapping    :  239 ( 5% of all;  8% of unignored; 12% of processed)

 undetected problems:  142 ( 3% of all;  5% of unignored;  7% of processed)

Pages ignored are so because they contain a comment indicating that they
were produced from the pod format. In that case, po4a refuse to go further,
and recommand to the user to translate the source file, not this generated
one. For pages generated by other means (like docbook2man), po4a will emit a
warning and process the page.	  

Parser fails on pages based on mdoc(7), pages using conditionals with .if,
defining new macros with .de, and more generally, being too cleaver in nroff
for our simple parser (which is not a real interpreter).

To detect wrap changes, we run diff on the generated cat files, and if a
change is detected, we run a modified version of wdiff(1), which also ignore
hyphenation changes. If wdiff don't detect any difference, we assume that
the changes are harmless.

But for 3% of the pages, po4a isn't idempotent and can be considered as
buggy. Most of the time, the changes are about font, with some chars being
bold instead of italics, or so. But the problem may also be problematic.
Repporting any problematic problems is good, but if you could come with a
fix, it would be even better. ;)

Arguable macro handeling

Here is a list of macro, there definition from groff(7) or man(7) and what I
     do with it. It's not optimal, but I don't have any better idea for them:
 .de macro: Define or redefine macro until .. is encountered.
   Since we're not a real groff interpreter, we can't handle such cases. A
   possible improvement would be to read the macro name and its definition,
   compare this to well known user macros, and accept it if the definition
 .ie cond anything: If cond then anything else goto .el.
 .if cond anything: If cond then anything; otherwise do nothing.
   Same problem, but I've really no idea here.
 .so filename: Include source file.
   Not sure what we should do here. For now, I offer the ability to
   translate the filename to translator. But maybe, we shouldn't even have
   to translate this, letting man searching for the translated version of the 

Ununderstood but used macros
Here is a list of such macros (partial list since the program fails on
the first unknown macro):
 ..               ."              .AT             .b              .bank
 .BE              ..br            .Bu             .BUGS           .BY
 .ce              .dbmmanage      .do             .DS             .En
 .EP              .EX             .Fi             .hw             .i
 .Id              .l              .LO             .mf             .mso
 .N               .na             .NF             .nh             .nl
 .Nm              .ns             .NXR            .OPTIONS        .PB
 .pp              .PR             .PRE            .PU             .REq
 .RH              .rn             .S<             .sh             .SI
 .splitfont       .Sx             .T              .TF             .The
 .TT              .UC             .ul             .Vb             .zZ

Any input welcome.

Specific problems about some pages


Here is the diff at the output level:
-              d3b07384d113edec49eaa6238ad5ff00  md5-test-file
+              d3b07384d113edec49eaa6238ad5ff00 md5-test-file
Here is the diff at the macro level:
-.B d3b07384d113edec49eaa6238ad5ff00\  md5-test-file
+.B d3b07384d113edec49eaa6238ad5ff00 md5-test-file

The author wants to put an extra space at the end of the macro arg, but it
fails because of the wrapping. Please turn the wrapping of either by
indenting the paragraph, or by using the .nf/.fi groff macros.

Conclusion about Man.pm

Since ignored pages are translatable with po4a::pod and since wrapping
changes are acceptables in most cases, it looks like the current version of
po4a can translate 76% of the man pages on my machine. Moreover, most of the
untranslatable pages could be fixed with some simple tricks given above.
Isn't that coooool?

As you can see, this module seem mature enough for a wide use. I still would
prefer some more testing before releasing the beast.


It's a new module to handle the configuration help of each compilation
option of the linux kernel. There is several projects here and there to
translate the /usr/src/linux/Configuration.help file, but no assisting tool.
An this format is quite prehistoric (chuncks are separated by '\n\n\n'; each
chunk is of the form 'short desc\nvariable\nlongdesc, paragraphs separated
by one empty line'). Moreover, since all the documentation for all kernel
options are stored in only one file, managing translation should be a
nightmare. This would explain why 2.4 kernels aren't translated to french
while 2.2 were. 

The module is done and seem to work. The only problem I know is that
wrapping is turned of for now, because the file contains tables which are
not specifically indented and would get messed up if I turn wrapping on
without getting the original changed. 

But that's a detail. The main problem is that we should patch make xconfig
and such to use the translation also. Will see how much time I'll have to
argue with developpers ;)


Once I get Man.pm running sufficiently good (ie, no really hurting diff
anymore), I'll give it for general consumption, and die from the numerous
bug repports I'll certainly get... For example, it would be more than great
if I could convince Gerard Delafond, with coordinate the translation of man
pages to french to use po4a. It would be possible, he comes from the kde
translation team, they are already convinced of the interest of po-based
translation tools.

I now dream of a texinfo module, but this one seems harder (ie, longer) to
do. Not as hard as Man.pm, because I can steal code from texi2html, but this
perl script is pretty long and indigest...

Another idea would be to embed the addendum files into po files, as regular
comments. It would be rather usefull for short addendums, like the ones for
man pages, containing only the name of the translators. For bigger
addendums, it would still be possible to use separate files. 


Until end of Xmas break, I can't syncronize my package pool. So, for now,
get the package from there:


Thanks for reading 'till the end, Mt.

Each language has its purpose, however humble.  Each language expresses the
Yin and Yang of software.  Each language has its place within the Tao.
But do not program in COBOL if you can avoid it.
          -- The Tao of programming

Reply to: