Re: Work on a centralized infrastructure for i18n/l10n
I usually split the translation process as:
string
extraction translation
-----------> ------------>
Database Translator
<----------- <------------
displaying tool
translated
strings
Sure, everybody does.
* Some string extraction tools exist: xgettext, poxml, po4a,
xml2po, some
tools generating XLIFF
Fair enough. Any textual data can be converted to any kind of format.
* The database can be a PO file, an XLIFF file, or (why not) another
database.
Definitely, any placeholder with a structure is good.
* The translation tool is a tool able to deal with the format of the
database and help the translator (merging old translations,
displaying
the strings that need to be updated, etc.)
Sounds like a lecture for freshmen CAT tool users but go ahead.
* Displaying the translated strings is done by gettext or by
generating
the documentation (this is usually done by the tool used for the
string
extraction).
Oui...
I don't exclude the manual translation: the extraction is done by an
human, who stores an original string in her brain, translate it and
write
it in the translated document.
...
Thus IMO, advantages of the XLIFF format should be demonstrated by
considering XLIFF as the database.
I won't consider having good translation tools or string extraction
tools
that deal with XLIFF files as an advantage of the XLIFF format.
Well, then, if you won't why bother going that far trying to
demonstrate that anything is just as good as anything else ?
My prefered translation tool is vi. Thus, as a translator, I prefer
PO to
XLIFF or a mysql database.
And here lays the problem: you consider vi as a translation tool. Now
tell me, how many people who do translations on a daily basis would
consider vi as a translation tool ?
I think you are mixing 2 things here: how to manage the backend, and
what format to propose to end translators. The back end could be
managed using any format, even a format developed only for Debian.
There are plenty of ways to do that and I don't question the Debian
way (or what is going to be the Debian way).
What I am saying (maybe on my 3rd or 4th mail) is that opening the
translation process to people outside the GNU/Linux world could be
the result of adopting a translation industry standard that benefits
the end translator because more tools exist with more options on more
platforms with less nerdness to overcome before actually starting to
translate. And since that end user format _happens_ to be handled as
a TM management format as well (and can easily be transformed to the
TM "exchange" format that is TMX also an industry standard) why not
use that format as the storage format. That would save transformations.
It seems you like to have sentences separated. This is not related to
using XLIFF. This is a string extraction issue.
Note however that most of the complaints I receive for po4a (about its
strings extraction features) are that there is too few context in the
strings proposed to the translators, not that paragraphs should be
split
in sentences.
Now you are talking about another problem: adequation of the format
to the task. And I think your translators are very much aware of
that: sentence translation is not appropriately handled by po based
tools. Just like you mention in your other mail: multilingual bodies
are not either properly handled by po based tools.
Context was an issue in the translation world before po ever existed
and before computer guy realized there was a need for localizing
their strings. And the proof that they did not quite get it the right
way is the character set issues that eventually are starting to get
solved with Unicode being pretty much generally accepted.
Now it happens that quite a number of translation based groups (using
computers and not the other way round) have analysed this problem and
have come up with a number of reasonable standards that are also
starting to get common acceptance: xliff for the translation itself,
tmx for its exchange, tbx for glossaries, srx for segmentation (not
limited to "sentence" segmenting). The standards work very well
together and offer a stable common ground on which back ends are
created, tools are developped, translations are accomplished etc.
And, of course, all the above processes include from the earliest
premisses what it took the computer guys so long to figure out,
because the computer guys are not essentially aware of localization
and translation issues, mostly because they don't need to be. And
that is fair enough.
It seems you prefer XLIFF for HTML translations. I don't know why,
but it is
probably not related to the XLIFF format. Maybe it is just that the
tool
you use with your XLIFF files is better than what a PO tool would have
done (at what step: string extraction, translation?).
No I don't prefer xliff for html transformations, I don't transform,
I translate.
I understand that gettext and po and all the related tralala does
what it is requested to do: extract text from source, put that in a
nice package for display in a string editor and insert back the
finished product.
Now my idea, but I may have gotten that wrong, it that all this was
developed in a very specialized context of _localicalising certain
applications_ and not of translating. Localization and translation
are two different processes and although they overlap they are not
equivalent. po comes from a specific localisation context and
extending it forever to the complexity of the current document
formats and workflows seems like a mistake to me since it is not
designed for that in the first place. Cf the above 2 examples
(segmentation and multilinguism). While the translation world has
produced standards that also apply to localization processes.
One of the advantages of XLIFF could be for storing the old
original and
translated strings (this is convenient to check what changed in the
original string; i.e. was the original string updated just to fix a
typo,
or did the meaning changed?). With a PO, a versionning system can
help to
check what changed; but this doesn't always work smoothly (e.g. when a
string move inside a PO).
Because it is not designed for that.
This is not directly an advantage for the translator.
??? Oh really ?
It is just something
that can help the translation tool to propose another feature.
Propose to whom ? Somebody else but the translator ?
With a PO, this could be done by providing the old and new PO to a
tool (I would
love to have such a tool).
Well, that already exists with tools that support a standard that's
relevant to translation, which is not the case for po as you say.
One feature I don't like in XLIFF is having multiple languages in
the same
XLIFF file. This was proved wrong with the debconf translations
(multiple
translation updates can't be committed).
1) a xliff file is not required to have more than 2 translation unit
variants.
2) if you use formats that are not designed to support multiple tuv
chances that the result will not be satisfying.
If you claim that XLIFF translation tools are better integrated to
translation memories, it is "just" a translation tool issue. And a PO
translation tool could also use TMX, or XLIFF memories could be
converted
to gettext compendia.
You are saying po "could" be used for things it is not designed for.
I am saying standards exist, designed for that specific task and
tools exist that do not require to accept nerdiness on the end user
(translator) side.
Then if XLIFF based tools are really much better than PO based tools,
maybe these tools should be used (PO could be created temporarily if
needed). The links you provided did not convinced me.
The tools are not "much better", the tools are designed to support a
format that is designed for the tasks needed in the Debian i18n
framework. If you want to write something from scratch using formats
that are not designed for the job you are bound to face a number of
difficulties. You already described a number of po limitations, and
those are limitations not because of the tools, but because po is not
the proper format to accomplish the task.
Now you may always argue that since po has been around for a while,
it is always better to adapt the existing framework, and I'd
understand this conservative approach, which is very Debianese, I
really have nothing against that. But in the end, if it is about
creating from scratch a framework that deals with all sorts of
translation centered issues, it is better to use a format that has
been designed from scratch to deal with those issues.
The 2 links were not meant to "convince" anybody, but to give a
glimpse of 1) how xliff can be made to fit a po centered system, with
a transition to xliff in mind, and how xliff is used in a broader
perspective.
There are plenty of people who have extensively written on
localization issues (see the ibm developer site, oasis, lisa) and I
can assure you that they have plenty of good reasons to use such
standards and not po.
Joyeux Noël anyway :)
Jean-Christophe Helary
Reply to: