Re: [translate-pootle] Wordforge

To: translate-pootle@lists.sourceforge.net, debian-i18n@lists.debian.org
Subject: Re: [translate-pootle] Wordforge
From: Javier SOLA <javier@Khmeros.info>
Date: Fri, 09 Jun 2006 16:58:19 +0700
Message-id: <[🔎] 4489463B.7010006@Khmeros.info>
In-reply-to: <[🔎] e76b46480606081502p43b102d1n3e48316a379c0d4f@mail.gmail.com>
References: <[🔎] 20060605020856.6a722cb9@localhost.localdomain> <[🔎] 1149604559.9167.99.camel@localhost.localdomain> <[🔎] 20060608001728.22f6fa64@localhost.localdomain> <[🔎] 20060607233200.GA6007@nekral.homelinux.net> <[🔎] 1149763917.9167.334.camel@localhost.localdomain> <[🔎] e76b46480606081502p43b102d1n3e48316a379c0d4f@mail.gmail.com>

Hi Aigars,

I think that this discussion is very important.

We need to ensure that Pootle is capable of handling the large amountsof information that Debian needs, which is probably not the problem, butit must also handle all the processes that Debian needs. The solutionmight be either on file handling or in databases.

First, it is important to understand the complexity of the data that weare handling. we are not only talking about a set of source strings andtheir translations, associated in files. WE are talking about managingprocess information to optimize the result of the work of thetranslators. Each XLIFF file not only contains strings and informationabout them. It might also contain a glossary, translation memoryinformation, comments from translators or reviewers, information aboutresults of tests run on each string, data for conection to SVN... andprocess information: a series of phases through which each file hasalready gone(translation-review-approval-update-translation-review-approval... ),associating each message to a given phase. We can also have translationsof the same message to other langauges, as reference. Also, XLIFF filesmight include counters that give information about the state of thefile, whitout having to recalculate.

All this information is easy to store in XML, but it would require quitea complex database.

My believe is that the process that will will use most time in Pootle isthe process of merging two files, which must happen when a file iscommitted to SVN, when is uploaded to Pootle from a translator... orwhen a new POT/XLIFFT file is uploaded to Pootle for updating alltranslations of a given package (much more efficient that doing all thelanguages one by one against CVS. If data is in a database, then atleast one file does not need to be parsed (every time the process runs),and the process would probably be faster, but there are many otherfactors that could become more complicated because of the DB. Updatestake place at non-critical times, but user demands for files mut beresponded to immediatly, If all the files need to be created to beserved to the user, this process might be longer that what the user isprepared to wait (I don't know)

My personal conclusion is that this is something that we really need tolook at, and I am very happy that you and other people are getting intoit... but it is not something that should be resolved now in order tostart Guntaitas' project, there is too much at stake to rush a designdecision that will affect all the future of Pootle. I would very muchprefer that we -in this list- analyse the issue much further and comeout with the right conclusion, which we will implement, as we are asinterested as you are on making sure that Pootle scales and can respondthe Debian's needs, which means that it will be able to respond to theneeds of any other FOSS project.

As Christian has proposed, I think that if we can get separation offront-end and back-end now, and write the API, we will be able later (orin parallel) to store in databases all the information that we thinkmight help creating a better Pootle.

I also think that we should start imediatly an analysis of whatinformation might be interesting to have in a database and whichinformation should be in databases. It might even be interesting to havethe same information in both formats (every time an XLIFF file iscreated or modified, the info is stored in a database, which would workas a cache).


More comments below

Aigars Mahinovs wrote:

In my opinion it would be quite problematic to implement the
distributed version of this system by distributing the backend - that
would totally bypass all the permissions and would cause all sorts of
trust issues.

It would be much more logical to have XML RPC or something like that
and have the syncronisation processes launched by cron on regular
basis and have the incoming data streams processed in accordance to
the local rules. For example, messages from a trusted localisation
team server could be integrated directly, but messages from Rosetta
would go via some kind of approval dependent on the localization team
practises.

I think that you are right, this might be a very good way of doing it.


I imagine that numer of times we need to write one string to the file
(making or updating a translation) outnumbers the number of times we
need to get the full file (download of the result) in the order of
1000:1. And I also imagine that creating a PO file from said XLIFF
will take just as much time as making it from a database (or even
more).

I think that people will tend to work offline, therefore managing files.The system is being developed for native use of XLIFF files, which maketranslation editors much easier to use for translators, creating POfiles would only be for people who still do not want to change, forwhatever the reasons.

The CPU may be more occupied in doing fuzzy matching of strings. I'm not
sure the fuzzy matching algorithm can use some kind of cache in a
database. (The number of fuzzy matching operation is more than
proportionnal to the number of strings - which IMHO better reflects the
size of the translation server than the number of the simultaneous users
triggering write operations)

The CPU is most occupied at startup, indexing and checking files.  This
would not change at all with a DB.  That needs to be backgrounded.


This would be completly eliminated by the DB, because the DB engine
would be doing those tasks using highly optimised C and assembler
code.

We need to understand the processes here a little better and understandthe need. Of course, any operation inside a database would be faster,there is no doubt of that, but there ar e a number of other things thatneed to be taken into account. If the result is that a DB is faster anddoes not make things too complicated, dababases it should be...


Well, we need to think of the database schema in the way to use as
much processing as possible on the database side.

One other think about the database backend is that you can easily move
the database to another server from the Pootle itself and also
database software can be easily distributed to several servers if
there is any kind of bottleneck there.

This is definitelly true, and can make the server pair Pootle/DB serververy powerful. We really need to look at this, and try to make a plan toexport to a DB server as many tasks as possible, while making sure thatwe do not enter an over-complicated structure that later becomes tocomplicated to use or maintain.


Javier

Reply to:

Follow-Ups:
- Re: [translate-pootle] Wordforge
  - From: Zejn Gasper <zejn@kiberpipa.org>
- Information that need to be stored for the translation process (was: [translate-pootle] Wordforge)
  - From: Nicolas François <nicolas.francois@centraliens.net>

References:
- Wordforge
  - From: Gintautas Miliauskas <gintas@akl.lt>
- Re: [translate-pootle] Wordforge
  - From: Dwayne Bailey <dwayne@translate.org.za>
- Re: [translate-pootle] Wordforge
  - From: Gintautas Miliauskas <gintas@akl.lt>
- Re: [translate-pootle] Wordforge
  - From: Nicolas François <nicolas.francois@centraliens.net>
- Re: [translate-pootle] Wordforge
  - From: Dwayne Bailey <dwayne@translate.org.za>
- Re: [translate-pootle] Wordforge
  - From: "Aigars Mahinovs" <aigarius@gmail.com>

Prev by Date: [Fwd: Re: [translate-pootle] Wordforge]
Next by Date: Re: [translate-pootle] Wordforge
Previous by thread: Re: [translate-pootle] Wordforge
Next by thread: Re: [translate-pootle] Wordforge
Index(es):
- Date
- Thread