[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [translate-pootle] Wordforge



Hi Aigars,

I think that this discussion is very important.

We need to ensure that Pootle is capable of handling the large amounts of information that Debian needs, which is probably not the problem, but it must also handle all the processes that Debian needs. The solution might be either on file handling or in databases.

First, it is important to understand the complexity of the data that we are handling. we are not only talking about a set of source strings and their translations, associated in files. WE are talking about managing process information to optimize the result of the work of the translators. Each XLIFF file not only contains strings and information about them. It might also contain a glossary, translation memory information, comments from translators or reviewers, information about results of tests run on each string, data for conection to SVN... and process information: a series of phases through which each file has already gone (translation-review-approval-update-translation-review-approval... ), associating each message to a given phase. We can also have translations of the same message to other langauges, as reference. Also, XLIFF files might include counters that give information about the state of the file, whitout having to recalculate.

All this information is easy to store in XML, but it would require quite a complex database.

My believe is that the process that will will use most time in Pootle is the process of merging two files, which must happen when a file is committed to SVN, when is uploaded to Pootle from a translator... or when a new POT/XLIFFT file is uploaded to Pootle for updating all translations of a given package (much more efficient that doing all the languages one by one against CVS. If data is in a database, then at least one file does not need to be parsed (every time the process runs), and the process would probably be faster, but there are many other factors that could become more complicated because of the DB. Updates take place at non-critical times, but user demands for files mut be responded to immediatly, If all the files need to be created to be served to the user, this process might be longer that what the user is prepared to wait (I don't know)

My personal conclusion is that this is something that we really need to look at, and I am very happy that you and other people are getting into it... but it is not something that should be resolved now in order to start Guntaitas' project, there is too much at stake to rush a design decision that will affect all the future of Pootle. I would very much prefer that we -in this list- analyse the issue much further and come out with the right conclusion, which we will implement, as we are as interested as you are on making sure that Pootle scales and can respond the Debian's needs, which means that it will be able to respond to the needs of any other FOSS project.

As Christian has proposed, I think that if we can get separation of front-end and back-end now, and write the API, we will be able later (or in parallel) to store in databases all the information that we think might help creating a better Pootle.

I also think that we should start imediatly an analysis of what information might be interesting to have in a database and which information should be in databases. It might even be interesting to have the same information in both formats (every time an XLIFF file is created or modified, the info is stored in a database, which would work as a cache).

More comments below

Aigars Mahinovs wrote:

In my opinion it would be quite problematic to implement the
distributed version of this system by distributing the backend - that
would totally bypass all the permissions and would cause all sorts of
trust issues.

It would be much more logical to have XML RPC or something like that
and have the syncronisation processes launched by cron on regular
basis and have the incoming data streams processed in accordance to
the local rules. For example, messages from a trusted localisation
team server could be integrated directly, but messages from Rosetta
would go via some kind of approval dependent on the localization team
practises.
I think that you are right, this might be a very good way of doing it.


I imagine that numer of times we need to write one string to the file
(making or updating a translation) outnumbers the number of times we
need to get the full file (download of the result) in the order of
1000:1. And I also imagine that creating a PO file from said XLIFF
will take just as much time as making it from a database (or even
more).
I think that people will tend to work offline, therefore managing files. The system is being developed for native use of XLIFF files, which make translation editors much easier to use for translators, creating PO files would only be for people who still do not want to change, for whatever the reasons.

The CPU may be more occupied in doing fuzzy matching of strings. I'm not
sure the fuzzy matching algorithm can use some kind of cache in a
database. (The number of fuzzy matching operation is more than
proportionnal to the number of strings - which IMHO better reflects the
size of the translation server than the number of the simultaneous users
triggering write operations)
The CPU is most occupied at startup, indexing and checking files.  This
would not change at all with a DB.  That needs to be backgrounded.

This would be completly eliminated by the DB, because the DB engine
would be doing those tasks using highly optimised C and assembler
code.
We need to understand the processes here a little better and understand the need. Of course, any operation inside a database would be faster, there is no doubt of that, but there ar e a number of other things that need to be taken into account. If the result is that a DB is faster and does not make things too complicated, dababases it should be...


Well, we need to think of the database schema in the way to use as
much processing as possible on the database side.

One other think about the database backend is that you can easily move
the database to another server from the Pootle itself and also
database software can be easily distributed to several servers if
there is any kind of bottleneck there.
This is definitelly true, and can make the server pair Pootle/DB server very powerful. We really need to look at this, and try to make a plan to export to a DB server as many tasks as possible, while making sure that we do not enter an over-complicated structure that later becomes to complicated to use or maintain.

Javier



Reply to: