Re: [translate-pootle] Wordforge
Hi Aigars,
I think that this discussion is very important.
We need to ensure that Pootle is capable of handling the large amounts
of information that Debian needs, which is probably not the problem, but
it must also handle all the processes that Debian needs. The solution
might be either on file handling or in databases.
First, it is important to understand the complexity of the data that we
are handling. we are not only talking about a set of source strings and
their translations, associated in files. WE are talking about managing
process information to optimize the result of the work of the
translators. Each XLIFF file not only contains strings and information
about them. It might also contain a glossary, translation memory
information, comments from translators or reviewers, information about
results of tests run on each string, data for conection to SVN... and
process information: a series of phases through which each file has
already gone
(translation-review-approval-update-translation-review-approval... ),
associating each message to a given phase. We can also have translations
of the same message to other langauges, as reference. Also, XLIFF files
might include counters that give information about the state of the
file, whitout having to recalculate.
All this information is easy to store in XML, but it would require quite
a complex database.
My believe is that the process that will will use most time in Pootle is
the process of merging two files, which must happen when a file is
committed to SVN, when is uploaded to Pootle from a translator... or
when a new POT/XLIFFT file is uploaded to Pootle for updating all
translations of a given package (much more efficient that doing all the
languages one by one against CVS. If data is in a database, then at
least one file does not need to be parsed (every time the process runs),
and the process would probably be faster, but there are many other
factors that could become more complicated because of the DB. Updates
take place at non-critical times, but user demands for files mut be
responded to immediatly, If all the files need to be created to be
served to the user, this process might be longer that what the user is
prepared to wait (I don't know)
My personal conclusion is that this is something that we really need to
look at, and I am very happy that you and other people are getting into
it... but it is not something that should be resolved now in order to
start Guntaitas' project, there is too much at stake to rush a design
decision that will affect all the future of Pootle. I would very much
prefer that we -in this list- analyse the issue much further and come
out with the right conclusion, which we will implement, as we are as
interested as you are on making sure that Pootle scales and can respond
the Debian's needs, which means that it will be able to respond to the
needs of any other FOSS project.
As Christian has proposed, I think that if we can get separation of
front-end and back-end now, and write the API, we will be able later (or
in parallel) to store in databases all the information that we think
might help creating a better Pootle.
I also think that we should start imediatly an analysis of what
information might be interesting to have in a database and which
information should be in databases. It might even be interesting to have
the same information in both formats (every time an XLIFF file is
created or modified, the info is stored in a database, which would work
as a cache).
More comments below
Aigars Mahinovs wrote:
In my opinion it would be quite problematic to implement the
distributed version of this system by distributing the backend - that
would totally bypass all the permissions and would cause all sorts of
trust issues.
It would be much more logical to have XML RPC or something like that
and have the syncronisation processes launched by cron on regular
basis and have the incoming data streams processed in accordance to
the local rules. For example, messages from a trusted localisation
team server could be integrated directly, but messages from Rosetta
would go via some kind of approval dependent on the localization team
practises.
I think that you are right, this might be a very good way of doing it.
I imagine that numer of times we need to write one string to the file
(making or updating a translation) outnumbers the number of times we
need to get the full file (download of the result) in the order of
1000:1. And I also imagine that creating a PO file from said XLIFF
will take just as much time as making it from a database (or even
more).
I think that people will tend to work offline, therefore managing files.
The system is being developed for native use of XLIFF files, which make
translation editors much easier to use for translators, creating PO
files would only be for people who still do not want to change, for
whatever the reasons.
The CPU may be more occupied in doing fuzzy matching of strings. I'm not
sure the fuzzy matching algorithm can use some kind of cache in a
database. (The number of fuzzy matching operation is more than
proportionnal to the number of strings - which IMHO better reflects the
size of the translation server than the number of the simultaneous users
triggering write operations)
The CPU is most occupied at startup, indexing and checking files. This
would not change at all with a DB. That needs to be backgrounded.
This would be completly eliminated by the DB, because the DB engine
would be doing those tasks using highly optimised C and assembler
code.
We need to understand the processes here a little better and understand
the need. Of course, any operation inside a database would be faster,
there is no doubt of that, but there ar e a number of other things that
need to be taken into account. If the result is that a DB is faster and
does not make things too complicated, dababases it should be...
Well, we need to think of the database schema in the way to use as
much processing as possible on the database side.
One other think about the database backend is that you can easily move
the database to another server from the Pootle itself and also
database software can be easily distributed to several servers if
there is any kind of bottleneck there.
This is definitelly true, and can make the server pair Pootle/DB server
very powerful. We really need to look at this, and try to make a plan to
export to a DB server as many tasks as possible, while making sure that
we do not enter an over-complicated structure that later becomes to
complicated to use or maintain.
Javier
Reply to: