[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [translate-pootle] Wordforge



Hi all,

I've been occupied a little lately and only passively reading the list. I may 
be a little bit too technical, but that's intentionally. ;)

There are some things I would like to add to this discussion:

Not only databases, file systems can also scale, eg. you can set up a NFS 
cluster, so flatfiles aren't necessarily a bottleneck. Of course only pootle 
is given access to files.

XML files do not need to be parsed as Python has "pickle" [1], which can store 
a Python object in binary format in a file. This may be a little risky, 
hackish trick, but could probably greatly accelerate XML handling. Sure, 
plain XML also gets stored besides the pickled version. Also, since Pootle is 
converting to kid templates, I'm not afraid it would be slow, since kid uses 
elementtree XML parsing library [2], that has a C implementation. Pickle 
could also be used for RPC.

There are some proven solutions, which should be used, eg. memcached [3]. It 
can be used as a distributed cache spread accross a number of servers' RAM 
and is production use for a long time. Current cache system is less flexible 
and written in Python, while memcached is a daemon written in C. Python API 
binding is available [4], but not yet packaged in Debian.

Fuzzy matching should be implemented separately. This is usually CPU intensive 
task and it could be handy, to have 'fuzzy servers' separated from main 
Pootle server, that remains responsive at all times. This would also enable 
to have eg. one fuzzy server for Debian translations, one for GNOME, one for 
KDE ... of course - if needed. Also, if some communities would for some 
reason like to disable fuzzy matching, implementing it separately makes 
perfect sense.

Indexing would be split; statistics, translation states and similar metadata 
would go into relational database for faster access. But for the PO/XLIFF 
files, I think I'd somehow rather go with (pickled) flatfiles. 


So there we'd have indexing, caching, storage backend, fuzzy matching, 
middleware (this being the part of pootle handling mails and similar) and web 
frontend. Pootle server would glue this together in a simple-to-use package.

I hope I'll have more time in July, I'm having fun with exams now.

Gasper

[1] http://docs.python.org/lib/module-pickle.html
[2] http://effbot.org/zone/element-index.htm
[3] http://www.danga.com/memcached/
[4] ftp://ftp.tummy.com/pub/python-memcached/

On Friday 09 June 2006 11:58, Javier SOLA wrote:
> Hi Aigars,
>
> I think that this discussion is very important.
>
> We need to ensure that Pootle is capable of handling the large amounts
> of information that Debian needs, which is probably not the problem, but
> it must also handle all the processes that Debian needs. The solution
> might be either on file handling or in databases.
>
> First, it is important to understand the complexity of the data that we
> are handling. we are not only talking about a set of source strings and
> their translations, associated in files. WE are talking about managing
> process information to optimize the result of the work of the
> translators. Each XLIFF file not only contains strings and information
> about them. It might also contain a glossary, translation memory
> information, comments from translators or reviewers, information about
> results of tests run on each string, data for conection to SVN... and
> process information: a series of phases through which each file has
> already gone
> (translation-review-approval-update-translation-review-approval...  ),
> associating each message to a given phase. We can also have translations
> of the same message to other langauges, as reference. Also, XLIFF files
> might include counters that give information about the state of the
> file, whitout having to recalculate.
>
> All this information is easy to store in XML, but it would require quite
> a complex database.
>
> My believe is that the process that will will use most time in Pootle is
> the process of merging two files, which must happen when a file is
> committed to SVN, when is uploaded to Pootle from a translator... or
> when a new POT/XLIFFT file is uploaded to Pootle for updating all
> translations of a given package (much more efficient that doing all the
> languages one by one against CVS. If data is in a database, then at
> least one file does not need to be parsed (every time the process runs),
> and the process would probably be faster, but there are many other
> factors that could become more complicated because of the DB. Updates
> take place at non-critical times, but user demands for files mut be
> responded to immediatly, If all the files need to be created to be
> served to the user, this process might be longer that what the user is
> prepared to wait (I don't know)
>
> My personal conclusion is that this is something that we really need to
> look at, and I am very happy that you and other people are getting into
> it... but it is not something that should be resolved now in order to
> start Guntaitas' project, there is too much at stake to rush a design
> decision that will affect all the future of Pootle. I would very much
> prefer that we -in this list- analyse the issue much further and come
> out with the right conclusion, which we will implement, as we are as
> interested as you are on making sure that Pootle scales and can respond
> the Debian's needs, which means that it will be able to respond to the
> needs of any other FOSS project.
>
> As Christian has proposed, I think that if we can get separation of
> front-end and back-end now, and write the API, we will be able later (or
> in parallel) to store in databases all the information that we think
> might help creating a better Pootle.
>
> I also think that we should start imediatly an analysis of what
> information might be interesting to have in a database and which
> information should be in databases. It might even be interesting to have
> the same information in both formats (every time an XLIFF file is
> created or modified, the info is stored in a database, which would work
> as a cache).
>
> More comments below
>
> Aigars Mahinovs wrote:
> >In my opinion it would be quite problematic to implement the
> >distributed version of this system by distributing the backend - that
> >would totally bypass all the permissions and would cause all sorts of
> >trust issues.
> >
> >It would be much more logical to have XML RPC or something like that
> >and have the syncronisation processes launched by cron on regular
> >basis and have the incoming data streams processed in accordance to
> >the local rules. For example, messages from a trusted localisation
> >team server could be integrated directly, but messages from Rosetta
> >would go via some kind of approval dependent on the localization team
> >practises.
>
> I think that you are right, this might be a very good way of doing it.
>
> >I imagine that numer of times we need to write one string to the file
> >(making or updating a translation) outnumbers the number of times we
> >need to get the full file (download of the result) in the order of
> >1000:1. And I also imagine that creating a PO file from said XLIFF
> >will take just as much time as making it from a database (or even
> >more).
>
> I think that people will tend to work offline, therefore managing files.
> The system is being developed for native use of XLIFF files, which make
> translation editors much easier to use for translators, creating PO
> files would only be for people who still do not want to change, for
> whatever the reasons.
>
> >>>The CPU may be more occupied in doing fuzzy matching of strings. I'm not
> >>>sure the fuzzy matching algorithm can use some kind of cache in a
> >>>database. (The number of fuzzy matching operation is more than
> >>>proportionnal to the number of strings - which IMHO better reflects the
> >>>size of the translation server than the number of the simultaneous users
> >>>triggering write operations)
> >>
> >>The CPU is most occupied at startup, indexing and checking files.  This
> >>would not change at all with a DB.  That needs to be backgrounded.
> >
> >This would be completly eliminated by the DB, because the DB engine
> >would be doing those tasks using highly optimised C and assembler
> >code.
>
> We need to understand the processes here a little better and understand
> the need. Of course, any operation inside a database would be faster,
> there is no doubt of that, but there ar e a number of other things that
> need to be taken into account. If the result is that a DB is faster and
> does not make things too complicated, dababases it should be...
>
> >Well, we need to think of the database schema in the way to use as
> >much processing as possible on the database side.
> >
> >One other think about the database backend is that you can easily move
> >the database to another server from the Pootle itself and also
> >database software can be easily distributed to several servers if
> >there is any kind of bottleneck there.
>
> This is definitelly true, and can make the server pair Pootle/DB server
> very powerful. We really need to look at this, and try to make a plan to
> export to a DB server as many tasks as possible, while making sure that
> we do not enter an over-complicated structure that later becomes to
> complicated to use or maintain.
>
> Javier



Reply to: