[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [translate-pootle] Wordforge



I personally feel strongly that move to storing everything in a
database is long overdue in Pootle, but I am very cautionos about the
move because of respect to development decisions that Pootle
developers made before for reasons that I do not know at this point.
But I would really like these reasons to be brought up so this
technical decision can be discussed on real technical merits.

On 6/8/06, Dwayne Bailey <dwayne@translate.org.za> wrote:
Memory: Have you actually validated high memory usage or is that just
guessing?

I have repeatedly heard that one of the main problems with Pootle, but
I have not checked it myself.

Distribution: centralising a database is not distribution of
translations at all.  It would work to a point, its the CVS model vs the
Bazaar model.  This it allows no independence of teams (project,
language, etc).  That is what Pootle hopes to achieve in distribution
strategies.

In my opinion it would be quite problematic to implement the
distributed version of this system by distributing the backend - that
would totally bypass all the permissions and would cause all sorts of
trust issues.

It would be much more logical to have XML RPC or something like that
and have the syncronisation processes launched by cron on regular
basis and have the incoming data streams processed in accordance to
the local rules. For example, messages from a trusted localisation
team server could be integrated directly, but messages from Rosetta
would go via some kind of approval dependent on the localization team
practises.

Indexing and locking:  It is manual, indexing we wanted to move to a DB
anyway (stats, untranslated, fuzzy, etc). Locking is mostly solved on
the files level.  But now we see contention issues that relate to who
was first in a change, which a DB wouldn't solve.

Database does very nice indexing that is heavily optimised and row
level locking can be done, but for that love level the ownership
principe should be the solution.

I'm not averse to a DB.  But I feel that it should be based on a real
problem not a perceived problems.  I'd also like to see that if we do
work in some way with a DB that the files are our authority.

I am not really seeing the reasoning behind having files as the
primary storage. Yes, we need to store enough metainformation in the
database to be able to rebuild po files byte per byte and also produce
files in other formats that would have the same metainformation, but
thinking of files as the primary storage ... I do really see no real
reason for that.

> I'm not that sure a database would bring a that important performance
> improvement.
>
> Writing to the database will probably be faster than writing an XLIFF
> file.

Writing a single changed string to a database will be several orders
of magnitude faster then reading an XML file, parsing it,changing it
and writing the parsed version back. And the database code will take
care of the locking, caching and indexing in the process.

> But users may want to retrieve XLIFF files. This operation will be faster
> if the strings are already stored in a XLIFF file.

I imagine that numer of times we need to write one string to the file
(making or updating a translation) outnumbers the number of times we
need to get the full file (download of the result) in the order of
1000:1. And I also imagine that creating a PO file from said XLIFF
will take just as much time as making it from a database (or even
more).

> The CPU may be more occupied in doing fuzzy matching of strings. I'm not
> sure the fuzzy matching algorithm can use some kind of cache in a
> database. (The number of fuzzy matching operation is more than
> proportionnal to the number of strings - which IMHO better reflects the
> size of the translation server than the number of the simultaneous users
> triggering write operations)

The CPU is most occupied at startup, indexing and checking files.  This
would not change at all with a DB.  That needs to be backgrounded.

This would be completly eliminated by the DB, because the DB engine
would be doing those tasks using highly optimised C and assembler
code.

We are not currently doing fuzzy matching on Pootle.  But our ideas in
doing that would not be live.  This is where a DB does make sense, but
that deserves another mail.

If the database engine supports fuzzy full text search, then it will
be better then anything that we can write in python or any other high
end language.

We've had to revisit the file vs DB decision countless times, we're used
to it.  Most people who propose DBs list all sorts of irrelevant
reasons, text matching being my favourite.  Plus their idea of a DB
schema usually shows complete lack of understanding of how to use a DB.
And its this idea of trying to make documents squeeze in an RDBMS.

Well, we need to think of the database schema in the way to use as
much processing as possible on the database side.

One other think about the database backend is that you can easily move
the database to another server from the Pootle itself and also
database software can be easily distributed to several servers if
there is any kind of bottleneck there.

--
Best regards,
   Aigars Mahinovs        mailto:aigarius@debian.org
#--------------------------------------------------------------#
|     .''`.         Debian GNU/Linux              LAKA         |
|    : :' :      http://www.debian.org  &  http://www.laka.lv  |
|    `. `'                                                     |
|      `-                                                       |
#--------------------------------------------------------------#



Reply to: