[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [translate-pootle] Wordforge



On Thu, 2006-06-08 at 01:32 +0200, Nicolas François wrote:
> Hello Gintautas,
> 
> On Thu, Jun 08, 2006 at 12:17:28AM +0300, Gintautas Miliauskas wrote:
> > Hi,
> > 
> > I have a strong opinion about the direction Pootle's backend should be
> > headed. I think that at the moment you have a 'loose' system based on
> > files, which is simple and transparent.  However, it is inefficient
> > in memory usage, speed and ease of distribution.  Since memory usage
> > depends on the database size, I take it that you are using some sort of
> > memory cache to speed up the system.

If we're not working with a file, its not in memory.

> > I think that an obvious solution here is to use a relational database
> > (I would suggest PostgreSQL).  Unlike ordinary files, it allows
> > extremely speedy random writes which is exactly what we need here.
> > I would expect the problem of high memory usage to disappear completely
> > too.  In fact, if we can put all important data on the database,
> > distribution of the system would then become trivial -- several
> > instances of the application (possibly on different computers) would
> > simply use a single instance of the database.  I would say that by doing
> > everything (indexing, locking, etc.) manually we're reinventing the
> > wheel, badly.

Memory: Have you actually validated high memory usage or is that just
guessing?

Distribution: centralising a database is not distribution of
translations at all.  It would work to a point, its the CVS model vs the
Bazaar model.  This it allows no independence of teams (project,
language, etc).  That is what Pootle hopes to achieve in distribution
strategies.

Indexing and locking:  It is manual, indexing we wanted to move to a DB
anyway (stats, untranslated, fuzzy, etc). Locking is mostly solved on
the files level.  But now we see contention issues that relate to who
was first in a change, which a DB wouldn't solve.

I'm not averse to a DB.  But I feel that it should be based on a real
problem not a perceived problems.  I'd also like to see that if we do
work in some way with a DB that the files are our authority.  Just as we
respect changes from CVS, I'd like to see us respect files.  The DB
should be of the plastic disposable sort.  We can drop it and simply
rebuild it from the files.

> I'm not that sure a database would bring a that important performance
> improvement.
> 
> Writing to the database will probably be faster than writing an XLIFF
> file.
> But users may want to retrieve XLIFF files. This operation will be faster
> if the strings are already stored in a XLIFF file.
> 
> Also, I'm not sure a Pootle server is mostly doing write operations
> (the number of write operations is probably proportionnal to the number of
> users).

Writes and memory is proportional to the number of users, not the number
of files being managed.

XLIFF does worry me in terms of performance though, as working with XML
seems to be a bit heavy.  Users wanting an XLIFF files would probably
not be as demanding as those working live.  Also caching the files could
mitigate that hit.

> The CPU may be more occupied in doing fuzzy matching of strings. I'm not
> sure the fuzzy matching algorithm can use some kind of cache in a
> database. (The number of fuzzy matching operation is more than
> proportionnal to the number of strings - which IMHO better reflects the
> size of the translation server than the number of the simultaneous users
> triggering write operations)

The CPU is most occupied at startup, indexing and checking files.  This
would not change at all with a DB.  That needs to be backgrounded.

We are not currently doing fuzzy matching on Pootle.  But our ideas in
doing that would not be live.  This is where a DB does make sense, but
that deserves another mail.

> > I also think that using XLIFF, an XML format, for the backend is a bad
> > idea.  I think that XML is great for serializing data and sharing it
> > between completely disparate systems, but it's awful for random writes
> > and places where performance is important (such as the data storage
> > backend for a heavily used system). Nobody cares whether the backend
> > storage is compliant with some standard, it's only the interface where
> > standards-compliance matters. The backend must simply be as efficient
> > as possible and not get in the way.

I care if its complaint! Compliant to the needs of the files we work on
that is.  But yes seeing XML as an exchange format might be a correct
move.
 
> > I do not mean to thrash your design decisions or stall work on the
> > backend.  Files as backend are great for small projects where
> > performance is unimportant, because then you don't need to set up an
> > SQL server.  I just want to suggest designing the API in such a way that
> > does not depend on files, i.e., such that a relational database would
> > not be too much trouble to plug in.

We really haven't tested high scale usage.  I've used Pootle all day
with over 40 translators working on Mozilla which is about 300 files,
but of course only 40 in memory.  I ran it on my laptop over a computer
lab LAN.

Debian would raise that bar considerably... or would it?  Let play with
some figures and come up with some usage estimates.

180 000 files (not really relevant except on startup)
200 languages
4 translators live per language
800 total translators live
Lets assume they are all editing independent files
200K per text file
500K assumed per file object
400M of files in memory

Is that extreme?  Or are my sums wrong?

We are not good at releasing that memory, that would be a problem. But
400M when most people would deploy a high end Pootle server with at
least 2G.

> I'm not that used to Pootle. Maybe the base.TranslationStore (or
> TranslationUnit) API can be used for a database storage.

That is what I would suggest.  Adding SQL as a class derived from base
would allow us to import files into SQL.  Make the SQL independent of
anything. Making it easy to import any file that complies with the base
class.

Pootle could work in SQL.  When request to download files where received
Pootle could use the convertor logic to merge translations from the DB
into the files.

However what I want to avoid at all costs would be reliance on a DB.  We
should be able to dump the DB and just regenerate where we were.  The DB
should be about performance, not storage.

> It could be latter interresting to investigate the performance gain given
> by such a storage. So if you think a method is missing or should be
> generalized (to help using faster search in a database), this could be
> nice to know it in advance.
> 
> > I would be happy to hear your thoughts.  I hope the letter did not come
> > out too harsh.  There may be more options here, or you may have some
> > plans that I am simply not aware of.  However, this is critical for my
> > work for Debian and I want to cover this ASAP.
> 
> If you can work on an API for the storage method, this won't be critical,
> and could be sorted out during an optimization phase (Pootle 1.5.1 in the
> Wordforge's roadmap)

We've had to revisit the file vs DB decision countless times, we're used
to it.  Most people who propose DBs list all sorts of irrelevant
reasons, text matching being my favourite.  Plus their idea of a DB
schema usually shows complete lack of understanding of how to use a DB.
And its this idea of trying to make documents squeeze in an RDBMS.

I have some suggestions on what could work:

1) SQL on base class as Nicolas suggested
2) Migrate meta data to a DB (we wanted to do this anyway).  This would
be an incremental approach and leave the file format handling
3) Do some real testing using the benchmarking scripts to check where
the bottlenecks might be
4) TM matches stored using a DB

2 & 4 give us an immediately useful addition for a DB and could help
simplify code.

We'll have a live Pootle server soon, so we can see the real problems re
memory, that is where 3) comes into play

I'd say 4) is good but I'd prefer you to be working against hard data.
I can guess better where the performance bottlenecks will be.  I don't
think your guessing right yet. These bottlenecks might in fact be very
easy to resolve.  Thus saving a major DB move.

OK.  So my suggestion is that you work on DB interventions where we've
already identified that they would be useful.  At the same time lets
test where the bottlenecks are and then if needed we look at a DB
without losing the usefulness of directly working with files.

-- 
Dwayne Bailey
Translate.org.za

+27-12-460-1095 (w)
+27-83-443-7114 (cell)



Reply to: