[Pkg-ime-devel] RFS: scim-waitzar, libwaitzar (re-submission) Attn: Paul Wise

Subject: [Pkg-ime-devel] RFS: scim-waitzar, libwaitzar (re-submission) Attn: Paul Wise
From: sorlok_reaves@yahoo.com (S'orlok Reaves)
Date: Tue, 20 Jan 2009 23:09:08 -0800 (PST)
Message-id: <[🔎] 772381.13161.qm@web30001.mail.mud.yahoo.com>
In-reply-to: <e13a36b30901202232o312cf9f2xde3226e17613a091@mail.gmail.com>

> So which of these are used for creating the Myanmar.model
> file?
None of them, actually. Creating Myanmar.model takes a few steps:
1) Copy all Burmese words from Myanmar_List_v2.txt into Myanmar.model
2) For each word, create and store a reverse-look-up in Myanmar.model
(The next few steps are optional)
3) For a given corpus, scan each word and count its frequency. Then, compute bigram and trigram frequencies. (I currently use a Java script for this).
4) Prune out uni/bi/trigrams which are considered "useless" (matter of opinion; again, I use a Java script to help me). Store uni/bi/trigrams in Myanmar.model
(The next steps are quality assurance)
5) Go over the model by hand, checking for errors and out-of-order encoding.
6) Use the KaNaung code to convert each word into our three output encodings. Visually check that these all look the same.

Two unattractive properties of this process:
1) It requires a lot of manual intervention (for QA, which I feel is important).
2) The Java scripts I use were written before Unicode 5.1 came out, so I used the Zawgyi-One encoding internally. This encoding is non-standard, and requires a great deal of knowledge to use properly. (This is one of the main reasons I am not comfortable releasing my Java helper scripts --I don't feel right promoting the use of a broken non-standard encoding).

I suppose a long-term goal would be to release a set of Unicode 5.1, waitzar-specific trigram generators; however, this is really just a pipe dream for now --it would be a huge amount of work, with very little benefit.

Cheers,
-->Seth
PS: When I say "a Java script", I mean "a script written in the Java programming language".

Reply to:

Prev by Date: [Pkg-ime-devel] RFS: scim-waitzar, libwaitzar (re-submission) Attn: Paul Wise
Previous by thread: [Pkg-ime-devel] RFS: scim-waitzar, libwaitzar (re-submission) Attn: Paul Wise
Index(es):
- Date
- Thread