[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: glimpse sucks.



On Fri, Sep 22, 2000 at 09:20:29AM +0900, NOKUBI Takatsugu wrote:
> Hmm... I don't know about Chinese. I think, it is hard to determine
> word boundary in Chinese. So some word segmentation tools need for
> processing Chinese (like kakasi, chasen in Japanese). I looked in the
> output of "apt-get search chinese", but it seems there are no such
> tool...
The way udmsearch works is it has a big array of characters.  Those
characters make up a word, anything not in that array makes up
whitespace.  Easy stuff for single byte character sets.

I totally don't understand dual byte character sets at all, but I'm
guessing you could do a similar thing. Have an array of dual bytes which
make up characters.  I really don't know what to do here.

> There is the another solution. It is "letter indexing
> approach". However, that approach is more difficult to implement than
> "word indexing approach". It sould be hard to implement it in Glimpse.
Sounds hard to implement it at all.

Is there any solution that searches well for both single and dual byte
character sets?

  - Craig
-- 
Craig Small VK2XLZ  GnuPG:1C1B D893 1418 2AF4 45EE  95CB C76C E5AC 12CA DFA5
Eye-Net Consulting http://www.eye-net.com.au/        <csmall@eye-net.com.au>
MIEEE <csmall@ieee.org>                 Debian developer <csmall@debian.org>



Reply to: