[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: glimpse sucks.



In article <[🔎] 20000922155516.A808@eye-net.com.au>
csmall@eye-net.com.au writes:

>> On Fri, Sep 22, 2000 at 09:20:29AM +0900, NOKUBI Takatsugu wrote:
>> > Hmm... I don't know about Chinese. I think, it is hard to determine
>> > word boundary in Chinese. So some word segmentation tools need for
>> > processing Chinese (like kakasi, chasen in Japanese). I looked in the
>> > output of "apt-get search chinese", but it seems there are no such
>> > tool...
>> The way udmsearch works is it has a big array of characters.  Those
>> characters make up a word, anything not in that array makes up
>> whitespace.  Easy stuff for single byte character sets.
>> 

>> I totally don't understand dual byte character sets at all, but I'm
>> guessing you could do a similar thing. Have an array of dual bytes which
>> make up characters.  I really don't know what to do here.

When we think about search engine, we consider it is language
proccesing. It is related to grammer, vocabulary, and/or something,
not only character sets.

It is difficult to regard a sentence as simple byte stream. However,
there is a such approach.

Sufary is a software of such things. But it creates a 4 times size of
index file from an original data. In addition, Sufary can handle only
one file.
Sufary was packaged as "sufary" on Debian.
-- 
NOKUBI Takatsugu
E-mail: knok@daionet.gr.jp
	knok@debian.org



Reply to: