Re: glimpse sucks.
In article <[🔎] 20000922155516.A808@eye-net.com.au>
csmall@eye-net.com.au writes:
>> On Fri, Sep 22, 2000 at 09:20:29AM +0900, NOKUBI Takatsugu wrote:
>> > Hmm... I don't know about Chinese. I think, it is hard to determine
>> > word boundary in Chinese. So some word segmentation tools need for
>> > processing Chinese (like kakasi, chasen in Japanese). I looked in the
>> > output of "apt-get search chinese", but it seems there are no such
>> > tool...
>> The way udmsearch works is it has a big array of characters. Those
>> characters make up a word, anything not in that array makes up
>> whitespace. Easy stuff for single byte character sets.
>>
>> I totally don't understand dual byte character sets at all, but I'm
>> guessing you could do a similar thing. Have an array of dual bytes which
>> make up characters. I really don't know what to do here.
When we think about search engine, we consider it is language
proccesing. It is related to grammer, vocabulary, and/or something,
not only character sets.
It is difficult to regard a sentence as simple byte stream. However,
there is a such approach.
Sufary is a software of such things. But it creates a 4 times size of
index file from an original data. In addition, Sufary can handle only
one file.
Sufary was packaged as "sufary" on Debian.
--
NOKUBI Takatsugu
E-mail: knok@daionet.gr.jp
knok@debian.org
Reply to: