Re: glimpse sucks.

To: csmall@eye-net.com.au
Cc: debian-www@lists.debian.org
Subject: Re: glimpse sucks.
From: knok@daionet.gr.jp (NOKUBI Takatsugu)
Date: Mon, 25 Sep 2000 16:53:35 JST
Message-id: <[🔎] 200009250753.QAA12550@ns1.eal.or.jp>
In-reply-to: Your message of "Fri, 22 Sep 2000 15:55:16 +1100". <[🔎] 20000922155516.A808@eye-net.com.au>

In article <[🔎] 20000922155516.A808@eye-net.com.au>
csmall@eye-net.com.au writes:

>> On Fri, Sep 22, 2000 at 09:20:29AM +0900, NOKUBI Takatsugu wrote:
>> > Hmm... I don't know about Chinese. I think, it is hard to determine
>> > word boundary in Chinese. So some word segmentation tools need for
>> > processing Chinese (like kakasi, chasen in Japanese). I looked in the
>> > output of "apt-get search chinese", but it seems there are no such
>> > tool...
>> The way udmsearch works is it has a big array of characters.  Those
>> characters make up a word, anything not in that array makes up
>> whitespace.  Easy stuff for single byte character sets.
>> 

>> I totally don't understand dual byte character sets at all, but I'm
>> guessing you could do a similar thing. Have an array of dual bytes which
>> make up characters.  I really don't know what to do here.

When we think about search engine, we consider it is language
proccesing. It is related to grammer, vocabulary, and/or something,
not only character sets.

It is difficult to regard a sentence as simple byte stream. However,
there is a such approach.

Sufary is a software of such things. But it creates a 4 times size of
index file from an original data. In addition, Sufary can handle only
one file.
Sufary was packaged as "sufary" on Debian.
-- 
NOKUBI Takatsugu
E-mail: knok@daionet.gr.jp
	knok@debian.org

Reply to:

References:
- Re: glimpse sucks.
  - From: csmall@eye-net.com.au (Craig Small)

Prev by Date: Re: my first try on changing the design of web site front page
Next by Date: Re: my first try on changing the design of web site front page
Previous by thread: Re: glimpse sucks.
Next by thread: Re: New debian CD vendor
Index(es):
- Date
- Thread