Re: Status of new search engine

To: debian-www@lists.debian.org
Subject: Re: Status of new search engine
From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Date: Tue, 17 Dec 2002 20:46:36 +0900 (JST)
Message-id: <[🔎] 20021217.204636.98158618.debian@tmail.plala.or.jp>
In-reply-to: <[🔎] 20021217111651.GA20475@enc.com.au>
References: <[🔎] 20021217083054.GB15269@enc.com.au> <[🔎] 20021217.182244.42411994.debian@tmail.plala.or.jp> <[🔎] 20021217111651.GA20475@enc.com.au>

Hi,

From: csmall@enc.com.au (Craig Small)
Subject: Re: Status of new search engine
Date: Tue, 17 Dec 2002 22:16:51 +1100

> > brokenly.  I found the search page http://search.debian.org/new/search.cgi
> > have the following line:
> >    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> I've fixed that now.  I'm not sure how to permanently do this as it
> comes from the generic webwml templates.

Ok, I confirmed it.  Then, could you please modify the program to
assemble the result page to convert all results into UTF-8?

I imagine, if we are lucky, we can use the output of search engine
because you said the core part of the search engine is working on
UTF-8, which means that the web pages are *already* converted into
UTF-8 in order to be searched.

Otherwise, if you are using Perl, you can use libtext-iconv-perl
package.  The default encoding for each language (i.e., the "from"
encoding for conversion) is available in webwml/<language>/.wmlrc
file.  I think you can hold the pairs of language and encoding as
constants or hard-coded, because it is rare that a new language
is added to Debian web pages.

> Now, if I pick it up from the search page, I get
> http://search.debian.org/new/search.en.cgi?q=%E4%B9%85%E4%BF%9D%E7%94%B0+%E6%99%BA%E5%BA%83
> and results look sensible.

This works fine.

Ah, now I can input my name in the webform and the search goes well.
I think that, since the webpage is now UTF-8, Internet Explorer submits
the query in UTF-8.

> I then searched ???????? which is something to do with security
> and got
> http://search.debian.org/new/search.en.cgi?q=%E3%82%BB%E3%82%AD%E3%83%A5%E3%83%AA%E3%83%86%E3%82%A3%E6%83%85%E5%A0%B1&ps=10&o=0&m=and&lang=
> with no results

I imagine this is a fault of the search engine.  Since Japanese sentence
doesn't separate words with whitespaces, the search engine cannot extract
words from Japanese sentences.

I confirmed this point by searching %E4%B9%85%E4%BF%9D , which is a
first two characters from my name %E4%B9%85%E4%BF%9D%E7%94%B0 with
three characters.  It gave result of zero.

> and
> http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%E3%82%BB%E3%82%AD%E3%83%A5%E3%83%AA%E3%83%86%E3%82%A3%E6%83%85%E5%A0%B1&btnG=Google+Search
> with lots of results.

Google seems to have a better sentence analyzer.

I heard that "namazu" can be used for such purpose, i.e., constructing
a whole-text search engine for Japanese.  It is a free software and
available as a Debian package.  Namazu is very popular not only among
Japanese free software community but also among commercial usages.

For example, Debian JP Project uses Namazu for whole-text search of
mailing list archive.  http://www.debian.or.jp/search/

However, please don't ask me about Namazu because I have never used it.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/

Reply to:

Follow-Ups:
- Re: Status of new search engine
  - From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>

References:
- Status of new search engine
  - From: csmall@enc.com.au (Craig Small)
- Re: Status of new search engine
  - From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
- Re: Status of new search engine
  - From: csmall@enc.com.au (Craig Small)

Prev by Date: Re: Status of new search engine
Next by Date: Re: Status of new search engine
Previous by thread: Re: Status of new search engine
Next by thread: Re: Status of new search engine
Index(es):
- Date
- Thread