[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Status of new search engine



Hi,

From: csmall@enc.com.au (Craig Small)
Subject: Re: Status of new search engine
Date: Tue, 17 Dec 2002 22:16:51 +1100

> > brokenly.  I found the search page http://search.debian.org/new/search.cgi
> > have the following line:
> >    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> I've fixed that now.  I'm not sure how to permanently do this as it
> comes from the generic webwml templates.

Ok, I confirmed it.  Then, could you please modify the program to
assemble the result page to convert all results into UTF-8?

I imagine, if we are lucky, we can use the output of search engine
because you said the core part of the search engine is working on
UTF-8, which means that the web pages are *already* converted into
UTF-8 in order to be searched.

Otherwise, if you are using Perl, you can use libtext-iconv-perl
package.  The default encoding for each language (i.e., the "from"
encoding for conversion) is available in webwml/<language>/.wmlrc
file.  I think you can hold the pairs of language and encoding as
constants or hard-coded, because it is rare that a new language
is added to Debian web pages.


> Now, if I pick it up from the search page, I get
> http://search.debian.org/new/search.en.cgi?q=%E4%B9%85%E4%BF%9D%E7%94%B0+%E6%99%BA%E5%BA%83
> and results look sensible.

This works fine.

Ah, now I can input my name in the webform and the search goes well.
I think that, since the webpage is now UTF-8, Internet Explorer submits
the query in UTF-8.


> I then searched ???????? which is something to do with security
> and got
> http://search.debian.org/new/search.en.cgi?q=%E3%82%BB%E3%82%AD%E3%83%A5%E3%83%AA%E3%83%86%E3%82%A3%E6%83%85%E5%A0%B1&ps=10&o=0&m=and&lang=
> with no results

I imagine this is a fault of the search engine.  Since Japanese sentence
doesn't separate words with whitespaces, the search engine cannot extract
words from Japanese sentences.

I confirmed this point by searching %E4%B9%85%E4%BF%9D , which is a
first two characters from my name %E4%B9%85%E4%BF%9D%E7%94%B0 with
three characters.  It gave result of zero.


> and
> http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%E3%82%BB%E3%82%AD%E3%83%A5%E3%83%AA%E3%83%86%E3%82%A3%E6%83%85%E5%A0%B1&btnG=Google+Search
> with lots of results.

Google seems to have a better sentence analyzer.

I heard that "namazu" can be used for such purpose, i.e., constructing
a whole-text search engine for Japanese.  It is a free software and
available as a Debian package.  Namazu is very popular not only among
Japanese free software community but also among commercial usages.

For example, Debian JP Project uses Namazu for whole-text search of
mailing list archive.  http://www.debian.or.jp/search/

However, please don't ask me about Namazu because I have never used it.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/




Reply to: