[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: search.debian.org is online



Hi,

From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Subject: Re: search.debian.org is online
Date: Sun, 12 Jan 2003 16:51:31 +0900 (JST)

> > 1. handling of two-byte characters
> > 2. extraction of words from sentences without whitespaces
> 
> I think I found the reason of the problem 1.  Though mnogosearch
> supports multibyte languages, it doesn't support them by default.
> To support them, recompilation is needed.
> 
> 
> mnogosearch-3.2.7$ ./configure --help
>                      .....
>   --with-extra-charsets=CHARSET[,CHARSET,...]
>                           Use additional non-default charsets:
>                           none, all or a list from this set:
>                           big5 gb2312 gbk japanese euc-kr gujarati tscii
>                      .....

I'd like the mnoGoSearch of search.debian.org to be recompiled
with extra-charsets enabled, because it (I expect) immediately
benefits Korean.  (Note that Korean doesn't have the problem 2).
Since it doesn't need the newer version of mnoGoSearch with ChaSen
support (CVS version 3.2.8, to solve problem 2), it can be done now!

I think --with-extra-charset=all or
--with-extra-charset=big5,gb2312,japanese,euc-kr is a good idea
because it enables sane "search results" page.


> Note that "japanese" means Shift_JIS, which is not the encoding for
> Debian Japanese web pages.  Debian Japanese web pages are written
> using ISO-2022-JP which seems not be supported by mnogosearch.

During browsing the source code of mnoGoSearch, I found that
version 3.2.6 seems to support ISO-2022-JP encoding which is
used for Debian Japanese pages, though it is not documented.
(Of course "japanese" extra-charsets must be enabled in 
./configure time.)

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/




Reply to: