[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

enable searching East Asian words at search.debian.org



Hi,

So far search.debian.org doesn't support East Asian languages
(Chinese, Japanese, and Korean).  I.e., it cannot search Chinese,
Japanese, nor Korean words.

I have recently researched this problem and I think I found
how to fix it.  I tested at my personal machine without 24hr
internet connection and it works almost fine.

 1. install libchasen-dev, libchasen0, and ipadic packages.
 2. recompile mnogosearch (version 3.2.8 or later) with
    --enable-chasen --with-extra-charsets=all option for ./configure .
 3. invoke "indexer -C" and then "indexer" to rebuild the search database.

Could someone do this?  Or, can I have a database (postgresql) access
(write access) permission at klecker to prove this?


Explanation:

Chasen packages are needed to extract words from Japanese texts.
Japanese texts don't use whitespaces between words.  --enable-chasen
(since version 3.2.8) option for mnogosearch enables usage of chasen
from mnogosearch.

Though mnogosearch is Unicode-based software and potentially supports
East Asian languages, support of these languages is disabled by default.
To enable this, --with-extra-charsets=all is needed.

Since the current search database in search.debian.org doesn't have
any east Asian words, it is needed to rebuild the whole database.
(Of course it is enough to rebuild database only for *.{ja,ko,zh-cn,
zh-hk,zh-tw}.html pages but I don't know if it is possible to this.)

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/




Reply to: