[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: search.debian.org is online



Hi,

From: csmall@enc.com.au (Craig Small)
Subject: Re: search.debian.org is online
Date: Mon, 30 Dec 2002 11:07:32 +1100

> > Note that, if this problem is fixed, Korean people will benefit very
> > much even if the word-separation problem is not fixed.
> I don't understand.  Are you saying that Korean uses two-byte characters
> but doesn't have spaces in words and should be ok now?

The current version of Debian search site has two problems for east Asian
languages:

1. handling of two-byte characters
2. extraction of words from sentences without whitespaces

The problem of 1 affects Chinese, Japanese, and Korean.  However, the
problem of 2 affects Chinese and Japanese only, because modern Korean
uses whitespaces between words.

You said that the problem of 2 will be solved in the future version of
mnogosearch (version 3.2.8) by using "chasen".  It is a good news though
we have to wait the release of the version.
http://lists.debian.org/debian-www/2002/debian-www-200212/msg00268.html
I think you are aware of this problem.

However, I am not sure that you are aware of the problem of 1.  I think
the problem of 1 exists *besides* the problem 2.  The reason of my idea
is reported in the following mail.
http://lists.debian.org/debian-www/2002/debian-www-200212/msg00267.html

Since Korean is two byte language and uses whitespaces between words,
solving the problem 1 will immediately benefits Koreans.



The following is the detail of problem 1 reported in the above URL.
If you already understand the problem, you don't need to read it.



It is apparent that a word "news" which is translated into each language
appears in http://www.debian.org/index.<language>.html .  Now, since the
word "news" appears as a section title, the word appears alone (i.e.,
isn't affected by the "word separation without whitespaces" problem) and
should be able to be searched.  However, the search fails for Chinese,
Japanese, and Korean.

This means that, even if a Japanese (Chinese, Korean) word appears
with separated by whitespaces, the search fails.  Thus, there exists
another distinct problem than the problem 2.

However, two-byte search doesn't always fail.  For example, I reported
in http://lists.debian.org/debian-www/2002/debian-www-200212/msg00256.html
that I can search my name.  I guess the condition when a search succeeds
or fails depends on whether the Japanese word is written in normal EUC-JP
encoding or in HTML "&#xxxx;" expression where xxxx is UTF-8 codepoint.
When the word is written in "&#xxxx;" expression, the search succeeds
while the word is written in normal EUC-JP encoding, the search fails.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/




Reply to: