Bug#905126: www.debian.org: Website search box unhelpful for common names (e.g. Buster) in certain character sets
On Wed, Aug 01, 2018 at 01:41:24PM +0200, Laura Arjona Reina wrote:
> On Tue, 31 Jul 2018 21:40:18 +0800 Jonathan Wiltshire <jmw@debian.org>
> wrote:
> > A number of search languages end up with no results for contextually
> > common search terms, for example "debian" or "buster".
> >
> > To reproduce:
> > - use the search box for the term "buster" in English. There are a
> > number of results including release information, news items and
> > errata.
> > - set the language to Vietnamese, Chinese or similar and search again
> > - there are no results.
>
> I can reproduce that. However, searching in Vietnamese for "Debian" or
> "Buster" shows results.
>
> E.g. the search for "Buster" in Vietnamese produces this link as first
> match:
>
> https://www.debian.org/releases/index.vi.html
> 100% relevant, matching: buster
>
> Interestingly, it says "matching: buster" (smallcaps, but I searched for
> Buster)
>
> If I search for "buster" (with quotes), I also get the results.
The matching isn't case-sensitive (but capitalising a word in the query
suppresses stemming, as does putting it in quotes).
> The relevant code in the Debian website about this bug is in the file
> webwml/english/search.xml.in, that I think it just sends the search term
> to the search engine (which is in search.debian.org):
>
> <: my $ext = lc('$(CUR_ISO_LANG)'); $ext =~ s/-/_/;
> print
> 'template="https://search.debian.org/cgi-bin/omega?P={searchTerms}&HITSPERPAGE={count?}&DB='.$ext.'[CN:-cn:][TW:-tw:][HK:-hk:]"/>';
> :>
It looks to me like the problem is there's no explicit stemmer mapping
for zh-cn so it uses the English stemmer, but that stemmer wasn't used
at index time. Those mappings are in:
/srv/search.debian.org/xapian/templates/inc/stemmer
(At least on wolkenstein - the host key for search.d.o doesn't seem
to match for me so I didn't look there yet - probably I need to update
my debian hosts list. The setup is that search.debian.org is
cgi-grnet-01.debian.org, but the indexing actually happens on
wolkenstein.debian.org and the databases replicated to the front-end
machine).
I'm not sure how the stemmer mapping file is generated, but I'll look
into it today if I can. I think we should be able to just specify a
default of "none" but I suspect this file is generated so I need to
fix the script not just the current output.
> I couldn't find a canonical repository or pseudopackage related to
> search.debian.org. For what I've search, it is a "a slightly patched
> xapian-omega instance". I've logged in the machine and the code there
> has two remote repositories. I'm CC'ing Raphael Geissert (shown as
> contact for comments in the search result pages) and Olly Betts (shown
> as the author of the last commits in the repo that is currently deployed
> in search.debian.org). I hope they can help or tell us how to proceed.
Thanks for looping me in.
I think that "slightly patched" is out of date and we've been using the
standard xapian-omega package for some time now.
Cheers,
Olly
Reply to: