[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#905126: www.debian.org: Website search box unhelpful for common names (e.g. Buster) in certain character sets



On Wed, Aug 01, 2018 at 01:41:24PM +0200, Laura Arjona Reina wrote:
> On Tue, 31 Jul 2018 21:40:18 +0800 Jonathan Wiltshire <jmw@debian.org>
> wrote:
> > A number of search languages end up with no results for contextually
> > common search terms, for example "debian" or "buster".
> > 
> > To reproduce:
> >  - use the search box for the term "buster" in English. There are a
> >    number of results including release information, news items and
> >    errata.
> >  - set the language to Vietnamese, Chinese or similar and search again
> >  - there are no results.
> 
> I can reproduce that. However, searching in Vietnamese for "Debian" or
> "Buster" shows results.
> 
> E.g. the search for "Buster" in Vietnamese produces this link as first
> match:
> 
> https://www.debian.org/releases/index.vi.html
> 100% relevant, matching: buster
> 
> Interestingly, it says "matching: buster" (smallcaps, but I searched for
> Buster)
>
> If I search for "buster" (with quotes), I also get the results.

The matching isn't case-sensitive (but capitalising a word in the query
suppresses stemming, as does putting it in quotes).

> The relevant code in the Debian website about this bug is in the file
> webwml/english/search.xml.in, that I think it just sends the search term
> to the search engine (which is in search.debian.org):
> 
> <:	my $ext = lc('$(CUR_ISO_LANG)');  $ext =~ s/-/_/;
> 	print
> 'template="https://search.debian.org/cgi-bin/omega?P={searchTerms}&amp;HITSPERPAGE={count?}&amp;DB='.$ext.'[CN:-cn:][TW:-tw:][HK:-hk:]"/>';
> :>

It looks to me like the problem is there's no explicit stemmer mapping
for zh-cn so it uses the English stemmer, but that stemmer wasn't used
at index time.  Those mappings are in:

/srv/search.debian.org/xapian/templates/inc/stemmer

(At least on wolkenstein - the host key for search.d.o doesn't seem
to match for me so I didn't look there yet - probably I need to update
my debian hosts list.  The setup is that search.debian.org is
cgi-grnet-01.debian.org, but the indexing actually happens on
wolkenstein.debian.org and the databases replicated to the front-end
machine).

I'm not sure how the stemmer mapping file is generated, but I'll look
into it today if I can.  I think we should be able to just specify a
default of "none" but I suspect this file is generated so I need to
fix the script not just the current output.

> I couldn't find a canonical repository or pseudopackage related to
> search.debian.org. For what I've search, it is a "a slightly patched
> xapian-omega instance". I've logged in the machine and the code there
> has two remote repositories. I'm CC'ing Raphael Geissert (shown as
> contact for comments in the search result pages) and Olly Betts (shown
> as the author of the last commits in the repo that is currently deployed
> in search.debian.org). I hope they can help or tell us how to proceed.

Thanks for looping me in.

I think that "slightly patched" is out of date and we've been using the
standard xapian-omega package for some time now.

Cheers,
    Olly


Reply to: