[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1015198: better search results



On Wed, Aug 10, 2022 at 09:17:02AM +0200, Thomas Lange wrote:
> in #1015198 I also reported useless search results
> similar to #658227 (still open since 2012).

Sorting by date is unlikely to help #658227 - it would prefer the newest
documents which mention the DFSG, and the social contract page was
created a long time ago, and presumably changes very rarely.

Interestingly it's not just our search which struggles with the DFSG
case.  Testing with a couple of popular web search engines in a private
window (to try to minimise any bias from previous searches, etc, though
there are likely geographical and maybe other variations still):

On duckduckgo (which I think is currently bing underneath), "Diabetic
Foot Study Group" is top.  Second is the wikipedia page for "our" DFSG.
First d.o hit is https://wiki.debian.org/DFSGLicenses at #5, second d.o
hit is the social contract page which isn't until #10.

Google ranks the wikipedia page top, first d.o hit is
https://people.debian.org/~bap/dfsg-faq.html at #3 and the social
contract page is #4.

I suspect it doesn't help that the canonical "right answer" here is
actually a page which is primarily about the "Debian Social Contract"
(that's the page title and top heading, and what's talked about most in
the initial part of the page which it's likely search engines put extra
weight on).

> I found this in the xapian docs. Do you think this would be the best
> solution to get the results sorted by date? I'm not sure if it would
> be easy to index all our html documents by date, since the time stamps
> on the files do not reflect the date of the last modification of the
> content. Do you know of any other solutions?

That's the most efficient way to sort by date, but requires extra work
at index time.  You can also store the date in a "value slot" and sort
by that, which is easier at index time, but slower at search time.

The last modified date is actually already available to sort by behind
the scenes - here's what it looks like for your `debconf` example:

https://search.debian.org/?q=debconf&HITSPERPAGE=100&DB=en&SORT=-0

Note that paging through results doesn't preserve this sort setting
because the template wasn't written expecting this, so I've set it to
show 100 results, which are dominated by WNPP reports, translation
reports, etc.

Aside from such auto-generated documents (which could probably be
excluded or penalised in the ranking), last modified isn't entirely
helpful anyway - we don't want a page about debconf 10 to beat the one
about debconf 22 just because someone recently fixed a typo or updated a
link on the old page.

Sorting by creation date probably doesn't really help either.  The
Social Contract page was created long ago, but the most recent Debconf
page fairly recently so neither ascending nor descending is good across
the two cases you've highlighted.

I suspect boosting based on some link analysis within d.o would help
a lot - both the social contract page and the latest debconf page
will tend to have more incoming links compared to other pages matching
the same terms, and the various autogenerated pages are unlikely to be
linked to a lot from elsewhere.  I did some initial work on that at
Debconf 16, but ran out of time and sadly haven't managed to get back to
it since.

Cheers,
    Olly


Reply to: