[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[DAK web API] Too many duplicate entries in list all source packages



Hi,

With dakweb.queries.source's[1] "source_by_metadata" API, I'm getting too many duplicate entries. For example, for "Go-Import-Path" metadata[2], I'm getting "2184" duplicate entries** out of "4212", which is ~51% of total entries.

[1]: https://ftp-team.pages.debian.net/dak/epydoc/dakweb.queries.source-module.html
[2]: https://api.ftp-master.debian.org/source/by_metadata/Go-Import-Path

This is happening because we're using List[3] data structure to store results and keep appending data in it. I don't think, these duplicate entries or even their count would be useful for anyone. Please correct me, I wrong.

[3]: https://ftp-team.pages.debian.net/dak/epydoc/dakweb.queries.source-pysrc.html#source_by_metadata (See line 239.)

A good solution would be to use "Set" data structure, instead of "List". Using Set shouldn't makes any difference in terms of performance as both (list.append() and set.add()) are constant time operation in Python. And on bright side, we don't duplicate entries. If everyone okay with it, I can create a patch.


** To count number of duplicate entries, I've written a script:

$ cat count.py
import json
import urllib.request

url = "https://api.ftp-master.debian.org/source/by_metadata/Go-Import-Path";
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
    pkg_list = json.load(response)

count = 0
pkgs = set()

for pkg in pkg_list:
    t = (pkg["source"], pkg["metadata_value"])
    if t in pkgs:
        count = count + 1
    else:
        pkgs.add(t)

print(count)
$ python3 count.py
2184


Cheers,
Vipul


Reply to: