[DAK web API] Too many duplicate entries in list all source packages
Hi,
With dakweb.queries.source's[1] "source_by_metadata" API, I'm getting
too many duplicate entries. For example, for "Go-Import-Path"
metadata[2], I'm getting "2184" duplicate entries** out of "4212", which
is ~51% of total entries.
[1]:
https://ftp-team.pages.debian.net/dak/epydoc/dakweb.queries.source-module.html
[2]: https://api.ftp-master.debian.org/source/by_metadata/Go-Import-Path
This is happening because we're using List[3] data structure to store
results and keep appending data in it. I don't think, these duplicate
entries or even their count would be useful for anyone. Please correct
me, I wrong.
[3]:
https://ftp-team.pages.debian.net/dak/epydoc/dakweb.queries.source-pysrc.html#source_by_metadata
(See line 239.)
A good solution would be to use "Set" data structure, instead of "List".
Using Set shouldn't makes any difference in terms of performance as both
(list.append() and set.add()) are constant time operation in Python. And
on bright side, we don't duplicate entries. If everyone okay with it, I
can create a patch.
** To count number of duplicate entries, I've written a script:
$ cat count.py
import json
import urllib.request
url = "https://api.ftp-master.debian.org/source/by_metadata/Go-Import-Path"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
pkg_list = json.load(response)
count = 0
pkgs = set()
for pkg in pkg_list:
t = (pkg["source"], pkg["metadata_value"])
if t in pkgs:
count = count + 1
else:
pkgs.add(t)
print(count)
$ python3 count.py
2184
Cheers,
Vipul
Reply to: