[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Validating debian/upstream/metadata for debian-med projects



Hello,

Recent thread on debian-science@ [1] motivated me to look deeper into
enforcing quality standards of debian/upstream/metadata files (a.k.a.
DEP 12) we ship with Debian packages. I learnt that lintian already runs
YAML syntax check on debian/upstream/metadata files, but further
validation is not performed (to my knowledge). Thus I have developed a
formal validation tool [2] to check the contents inside these YAML
files, mostly syntax of URLs and some fields that are defined to be in
correspondence to BibTeX as per [3].

Yesterday I have downloaded debian/upstream/metadata files from all
>1300 projects under https://salsa.debian.org/debian-med/ and run
against my validator. Resulting validation messages could be grouped
into the following categories:

1. Highly possible typos: reference year '200' (bagpipe), '20015'
(rambo-k), URLs with spaces (bio-tradis) and so on. This category is the
one I was actually aiming at.

2. URLs with trailing newlines (adapterremoval, aevol, amos, just to
name a few). This is most likely due to YAML property to append newline
to the end of multiline strings, which can be quite easily averted [4].
On the other hand, trailing newlines in URLs could be ignored at all, as
clearly they are not intentional.

3. Numeric months in references (augustus, cluster3, haploview, just to
name a few). According to [3], "[Reference] keys that correspond to
standard BibTeX entries must provide the same content", and 1988 BibTeX
manual from CTAN [5] says "[month:] You should use the standard
three-letter abbreviation". Of course "should" is not "must" (in terms
of RFC 2119), but machine-reading would be easier with a consistent
definition.

4. E-mail addresses in Bug-Submit (htslib, last-align, nanook, just to
name a few). Per [3], values of Bug-Submit are URLs. Maybe [3] could be
amended to cover e-mails too?

5. Unclear scalar/list status of some fields. Only Screenshots is
defined as "One or more URLs", while in reality lists appear for
Webservice (clustalw, primer3), Bug-Submit (mira, albeit seems broken).
Maybe these too could be defined as "One or more URLs"?

6. Empty templates (agat, intake, libpll-2, just to name a few). I would
suggest removing the templates, as they do not carry anything meaningful.

7. DOIs written as URLs (fast, libnewuoa). This is debatable, and [5]
does not talk about DOIs at all.

As said earlier, I would be interested in implementing formal validation
of debian/upstream/metadata in lintian to catch typos and so on.
However, there are a few ambiguities in the specification, which would
be really interesting to discuss and resolve.

Please do not take any part of my text as a critique for anyone. Package
names are here only for the purpose of illustration.

[1] https://lists.debian.org/debian-science/2021/01/msg00050.html
[2] https://github.com/merkys/Debian-DEP12, no stable release yet
[3] https://wiki.debian.org/UpstreamMetadata
[4] https://yaml-multiline.info/
[5]
https://mirror.datacenter.by/pub/mirrors/CTAN/biblio/bibtex/base/btxdoc.pdf

Best wishes,
Andrius


Reply to: