Re: "Entry: NA" in debian/upstream/metadata
On 2021-03-03 19:58, Steffen Möller wrote:
> Am 03.03.21 um 17:39 schrieb Matus Kalas:
>> Hey all again, and thanks for your thoughts Andrius and Andreas!
>> On 2021-03-03 09:36, Andreas Tille wrote:
>>> Hi Andrius,
>>> On 2021-03-03 08:54, Andrius Merkys wrote:
>>>> Dear Matus,
>>>> On 2021-03-02 19:56, Matus Kalas wrote:
>>>>> I'd suggest hearing from the folks who have done the most of the work
>>>>> with manually including those IDs, and letting them approve/decide.
>> Steffen et al., your opninions on this matter?
> Sorry for being late on this.
> So, "NA" indeed means like "hey, I checked but this was not found". This
> information should not be lost.
> An empty entry, as if from a template, does not have the same meaning.
> If NA (which is how R expects it and I found it likely to be easier to
> parse) or N/A - I would not be bother to do all these changes and would
> just leave it. Indeed, on the Excel sheet I am using N/A.
> As it happens, we had a quick thought exchange on zoom today and I tend
> to think that the general idea is that these NAs have to disappear, i.e.
> add these entries to bio.tools.
Thank you for confirming the distinction between empty value and "NA".
>>>>> I can imagine that for purely practical reasons in the process of the
>>>>> manual curation, it might make sense to allow explicitly:
>>>>> - Name: OMICtools
>>>>> Entry: N/A (Meaning: I have checked and there was no record)
>>>>> - Name: bio.tools
>>>>> Entry: "" (Meaning: I or someone else should check this
>>>>> or perhaps: I checked but wasn't conclusive yet)
>>>>> The latter might be useful for contributors who aren't used to all
>>>>> IDs, to make them more visible (including where the gaps are). But on
>>>>> the other hand, if those are well present in an upstream/metadata
>>>>> template and very clear in the documentation of upstream/metadata,
>>>>> it is not necessary and I'd then tend to like your suggestion Andrius.
>>>> To me, three flavors of "unknown" looks like an overkill. Most of the
>>>> metadata in Debian does not even have the two flavors of "unknown":
>>>> missing Bug-Submit field in d/u/metadata, Homepage in d/control and
>>>> Upstream-Contact in d/copyright means that this piece of information is
>>>> either nonexistent or simply not entered (for example, due to the lack
>>>> of time). Thus I am not sure whether the added value is worth the
>>>> infrastructure/effort here. But again, this is solely my opinion,
>>>> certainly not aimed at reflecting those of the people who enter and use
>>>> the data in d/u/metadata.
> Hm. I see the following:
> * empty - nobody cared, yet
> * "N/A" or "NA" or "<N/A>" or "<NA>" the latter two I would prefer but
> do not really care, may be too difficult in YAML since < is a special
> character - checked but not found
> * "<rejected>" - bio.tools decided against referencing that package. We
> are likely to see a few of these in near future.
Just a suggestion: maybe a "Status" field could be of use here? If more
special values of "Entry" are about to be introduced, it is better to
use a separate field to make this more machine-readable.
Suggested values for "Status":
* "confirmed" (default) - an entry in the registry is confirmed, and its
ID is stored in "Entry" field;
* "not-found" - the registry was checked for a match, but it was not
found at that point of time (here timestamp field could be of value);
* "rejected" - the registry explicitly rejected an attempt to register
* "pending" - package is submitted for registry, no response yet;
>>> <all easy for Andreas>
>>>> If three flavors option would be preferred, I would also suggest adding
>>>> date fields for each entry to signal at which point in time the
>>>> was inspected.
>>> As I wrote above later addition of some software to some registry can
>>> spoil the different meanings of unknown. This could be cured by such a
>>> date field but I don't think it is of any better value than draining
>>> time from people maintaining that extra field. Thus I do not think we
>>> should do this.
>> We definitely don't need a date, git blame does that. Also in the form
>> of the Blame button in Salsa. Without a possibility for inconsistency.
> This may be material for another paper: Means to synchronize between
> volunteer databases.
> * Provenance is accepted
> * data transfer status - this is not yet happening in routine but this
> is what we are doing here.
> @Andrius - If I do not need to be involved and if no information is
> lost, then I promise to be very happy with whatever you come up with,
> whatever this may be. The chance to have a reference named "NA", though,
> especially with all caps, that is darn close to zero and I wish you
> would invest/sink your valuable time into something else.
I do not want to interfere with the current practice nor cause loss of
valuable data. From the fact that "NA" special value is not mentioned in
DEP 12 I assumed it had the same meaning as empty value - thanks for
confirming I was wrong. I am fine with leaving it that way - as you say,
chance of having entry in the registry of that name may be small.
However, I know nothing about the naming conventions of the registries,
and my experience with structured data makes me uneasy about special
values. In any case such values should be described, and I volunteer to
update DEP 12 to reflect the current usage.
>> There is one closely related issue, which we just briefly touched upon
>> with Steffen and Hervé in a telcon: What to do with those "NA"
>> packages that are missing in e.g. bio.tools?
>> The regitration in bio.tools (and surely also SciCrunch) could be
>> automated, but there are at least a couple of things needing human
>> - Which src packages represent one tool (often e.g. libs | language
>> bindings form separate Debian pkgs). How to mark this and where? Is
>> there an exisiting Debian mechanism? Or do we need to abuse the
>> d/u/metadata "Entry" for that, before they're added? (3rd or 4th
>> flavour of info then 😀 ; btw. git branches could help here 😉 ; and
>> not in google spreadsheet perhaps 😜 as it has to be machine-readable)
>> - Choosing an available, reasonable biotoolsID and tool name.
>> Ideally tool name and biotoolsID are identical with ID having all
>> small case and spaces removed/replaced.
>> - Any other things needing human curation?
>> Thank you all, I'm very happy seeing this progressing!
>> P.S.: Could you please leave all the contents in when replying to the
>> thread, so that others can reply to previously mentioned points
>> without having to read every single email in the thread and possibly
>> breaking linearity of it? I agree that's it not ecological to
>> broadcast the same text all around the globe again and again, but
>> there are other solutions than emails that handle that without
>> compromising. Many thanks!