Re: RRID -> SciCrunch

2017-10-25 18:19 GMT+03:00 Matus Kalas <Matus.Kalas@uib.no>:

On 2017-10-25 15:12, Michael Crusoe wrote:

2017-10-25 16:04 GMT+03:00 Steffen Möller <steffen_moeller@gmx.de>:

On 25.10.17 13:47, Michael Crusoe wrote:

2017-10-25 14:34 GMT+03:00 Steffen Möller <steffen_moeller@gmx.de
<mailto:steffen_moeller@gmx.de>>:

On 25.10.17 10:56, Michael Crusoe wrote:

Sorry, I missed the bit where we are deprecating RRID. Can

someone

explain?

Because of

https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf [1]

<https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf [1]>

and

some web googling from which I gathered that the "Research

Resource

IDentifiers" are not only provided by SciCrunch. Admittedly, I

fail to

find that page now that I want to find it :o/

There is no conflict here. scicrunch.org [2]

<http://scicrunch.org> is the

post-pilot phase of what is described in that paper.

Personally, I could not care less, let those catalog providers

fight

that out among themselves. However, I find that the notion of
SciCrunch
clearly identifies the provenance of that information, while

"RRID" to

me is more of a concept coined by (https://www.force11.org),

not a

provider. And with several initiatives following the same

purpose, I

found that by using SciCrunch not RRID, we would be the most
provider-neutral. And then again, it is only something local

to the

Debian packaging, not publicly visible, so nobody should truly
care and
the use of SciCrunch imho serves us best on a technical level.

RRIDs share a single name space that allow for multiple providers,

sci

crunch being the current main provider for software tools and
databases and other registries responsible for the other types. By
referring to RRIDs generically then there is no conflict.

See https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558 [3]

for an

overview

Please rename this field to RRID, or better yet just have a list

of

URIs like we do in CWL so you don't have to care if it is a RRID,

DOI

or whatever :-)

http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage

[4]

This is what we are doing. The field is called "Registry" (not RRID,
so
we can also refer to Wikis and other catalogs) and allows for an
arbitrary unordered number of (Name, Entry) tupels, in complete
analogy
to the CWL, I tend to think.

Well, no. In CWL we don't separate the provider from the identifier.
That's the whole point about COOL URIs.
I've CC'd Stian as he explains this better than I do.

The problem is that from the applied IDs, only Bio.Tools provide COOL URIs. Other providers should, but they don't, at least not yet. Thus a provider + ID pair is, unfortunately, necessary.

RRIDs have a COOL URI form: https://identifiers.org/rrid/RRID:SCR_001156

As a side-track: You mentioned DOIs, Michael.
Would it make sense if Debian (Med) adds DOIs as citable links to upstream releases, in addition to the upstream version and upstream repo information?
And in any case, DOIs for both the upstream project as a whole (i.e. all releases), and/or for the particular releases, can be added as citations for a src package, if package maintainer or upstream wish so.

Of course, the more links the merrier. In fact, one could write a simple script to autofill most of the upstream/metadata fields from any and all available identifiers and DOIs.

A quick recap for those following along:

Software identifiers are for the concept of a particular piece of software. They are persistent regardless of 1) the version of the software 2) the release of a paper for major new functionality or 3) switching to new repository

khmer will always be identifiable by https://identifiers.org/rrid/RRID:SCR_001156 regardless of new releases, new papers, or new hosting platforms

DOIs are currently used to identify point in time digital objects like papers, or a certain source code release.

It is true that some services that issue DOIs for software releases, like Zenodo and FigShare, do have a "primary" DOI that each release derives from. But that becomes insufficient as one might switch between those services or to another provider.

Back to your suggestion: the next step is to determine the best place to put these per-source DOIs within Debian.

Do we add them to

1) debian/upstream/metadata as part of a version:DOI dictionary?

2) debian/changelog for each release or just for the ${upstreamversion}-1 release?

3) the binary package control file in the binary packages? https://www.debian.org/doc/debian-policy/#s-binarycontrolfiles

4) the debian source control file (.dsc) ? https://www.debian.org/doc/debian-policy/#debian-source-control-files-dsc

and/or

5) someplace else?

We can help ourselves answer this question by determining how and for what purposes we might want to access these DOIs

1) From a running system, as part of a citation/provenance query?

2) From Umegaya / the Ultimate Debian Database (UDD) ?

3) Someplace else and/or some other purpose?

Related to your comment (and very, very close to my heart) is the
question if we do everything sufficiently well to map the CWL
workflows
to Debian packages. We could for instance add references to
CWL-workflow-database-entries for the workflows that a particular
Debian
packages is used in, so we can test them when the package updates -
er,
before the package updates in the distribution.

We are good here; you can determine the packages used in any given CWL
description that includes a SoftwareRequirement that is mappable
directly or indirectly to a package.

For automated testing you would need a way to specify "normal" or
expected results; CWL v1.0.x doesn't have that concept. A
researchobject.org [16] RO that contains/references those results with
the corresponding CWL workflow would however fulfill this role.

And another side-track: In addition to CWL workflows and using them as test (requiring some input-output pairs and equality relation), would it make sense for Debian to link to some kind of "CWL wrappers" for the single tools?

Instead of linking we can include them in the package, like we do unix manual pages.

See the section to the spec about where to find CWL tool descriptions:

http://www.commonwl.org/v1.0/CommandLineTool.html#Discovering_CWL_documents_on_a_local_filesystem

Perhaps this should be added to the Debian-Med policy as a bonus item for packages? samtools already ships some descriptions

$ apt-file search /usr/share/commonwl

samtools: /usr/share/commonwl/samtools-faidx.cwl

samtools: /usr/share/commonwl/samtools-index.cwl

samtools: /usr/share/commonwl/samtools-rmdup.cwl

samtools: /usr/share/commonwl/samtools-sort.cwl

samtools: /usr/share/commonwl/samtools-view.cwl

That is again similar to the elsewhere-discussed proposal of generating (and/or linking to) software containers (Docker, Singularity, rkt?)...

Software containers can be generated fairly automatically and don't really benefit from upstream's participation.

CWL tool descriptions can and should be maintained collectively; preferably they are offered to upstream for inclusion just like other Debian instigated patches and manual pages are sent up.

Back to the topic: I agree with Steffen that if we mean the link pairs as Provider + ID (as opposed to ID_type + ID_value), then SciCrunch makes more sense than RRID.

Cheers,
Matus

Michael R. Crusoe
Co-founder & Lead,
Common Workflow Language project
https://impactstory.org/u/0000-0002-2961-9670
mrc@commonwl.org
+1 480 627 9108