Re: RRID -> SciCrunch

2017-10-25 18:54 GMT+03:00 Steffen Möller <steffen_moeller@gmx.de>:

On 25.10.17 17:19, Matus Kalas wrote:
> On 2017-10-25 15:12, Michael Crusoe wrote:
>> 2017-10-25 16:04 GMT+03:00 Steffen Möller <steffen_moeller@gmx.de>:
>>
>>> On 25.10.17 13:47, Michael Crusoe wrote:
>>>>
>>>>
>>>> 2017-10-25 14:34 GMT+03:00 Steffen Möller <steffen_moeller@gmx.de
>>>> <mailto:steffen_moeller@gmx.de>>:
>>>>
>>>>
>>>> On 25.10.17 10:56, Michael Crusoe wrote:
>>>>> Sorry, I missed the bit where we are deprecating RRID. Can
>>> someone
>>>>> explain?
>>>>
>>>> Because of
>>> https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf [1]
>>>> <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf [1]>
>>> and
>>>> some web googling from which I gathered that the "Research
>>> Resource
>>>> IDentifiers" are not only provided by SciCrunch. Admittedly, I
>>> fail to
>>>> find that page now that I want to find it :o/
>>>>
>>>>
>>>> There is no conflict here. scicrunch.org [2]
>>> <http://scicrunch.org> is the
>>>> post-pilot phase of what is described in that paper.
>>>>
>>>>
>>>>
>>>> Personally, I could not care less, let those catalog providers
>>> fight
>>>> that out among themselves. However, I find that the notion of
>>>> SciCrunch
>>>> clearly identifies the provenance of that information, while
>>> "RRID" to
>>>> me is more of a concept coined by (https://www.force11.org),
>>> not a
>>>> provider. And with several initiatives following the same
>>> purpose, I
>>>> found that by using SciCrunch not RRID, we would be the most
>>>> provider-neutral. And then again, it is only something local
>>> to the
>>>> Debian packaging, not publicly visible, so nobody should truly
>>>> care and
>>>> the use of SciCrunch imho serves us best on a technical level.
>>>>
>>>>
>>>> RRIDs share a single name space that allow for multiple providers,
>>> sci
>>>> crunch being the current main provider for software tools and
>>>> databases and other registries responsible for the other types. By
>>>> referring to RRIDs generically then there is no conflict.
>>>>
>>>> See https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558 [3]
>>> for an
>>>> overview
>>>>
>>>> Please rename this field to RRID, or better yet just have a list
>>> of
>>>> URIs like we do in CWL so you don't have to care if it is a RRID,
>>> DOI
>>>> or whatever :-)
>>>>
>>>> http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage
>>> [4]
>>>>
>>> This is what we are doing. The field is called "Registry" (not RRID,
>>> so
>>> we can also refer to Wikis and other catalogs) and allows for an
>>> arbitrary unordered number of (Name, Entry) tupels, in complete
>>> analogy
>>> to the CWL, I tend to think.
>>
>> Well, no. In CWL we don't separate the provider from the identifier.
>> That's the whole point about COOL URIs.
>> I've CC'd Stian as he explains this better than I do.
>
> The problem is that from the applied IDs, only Bio.Tools provide COOL
> URIs. Other providers should, but they don't, at least not yet. Thus a
> provider + ID pair is, unfortunately, necessary.
>
> As a side-track: You mentioned DOIs, Michael.
> Would it make sense if Debian (Med) adds DOIs as citable links to
> upstream releases, in addition to the upstream version and upstream
> repo information?
> And in any case, DOIs for both the upstream project as a whole (i.e.
> all releases), and/or for the particular releases, can be added as
> citations for a src package, if package maintainer or upstream wish so.

The task page already presents traditional references and at least the
metadata file also harvests their DOIs. I have read that an entry in
OMICtools would also have a DOI, but I admit not to have seen that yet,
and because I am confused enough already, I have not looked for it, either.

When we now add DOIs that indicate a particular version of the software,
so I admit to embrace that for the nice semantics behind that (like a
feature having been added at a particular point in time) but then again
... ouch. We will have some individuals that reference a software by its
DOI and others by a version. And how do you express with a DOI that a
particular feature is no longer present? Trivial with a version number
since it is ordered. And now someone tells me that DOIs are also ordered
- they most likely are but close to nobody thinks about it that way, I
presume.

>
>
>>> Related to your comment (and very, very close to my heart) is the
>>> question if we do everything sufficiently well to map the CWL
>>> workflows
>>> to Debian packages. We could for instance add references to
>>> CWL-workflow-database-entries for the workflows that a particular
>>> Debian
>>> packages is used in, so we can test them when the package updates -
>>> er,
>>> before the package updates in the distribution.
>>
>> We are good here; you can determine the packages used in any given CWL
>> description that includes a SoftwareRequirement that is mappable
>> directly or indirectly to a package.
>>
>> For automated testing you would need a way to specify "normal" or
>> expected results; CWL v1.0.x doesn't have that concept. A
>> researchobject.org [16] RO that contains/references those results with
>> the corresponding CWL workflow would however fulfill this role.
>>

I am a fan. Yes, please! RO, go, go, go! Let us complete one. While
writing this down I somehow sensed that I got the dependencies wrong.
So, please correct me. I initially saw:

* Debian package that features the CWL, happily an auto-created Debian
package from a CWL-database.

- using only Debian packages to perform the workflow

- we are a bit weak on the Debianisation of public data that we are
likely to need for those tests

* Auto-created test(s) for that workflow added from the RO collection.

* Submitted to Debian as a package

* Control the results from ci.debian.org

I am not so confident that we have a 1:n relationship between CWL
workflows and ROs. It is more like one RO integrating a subset of
workflows, right? This will render it all a bit more complicated, as in
CWL-representing Debian packages depending on each other, but I still
like it.

A researchobject is a data container or data manifest. It is the recommended method of communicating the results of a CWL workflow. As a single CWL workflow can produces varying outputs based upon varying inputs there can naturally be many different ROs describing those different ouptus.

How should we name those Debian packages that are auto-created to
represent a CWL workflow? This depends on the database from which we
derive the package, right? Do we agree that the ROs are not appearing as
packages themselves?

I do not think it makes sense to package particular workflows as a Debian package; except as shared example workflow to be Recommended by the various CWL runners

> That is again similar to the elsewhere-discussed proposal of
> generating (and/or linking to) software containers (Docker,
> Singularity, rkt?)...

We should reference them, too. I am a bit uncertain if we should
distinguish between a software container installs a Debian package and a
container installs a Conda package.

What's the value of referencing an outside software container from a Debian package? They are easy enough to make by hand and are soon going to be autogenerated..

> Back to the topic: I agree with Steffen that if we mean the link pairs
> as Provider + ID (as opposed to ID_type + ID_value), then SciCrunch
> makes more sense than RRID.

Nice to hear in the sense that we would not have to change them all back
again. Let us wait to learn if Stian is any opinionated about it.

@Matúš, I'd like to open another thread on how to proceed with the EDAM
annotation of packages in Debian to help structuring our task pages. My
immediate thought was to just take the Topic and then store the whole
path (there was path, right?) of your ontology until it gets to that
topic as a header for all the tools that share that topic.

The annoying bit would then be that the upload into the UDD (that Debian
database from which the task page is derived) would need to know about
the EDAM ontology. But that would not scale with all the other Debian
packages of other disciplines. So we would need to somewhere generate
the full path in an automated fashion as a derived field (since you may
reorganise your ontology) or we live with the specifier of the topic
alone. But then we are weaker on the semantics. Ideas welcome.

Steffen

Michael R. Crusoe
Co-founder & Lead,
Common Workflow Language project
https://impactstory.org/u/0000-0002-2961-9670
mrc@commonwl.org
+1 480 627 9108