[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RRID -> SciCrunch



On 25.10.17 17:49, Michael Crusoe wrote:
>
>
> 2017-10-25 18:19 GMT+03:00 Matus Kalas <Matus.Kalas@uib.no
> <mailto:Matus.Kalas@uib.no>>:
>
>     On 2017-10-25 15:12, Michael Crusoe wrote:
>
>         2017-10-25 16:04 GMT+03:00 Steffen Möller
>         <steffen_moeller@gmx.de <mailto:steffen_moeller@gmx.de>>:
>
>             On 25.10.17 13:47, Michael Crusoe wrote:
>
>
>
>                 2017-10-25 14:34 GMT+03:00 Steffen Möller
>                 <steffen_moeller@gmx.de <mailto:steffen_moeller@gmx.de>
>                 <mailto:steffen_moeller@gmx.de
>                 <mailto:steffen_moeller@gmx.de>>>:
>
>
>                 On 25.10.17 10:56, Michael Crusoe wrote:
>
>                     Sorry, I missed the bit where we are deprecating
>                     RRID. Can
>
>             someone
>
>                     explain?
>
>
>                 Because of
>
>             https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf
>             <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf> [1]
>
>                 <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf
>                 <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf>
>                 [1]>
>
>             and
>
>                 some web googling from which I gathered that the "Research
>
>             Resource
>
>                 IDentifiers" are not only provided by SciCrunch.
>                 Admittedly, I
>
>             fail to
>
>                 find that page now that I want to find it :o/
>
>
>                 There is no conflict here. scicrunch.org
>                 <http://scicrunch.org> [2]
>
>             <http://scicrunch.org> is the
>
>                 post-pilot phase of what is described in that paper.
>
>
>
>                 Personally, I could not care less, let those catalog
>                 providers
>
>             fight
>
>                 that out among themselves. However, I find that the
>                 notion of
>                 SciCrunch
>                 clearly identifies the provenance of that information,
>                 while
>
>             "RRID" to
>
>                 me is more of a concept coined by
>                 (https://www.force11.org),
>
>             not a
>
>                 provider. And with several initiatives following the same
>
>             purpose, I
>
>                 found that by using SciCrunch not RRID, we would be
>                 the most
>                 provider-neutral. And then again, it is only something
>                 local
>
>             to the
>
>                 Debian packaging, not publicly visible, so nobody
>                 should truly
>                 care and
>                 the use of SciCrunch imho serves us best on a
>                 technical level.
>
>
>                 RRIDs share a single name space that allow for
>                 multiple providers,
>
>             sci
>
>                 crunch being the current main provider for software
>                 tools and
>                 databases and other registries responsible for the
>                 other types. By
>                 referring to RRIDs generically then there is no conflict.
>
>                 See
>                 https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558
>                 <https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558>
>                 [3]
>
>             for an
>
>                 overview
>
>                 Please rename this field to RRID, or better yet just
>                 have a list
>
>             of
>
>                 URIs like we do in CWL so you don't have to care if it
>                 is a RRID,
>
>             DOI
>
>                 or whatever :-)
>
>                 http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage
>                 <http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage>
>
>             [4]
>
>
>             This is what we are doing. The field is called "Registry"
>             (not RRID,
>             so
>             we can also refer to Wikis and other catalogs) and allows
>             for an
>             arbitrary unordered number of (Name, Entry) tupels, in
>             complete
>             analogy
>             to the CWL, I tend to think.
>
>
>         Well, no. In CWL we don't separate the provider from the
>         identifier.
>         That's the whole point about COOL URIs.
>         I've CC'd Stian as he explains this better than I do.
>
>
>     The problem is that from the applied IDs, only Bio.Tools provide
>     COOL URIs. Other providers should, but they don't, at least not
>     yet. Thus a provider + ID pair is, unfortunately, necessary.
>
>
> RRIDs have a COOL URI form: https://identifiers.org/rrid/RRID:SCR_001156


This is the one we are generating on the task page from the information
Name: SciCrunch, Entry: SCR_001156. And as of today, the same would be
generated if the entry was with Name: RRID.

I think we would make as mistake to give up the distinction, let alone
since we can no longer query the UDD for, e.g., entries that have an
RRID but no bio.tools entry.


>  
>
>
>     As a side-track: You mentioned DOIs, Michael.
>     Would it make sense if Debian (Med) adds DOIs as citable links to
>     upstream releases, in addition to the upstream version and
>     upstream repo information?
>     And in any case, DOIs for both the upstream project as a whole
>     (i.e. all releases), and/or for the particular releases, can be
>     added as citations for a src package, if package maintainer or
>     upstream wish so.
>
>
> Of course, the more links the merrier. In fact, one could write a
> simple script to autofill most of the upstream/metadata fields from
> any and all available identifiers and DOIs.
>
> A quick recap for those following along:
> Software identifiers are for the concept of a particular piece of
> software. They are persistent regardless of 1) the version of the
> software 2) the release of a paper for major new functionality or 3)
> switching to new repository
>
> khmer will always be identifiable
> by https://identifiers.org/rrid/RRID:SCR_001156 regardless of new
> releases, new papers, or new hosting platforms
>
> DOIs are currently used to identify point in time digital objects like
> papers, or a certain source code release.
> It is true that some services that issue DOIs for software releases,
> like Zenodo and FigShare, do have a "primary" DOI that each release
> derives from. But that becomes insufficient as one might switch
> between those services or to another provider.
>
> Back to your suggestion: the next step is to determine the best place
> to put these per-source DOIs within Debian.
>
> Do we add them to
> 1) debian/upstream/metadata as part of a version:DOI dictionary?
Interesting. I do not immediately see how this fits that format, though.
> 2) debian/changelog for each release or just for the
> ${upstreamversion}-1 release?
${upstreamversion}-1,  preferred over 1)
> 3) the binary package control file in the binary
> packages? https://www.debian.org/doc/debian-policy/#s-binarycontrolfiles
may be required for indexing
> 4) the debian source control file (.dsc)
> ? https://www.debian.org/doc/debian-policy/#debian-source-control-files-dsc
yes, again for indexing
> and/or
> 5) someplace else?

Just to have it discussed: debian/copyright

But today the version of the software is not specified in that document
and we do not like changing the copyright file when the copyright has
not changed. So, debian/copyright is inferiour to debian/changelog IMHO.

>
> We can help ourselves answer this question by determining how and for
> what purposes we might want to access these DOIs
> 1) From a running system, as part of a citation/provenance query?
Yes, and there is some prior art to this. The package "devscripts"
provides the tool "wnpp-alert" that shows all those packages that are
orphaned or requested to be adopted and installed on that system. We
could come up with the same for any CWL workflow/wrapper - but for that
we would not need this information shipping with the Debian packages
themselves but could work with the catalogs directly.
> 2) From Umegaya / the Ultimate Debian Database (UDD) ?

yes, this is the place to consult for the DOI assignments.

Especially when the workflow is hidden away in some container, the query
should be performed in an OS-independent manner - like via the UDD.

> 3) Someplace else and/or some other purpose?

If those DOI help with the communication between the various catalogs /
software repositories / papers, then I presume that this is mostly
outside of our immediate control.


>  
>
>
>
>
>             Related to your comment (and very, very close to my heart)
>             is the
>             question if we do everything sufficiently well to map the CWL
>             workflows
>             to Debian packages. We could for instance add references to
>             CWL-workflow-database-entries for the workflows that a
>             particular
>             Debian
>             packages is used in, so we can test them when the package
>             updates -
>             er,
>             before the package updates in the distribution.
>
>
>         We are good here; you can determine the packages used in any
>         given CWL
>         description that includes a SoftwareRequirement that is mappable
>         directly or indirectly to a package.
>
>         For automated testing you would need a way to specify "normal" or
>         expected results; CWL v1.0.x doesn't have that concept. A
>         researchobject.org <http://researchobject.org> [16] RO that
>         contains/references those results with
>         the corresponding CWL workflow would however fulfill this role.
>
>
>     And another side-track: In addition to CWL workflows and using
>     them as test (requiring some input-output pairs and equality
>     relation), would it make sense for Debian to link to some kind of
>     "CWL wrappers" for the single tools?
>
>
> Instead of linking we can include them in the package, like we do unix
> manual pages.
It is what I had also suggested. Maybe we can come up with an
auto-update with a dh-cwl helper when there is internet access?
>
> See the section to the spec about where to find CWL tool descriptions:
> http://www.commonwl.org/v1.0/CommandLineTool.html#Discovering_CWL_documents_on_a_local_filesystem
>
> Perhaps this should be added to the Debian-Med policy as a bonus item
> for packages? samtools already ships some descriptions
>
> $ apt-file search /usr/share/commonwl
> samtools: /usr/share/commonwl/samtools-faidx.cwl
> samtools: /usr/share/commonwl/samtools-index.cwl
> samtools: /usr/share/commonwl/samtools-rmdup.cwl
> samtools: /usr/share/commonwl/samtools-sort.cwl
> samtools: /usr/share/commonwl/samtools-view.cwl
I was not aware of those - excellent!
>  
>
>     That is again similar to the elsewhere-discussed proposal of
>     generating (and/or linking to) software containers (Docker,
>     Singularity, rkt?)...
>
> Software containers can be generated fairly automatically and don't
> really benefit from upstream's participation.
Let us see how this develops. For instance, I anticipate that most
issues that Debian packages run into when there are new versions out,
will also affect the BioConda community. Via OMICtools we have an
indirect mapping from Debian packages to BioConda. We could make that a
more direct one. That way we could mutually learn about issues with
particular new versions that affect various auto-generated
Docker/Singularity images.
> CWL tool descriptions can and should be maintained collectively;
> preferably they are offered to upstream for inclusion just like other
> Debian instigated patches and manual pages are sent up.
I agree. And in a way this is why I find it problematic to statically
ship those wrappers when there are newer versions already available on
the CWL github. We need an update mechanism, I think, not only at build
time but also for the already installed packages - but then again, this
very much contradicts the concepts of a stable release. So, I still need
to make my mind up about this all.
>
>     Back to the topic: I agree with Steffen that if we mean the link
>     pairs as Provider + ID (as opposed to ID_type + ID_value), then
>     SciCrunch makes more sense than RRID.
>
>
>     Cheers,
>     Matus
>
>
>
>
> -- 
> Michael R. Crusoe
> Co-founder & Lead,
> Common Workflow Language project <http://www.commonwl.org/>
> https://impactstory.org/u/0000-0002-2961-9670
> mrc@commonwl.org <mailto:mrc@commonwl.org>
> +1 480 627 9108


Reply to: