[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RRID -> SciCrunch



On 25.10.17 18:52, Michael Crusoe wrote:
>
>
> 2017-10-25 19:21 GMT+03:00 Steffen Möller <steffen_moeller@gmx.de
> <mailto:steffen_moeller@gmx.de>>:
>
>
>     On 25.10.17 17:49, Michael Crusoe wrote:
>     >
>     >
>     > 2017-10-25 18:19 GMT+03:00 Matus Kalas <Matus.Kalas@uib.no
>     <mailto:Matus.Kalas@uib.no>
>     > <mailto:Matus.Kalas@uib.no <mailto:Matus.Kalas@uib.no>>>:
>     >
>     >     On 2017-10-25 15:12, Michael Crusoe wrote:
>     >
>     >         2017-10-25 16:04 GMT+03:00 Steffen Möller
>     >         <steffen_moeller@gmx.de <mailto:steffen_moeller@gmx.de>
>     <mailto:steffen_moeller@gmx.de <mailto:steffen_moeller@gmx.de>>>:
>     >
>     >             On 25.10.17 13:47, Michael Crusoe wrote:
>     >
>     >
>     >
>     >                 2017-10-25 14:34 GMT+03:00 Steffen Möller
>     >                 <steffen_moeller@gmx.de
>     <mailto:steffen_moeller@gmx.de> <mailto:steffen_moeller@gmx.de
>     <mailto:steffen_moeller@gmx.de>>
>     >                 <mailto:steffen_moeller@gmx.de
>     <mailto:steffen_moeller@gmx.de>
>     >                 <mailto:steffen_moeller@gmx.de <mailto:steffen_moeller@gmx.de>>>>:
>     >
>     >
>     >                 On 25.10.17 10:56, Michael Crusoe wrote:
>     >
>     >                     Sorry, I missed the bit where we are deprecating
>     >                     RRID. Can
>     >
>     >             someone
>     >
>     >                     explain?
>     >
>     >
>     >                 Because of
>     >
>     >           
>      https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf
>     <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf>
>     >           
>      <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf
>     <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf>> [1]
>     >
>     >               
>      <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf
>     <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf>
>     >               
>      <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf
>     <https://arxiv.org/ftp/arxiv/papers/1707/1707.03659.pdf>>
>     >                 [1]>
>     >
>     >             and
>     >
>     >                 some web googling from which I gathered that the
>     "Research
>     >
>     >             Resource
>     >
>     >                 IDentifiers" are not only provided by SciCrunch.
>     >                 Admittedly, I
>     >
>     >             fail to
>     >
>     >                 find that page now that I want to find it :o/
>     >
>     >
>     >                 There is no conflict here. scicrunch.org
>     <http://scicrunch.org>
>     >                 <http://scicrunch.org> [2]
>     >
>     >             <http://scicrunch.org> is the
>     >
>     >                 post-pilot phase of what is described in that paper.
>     >
>     >
>     >
>     >                 Personally, I could not care less, let those catalog
>     >                 providers
>     >
>     >             fight
>     >
>     >                 that out among themselves. However, I find that the
>     >                 notion of
>     >                 SciCrunch
>     >                 clearly identifies the provenance of that
>     information,
>     >                 while
>     >
>     >             "RRID" to
>     >
>     >                 me is more of a concept coined by
>     >                 (https://www.force11.org),
>     >
>     >             not a
>     >
>     >                 provider. And with several initiatives following
>     the same
>     >
>     >             purpose, I
>     >
>     >                 found that by using SciCrunch not RRID, we would be
>     >                 the most
>     >                 provider-neutral. And then again, it is only
>     something
>     >                 local
>     >
>     >             to the
>     >
>     >                 Debian packaging, not publicly visible, so nobody
>     >                 should truly
>     >                 care and
>     >                 the use of SciCrunch imho serves us best on a
>     >                 technical level.
>     >
>     >
>     >                 RRIDs share a single name space that allow for
>     >                 multiple providers,
>     >
>     >             sci
>     >
>     >                 crunch being the current main provider for software
>     >                 tools and
>     >                 databases and other registries responsible for the
>     >                 other types. By
>     >                 referring to RRIDs generically then there is no
>     conflict.
>     >
>     >                 See
>     >               
>      https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558
>     <https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558>
>     >               
>      <https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558
>     <https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000558>>
>     >                 [3]
>     >
>     >             for an
>     >
>     >                 overview
>     >
>     >                 Please rename this field to RRID, or better yet just
>     >                 have a list
>     >
>     >             of
>     >
>     >                 URIs like we do in CWL so you don't have to care
>     if it
>     >                 is a RRID,
>     >
>     >             DOI
>     >
>     >                 or whatever :-)
>     >
>     >               
>      http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage
>     <http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage>
>     >               
>      <http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage
>     <http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwarePackage>>
>     >
>     >             [4]
>     >
>     >
>     >             This is what we are doing. The field is called
>     "Registry"
>     >             (not RRID,
>     >             so
>     >             we can also refer to Wikis and other catalogs) and
>     allows
>     >             for an
>     >             arbitrary unordered number of (Name, Entry) tupels, in
>     >             complete
>     >             analogy
>     >             to the CWL, I tend to think.
>     >
>     >
>     >         Well, no. In CWL we don't separate the provider from the
>     >         identifier.
>     >         That's the whole point about COOL URIs.
>     >         I've CC'd Stian as he explains this better than I do.
>     >
>     >
>     >     The problem is that from the applied IDs, only Bio.Tools provide
>     >     COOL URIs. Other providers should, but they don't, at least not
>     >     yet. Thus a provider + ID pair is, unfortunately, necessary.
>     >
>     >
>     > RRIDs have a COOL URI form:
>     https://identifiers.org/rrid/RRID:SCR_001156
>     <https://identifiers.org/rrid/RRID:SCR_001156>
>
>
>     This is the one we are generating on the task page from the
>     information
>     Name: SciCrunch, Entry: SCR_001156. And as of today, the same would be
>     generated if the entry was with Name: RRID.
>
>     I think we would make as mistake to give up the distinction, let alone
>     since we can no longer query the UDD for, e.g., entries that have an
>     RRID but no bio.tools entry.
>
>
> I'm not convinced that this proposal makes that impossible.
It is beyond what basic SQL can do. In particular I am fond of the "NA"
assignment when you did search for a package in a resource but could not
find it.
>  
>
>     >  
>     >
>     >
>     >     As a side-track: You mentioned DOIs, Michael.
>     >     Would it make sense if Debian (Med) adds DOIs as citable
>     links to
>     >     upstream releases, in addition to the upstream version and
>     >     upstream repo information?
>     >     And in any case, DOIs for both the upstream project as a whole
>     >     (i.e. all releases), and/or for the particular releases, can be
>     >     added as citations for a src package, if package maintainer or
>     >     upstream wish so.
>     >
>     >
>     > Of course, the more links the merrier. In fact, one could write a
>     > simple script to autofill most of the upstream/metadata fields from
>     > any and all available identifiers and DOIs.
>     >
>     > A quick recap for those following along:
>     > Software identifiers are for the concept of a particular piece of
>     > software. They are persistent regardless of 1) the version of the
>     > software 2) the release of a paper for major new functionality or 3)
>     > switching to new repository
>     >
>     > khmer will always be identifiable
>     > by https://identifiers.org/rrid/RRID:SCR_001156
>     <https://identifiers.org/rrid/RRID:SCR_001156> regardless of new
>     > releases, new papers, or new hosting platforms
>     >
>     > DOIs are currently used to identify point in time digital
>     objects like
>     > papers, or a certain source code release.
>     > It is true that some services that issue DOIs for software releases,
>     > like Zenodo and FigShare, do have a "primary" DOI that each release
>     > derives from. But that becomes insufficient as one might switch
>     > between those services or to another provider.
>     >
>     > Back to your suggestion: the next step is to determine the best
>     place
>     > to put these per-source DOIs within Debian.
>     >
>     > Do we add them to
>     > 1) debian/upstream/metadata as part of a version:DOI dictionary?
>     Interesting. I do not immediately see how this fits that format,
>     though.
>     > 2) debian/changelog for each release or just for the
>     > ${upstreamversion}-1 release?
>     ${upstreamversion}-1,  preferred over 1)
>     > 3) the binary package control file in the binary
>     >
>     packages? https://www.debian.org/doc/debian-policy/#s-binarycontrolfiles
>     <https://www.debian.org/doc/debian-policy/#s-binarycontrolfiles>
>     may be required for indexing
>     > 4) the debian source control file (.dsc)
>     >
>     ? https://www.debian.org/doc/debian-policy/#debian-source-control-files-dsc
>     <https://www.debian.org/doc/debian-policy/#debian-source-control-files-dsc>
>     yes, again for indexing
>     > and/or
>     > 5) someplace else?
>
>     Just to have it discussed: debian/copyright
>
>     But today the version of the software is not specified in that
>     document
>     and we do not like changing the copyright file when the copyright has
>     not changed. So, debian/copyright is inferiour to debian/changelog
>     IMHO.
>
>     >
>     > We can help ourselves answer this question by determining how
>     and for
>     > what purposes we might want to access these DOIs
>     > 1) From a running system, as part of a citation/provenance query?
>     Yes, and there is some prior art to this. The package "devscripts"
>     provides the tool "wnpp-alert" that shows all those packages that are
>     orphaned or requested to be adopted and installed on that system. We
>     could come up with the same for any CWL workflow/wrapper - but for
>     that
>     we would not need this information shipping with the Debian packages
>     themselves but could work with the catalogs directly.
>     > 2) From Umegaya / the Ultimate Debian Database (UDD) ?
>
>     yes, this is the place to consult for the DOI assignments.
>
>     Especially when the workflow is hidden away in some container, the
>     query
>     should be performed in an OS-independent manner - like via the UDD.
>
>     > 3) Someplace else and/or some other purpose?
>
>     If those DOI help with the communication between the various
>     catalogs /
>     software repositories / papers, then I presume that this is mostly
>     outside of our immediate control.
>
>
> Reminder, we are talking about per-version DOIs here. The easiest way
> to communicate between databases about the idea of a piece of software
> is a software identifier.


If the version is part of the DOI then it can be understood by a human.
Otherwise it will be only for machines. If it is for machines, then the
annotation is best achieved in some automated manner or we will have
difficulties to persuade our human maintainer to add that information.


>  
>
>
>
>     >  
>     >
>     >
>     >
>     >
>     >             Related to your comment (and very, very close to my
>     heart)
>     >             is the
>     >             question if we do everything sufficiently well to
>     map the CWL
>     >             workflows
>     >             to Debian packages. We could for instance add
>     references to
>     >             CWL-workflow-database-entries for the workflows that a
>     >             particular
>     >             Debian
>     >             packages is used in, so we can test them when the
>     package
>     >             updates -
>     >             er,
>     >             before the package updates in the distribution.
>     >
>     >
>     >         We are good here; you can determine the packages used in any
>     >         given CWL
>     >         description that includes a SoftwareRequirement that is
>     mappable
>     >         directly or indirectly to a package.
>     >
>     >         For automated testing you would need a way to specify
>     "normal" or
>     >         expected results; CWL v1.0.x doesn't have that concept. A
>     >         researchobject.org <http://researchobject.org>
>     <http://researchobject.org> [16] RO that
>     >         contains/references those results with
>     >         the corresponding CWL workflow would however fulfill
>     this role.
>     >
>     >
>     >     And another side-track: In addition to CWL workflows and using
>     >     them as test (requiring some input-output pairs and equality
>     >     relation), would it make sense for Debian to link to some
>     kind of
>     >     "CWL wrappers" for the single tools?
>     >
>     >
>     > Instead of linking we can include them in the package, like we
>     do unix
>     > manual pages.
>     It is what I had also suggested. Maybe we can come up with an
>     auto-update with a dh-cwl helper when there is internet access?
>     >
>     > See the section to the spec about where to find CWL tool
>     descriptions:
>     >
>     http://www.commonwl.org/v1.0/CommandLineTool.html#Discovering_CWL_documents_on_a_local_filesystem
>     <http://www.commonwl.org/v1.0/CommandLineTool.html#Discovering_CWL_documents_on_a_local_filesystem>
>     >
>     > Perhaps this should be added to the Debian-Med policy as a bonus
>     item
>     > for packages? samtools already ships some descriptions
>     >
>     > $ apt-file search /usr/share/commonwl
>     > samtools: /usr/share/commonwl/samtools-faidx.cwl
>     > samtools: /usr/share/commonwl/samtools-index.cwl
>     > samtools: /usr/share/commonwl/samtools-rmdup.cwl
>     > samtools: /usr/share/commonwl/samtools-sort.cwl
>     > samtools: /usr/share/commonwl/samtools-view.cwl
>     I was not aware of those - excellent!
>     >  
>     >
>     >     That is again similar to the elsewhere-discussed proposal of
>     >     generating (and/or linking to) software containers (Docker,
>     >     Singularity, rkt?)...
>     >
>     > Software containers can be generated fairly automatically and don't
>     > really benefit from upstream's participation.
>     Let us see how this develops. For instance, I anticipate that most
>     issues that Debian packages run into when there are new versions out,
>     will also affect the BioConda community. Via OMICtools we have an
>     indirect mapping from Debian packages to BioConda. We could make
>     that a
>     more direct one. That way we could mutually learn about issues with
>     particular new versions that affect various auto-generated
>     Docker/Singularity images.
>
>
> Ah, now you're talking about linking to other packaging systems which
> I support.
> However, with software identifiers being adopted by both debian-med
> and bioconda the linking becomes implicit

agreed

The Debian package tracker shows information on the equally named Ubuntu
package. When in some automated fashion or manually edited, we could
extend the same cross-distro watch for all our packages. While the
computational biology community would mostly be concerned with BioConda,
the physicists would mostly compare with ScientificLinux or so.

So, while we can think of an automated annotation with such cross-links
because of the RRIDs we assigned, other communities may need to perform
this manually.

https://distrowatch.com/?language=EN for those not aware of it.


>     > CWL tool descriptions can and should be maintained collectively;
>     > preferably they are offered to upstream for inclusion just like
>     other
>     > Debian instigated patches and manual pages are sent up.
>     I agree. And in a way this is why I find it problematic to statically
>     ship those wrappers when there are newer versions already available on
>     the CWL github. We need an update mechanism, I think, not only at
>     build
>     time but also for the already installed packages - but then again,
>     this
>     very much contradicts the concepts of a stable release. So, I
>     still need
>     to make my mind up about this all.
>
>
> CWL tool descriptions will stabilize quickly enough. CWL executors are
> not required to use the descriptions in /usr/share/commonwl (or any
> other location); they merely assist users in getting started with the
> software already on their system. At anytime they can write their own,
> download a different one, or copy and improve the system installed
> version.

Do the CWL wrappers ship with a URL from which to download the latest
version as part of the CWL description? Like a "this document lives
here" line that you may refer to as an "origin" field or so?

I skimmed through the CWL spec and could not find it. There was an issue
https://github.com/common-workflow-language/common-workflow-language/issues/170
that I interpreted as requesting the very same but I admit to have
somewhat failed to explicitly grasp how that schema would be adapted for
automated updates.

>  
>
>     >
>     >     Back to the topic: I agree with Steffen that if we mean the link
>     >     pairs as Provider + ID (as opposed to ID_type + ID_value), then
>     >     SciCrunch makes more sense than RRID.
>
>
> From https://identifiers.org/rrid/RRID:SCR_001156
>
> "Proper citation
>
> khmer, RRID:SCR_001156"
>
> So please don't strip off RRID :-)

I may be in demand of some further brainwash here. When put in a paper,
the "RRID:" prefix needs to appear. But for our Debian-internal
referencing that is in a single formal field not in a free text area,
and not together with any other identifiers, I was even tempted to omit
the "SRC_" and leading 0s :o)

Steffen


Reply to: