[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: getData&DataLad - could it be heaven? Was: Sepp : including a dataset?



While crafting an initial reply to Jun, I just stumbled across
https://genexa.ch/sars2-bioinformatics-resources/ in the Nanopore
Covid-19 forum which is offering readily usable Covid-19 resources for a
set of tools that we already have in our distribution.

For one, Yaroslav may have an idea about the kind of technology they
should be using to redistribute their downloads. :o)

While I am writing these lines, I think I would not want to use any of
these .zip files/.tarballs if I do not exactly know how these were
created. My initial reaction was that I would never use this. And then I
looked a the descriptions and thought like .. uh .. something as trivial
as a wget "some search URL"|clustalw -if - or so, I mean, who would want
to redo that - not many. But no way you use that without knowing exactly
what has been done.

@Yaroslav, how would you prepare that genexa site, instead? Could there
be some "mouseover"-event to inspect the provenance? Any other
reasonable way to github-like have a web presentation for a set of
resources that also informs on the provenance?

If you have not spotted this - I am basically declaring a partial
defeat. I do not need to redo what is trivial if its provenance is
inspectable and I _could_ redo it.

Best,

Steffen

On 10.10.20 02:08, Yaroslav Halchenko wrote:
> On Fri, 09 Oct 2020, Steffen Möller wrote:
>>> On Fri, 09 Oct 2020, Steffen Möller wrote:
>>>> I added datalad-crawler to
>>>> https://docs.google.com/spreadsheets/d/1tApLhVqxRZ2VOuMH_aPUgFENQJfbLlB_PFH_Ah_q7hM/edit#gid=401910682
>>>> .
>>> thanks!  requested write access to contribute (no worries -- I will not
>>> be as vocal there ;))
>> Jun has proven to be exceptionally fast.
> yeap, I already fixed the typo ;)
>
>>>>  * redistribution of the such prepared files and folders with datalad.
>>> sure -- could be done from "generic" to specific ("datalad").  But as
>>> for Debian users, they do not want to learn any generic or
>>> specific.  They know and want "apt install" so it is the question of
>>> "what would be the best". (I also like apt-file command to often find a
>>> file which might be in some package but I do not know which...)
>> Yip. Using that, too. For the moment I do not see how to mix a
>> getData-installed setup and a datalad-provided one. That may be trivial
>> but I don't see it, not knowing enough.
> If "getData" (sorry, didn't look) can output just a list of URLs to
> download and any metadata to decide on filenames (besides the ones from
> the URLs, and as contained in the archives, if desired) it could be as
> easy as
>
>     datalad create blah && cd blah && getData --url-records blah | datalad addurls - '{url}' '{_url_basename}'
>
> to get yourself a datalad dataset with all the files.  If there are
> tarballs ,  subsequent call to  datalad add-archive-content  would add
> extracted files.  At some point we should "merry" datalad-crawler and
> "datalad addurls" to make it even easier to crawl/update.  Actually if
> `getData` returns such records, should be easy to just create getData
> specific crawler pipeline which would do all needed, and be
> efficient for updates.
>
>>> If getData could account for aforementioned possible gotchas and provide
>>> resilient solution, sure -- why not? ;)
>> Well. We should talk to diverse upstreams if they would offer their
>> (raw) data  via git-annex in some way.
> that is the beauty of git-annex which sparked the whole datalad effort:
> git-annex can access data from a wide range of sources, including plain
> urls , and allow for custom "external special remotes".  This way
> we added support for getting files from tarballs (we first add tarball
> to git annex, and individual files are obtained via datalad-archives
> git-annex special remote), or from portals with odd authentication
> schemes (we have datalad special remote), so we would first ask user
> credentials, and authenticate on their behalf.  E.g. in case of some it
> would be interaction with their auth server to first get a token to then
> use for access to S3.
>
> The point is -- no need to talk to any human/upstream!  Just alert
> them whenever their data is actually not accessible or broken -- I ran
> into a good number of cases of broken archives etc.  They post data,
> nobody cares to download and thus they do not even know that it is junk
> they host.
>
>> You discuss transport-reliability. I don't care too
>> much, will just manually (?) reinitiate what has failed and check into what
>> datalad then redisitributes only after getData was successful.
> I am talking more about upstream changing or moving the data,
> portals going offline (I had a use case for
> https://github.com/datalad/nih--videocast whenever government shutdown
> and their original portal went down, but videos still were accessible
> from original urls which otherwise nobody could get to ;)).
>
>> Maybe that got lost a bit. We have talked a lot about the complexity of
>> workflows. But maintaining the reference datasets has some complexity to
>> it, too. I lack immediate examples, but it is not unthinkable that the
>> same search tool is used from different tools but each downstream tool
>> uses it with different parameters which may need to prepare different
>> indexes for.
> yes. that is what Michael Hanke (with whom we started DataLad) had in
> mind while developing python3-whoosh based search -- ability to
> establish individual index depending on the purpose... in my simple
> human life, I am looking more for 'google-like' approach -- simple
> query, if desired -- more specialized, and internal implementation
> should take care about optimizing (well - current default search backend
> in datalad is not dumb simple and not optimizing -- a very simple loop
> through the records ;))
>
>> Or - different genome sizes may be suggestive for different hash
>> sizes/min index lengths whatever and possibly downstream tools need to
>> know about these parameters?
> sorry, I have little to no clue in bioinformatics, so might misinterpret
> what has sizes we are talking about ;)
>
>> I really like all the meta information that datalad can pass with its
>> data, but we fall short of everything if we only discuss redistribution.
> EXACTLY! that is why datalad is not just for re-distribution, although
> born to fill that niche.
>
>>> And datalad is also in debian already.  git format of debian source
>>> packages have been supported for a while. so in principle you could just
>>> wrap pre-created "generic" git/git-annex (datalad) datasets to
>>> become debian source packages and just provide debhelpers for
>>> post-install/uninstall
>> Here I disagree (yet). We get in trouble if we just remove some TB of
>> data and hours of indexing because of a deleted package on a server when
>> the clients want to keep using those indexes. I would separate data and
>> package management.
> ;-)  use-cases vary! Nothing and nobody can prevent someone to remove
> TBs of data in general.  So it is more about avoiding unintended
> actions. By default 'datalad remove' is quite slow since it first
> verifies that data is still available in the original location and could
> be re-obtained if desired -- takes time -- gives time for a 2nd
> thought ;) Other prevention mechanisms could be implemented to ask user
> a confirmation to uninstall something as massive.
>
>> Again, RL stronly suggest me keep doing something else for the immediate
>> future.
> understood.  RL is indeed a ... RL ;)
>
>> Do you have neurological data from COVID-19 patients? Many are
> nope :-/  in research environments, unlikely any useful dataset would
> come out, since they would not have means to recruit relevant
> population, and many are still on quarantine (like ours) and not
> scanning.  Hospitals would be the main source of good relevant data,
> e.g. google lead me to
> http://www.ajnr.org/content/early/2020/09/10/ajnr.A6717
> I have emailed corresponding author, let's see...
>
> But I feel that "very very unlikely"  any clinical neuroimaging
> data would be shared. Eh, forgot to mention authors  our other
> project https://open-brain-consent.readthedocs.io/ which could have come
> handy (or could be in the future).
>
> The best candidate would be https://www.ukbiobank.ac.uk/ when they
> collect more of longitudinal neuroimaging data on participants who also
> went through COVID.  But that database is not free to access, and
> process is somewhat tedious, but the database is quite rich.
>
>> discussing depressions etc. And would there possibly be OMICS data from
>> the same individuals? Have not checked the literature, yet.
> me neither ;)
>
>> Should be
>> available, somewhere. We could then use such data as a joint exercise
>> and come up with a not completely unreasonable small project, say (Just
>> something from the hip) to rank patients for their Ca++ transporter
>> expressions (in the blood which may also affect neurons, either directly
>> or via the blood) and for look associated phenotypes on the EEG.  When
>> we get through this together, we should then also know about how getData
>> or datalad can help us.
> Sounds like indeed a good project to pursue.  I will my ears/eyes open
> on either any neuro-related data pops up somewhere, but I have stated my
> concerns above ;)
>


Reply to: