[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: getData&DataLad - could it be heaven? Was: Sepp : including a dataset?



Hi Yaroslav,

On 09.10.20 02:21, Yaroslav Halchenko wrote:
> Please pardon me in advance -- came out long again.
>
> TL;DR summary:
>
> - would be great to merry getData and datalad to benefit from knowledge
>   getData possess ATM on data sources, but then gain (many)
>   benefits from git/git-annex/datalad, especially while thinking about
>   "research process", data management, reproducibility etc
Great. To me the orchestration and formalisation+realisation of
procedural and/or conceptional knowledge is important. And I very much
like the concepts behind datalad that you(r group) introduces to our
BioMed world.
> - I shared some initial prototype for the "dh-annex" helper I started
>   years back but might be over-complicated/under-engineered, and
>   definitely not yet complete
Real life is a bit demanding on me these days. I have worked overtime to
get as much as possible packaged up and shoved into new that would get
us a complete workflow. Once that makes it into our distribution and we
are thus all synced up with our software infrastructure, this is when I
will address the Debianisation/Dataladification of inputs.
> NB BTW https://wiki.debian.org/getData still points to SVN which is
> gone?
Updated. Thank you tons for spotting that. GetData is also in the
distribution, btw.
> "and here we go..."
>
> <preaching>
> On Fri, 09 Oct 2020, Steffen Möller wrote:
>> I added datalad-crawler to
>> https://docs.google.com/spreadsheets/d/1tApLhVqxRZ2VOuMH_aPUgFENQJfbLlB_PFH_Ah_q7hM/edit#gid=401910682
>> .
> thanks!  requested write access to contribute (no worries -- I will not
> be as vocal there ;))
Jun has proven to be exceptionally fast. PM me with an address you
prefer to use if there are unexpected delays.
>
>> The problem I still have with a datalad-only solution is that it
>> alienates the folks that have always done it by themselves, i.e.
>> fetching the databases from upstream, unpacking and indexing it all for
>> all the tools that possibly ask for it. What I see is
>>  * some automated routine that prepares all the downloads/indexes somewhere
>>  * whoever wants to redo/improve that process themselves please copy
>> that automated routine
>>  * redistribution of the such prepared files and folders with datalad.
> sure -- could be done from "generic" to specific ("datalad").  But as
> for Debian users, they do not want to learn any generic or
> specific.  They know and want "apt install" so it is the question of
> "what would be the best". (I also like apt-file command to often find a
> file which might be in some package but I do not know which...)
Yip. Using that, too. For the moment I do not see how to mix a
getData-installed setup and a datalad-provided one. That may be trivial
but I don't see it, not knowing enough.

I do not think that "apt(-get) install" should initiate the downloads
directly.

> Sure it could be also accomplished with non-datalad "backend" to
> actually fetch the data BUT you would need to re-implement what
> git-annex does for us already (checksumming, redundant access urls,
> etc), and might quickly get into trouble of unfetchable data so you
> would need to keep copies, but become efficient in how you manage them
> to not duplicate across versions of the same dataset when those start to
> appear, so you would arrive to annex'es .git/annex/objects
> keystore...  so - yes, could be done, but might be actually more work in
> the long run IMHO and AFAIK from experience of working with more "live"
> (not just a dump at the end of the project, but being cooked as more
> data comes in, analyzed/reanalyzed etc)
>
>> Somewhere somehow we need to link this to the packages that are
>> installed/installable. I don't think we should have any redistribution
>> with datalad without that automated processing. For me, strongly biased,
>> the automation comes via getData.
> If getData could account for aforementioned possible gotchas and provide
> resilient solution, sure -- why not? ;)
Well. We should talk to diverse upstreams if they would offer their
(raw) data
via git-annex in some way. You discuss transport-reliability. I don't
care too
much, will just manually (?) reinitiate what has failed and check into what
datalad then redisitributes only after getData was successful.

Maybe that got lost a bit. We have talked a lot about the complexity of
workflows. But maintaining the reference datasets has some complexity to
it, too. I lack immediate examples, but it is not unthinkable that the
same search tool is used from different tools but each downstream tool
uses it with different parameters which may need to prepare different
indexes for.

Or - different genome sizes may be suggestive for different hash
sizes/min index lengths whatever and possibly downstream tools need to
know about these parameters?

I really like all the meta information that datalad can pass with its
data, but we fall short of everything if we only discuss redistribution.

>
>
> What is great about having getData
There are other tools out there, also in our distribution, which do the
same job but that I fail to grasp. Let us just agree for a minute to
keep using the term getData but as a bit of a metaphor and as a reminder
that we want something simple and extendable.
> is it is a great resource to indeed
> feed datalad (there is also datalad addurls which is an "alternative" to
> datalad-crawler extension; we will merry them eventually) to establish
> datalad datasets and thus possibly provide redundant availability to
> that data if hosting it somewhere (or nowhere and just keeping it until
> original host disappears or changes data) and then publish ;)  (I do the
> same... see also note about containers below and take it along with the
> news that docker hub soon will start pruning elderly images).
>
> And datalad is also in debian already.  git format of debian source
> packages have been supported for a while. so in principle you could just
> wrap pre-created "generic" git/git-annex (datalad) datasets to
> become debian source packages and just provide debhelpers for
> post-install/uninstall
Here I disagree (yet). We get in trouble if we just remove some TB of
data and hours of indexing because of a deleted package on a server when
the clients want to keep using those indexes. I would separate data and
package management.
> and be done, gain resilience through redundant
> availability and exact versioning, etc.  but damn me -- why I haven't
> done it already if it all so "easy"? ;)  I think I was over-engineering
> starting with a too complex project to start with, and thus not
> finishing at all :-/  And then just started to use datalad directly too
> much.  But I found that thing I started to work on
>
> https://github.com/yarikoptic/dh-annex
> which has the script
> https://github.com/yarikoptic/dh-annex/blob/master/tools/generate_sample_repo
> which I believe was my CI for the
> https://github.com/yarikoptic/dh-annex/blob/master/tools/dh_annex
> ;)
Sorry, RL does not want me to think along just as yet.
> Back to the topic of "why may be datalad":
> And then there is the whole question of empowering 'git knowledgeable'
> users... any installed datalad dataset then could become a subdataset
> within "study" dataset; throw some containers inside (could also be a
> debian package -- have a look at my collection of singularity containers
> for neuroimaging which is also datalad dataset:
> https://github.com/ReproNim/containers/), use datalad
> container-run (datalad-container and singularity-container are in
> debian), and make the whole "research" project if not fully
> auto-reproducible  but with a thorough history and exact versioning of
> all ingredients. Have a look e.g. at
> http://handbook.datalad.org/en/latest/basics/basics-yoda.html for more
> information.
>
> </preaching>
> <sorry>
>   I will try to not pollute the list with <preaching> much more
> </sorry>

Again, RL stronly suggest me keep doing something else for the immediate
future.

Do you have neurological data from COVID-19 patients? Many are
discussing depressions etc. And would there possibly be OMICS data from
the same individuals? Have not checked the literature, yet. Should be
available, somewhere. We could then use such data as a joint exercise
and come up with a not completely unreasonable small project, say (Just
something from the hip) to rank patients for their Ca++ transporter
expressions (in the blood which may also affect neurons, either directly
or via the blood) and for look associated phenotypes on the EEG.  When
we get through this together, we should then also know about how getData
or datalad can help us.

Best,

Steffen


Reply to: