[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

getData&DataLad - could it be heaven? Was: Sepp : including a dataset?



Please pardon me in advance -- came out long again.

TL;DR summary:

- would be great to merry getData and datalad to benefit from knowledge
  getData possess ATM on data sources, but then gain (many)
  benefits from git/git-annex/datalad, especially while thinking about
  "research process", data management, reproducibility etc

- I shared some initial prototype for the "dh-annex" helper I started
  years back but might be over-complicated/under-engineered, and
  definitely not yet complete

NB BTW https://wiki.debian.org/getData still points to SVN which is
gone?

"and here we go..."

<preaching>
On Fri, 09 Oct 2020, Steffen Möller wrote:
> I added datalad-crawler to
> https://docs.google.com/spreadsheets/d/1tApLhVqxRZ2VOuMH_aPUgFENQJfbLlB_PFH_Ah_q7hM/edit#gid=401910682
> .

thanks!  requested write access to contribute (no worries -- I will not
be as vocal there ;))

> The problem I still have with a datalad-only solution is that it
> alienates the folks that have always done it by themselves, i.e.
> fetching the databases from upstream, unpacking and indexing it all for
> all the tools that possibly ask for it. What I see is

>  * some automated routine that prepares all the downloads/indexes somewhere
>  * whoever wants to redo/improve that process themselves please copy
> that automated routine
>  * redistribution of the such prepared files and folders with datalad.

sure -- could be done from "generic" to specific ("datalad").  But as
for Debian users, they do not want to learn any generic or
specific.  They know and want "apt install" so it is the question of
"what would be the best". (I also like apt-file command to often find a
file which might be in some package but I do not know which...)

Sure it could be also accomplished with non-datalad "backend" to
actually fetch the data BUT you would need to re-implement what
git-annex does for us already (checksumming, redundant access urls,
etc), and might quickly get into trouble of unfetchable data so you
would need to keep copies, but become efficient in how you manage them
to not duplicate across versions of the same dataset when those start to
appear, so you would arrive to annex'es .git/annex/objects
keystore...  so - yes, could be done, but might be actually more work in
the long run IMHO and AFAIK from experience of working with more "live"
(not just a dump at the end of the project, but being cooked as more
data comes in, analyzed/reanalyzed etc)

> Somewhere somehow we need to link this to the packages that are
> installed/installable. I don't think we should have any redistribution
> with datalad without that automated processing. For me, strongly biased,
> the automation comes via getData.

If getData could account for aforementioned possible gotchas and provide
resilient solution, sure -- why not? ;)    


What is great about having getData is it is a great resource to indeed
feed datalad (there is also datalad addurls which is an "alternative" to
datalad-crawler extension; we will merry them eventually) to establish
datalad datasets and thus possibly provide redundant availability to
that data if hosting it somewhere (or nowhere and just keeping it until
original host disappears or changes data) and then publish ;)  (I do the
same... see also note about containers below and take it along with the
news that docker hub soon will start pruning elderly images).

And datalad is also in debian already.  git format of debian source
packages have been supported for a while. so in principle you could just
wrap pre-created "generic" git/git-annex (datalad) datasets to
become debian source packages and just provide debhelpers for
post-install/uninstall and be done, gain resilience through redundant
availability and exact versioning, etc.  but damn me -- why I haven't
done it already if it all so "easy"? ;)  I think I was over-engineering
starting with a too complex project to start with, and thus not
finishing at all :-/  And then just started to use datalad directly too
much.  But I found that thing I started to work on

https://github.com/yarikoptic/dh-annex
which has the script
https://github.com/yarikoptic/dh-annex/blob/master/tools/generate_sample_repo
which I believe was my CI for the 
https://github.com/yarikoptic/dh-annex/blob/master/tools/dh_annex
;)

Back to the topic of "why may be datalad":
And then there is the whole question of empowering 'git knowledgeable'
users... any installed datalad dataset then could become a subdataset
within "study" dataset; throw some containers inside (could also be a
debian package -- have a look at my collection of singularity containers
for neuroimaging which is also datalad dataset:
https://github.com/ReproNim/containers/), use datalad
container-run (datalad-container and singularity-container are in
debian), and make the whole "research" project if not fully
auto-reproducible  but with a thorough history and exact versioning of
all ingredients. Have a look e.g. at
http://handbook.datalad.org/en/latest/basics/basics-yoda.html for more
information.

</preaching>
<sorry>
  I will try to not pollute the list with <preaching> much more
</sorry>

> Concerning Sepp, my hunch is that tipp and the data is not needed for
> immediate aims and I would skip that for now, leaving your comment on
> the excel sheet.

coolio.
-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW:   http://www.linkedin.com/in/yarik        


Reply to: