Re: getData&DataLad - could it be heaven? Was: Sepp : including a dataset?
On Fri, 09 Oct 2020, Steffen Möller wrote:
> > On Fri, 09 Oct 2020, Steffen Möller wrote:
> >> I added datalad-crawler to
> >> https://docs.google.com/spreadsheets/d/1tApLhVqxRZ2VOuMH_aPUgFENQJfbLlB_PFH_Ah_q7hM/edit#gid=401910682
> >> .
> > thanks! requested write access to contribute (no worries -- I will not
> > be as vocal there ;))
> Jun has proven to be exceptionally fast.
yeap, I already fixed the typo ;)
> >> * redistribution of the such prepared files and folders with datalad.
> > sure -- could be done from "generic" to specific ("datalad"). But as
> > for Debian users, they do not want to learn any generic or
> > specific. They know and want "apt install" so it is the question of
> > "what would be the best". (I also like apt-file command to often find a
> > file which might be in some package but I do not know which...)
> Yip. Using that, too. For the moment I do not see how to mix a
> getData-installed setup and a datalad-provided one. That may be trivial
> but I don't see it, not knowing enough.
If "getData" (sorry, didn't look) can output just a list of URLs to
download and any metadata to decide on filenames (besides the ones from
the URLs, and as contained in the archives, if desired) it could be as
easy as
datalad create blah && cd blah && getData --url-records blah | datalad addurls - '{url}' '{_url_basename}'
to get yourself a datalad dataset with all the files. If there are
tarballs , subsequent call to datalad add-archive-content would add
extracted files. At some point we should "merry" datalad-crawler and
"datalad addurls" to make it even easier to crawl/update. Actually if
`getData` returns such records, should be easy to just create getData
specific crawler pipeline which would do all needed, and be
efficient for updates.
> > If getData could account for aforementioned possible gotchas and provide
> > resilient solution, sure -- why not? ;)
> Well. We should talk to diverse upstreams if they would offer their
> (raw) data via git-annex in some way.
that is the beauty of git-annex which sparked the whole datalad effort:
git-annex can access data from a wide range of sources, including plain
urls , and allow for custom "external special remotes". This way
we added support for getting files from tarballs (we first add tarball
to git annex, and individual files are obtained via datalad-archives
git-annex special remote), or from portals with odd authentication
schemes (we have datalad special remote), so we would first ask user
credentials, and authenticate on their behalf. E.g. in case of some it
would be interaction with their auth server to first get a token to then
use for access to S3.
The point is -- no need to talk to any human/upstream! Just alert
them whenever their data is actually not accessible or broken -- I ran
into a good number of cases of broken archives etc. They post data,
nobody cares to download and thus they do not even know that it is junk
they host.
> You discuss transport-reliability. I don't care too
> much, will just manually (?) reinitiate what has failed and check into what
> datalad then redisitributes only after getData was successful.
I am talking more about upstream changing or moving the data,
portals going offline (I had a use case for
https://github.com/datalad/nih--videocast whenever government shutdown
and their original portal went down, but videos still were accessible
from original urls which otherwise nobody could get to ;)).
> Maybe that got lost a bit. We have talked a lot about the complexity of
> workflows. But maintaining the reference datasets has some complexity to
> it, too. I lack immediate examples, but it is not unthinkable that the
> same search tool is used from different tools but each downstream tool
> uses it with different parameters which may need to prepare different
> indexes for.
yes. that is what Michael Hanke (with whom we started DataLad) had in
mind while developing python3-whoosh based search -- ability to
establish individual index depending on the purpose... in my simple
human life, I am looking more for 'google-like' approach -- simple
query, if desired -- more specialized, and internal implementation
should take care about optimizing (well - current default search backend
in datalad is not dumb simple and not optimizing -- a very simple loop
through the records ;))
> Or - different genome sizes may be suggestive for different hash
> sizes/min index lengths whatever and possibly downstream tools need to
> know about these parameters?
sorry, I have little to no clue in bioinformatics, so might misinterpret
what has sizes we are talking about ;)
> I really like all the meta information that datalad can pass with its
> data, but we fall short of everything if we only discuss redistribution.
EXACTLY! that is why datalad is not just for re-distribution, although
born to fill that niche.
> > And datalad is also in debian already. git format of debian source
> > packages have been supported for a while. so in principle you could just
> > wrap pre-created "generic" git/git-annex (datalad) datasets to
> > become debian source packages and just provide debhelpers for
> > post-install/uninstall
> Here I disagree (yet). We get in trouble if we just remove some TB of
> data and hours of indexing because of a deleted package on a server when
> the clients want to keep using those indexes. I would separate data and
> package management.
;-) use-cases vary! Nothing and nobody can prevent someone to remove
TBs of data in general. So it is more about avoiding unintended
actions. By default 'datalad remove' is quite slow since it first
verifies that data is still available in the original location and could
be re-obtained if desired -- takes time -- gives time for a 2nd
thought ;) Other prevention mechanisms could be implemented to ask user
a confirmation to uninstall something as massive.
> Again, RL stronly suggest me keep doing something else for the immediate
> future.
understood. RL is indeed a ... RL ;)
> Do you have neurological data from COVID-19 patients? Many are
nope :-/ in research environments, unlikely any useful dataset would
come out, since they would not have means to recruit relevant
population, and many are still on quarantine (like ours) and not
scanning. Hospitals would be the main source of good relevant data,
e.g. google lead me to
http://www.ajnr.org/content/early/2020/09/10/ajnr.A6717
I have emailed corresponding author, let's see...
But I feel that "very very unlikely" any clinical neuroimaging
data would be shared. Eh, forgot to mention authors our other
project https://open-brain-consent.readthedocs.io/ which could have come
handy (or could be in the future).
The best candidate would be https://www.ukbiobank.ac.uk/ when they
collect more of longitudinal neuroimaging data on participants who also
went through COVID. But that database is not free to access, and
process is somewhat tedious, but the database is quite rich.
> discussing depressions etc. And would there possibly be OMICS data from
> the same individuals? Have not checked the literature, yet.
me neither ;)
> Should be
> available, somewhere. We could then use such data as a joint exercise
> and come up with a not completely unreasonable small project, say (Just
> something from the hip) to rank patients for their Ca++ transporter
> expressions (in the blood which may also affect neurons, either directly
> or via the blood) and for look associated phenotypes on the EEG. When
> we get through this together, we should then also know about how getData
> or datalad can help us.
Sounds like indeed a good project to pursue. I will my ears/eyes open
on either any neuro-related data pops up somewhere, but I have stated my
concerns above ;)
--
Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW: http://www.linkedin.com/in/yarik
Reply to: