[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to package human, mouse and viral genomes?



On Thu, 03 Sep 2020, Steffen Möller wrote:

> I looked at datasets.datalad.org. I could well imagine to use your
> technology for other (larger) databases like Pfam or UniProt or PDB. For
> cute little genomes my initial reaction was that I felt overwhelmed.
> Your pointer will certainly help to define what we want. Many thanks!

FWIW, a few more notes since you seems to be interested ;):  we do
have an elderly https://github.com/datalad/datalad-crawler/ "origin of
it all" but now just an extension to datalad which allows for efficient
"updates" and crawling of external resources.  See e.g. this
asciinema/script: https://www.datalad.org/for/data-consumers

But in many use cases a straight   "datalad addurls" command
(http://docs.datalad.org/en/stable/generated/man/datalad-addurls.html
part of handbook with an example:
http://handbook.datalad.org/en/latest/usecases/HCP_dataset.html?highlight=addurls#dataset-creation-with-datalad-addurls)
could be sufficient to "quickly" (depending on bandwidth and/or either
you use --fast option) populate a datalad dataset with files specified
in a spreadsheet/structured records.

So if you have some kind of .json or .csv/.tsv with records -- you could
try it quickly.  addurls also automagically adds columns as git-annex
metadata  per each file so someone could "toy around" (so far I
underused the feature) with "git annex views":
https://git-annex.branchable.com/git-annex-view/
or later to facilitate metadata extraction/aggregation/search.

A sample dataset (original cause for addurls to be written) is available
from http://datasets.datalad.org/?dir=/labs/openneurolab/metasearch  if
you decide to explore (that data is open so no authorization for
access would be needed).

The problem you might encounter in your cases is (not that great)
scalability of git/git-annex to contain hundreds of files in a
single repo.  So you might like splitting them into subdatasets (git
submodules) or providing custom views as I had mentioned before.

addurls makes it easy by establishing a subdataset whenever it
encounters // (instead of /) for path separation in the provided
filename.

PS I shut up now ;) Sorry for the flood of info.  We are just very
excited for DataLad even though we had been working on it for over 6
years and should be sick of it and git-annex by now ;-)  but we do
not!

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW:   http://www.linkedin.com/in/yarik        


Reply to: