Re: Sepp : including a dataset?

To: debian-med@lists.debian.org
Subject: Re: Sepp : including a dataset?
From: Steffen Möller <steffen_moeller@gmx.de>
Date: Fri, 9 Oct 2020 00:41:08 +0200
Message-id: <[🔎] c347dc5e-bb4a-9659-182f-51cd0e16bd6b@gmx.de>
In-reply-to: <[🔎] 20201008144303.GL6023@lena.dartmouth.edu>
References: <[🔎] 78ba3471-59cc-a762-76eb-fd8093288555@free.fr> <[🔎] 20201007204342.GC17648@an3as.eu> <[🔎] 20201008144303.GL6023@lena.dartmouth.edu>
I added datalad-crawler to
https://docs.google.com/spreadsheets/d/1tApLhVqxRZ2VOuMH_aPUgFENQJfbLlB_PFH_Ah_q7hM/edit#gid=401910682
.

The problem I still have with a datalad-only solution is that it
alienates the folks that have always done it by themselves, i.e.
fetching the databases from upstream, unpacking and indexing it all for
all the tools that possibly ask for it. What I see is

 * some automated routine that prepares all the downloads/indexes somewhere
 * whoever wants to redo/improve that process themselves please copy
that automated routine
 * redistribution of the such prepared files and folders with datalad.

Somewhere somehow we need to link this to the packages that are
installed/installable. I don't think we should have any redistribution
with datalad without that automated processing. For me, strongly biased,
the automation comes via getData.

Concerning Sepp, my hunch is that tipp and the data is not needed for
immediate aims and I would skip that for now, leaving your comment on
the excel sheet.

Best,

Steffen

On 08.10.20 16:43, Yaroslav Halchenko wrote:
> On Wed, 07 Oct 2020, Andreas Tille wrote:
>> On Wed, Oct 07, 2020 at 10:30:34PM +0200, Pierre Gruet wrote:
>>> I have almost finished the initial packaging of sepp [0]. Beside the
>>> sepp program, upstream also provides the tipp program in the same
>>> tarball. Basically, tipp classifies sequences using sepp and a
>>> collection of alignments and placements data and statistical methods.
>>> People installing tipp are invited to download a dataset (approx. 240Mo)
>>> [1] which does not belong to the same Github repository and has no
>>> license information inside it.
>>> Technically, I guess we might consider creating a sepp-data package with
>>> those data, but I also imagine this is not really feasible if we don't
>>> have much information about where those data come from, who collected
>>> them, ...
>>> Based on your experience, would you have some advice on this? My
>>> proposal is to let tipp aside and only focus on sepp, which is ready.
>> If the data are not part of the source tarball it might be an option
>> to provide both executables and add the documentation you are quoting
>> above.
> Continuing the thread of "what about datalad'ing it" I started in
> "How to package human, mouse and viral genomes?"
>
> here is a quick demonstration on `datalad-crawl` extension (a bit old but still
> works, it is what started datalad years back).  Some notes before cut/pasted
> dumps from terminal:
>
> - anyone can now
>    datalad install -g https://github.com/yarikoptic/tipp-reference-datalad
>   which will take care about downloading the tarball and extracting it
>
> - we could provide access to "extracted"  files in a clone somewhere on debian
>   infrastructure (and also mirror e.g. on http://datasets.datalad.org) -- all
>   for redundant availability etc
>
> - note that .zip contains also .tgz with the data besides the extracted
>
> - datalad install above would not remove downloaded archive or even its extracted
>   "to local cache" copy. so if to be "packaged" postinstall hook needs to take
>   care about dropping downloaded archives and running "datalad clean"
>
> - Since I said to crawl for all .zip (not just release etc) - I also got a copy
>   of master with the README.md extracted ;)
>
> - Similar procedures (or via `datalad addurls`) could be done to instantiate
>   datalad datasets (which are git/git-annex repositories) for any dataset
>
> - I did approach "minting" debian packages from datalad dataset hierarchies but
>   never  got it finished. Just FTR original elderly issue in datalad:
>   https://github.com/datalad/datalad/issues/156
>   Will be happy to collaborate etc
>
> - pardon the typo in the name in the dump below
>
>
> Ok, demo (datalad is in debian, but datalad-crawler extension not yet :-/
> please help to package/maintain -- it is on pypi)
>
> 	/tmp > datalad create tipp-refernece
> 	[INFO   ] Creating a new annex repo at /tmp/tipp-refernece
> 	[INFO   ] Scanning for unlocked files (this may take some time)
> 	create(ok): /tmp/tipp-refernece (dataset)
>
> 	/tmp > cd tipp-refernece
>
> 	/tmp/tipp-refernece > datalad crawl-init --save --template=simple_with_archives url=https://github.com/tandyw/tipp-reference/ a_href_match_=.*\.zip
> 	[INFO   ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/
> 	[INFO   ] Initiating special remote datalad-archives
>
> 	/tmp/tipp-refernece > datalad crawl
> 	[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
> 	[INFO   ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/
> 	[INFO   ] Running pipeline [<function Annexificator.switch_branch.<locals>.switch_branch at 0x7f69058ade50>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7f69050cbe80>, a_href_match(query='.*.zip'), <function fix_url at 0x7f6904882550>, <datalad_crawler.nodes.annex.Annexificator object at 0x7f69050cbf40>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7f6901c038b0>, [<function Annexificator.merge_branch.<locals>.merge_branch at 0x7f6901c039d0>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function Annexificator.add_archive_content.<locals>._add_archive_content at 0x7f6901c03940>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7f6901c03a60>, <function Annexificator.merge_branch.<locals>.merge_branch at 0x7f6901c03af0>, <function Annexificator.finalize.<locals>._finalize at 0x7f6901c03b80>]
> 	[INFO   ] Found branch non-dirty -- nothing was committed
> 	[INFO   ] Checking out master into a new branch incoming
> 	[INFO   ] Fetching 'https://github.com/tandyw/tipp-reference/'
> 	[INFO   ] Need to download 607 Bytes from https://github.com/tandyw/tipp-reference/archive/master.zip. No progress indication will be reported
> 	[INFO   ] Need to download 246.5 MB from https://github.com/tandyw/tipp-reference/releases/download/v2.0.0/tipp.zip. No progress indication will be reported
> 	[INFO   ] Repository found dirty -- adding and committing
> 	[INFO   ] Checking out master into a new branch incoming-processed
> 	[INFO   ] Initiating 1 merge of incoming using strategy theirs
> 	[INFO   ] Adding content of the archive ./tipp.zip into annex AnnexRepo(/tmp/tipp-refernece)
> 	[INFO   ] Finished adding ./tipp.zip: Files processed: 725, renamed: 725, +annex: 725
> 	[INFO   ] Adding content of the archive ./tipp-reference-master.zip into annex AnnexRepo(/tmp/tipp-refernece)
> 	[INFO   ] Finished adding ./tipp-reference-master.zip: Files processed: 1, renamed: 1, +git: 1
> 	[INFO   ] Repository found dirty -- adding and committing
> 	[INFO   ] Checking out an existing branch master
> 	[INFO   ] Initiating 1 merge of incoming-processed using strategy None
> 	[INFO   ] Found branch non-dirty -- nothing was committed
> 	[INFO   ] House keeping: gc, repack and clean
> 	[INFO   ] Finished running pipeline: URLs processed: 2, downloaded: 2, size: 246.5 MB,  Files processed: 730, renamed: 726, +git: 1, +annex: 727,  Branches merged: incoming->incoming-processed
> 	[INFO   ] Total stats: URLs processed: 2, downloaded: 2, size: 246.5 MB,  Files processed: 730, renamed: 726, +git: 1, +annex: 727,  Branches merged: incoming->incoming-processed,  Datasets crawled: 1
> 	datalad crawl  35.87s user 7.15s system 60% cpu 1:11.01 total
>
> 	/tmp/tipp-refernece > ls
> 	README.md  blast/  refpkg/  refpkg.tar.gz@  taxonomy/
>
> 	/tmp/tipp-refernece > less refpkg.tar.gz
>
> 	/tmp/tipp-refernece > ls refpkg
> 	16S_archaea.refpkg/   COG0090.refpkg/  COG0172.refpkg/  COG0533.refpkg/       rplB.refpkg/  rpsB.refpkg/
> 	16S_bacteria.refpkg/  COG0091.refpkg/  COG0184.refpkg/  COG0541.refpkg/       rplC.refpkg/  rpsC.refpkg/
> 	16S_silva.refpkg/     COG0092.refpkg/  COG0185.refpkg/  COG0552.refpkg/       rplD.refpkg/  rpsE.refpkg/
> 	COG0012.refpkg/       COG0093.refpkg/  COG0186.refpkg/  COG9999.refpkg/       rplE.refpkg/  rpsI.refpkg/
> 	COG0016.refpkg/       COG0094.refpkg/  COG0197.refpkg/  dnaG.refpkg/          rplF.refpkg/  rpsJ.refpkg/
> 	COG0018.refpkg/       COG0096.refpkg/  COG0200.refpkg/  frr.refpkg/           rplK.refpkg/  rpsK.refpkg/
> 	COG0048.refpkg/       COG0097.refpkg/  COG0201.refpkg/  infC.refpkg/          rplL.refpkg/  rpsM.refpkg/
> 	COG0049.refpkg/       COG0098.refpkg/  COG0202.refpkg/  nusA.refpkg/          rplM.refpkg/  rpsS.refpkg/
> 	COG0052.refpkg/       COG0099.refpkg/  COG0215.refpkg/  pgk.refpkg/           rplN.refpkg/  smpB.refpkg/
> 	COG0080.refpkg/       COG0100.refpkg/  COG0256.refpkg/  pyrG1.refpkg/         rplP.refpkg/  train.refpkg/
> 	COG0081.refpkg/       COG0102.refpkg/  COG0495.refpkg/  pyrg.refpkg/          rplS.refpkg/
> 	COG0087.refpkg/       COG0103.refpkg/  COG0522.refpkg/  rdp_bacteria.refpkg/  rplT.refpkg/
> 	COG0088.refpkg/       COG0124.refpkg/  COG0525.refpkg/  rplA.refpkg/          rpmA.refpkg/
>
>
> Then I published it to github
> https://github.com/yarikoptic/tipp-reference-datalad : (I had to manually set
> "master" to be the default branch, github behaviors changed, datalad did not
> account for that yet, I filed
> https://github.com/datalad/datalad/issues/4997)
>
> 	/tmp/tipp-refernece > datalad create-sibling-github tipp-reference-datalad
> 	.: github(-) [https://github.com/yarikoptic/tipp-reference-datalad.git (git)]
> 	'https://github.com/yarikoptic/tipp-reference-datalad.git' configured as sibling 'github' for Dataset(/tmp/tipp-refernece)
>
> 	/tmp/tipp-refernece > datalad push --to github
> 	publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/master->github:refs/heads/master [new branch]]
> 	publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
>
> now - to update -- just rerun   datalad crawl  and if nothing new -- it would do nothing
>
> 	/tmp/tipp-refernece > datalad crawl
> 	[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
> 	[INFO   ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/
> 	[INFO   ] Running pipeline [<function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebff8ee0>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7fe9ef46f460>, a_href_match(query='.*.zip'), <function fix_url at 0x7fe9eebab550>, <datalad_crawler.nodes.annex.Annexificator object at 0x7fe9ef46f880>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebfa98b0>, [<function Annexificator.merge_branch.<locals>.merge_branch at 0x7fe9ebfa99d0>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function Annexificator.add_archive_content.<locals>._add_archive_content at 0x7fe9ebfa9940>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebfa9a60>, <function Annexificator.merge_branch.<locals>.merge_branch at 0x7fe9ebfa9af0>, <function Annexificator.finalize.<locals>._finalize at 0x7fe9ebfa9b80>]
> 	[INFO   ] Found branch non-dirty -- nothing was committed
> 	[INFO   ] Checking out an existing branch incoming
> 	[INFO   ] Fetching 'https://github.com/tandyw/tipp-reference/'
> 	[INFO   ] Found branch non-dirty -- nothing was committed
> 	[INFO   ] Checking out an existing branch incoming-processed
> 	[INFO   ] Found branch non-dirty -- nothing was committed
> 	[INFO   ] Checking out an existing branch master
> 	[INFO   ] Finished running pipeline: URLs processed: 2,  Files processed: 2, skipped: 2
> 	[INFO   ] Total stats: URLs processed: 2,  Files processed: 2, skipped: 2,  Datasets crawled: 1
>
>
Reply to:
Follow-Ups:
- getData&DataLad - could it be heaven? Was: Sepp : including a dataset?
  - From: Yaroslav Halchenko <debian@onerussian.com>
References:
- Sepp : including a dataset?
  - From: Pierre Gruet <pgtdebian@free.fr>
- Re: Sepp : including a dataset?
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Sepp : including a dataset?
  - From: Yaroslav Halchenko <debian@onerussian.com>
Prev by Date: Bug#971870: ITP: sepp -- methods using ensembles of Hidden Markov Models (HMM)
Next by Date: getData&DataLad - could it be heaven? Was: Sepp : including a dataset?
Previous by thread: Re: Sepp : including a dataset?
Next by thread: getData&DataLad - could it be heaven? Was: Sepp : including a dataset?
Index(es):
- Date
- Thread