[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Sepp : including a dataset?



On Wed, 07 Oct 2020, Andreas Tille wrote:
> On Wed, Oct 07, 2020 at 10:30:34PM +0200, Pierre Gruet wrote:
> > I have almost finished the initial packaging of sepp [0]. Beside the
> > sepp program, upstream also provides the tipp program in the same
> > tarball. Basically, tipp classifies sequences using sepp and a
> > collection of alignments and placements data and statistical methods.
> > People installing tipp are invited to download a dataset (approx. 240Mo)
> > [1] which does not belong to the same Github repository and has no
> > license information inside it.

> > Technically, I guess we might consider creating a sepp-data package with
> > those data, but I also imagine this is not really feasible if we don't
> > have much information about where those data come from, who collected
> > them, ...

> > Based on your experience, would you have some advice on this? My
> > proposal is to let tipp aside and only focus on sepp, which is ready.

> If the data are not part of the source tarball it might be an option
> to provide both executables and add the documentation you are quoting
> above.

Continuing the thread of "what about datalad'ing it" I started in
"How to package human, mouse and viral genomes?"

here is a quick demonstration on `datalad-crawl` extension (a bit old but still
works, it is what started datalad years back).  Some notes before cut/pasted
dumps from terminal:

- anyone can now 
   datalad install -g https://github.com/yarikoptic/tipp-reference-datalad
  which will take care about downloading the tarball and extracting it

- we could provide access to "extracted"  files in a clone somewhere on debian
  infrastructure (and also mirror e.g. on http://datasets.datalad.org) -- all
  for redundant availability etc

- note that .zip contains also .tgz with the data besides the extracted

- datalad install above would not remove downloaded archive or even its extracted
  "to local cache" copy. so if to be "packaged" postinstall hook needs to take
  care about dropping downloaded archives and running "datalad clean"

- Since I said to crawl for all .zip (not just release etc) - I also got a copy 
  of master with the README.md extracted ;)

- Similar procedures (or via `datalad addurls`) could be done to instantiate
  datalad datasets (which are git/git-annex repositories) for any dataset

- I did approach "minting" debian packages from datalad dataset hierarchies but 
  never  got it finished. Just FTR original elderly issue in datalad:
  https://github.com/datalad/datalad/issues/156
  Will be happy to collaborate etc

- pardon the typo in the name in the dump below


Ok, demo (datalad is in debian, but datalad-crawler extension not yet :-/
please help to package/maintain -- it is on pypi)

	/tmp > datalad create tipp-refernece
	[INFO   ] Creating a new annex repo at /tmp/tipp-refernece 
	[INFO   ] Scanning for unlocked files (this may take some time) 
	create(ok): /tmp/tipp-refernece (dataset)

	/tmp > cd tipp-refernece

	/tmp/tipp-refernece > datalad crawl-init --save --template=simple_with_archives url=https://github.com/tandyw/tipp-reference/ a_href_match_=.*\.zip                      
	[INFO   ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/ 
	[INFO   ] Initiating special remote datalad-archives 

	/tmp/tipp-refernece > datalad crawl
	[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg 
	[INFO   ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/ 
	[INFO   ] Running pipeline [<function Annexificator.switch_branch.<locals>.switch_branch at 0x7f69058ade50>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7f69050cbe80>, a_href_match(query='.*.zip'), <function fix_url at 0x7f6904882550>, <datalad_crawler.nodes.annex.Annexificator object at 0x7f69050cbf40>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7f6901c038b0>, [<function Annexificator.merge_branch.<locals>.merge_branch at 0x7f6901c039d0>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function Annexificator.add_archive_content.<locals>._add_archive_content at 0x7f6901c03940>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7f6901c03a60>, <function Annexificator.merge_branch.<locals>.merge_branch at 0x7f6901c03af0>, <function Annexificator.finalize.<locals>._finalize at 0x7f6901c03b80>] 
	[INFO   ] Found branch non-dirty -- nothing was committed 
	[INFO   ] Checking out master into a new branch incoming 
	[INFO   ] Fetching 'https://github.com/tandyw/tipp-reference/' 
	[INFO   ] Need to download 607 Bytes from https://github.com/tandyw/tipp-reference/archive/master.zip. No progress indication will be reported 
	[INFO   ] Need to download 246.5 MB from https://github.com/tandyw/tipp-reference/releases/download/v2.0.0/tipp.zip. No progress indication will be reported 
	[INFO   ] Repository found dirty -- adding and committing 
	[INFO   ] Checking out master into a new branch incoming-processed 
	[INFO   ] Initiating 1 merge of incoming using strategy theirs 
	[INFO   ] Adding content of the archive ./tipp.zip into annex AnnexRepo(/tmp/tipp-refernece) 
	[INFO   ] Finished adding ./tipp.zip: Files processed: 725, renamed: 725, +annex: 725 
	[INFO   ] Adding content of the archive ./tipp-reference-master.zip into annex AnnexRepo(/tmp/tipp-refernece) 
	[INFO   ] Finished adding ./tipp-reference-master.zip: Files processed: 1, renamed: 1, +git: 1 
	[INFO   ] Repository found dirty -- adding and committing 
	[INFO   ] Checking out an existing branch master 
	[INFO   ] Initiating 1 merge of incoming-processed using strategy None 
	[INFO   ] Found branch non-dirty -- nothing was committed 
	[INFO   ] House keeping: gc, repack and clean 
	[INFO   ] Finished running pipeline: URLs processed: 2, downloaded: 2, size: 246.5 MB,  Files processed: 730, renamed: 726, +git: 1, +annex: 727,  Branches merged: incoming->incoming-processed 
	[INFO   ] Total stats: URLs processed: 2, downloaded: 2, size: 246.5 MB,  Files processed: 730, renamed: 726, +git: 1, +annex: 727,  Branches merged: incoming->incoming-processed,  Datasets crawled: 1 
	datalad crawl  35.87s user 7.15s system 60% cpu 1:11.01 total

	/tmp/tipp-refernece > ls 
	README.md  blast/  refpkg/  refpkg.tar.gz@  taxonomy/

	/tmp/tipp-refernece > less refpkg.tar.gz 

	/tmp/tipp-refernece > ls refpkg
	16S_archaea.refpkg/   COG0090.refpkg/  COG0172.refpkg/  COG0533.refpkg/       rplB.refpkg/  rpsB.refpkg/
	16S_bacteria.refpkg/  COG0091.refpkg/  COG0184.refpkg/  COG0541.refpkg/       rplC.refpkg/  rpsC.refpkg/
	16S_silva.refpkg/     COG0092.refpkg/  COG0185.refpkg/  COG0552.refpkg/       rplD.refpkg/  rpsE.refpkg/
	COG0012.refpkg/       COG0093.refpkg/  COG0186.refpkg/  COG9999.refpkg/       rplE.refpkg/  rpsI.refpkg/
	COG0016.refpkg/       COG0094.refpkg/  COG0197.refpkg/  dnaG.refpkg/          rplF.refpkg/  rpsJ.refpkg/
	COG0018.refpkg/       COG0096.refpkg/  COG0200.refpkg/  frr.refpkg/           rplK.refpkg/  rpsK.refpkg/
	COG0048.refpkg/       COG0097.refpkg/  COG0201.refpkg/  infC.refpkg/          rplL.refpkg/  rpsM.refpkg/
	COG0049.refpkg/       COG0098.refpkg/  COG0202.refpkg/  nusA.refpkg/          rplM.refpkg/  rpsS.refpkg/
	COG0052.refpkg/       COG0099.refpkg/  COG0215.refpkg/  pgk.refpkg/           rplN.refpkg/  smpB.refpkg/
	COG0080.refpkg/       COG0100.refpkg/  COG0256.refpkg/  pyrG1.refpkg/         rplP.refpkg/  train.refpkg/
	COG0081.refpkg/       COG0102.refpkg/  COG0495.refpkg/  pyrg.refpkg/          rplS.refpkg/
	COG0087.refpkg/       COG0103.refpkg/  COG0522.refpkg/  rdp_bacteria.refpkg/  rplT.refpkg/
	COG0088.refpkg/       COG0124.refpkg/  COG0525.refpkg/  rplA.refpkg/          rpmA.refpkg/


Then I published it to github
https://github.com/yarikoptic/tipp-reference-datalad : (I had to manually set
"master" to be the default branch, github behaviors changed, datalad did not
account for that yet, I filed
https://github.com/datalad/datalad/issues/4997)

	/tmp/tipp-refernece > datalad create-sibling-github tipp-reference-datalad
	.: github(-) [https://github.com/yarikoptic/tipp-reference-datalad.git (git)]
	'https://github.com/yarikoptic/tipp-reference-datalad.git' configured as sibling 'github' for Dataset(/tmp/tipp-refernece)

	/tmp/tipp-refernece > datalad push --to github                     
	publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/master->github:refs/heads/master [new branch]]            
	publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]  

now - to update -- just rerun   datalad crawl  and if nothing new -- it would do nothing

	/tmp/tipp-refernece > datalad crawl
	[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg 
	[INFO   ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/ 
	[INFO   ] Running pipeline [<function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebff8ee0>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7fe9ef46f460>, a_href_match(query='.*.zip'), <function fix_url at 0x7fe9eebab550>, <datalad_crawler.nodes.annex.Annexificator object at 0x7fe9ef46f880>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebfa98b0>, [<function Annexificator.merge_branch.<locals>.merge_branch at 0x7fe9ebfa99d0>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function Annexificator.add_archive_content.<locals>._add_archive_content at 0x7fe9ebfa9940>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebfa9a60>, <function Annexificator.merge_branch.<locals>.merge_branch at 0x7fe9ebfa9af0>, <function Annexificator.finalize.<locals>._finalize at 0x7fe9ebfa9b80>] 
	[INFO   ] Found branch non-dirty -- nothing was committed 
	[INFO   ] Checking out an existing branch incoming 
	[INFO   ] Fetching 'https://github.com/tandyw/tipp-reference/' 
	[INFO   ] Found branch non-dirty -- nothing was committed 
	[INFO   ] Checking out an existing branch incoming-processed 
	[INFO   ] Found branch non-dirty -- nothing was committed 
	[INFO   ] Checking out an existing branch master 
	[INFO   ] Finished running pipeline: URLs processed: 2,  Files processed: 2, skipped: 2 
	[INFO   ] Total stats: URLs processed: 2,  Files processed: 2, skipped: 2,  Datasets crawled: 1 


-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW:   http://www.linkedin.com/in/yarik        

Attachment: signature.asc
Description: PGP signature


Reply to: