On Wed, 07 Oct 2020, Andreas Tille wrote: > On Wed, Oct 07, 2020 at 10:30:34PM +0200, Pierre Gruet wrote: > > I have almost finished the initial packaging of sepp [0]. Beside the > > sepp program, upstream also provides the tipp program in the same > > tarball. Basically, tipp classifies sequences using sepp and a > > collection of alignments and placements data and statistical methods. > > People installing tipp are invited to download a dataset (approx. 240Mo) > > [1] which does not belong to the same Github repository and has no > > license information inside it. > > Technically, I guess we might consider creating a sepp-data package with > > those data, but I also imagine this is not really feasible if we don't > > have much information about where those data come from, who collected > > them, ... > > Based on your experience, would you have some advice on this? My > > proposal is to let tipp aside and only focus on sepp, which is ready. > If the data are not part of the source tarball it might be an option > to provide both executables and add the documentation you are quoting > above. Continuing the thread of "what about datalad'ing it" I started in "How to package human, mouse and viral genomes?" here is a quick demonstration on `datalad-crawl` extension (a bit old but still works, it is what started datalad years back). Some notes before cut/pasted dumps from terminal: - anyone can now datalad install -g https://github.com/yarikoptic/tipp-reference-datalad which will take care about downloading the tarball and extracting it - we could provide access to "extracted" files in a clone somewhere on debian infrastructure (and also mirror e.g. on http://datasets.datalad.org) -- all for redundant availability etc - note that .zip contains also .tgz with the data besides the extracted - datalad install above would not remove downloaded archive or even its extracted "to local cache" copy. so if to be "packaged" postinstall hook needs to take care about dropping downloaded archives and running "datalad clean" - Since I said to crawl for all .zip (not just release etc) - I also got a copy of master with the README.md extracted ;) - Similar procedures (or via `datalad addurls`) could be done to instantiate datalad datasets (which are git/git-annex repositories) for any dataset - I did approach "minting" debian packages from datalad dataset hierarchies but never got it finished. Just FTR original elderly issue in datalad: https://github.com/datalad/datalad/issues/156 Will be happy to collaborate etc - pardon the typo in the name in the dump below Ok, demo (datalad is in debian, but datalad-crawler extension not yet :-/ please help to package/maintain -- it is on pypi) /tmp > datalad create tipp-refernece [INFO ] Creating a new annex repo at /tmp/tipp-refernece [INFO ] Scanning for unlocked files (this may take some time) create(ok): /tmp/tipp-refernece (dataset) /tmp > cd tipp-refernece /tmp/tipp-refernece > datalad crawl-init --save --template=simple_with_archives url=https://github.com/tandyw/tipp-reference/ a_href_match_=.*\.zip [INFO ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/ [INFO ] Initiating special remote datalad-archives /tmp/tipp-refernece > datalad crawl [INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg [INFO ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/ [INFO ] Running pipeline [<function Annexificator.switch_branch.<locals>.switch_branch at 0x7f69058ade50>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7f69050cbe80>, a_href_match(query='.*.zip'), <function fix_url at 0x7f6904882550>, <datalad_crawler.nodes.annex.Annexificator object at 0x7f69050cbf40>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7f6901c038b0>, [<function Annexificator.merge_branch.<locals>.merge_branch at 0x7f6901c039d0>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function Annexificator.add_archive_content.<locals>._add_archive_content at 0x7f6901c03940>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7f6901c03a60>, <function Annexificator.merge_branch.<locals>.merge_branch at 0x7f6901c03af0>, <function Annexificator.finalize.<locals>._finalize at 0x7f6901c03b80>] [INFO ] Found branch non-dirty -- nothing was committed [INFO ] Checking out master into a new branch incoming [INFO ] Fetching 'https://github.com/tandyw/tipp-reference/' [INFO ] Need to download 607 Bytes from https://github.com/tandyw/tipp-reference/archive/master.zip. No progress indication will be reported [INFO ] Need to download 246.5 MB from https://github.com/tandyw/tipp-reference/releases/download/v2.0.0/tipp.zip. No progress indication will be reported [INFO ] Repository found dirty -- adding and committing [INFO ] Checking out master into a new branch incoming-processed [INFO ] Initiating 1 merge of incoming using strategy theirs [INFO ] Adding content of the archive ./tipp.zip into annex AnnexRepo(/tmp/tipp-refernece) [INFO ] Finished adding ./tipp.zip: Files processed: 725, renamed: 725, +annex: 725 [INFO ] Adding content of the archive ./tipp-reference-master.zip into annex AnnexRepo(/tmp/tipp-refernece) [INFO ] Finished adding ./tipp-reference-master.zip: Files processed: 1, renamed: 1, +git: 1 [INFO ] Repository found dirty -- adding and committing [INFO ] Checking out an existing branch master [INFO ] Initiating 1 merge of incoming-processed using strategy None [INFO ] Found branch non-dirty -- nothing was committed [INFO ] House keeping: gc, repack and clean [INFO ] Finished running pipeline: URLs processed: 2, downloaded: 2, size: 246.5 MB, Files processed: 730, renamed: 726, +git: 1, +annex: 727, Branches merged: incoming->incoming-processed [INFO ] Total stats: URLs processed: 2, downloaded: 2, size: 246.5 MB, Files processed: 730, renamed: 726, +git: 1, +annex: 727, Branches merged: incoming->incoming-processed, Datasets crawled: 1 datalad crawl 35.87s user 7.15s system 60% cpu 1:11.01 total /tmp/tipp-refernece > ls README.md blast/ refpkg/ refpkg.tar.gz@ taxonomy/ /tmp/tipp-refernece > less refpkg.tar.gz /tmp/tipp-refernece > ls refpkg 16S_archaea.refpkg/ COG0090.refpkg/ COG0172.refpkg/ COG0533.refpkg/ rplB.refpkg/ rpsB.refpkg/ 16S_bacteria.refpkg/ COG0091.refpkg/ COG0184.refpkg/ COG0541.refpkg/ rplC.refpkg/ rpsC.refpkg/ 16S_silva.refpkg/ COG0092.refpkg/ COG0185.refpkg/ COG0552.refpkg/ rplD.refpkg/ rpsE.refpkg/ COG0012.refpkg/ COG0093.refpkg/ COG0186.refpkg/ COG9999.refpkg/ rplE.refpkg/ rpsI.refpkg/ COG0016.refpkg/ COG0094.refpkg/ COG0197.refpkg/ dnaG.refpkg/ rplF.refpkg/ rpsJ.refpkg/ COG0018.refpkg/ COG0096.refpkg/ COG0200.refpkg/ frr.refpkg/ rplK.refpkg/ rpsK.refpkg/ COG0048.refpkg/ COG0097.refpkg/ COG0201.refpkg/ infC.refpkg/ rplL.refpkg/ rpsM.refpkg/ COG0049.refpkg/ COG0098.refpkg/ COG0202.refpkg/ nusA.refpkg/ rplM.refpkg/ rpsS.refpkg/ COG0052.refpkg/ COG0099.refpkg/ COG0215.refpkg/ pgk.refpkg/ rplN.refpkg/ smpB.refpkg/ COG0080.refpkg/ COG0100.refpkg/ COG0256.refpkg/ pyrG1.refpkg/ rplP.refpkg/ train.refpkg/ COG0081.refpkg/ COG0102.refpkg/ COG0495.refpkg/ pyrg.refpkg/ rplS.refpkg/ COG0087.refpkg/ COG0103.refpkg/ COG0522.refpkg/ rdp_bacteria.refpkg/ rplT.refpkg/ COG0088.refpkg/ COG0124.refpkg/ COG0525.refpkg/ rplA.refpkg/ rpmA.refpkg/ Then I published it to github https://github.com/yarikoptic/tipp-reference-datalad : (I had to manually set "master" to be the default branch, github behaviors changed, datalad did not account for that yet, I filed https://github.com/datalad/datalad/issues/4997) /tmp/tipp-refernece > datalad create-sibling-github tipp-reference-datalad .: github(-) [https://github.com/yarikoptic/tipp-reference-datalad.git (git)] 'https://github.com/yarikoptic/tipp-reference-datalad.git' configured as sibling 'github' for Dataset(/tmp/tipp-refernece) /tmp/tipp-refernece > datalad push --to github publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/master->github:refs/heads/master [new branch]] publish(ok): /tmp/tipp-refernece (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]] now - to update -- just rerun datalad crawl and if nothing new -- it would do nothing /tmp/tipp-refernece > datalad crawl [INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg [INFO ] Creating a pipeline to crawl data files from https://github.com/tandyw/tipp-reference/ [INFO ] Running pipeline [<function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebff8ee0>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7fe9ef46f460>, a_href_match(query='.*.zip'), <function fix_url at 0x7fe9eebab550>, <datalad_crawler.nodes.annex.Annexificator object at 0x7fe9ef46f880>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebfa98b0>, [<function Annexificator.merge_branch.<locals>.merge_branch at 0x7fe9ebfa99d0>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function Annexificator.add_archive_content.<locals>._add_archive_content at 0x7fe9ebfa9940>]], <function Annexificator.switch_branch.<locals>.switch_branch at 0x7fe9ebfa9a60>, <function Annexificator.merge_branch.<locals>.merge_branch at 0x7fe9ebfa9af0>, <function Annexificator.finalize.<locals>._finalize at 0x7fe9ebfa9b80>] [INFO ] Found branch non-dirty -- nothing was committed [INFO ] Checking out an existing branch incoming [INFO ] Fetching 'https://github.com/tandyw/tipp-reference/' [INFO ] Found branch non-dirty -- nothing was committed [INFO ] Checking out an existing branch incoming-processed [INFO ] Found branch non-dirty -- nothing was committed [INFO ] Checking out an existing branch master [INFO ] Finished running pipeline: URLs processed: 2, Files processed: 2, skipped: 2 [INFO ] Total stats: URLs processed: 2, Files processed: 2, skipped: 2, Datasets crawled: 1 -- Yaroslav O. Halchenko Center for Open Neuroscience http://centerforopenneuroscience.org Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 WWW: http://www.linkedin.com/in/yarik
Attachment:
signature.asc
Description: PGP signature