[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: data sets and/or access to data sets



I don't disagree, in principle.  There are many nice aspects to the debian packaging as you indicate.  We don't want to replicate the 100s of terabytes of data into the debian repository, so any "package" would not have the real data but would download the data from its source during the package install.  Maybe through pre/post install scripts?  I'm not overly familiar with those capabilities but it seems plausible to me.

However, it does leave open an interesting question.  Exactly what granularity of data belongs in a "package"?  A genome sounds good, but there are already thousands of genomes.  There are thousands of microarray experiments.  And there are millions of sequence entries in GenBank.  It is plausible that the user would want access to individual sequences.  So the idea of managing thousands of "packages" starts to sound pretty cumbersome.

Versioning of data is definitely an important issue that is somewhat overlooked.  Especially if scientists want to reproduce results from another researcher or in a paper, if you try to redo an experiment from many years ago, newer data could produce different results.  Galaxy[1] is one effort to get scientists to catalog reproducible workflows, and while it has some support for acquiring data, its main focus is on the analysis process.  I think the issue of "workflow governance" is still an open question.

cheers
Scott

[1] http://galaxy.psu.edu/

On Feb 15, 2011, at 6:18 PM, Yaroslav Halchenko wrote:

> well -- this issue is tangentially related to the software: why should
> we care about having Debian packages while there are CRAN, easy_install,
> etc -- all those great tools to deploy software -- domain specific and
> created by specialists.  Although such comparison is a stretch, I think
> it has its own merits.  Encapsulating (at least core sets) data into
> Debian packages makes them nicely integrated within the world of
> software withing Debian; with clear and uniform means on how to specify
> dependencies on data, on how to install, where to look for legal
> information, the same canonical location for related software and data
> etc.  Versioned dependencies become especially relevant aspect is
> construction of regression tests of software depending on
> corresponding data packages, e.g.
> http://neuro.debian.net/pkgs/fsl-feeds.html.
> 
> I am not suggesting to replace all those data provider systems created
> by professionals ;)  I am talking about complimenting them
> whenever feasible/sensible for the Debian needs/purposes.
> 
> On Tue, 15 Feb 2011, Scott Christley wrote:
> 
> 
>> I think putting the data itself into debian repository is problematic.  Regardless of any licensing issue, the shear amount of data is too great.  Better to let the professionals who are getting paid to manage the data (NCBI, KEGG, etc.) and download directly from those sites.  Pretty much all of them have ftp/http access to acquire data.
> 
>> I like the getData effort.  Have a set of "data descriptors" with information about how/where to get data, then when requested performs the download.  This is very much the architecture I was thinking about.  I see a number of ways the project could be expanded.  I would like to hear thoughts from Steffen and Charles about getData before I jump in with a bunch of additions.
> 
>> The biomaj projects looks interesting as well.  One possibility is to use it as the underlying data retrieval layer, but it also may be "too complex" for basic retrieval functions.
> 
>> Scott
> -- 
> =------------------------------------------------------------------=
> Keep in touch                                     www.onerussian.com
> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic


Reply to: