[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: data sets and/or access to data sets



I like what I saw about biomaj. What is cannot do for the
moment (from what I understood) is to express a runtime
dependency against a particular database version and have
that then installed package trigger biomaj to perform
that step. Correct me if I am wrong, please. Could that be
added?

What we had in mind for getData is that it would be installed
as some pre-depends to other Debian packages that only
aim at installing data. When they are installed, they
place a description of how do download themselves in
/etc/getData.d . Their post-inst script would then just
go and execute that.


Many greetings

Steffen



On 02/16/2011 06:28 PM, Olivier Sallou wrote:
> Data versioning is very difficult as all data sources do not keep "old"
> versions online, only a current one.
> 
> With biomaj we propose to keep old versions (or a number of old
> versions), but this is locally, it cannot help to reproduce an
> experiment with exactly the same data if remote source changed them....
> 
> Granularity, is indeed an issue. In our use of Biomaj, we see a lot of
> different requests from our users (biologists). Some indeed just want a
> few chromosomes, others expect a full database (GEO, Uniprot etc...)
> Furthermore, you need to be sure to have the infrastucture to old the
> data (downloading full genbank... is quite big).
> 
> While you will often check this when you download data manually, the
> ease of use of a package could "skip" this check from the user. Or he
> should be warned at install of disk requirements....
> 
> 
> Olivier
> 
> 
> 
> Le 2/16/11 6:16 PM, Scott Christley a écrit :
>> I don't disagree, in principle.  There are many nice aspects to the
>> debian packaging as you indicate.  We don't want to replicate the 100s
>> of terabytes of data into the debian repository, so any "package"
>> would not have the real data but would download the data from its
>> source during the package install.  Maybe through pre/post install
>> scripts?  I'm not overly familiar with those capabilities but it seems
>> plausible to me.
>>
>> However, it does leave open an interesting question.  Exactly what
>> granularity of data belongs in a "package"?  A genome sounds good, but
>> there are already thousands of genomes.  There are thousands of
>> microarray experiments.  And there are millions of sequence entries in
>> GenBank.  It is plausible that the user would want access to
>> individual sequences.  So the idea of managing thousands of "packages"
>> starts to sound pretty cumbersome.
>>
>> Versioning of data is definitely an important issue that is somewhat
>> overlooked.  Especially if scientists want to reproduce results from
>> another researcher or in a paper, if you try to redo an experiment
>> from many years ago, newer data could produce different results. 
>> Galaxy[1] is one effort to get scientists to catalog reproducible
>> workflows, and while it has some support for acquiring data, its main
>> focus is on the analysis process.  I think the issue of "workflow
>> governance" is still an open question.
>>
>> cheers
>> Scott
>>
>> [1] http://galaxy.psu.edu/
>>
>> On Feb 15, 2011, at 6:18 PM, Yaroslav Halchenko wrote:
>>
>>> well -- this issue is tangentially related to the software: why should
>>> we care about having Debian packages while there are CRAN, easy_install,
>>> etc -- all those great tools to deploy software -- domain specific and
>>> created by specialists.  Although such comparison is a stretch, I think
>>> it has its own merits.  Encapsulating (at least core sets) data into
>>> Debian packages makes them nicely integrated within the world of
>>> software withing Debian; with clear and uniform means on how to specify
>>> dependencies on data, on how to install, where to look for legal
>>> information, the same canonical location for related software and data
>>> etc.  Versioned dependencies become especially relevant aspect is
>>> construction of regression tests of software depending on
>>> corresponding data packages, e.g.
>>> http://neuro.debian.net/pkgs/fsl-feeds.html.
>>>
>>> I am not suggesting to replace all those data provider systems created
>>> by professionals ;)  I am talking about complimenting them
>>> whenever feasible/sensible for the Debian needs/purposes.
>>>
>>> On Tue, 15 Feb 2011, Scott Christley wrote:
>>>
>>>
>>>> I think putting the data itself into debian repository is
>>>> problematic.  Regardless of any licensing issue, the shear amount of
>>>> data is too great.  Better to let the professionals who are getting
>>>> paid to manage the data (NCBI, KEGG, etc.) and download directly
>>>> from those sites.  Pretty much all of them have ftp/http access to
>>>> acquire data.
>>>> I like the getData effort.  Have a set of "data descriptors" with
>>>> information about how/where to get data, then when requested
>>>> performs the download.  This is very much the architecture I was
>>>> thinking about.  I see a number of ways the project could be
>>>> expanded.  I would like to hear thoughts from Steffen and Charles
>>>> about getData before I jump in with a bunch of additions.
>>>> The biomaj projects looks interesting as well.  One possibility is
>>>> to use it as the underlying data retrieval layer, but it also may be
>>>> "too complex" for basic retrieval functions.
>>>> Scott
>>> -- 
>>> =------------------------------------------------------------------=
>>> Keep in touch                                     www.onerussian.com
>>> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
>>
> 


Reply to: