[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: data sets and/or access to data sets



A "bank" package could contain a biomaj property file (each bank has its own property file), put in biomaj bank directory (or symlink) and postinstall could trigger biomaj to make the update if required

A bank package would need:
- a property file describing where to download data, and which processes to apply - some optional post-processes file specific to bank treatment or wrappers to use some ENV variables that Biomaj provides

Binding a particular version is more a problem of remote site that does not always allow to do so.... Biomaj could hold a bank propery file per bank version if needed In this case, the remote url would bind to a URL specific to the bank version.

Olivier

Le 2/17/11 11:00 AM, Steffen Möller a écrit :
I like what I saw about biomaj. What is cannot do for the
moment (from what I understood) is to express a runtime
dependency against a particular database version and have
that then installed package trigger biomaj to perform
that step. Correct me if I am wrong, please. Could that be
added?

What we had in mind for getData is that it would be installed
as some pre-depends to other Debian packages that only
aim at installing data. When they are installed, they
place a description of how do download themselves in
/etc/getData.d . Their post-inst script would then just
go and execute that.


Many greetings

Steffen



On 02/16/2011 06:28 PM, Olivier Sallou wrote:
Data versioning is very difficult as all data sources do not keep "old"
versions online, only a current one.

With biomaj we propose to keep old versions (or a number of old
versions), but this is locally, it cannot help to reproduce an
experiment with exactly the same data if remote source changed them....

Granularity, is indeed an issue. In our use of Biomaj, we see a lot of
different requests from our users (biologists). Some indeed just want a
few chromosomes, others expect a full database (GEO, Uniprot etc...)
Furthermore, you need to be sure to have the infrastucture to old the
data (downloading full genbank... is quite big).

While you will often check this when you download data manually, the
ease of use of a package could "skip" this check from the user. Or he
should be warned at install of disk requirements....


Olivier



Le 2/16/11 6:16 PM, Scott Christley a écrit :
I don't disagree, in principle.  There are many nice aspects to the
debian packaging as you indicate.  We don't want to replicate the 100s
of terabytes of data into the debian repository, so any "package"
would not have the real data but would download the data from its
source during the package install.  Maybe through pre/post install
scripts?  I'm not overly familiar with those capabilities but it seems
plausible to me.

However, it does leave open an interesting question.  Exactly what
granularity of data belongs in a "package"?  A genome sounds good, but
there are already thousands of genomes.  There are thousands of
microarray experiments.  And there are millions of sequence entries in
GenBank.  It is plausible that the user would want access to
individual sequences.  So the idea of managing thousands of "packages"
starts to sound pretty cumbersome.

Versioning of data is definitely an important issue that is somewhat
overlooked.  Especially if scientists want to reproduce results from
another researcher or in a paper, if you try to redo an experiment
from many years ago, newer data could produce different results.
Galaxy[1] is one effort to get scientists to catalog reproducible
workflows, and while it has some support for acquiring data, its main
focus is on the analysis process.  I think the issue of "workflow
governance" is still an open question.

cheers
Scott

[1] http://galaxy.psu.edu/

On Feb 15, 2011, at 6:18 PM, Yaroslav Halchenko wrote:

well -- this issue is tangentially related to the software: why should
we care about having Debian packages while there are CRAN, easy_install,
etc -- all those great tools to deploy software -- domain specific and
created by specialists.  Although such comparison is a stretch, I think
it has its own merits.  Encapsulating (at least core sets) data into
Debian packages makes them nicely integrated within the world of
software withing Debian; with clear and uniform means on how to specify
dependencies on data, on how to install, where to look for legal
information, the same canonical location for related software and data
etc.  Versioned dependencies become especially relevant aspect is
construction of regression tests of software depending on
corresponding data packages, e.g.
http://neuro.debian.net/pkgs/fsl-feeds.html.

I am not suggesting to replace all those data provider systems created
by professionals ;)  I am talking about complimenting them
whenever feasible/sensible for the Debian needs/purposes.

On Tue, 15 Feb 2011, Scott Christley wrote:


I think putting the data itself into debian repository is
problematic.  Regardless of any licensing issue, the shear amount of
data is too great.  Better to let the professionals who are getting
paid to manage the data (NCBI, KEGG, etc.) and download directly
from those sites.  Pretty much all of them have ftp/http access to
acquire data.
I like the getData effort.  Have a set of "data descriptors" with
information about how/where to get data, then when requested
performs the download.  This is very much the architecture I was
thinking about.  I see a number of ways the project could be
expanded.  I would like to hear thoughts from Steffen and Charles
about getData before I jump in with a bunch of additions.
The biomaj projects looks interesting as well.  One possibility is
to use it as the underlying data retrieval layer, but it also may be
"too complex" for basic retrieval functions.
Scott
--
=------------------------------------------------------------------=
Keep in touch                                     www.onerussian.com
Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic


--
gpg key id: 4096R/326D8438  (pgp.mit.edu)
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438



Reply to: