I don't disagree, in principle. There are many nice aspects to the
debian packaging as you indicate. We don't want to replicate the 100s
of terabytes of data into the debian repository, so any "package"
would not have the real data but would download the data from its
source during the package install. Maybe through pre/post install
scripts? I'm not overly familiar with those capabilities but it seems
plausible to me.
However, it does leave open an interesting question. Exactly what
granularity of data belongs in a "package"? A genome sounds good, but
there are already thousands of genomes. There are thousands of
microarray experiments. And there are millions of sequence entries in
GenBank. It is plausible that the user would want access to
individual sequences. So the idea of managing thousands of "packages"
starts to sound pretty cumbersome.
Versioning of data is definitely an important issue that is somewhat
overlooked. Especially if scientists want to reproduce results from
another researcher or in a paper, if you try to redo an experiment
from many years ago, newer data could produce different results.
Galaxy[1] is one effort to get scientists to catalog reproducible
workflows, and while it has some support for acquiring data, its main
focus is on the analysis process. I think the issue of "workflow
governance" is still an open question.
cheers
Scott
[1] http://galaxy.psu.edu/
On Feb 15, 2011, at 6:18 PM, Yaroslav Halchenko wrote:
well -- this issue is tangentially related to the software: why should
we care about having Debian packages while there are CRAN, easy_install,
etc -- all those great tools to deploy software -- domain specific and
created by specialists. Although such comparison is a stretch, I think
it has its own merits. Encapsulating (at least core sets) data into
Debian packages makes them nicely integrated within the world of
software withing Debian; with clear and uniform means on how to specify
dependencies on data, on how to install, where to look for legal
information, the same canonical location for related software and data
etc. Versioned dependencies become especially relevant aspect is
construction of regression tests of software depending on
corresponding data packages, e.g.
http://neuro.debian.net/pkgs/fsl-feeds.html.
I am not suggesting to replace all those data provider systems created
by professionals ;) I am talking about complimenting them
whenever feasible/sensible for the Debian needs/purposes.
On Tue, 15 Feb 2011, Scott Christley wrote:
I think putting the data itself into debian repository is
problematic. Regardless of any licensing issue, the shear amount of
data is too great. Better to let the professionals who are getting
paid to manage the data (NCBI, KEGG, etc.) and download directly
from those sites. Pretty much all of them have ftp/http access to
acquire data.
I like the getData effort. Have a set of "data descriptors" with
information about how/where to get data, then when requested
performs the download. This is very much the architecture I was
thinking about. I see a number of ways the project could be
expanded. I would like to hear thoughts from Steffen and Charles
about getData before I jump in with a bunch of additions.
The biomaj projects looks interesting as well. One possibility is
to use it as the underlying data retrieval layer, but it also may be
"too complex" for basic retrieval functions.
Scott
--
=------------------------------------------------------------------=
Keep in touch www.onerussian.com
Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic