Packaging scientific datasets for Debian

To: Debian Science List <debian-science@lists.debian.org>
Subject: Packaging scientific datasets for Debian
From: Michael Hanke <michael.hanke@gmail.com>
Date: Sat, 8 May 2010 10:05:45 -0400
Message-id: <[🔎] 20100508140545.GA19587@meiner>
Mail-followup-to: Debian Science List <debian-science@lists.debian.org>

Dear Debian scientists,


I want to resurrect the discussion about dataset packaging in Debian. I
believe the latest state is reflected by this document:

  http://ftp-master.debian.org/wiki/projects/data/

Although it makes the impression that everything is already done, I
don't know if that is actually true. Does anyone know about the current
state of this effort?

Anyway, I want to discuss a different aspect of dataset packaging. In
neuroimaging research we have a number of 'standard datasets' that are
used by many tools (brain atlases, ...). With an increasing number of
neuroimaging packages, we see these datasets appearing in multiple
packages. I was wondering what could be a good approach to standardize
packaging of this kind of data.

So far I have included them in application-specific places, i.e.
/usr/share/<appname>/data. But that would obviously duplicate data
without necessity. Moreover, it would be nice to have data of some type
(e.g. brain-atlases, MRI data, ...) grouped together in a common place
that could be used as default data path for relevant applications.

I would appreciate your comments on a good approach to this problem --
that scales well to other fields of science too. I'm sure that this type
of integration of many packages into a common environment has been
approached multiple time before, and maybe one of the solutions can be
adapted in this case too.

Ideally, we could come up with a mini-policy for dataset packages. It
should not be overengineered, but it might cover things like:

- package naming, e.g. 'dataset-<meaningful name>'
- predictable and common location in the filesystem (maybe it should make
  it easy for an admin to relocate all packaged datasets to a dedicated storage
  device)
- some grouping by purpose (although many datasets can be used for
  different things) or type of data (MRI, pictures, sound
  databases, genome, ...)

A related long-term goal is to have a supply of datasets that can be
used for regression testing of scientific applications. It should make
it easier to come up with a dedicated package that implements the
regression test and declares dependencies on the necessary data
package(s).


Thanks in advance for your input,

Michael

-- 
GPG key:  1024D/3144BE0F Michael Hanke
http://mih.voxindeserto.de

Reply to:

Follow-Ups:
- Re: Packaging scientific datasets for Debian
  - From: Charles Plessy <plessy@debian.org>
- Re: Packaging scientific datasets for Debian
  - From: "Steve M. Robbins" <steve@sumost.ca>

Prev by Date: Re: Question on Salome package organization
Next by Date: Re: Packaging scientific datasets for Debian
Previous by thread: Re: Question on Salome package organization
Next by thread: Re: Packaging scientific datasets for Debian
Index(es):
- Date
- Thread