[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Packaging scientific datasets for Debian



Dear Debian scientists,


I want to resurrect the discussion about dataset packaging in Debian. I
believe the latest state is reflected by this document:

  http://ftp-master.debian.org/wiki/projects/data/

Although it makes the impression that everything is already done, I
don't know if that is actually true. Does anyone know about the current
state of this effort?

Anyway, I want to discuss a different aspect of dataset packaging. In
neuroimaging research we have a number of 'standard datasets' that are
used by many tools (brain atlases, ...). With an increasing number of
neuroimaging packages, we see these datasets appearing in multiple
packages. I was wondering what could be a good approach to standardize
packaging of this kind of data.

So far I have included them in application-specific places, i.e.
/usr/share/<appname>/data. But that would obviously duplicate data
without necessity. Moreover, it would be nice to have data of some type
(e.g. brain-atlases, MRI data, ...) grouped together in a common place
that could be used as default data path for relevant applications.

I would appreciate your comments on a good approach to this problem --
that scales well to other fields of science too. I'm sure that this type
of integration of many packages into a common environment has been
approached multiple time before, and maybe one of the solutions can be
adapted in this case too.

Ideally, we could come up with a mini-policy for dataset packages. It
should not be overengineered, but it might cover things like:

- package naming, e.g. 'dataset-<meaningful name>'
- predictable and common location in the filesystem (maybe it should make
  it easy for an admin to relocate all packaged datasets to a dedicated storage
  device)
- some grouping by purpose (although many datasets can be used for
  different things) or type of data (MRI, pictures, sound
  databases, genome, ...)

A related long-term goal is to have a supply of datasets that can be
used for regression testing of scientific applications. It should make
it easier to come up with a dedicated package that implements the
regression test and declares dependencies on the necessary data
package(s).


Thanks in advance for your input,

Michael

-- 
GPG key:  1024D/3144BE0F Michael Hanke
http://mih.voxindeserto.de


Reply to: