[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Huge data files in Debian

On Fri, 17 Jul 2015, Ole Streicher wrote:

> Hi all,

> again a question where I find it difficult to put it into one single
> box. However, please reply to debian-science.

> I am trying to get the package "astrometry.net" into Debian. This
> package exists for Ubuntu [1], but (with some minor changes) could be
> uploaded to Debian as well. I already contacted the creator of the
> package (no reply yet).

> The package, however, is accompanied with a number of data files from
> which at least some are needed to run the package. Fortunately, these
> data files are DFSG, and already available as Debian packages [2].

> But: These packages sum up to ~25 GB, with the maximal package size of
> 3.5 GB. What is the best way to deal with them? Loosely following the
> discussion about the Icedove icons, it is probably not a wise idea
> ("privacy breach") to let them downloaded from a third party server; at
> least as long as they are DFSG-free. But can (and shall) our Debian
> servers store these files? Is 25 GB much for us or not these days?

Unfortunately it is unlikely that we (as Debian) would be able to
afford providing generic storage and distribution of such large data
packages.  It just wouldn't scale -- where would be a cut off? (some
datasets we deal with in neuroimaging are already tens of TBs)

But also it is not just about "storage" -- conventional organization
(.orig.tar.* + .deb) with data being duplicated in both doesn't scale
well as well.

The "ultimate" solution we are aiming for (see http://datalad.org for more
information) is to utilize git-annex and "ship" either mere pointers to
git-annex sources or lean (without data) git-annex repositories which
fetch data from original (or mirrors) data providers.

Meanwhile, we (http://neuro.debian.net) have started to use git-annex to
at least avoid bloated .orig.tar.gz, see e.g.
sources: http://git.debian.org/?p=pkg-exppsy/pymvpa2-tutorialdata.git

So, .orig is in 3.0 (git) format and is just a lean annex repository.
When building a binary packages then load gets fetched, and brought into
.deb binary packages.

NB Some of the older packages on NeuroDebian still come with
bloated structure (orig + deb) or some other workarounds. But none of
them was really scalable since we can't host all the data we might want
to provide.

Going to debconf15?  may be we could have a BoF or just a lunch chat to
discuss this eye-to-eye? ;)

Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Research Scientist,            Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik        

Reply to: