Re: Huge data files in Debian

Andreas Tille <andreas@an3as.eu> writes:
> On Fri, Jul 17, 2015 at 09:58:32PM -0400, Yaroslav Halchenko wrote:
>> > +1
>> > Would you register a large data set BoF in Summit?
>> something like
>> https://summit.debconf.org/debconf15/meeting/333/bof-big-data-packages/
> Yes.
>> ?  not sure if I would actually like to be "The Speaker" ;)
> Why not.  Your mail contained someconstructove details that could
> kickstart a discussion.  There is not more needed in a BoF. 

I'd support this -- we could also discuss the "distributed filesystem
approach" there. Could this happen between Friday (DebCamp) and Tuesday?
I have to leave on Wednesday...

For the specified package (astrometry.net), I have been looking into the
content of the packages, and I think I can make a compromise here: The
tables mainly contain star positions, and the tables range sizes range
from brightest 1000 stars (for the smallest table) to ~200 million
stars, roughly doubling with each table. Already the first 12 tables are
quite useful for many applications (not for my instrument, however :-( ),
covering 1.45 million stars, and have a size of 114 MB; I think this
is acceptable for a Debian package.

However, there may be a licensing issue: The data are officially under
GPL-2+; but this is sort-of impossible: GPL requires to have the
"sources" available (and defines source as "the format that a human
prefers to edit" or so). There is no such "source" for these files: they
come from an survey (2MASS) and the final "source" are the positions of
the stars on the sky. These positions are obviously not editable by
humans yet :-) And while the process of generating the star catalogs is
somehow documented, I doubt that we can or should reproduce the catalogs
from the original exposures in Debian -- this would require disk space,
computing power and man power which we obviously don't have.

And the files have a quite straight-forward, documented structure so
that anyone could patch them if needed.

So, I would take these files (despite the fact that they are *generated*
by another program) as source files. Maybe we could discuss this as well
at the Debconf? At some point we should think about how to get this in
our Social Contract.



