[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Huge data files in Debian



Andreas Tille <andreas@an3as.eu> writes:
> On Fri, Jul 17, 2015 at 09:58:32PM -0400, Yaroslav Halchenko wrote:
>> > +1
>> > Would you register a large data set BoF in Summit?
>> 
>> something like
>> https://summit.debconf.org/debconf15/meeting/333/bof-big-data-packages/
>
> Yes.
>
>> ?  not sure if I would actually like to be "The Speaker" ;)
>
> Why not.  Your mail contained someconstructove details that could
> kickstart a discussion.  There is not more needed in a BoF. 

I'd support this -- we could also discuss the "distributed filesystem
approach" there. Could this happen between Friday (DebCamp) and Tuesday?
I have to leave on Wednesday...

For the specified package (astrometry.net), I have been looking into the
content of the packages, and I think I can make a compromise here: The
tables mainly contain star positions, and the tables range sizes range
from brightest 1000 stars (for the smallest table) to ~200 million
stars, roughly doubling with each table. Already the first 12 tables are
quite useful for many applications (not for my instrument, however :-( ),
covering 1.45 million stars, and have a size of 114 MB; I think this
is acceptable for a Debian package.

However, there may be a licensing issue: The data are officially under
GPL-2+; but this is sort-of impossible: GPL requires to have the
"sources" available (and defines source as "the format that a human
prefers to edit" or so). There is no such "source" for these files: they
come from an survey (2MASS) and the final "source" are the positions of
the stars on the sky. These positions are obviously not editable by
humans yet :-) And while the process of generating the star catalogs is
somehow documented, I doubt that we can or should reproduce the catalogs
from the original exposures in Debian -- this would require disk space,
computing power and man power which we obviously don't have.

And the files have a quite straight-forward, documented structure so
that anyone could patch them if needed.

So, I would take these files (despite the fact that they are *generated*
by another program) as source files. Maybe we could discuss this as well
at the Debconf? At some point we should think about how to get this in
our Social Contract.

Best

Ole


Reply to: