[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Large data packages in the archive



Hi,

one important question lately has been "What should we do with large
packages containing data", like game data, huge icon/wallpaper sets,
some science data sets, etc. Naturally, this is a decision ftpmaster has
to take, so here are our thoughts on it to facilitate discussion and see
if we missed important points but we keep the right to have the last
word how it gets done. :)


Basic Problem: "What to do with large data packages?"

That already has a problem: How to define "large"? One way, which we
chose for now, is simply "everything > 50MB".


While the archive software is written in Python, this problem sounds
like a Perl one as "There is more than one way to do (solve) it":

a.) We can simply say that we don't want this in Debian and people
    should use external hosting for such packages. After all they are
    for a very small minority usually.

b.) We can just add another component "data" besides
    main/contrib/non-free.

c.) We can host an own archive for it under control of ftpmaster.


The first two seem to have grave problems:

a.) Is basically no (good) option. It is our job to maintain the
	archive, and if there is enough demand we should make it possible to
	also host things like these data packages. Additionally it has the
	problem that it would require a move of everything that needs those
	data packages into contrib, as there wouldn't be a good base for a
	Policy exception.

b.) While that would be the most simple solution it has other problems,
	large enough that we decided against it. The biggest one being that
	of the principle of least surprise for our mirrors. We are talking
	about this to not bloat the main archive too much. If we just add
	another component stuff will end up mirrored a lot. Even if we send
	an announcement weeks before. Requiring every mirror admin to take a
	decision if they want to mirror or exclude it, then adjust their
	scripts, is a simple no-go for us.

So the way to go for us seems to be c.), hosting the archive ourself
(somewhere below data.debian.org probably).


For all the rest of the mail I talk about solution c., unless otherwise
stated.


So assume we go for solution c. (which is what happens unless someone
has a *very* strong reason not to, which I currently can't imagine) we
will setup a seperate archive for this. This will work the same way as
our main archive does, with a few notable points:

 - It will be solely arch:all, not splitted per architecture. Or, if
   someone presents *good* reasons why a data archive needs to be
   architecture-aware, we will also offer this, but *NO* autobuilder
   support will be provided.
   This is meant as a place for large datasets, and those should be
   arch independent. And would kill many autobuilders (think of binary
   packages having 500, 800 or more megabytes!)

 - It is an own archive, so it needs full source uploads to work,
   every data package you create will be a full source package and you
   have to split the source between this archive and the rest that goes
   into the normal Debian one.

 - We need to change policy. It currently forbids packages in main to
   Depend/Recommend something outside of it (which is good). As that
   would make the data archive less useful, I propose to change this to
   something including the meaning of "Packages in main are allowed to
   recommend packages in the data archive".
   Dependencies should *not* be allowed, but read the next point.

 - Packages in main need to be installable and not cause their (indirect)
   reverse build-depends to FTBFS in the absence of data.debian.org.
   If the data is necessary for the package to work and there is a small
   dataset (like 5 to 10 MB) that can be reasonably substituted for the
   complete data package, the smaller dataset should be included in
   main and the package then may depend on "foo-data | foo-data-small".


Any comments?

Timeframe for this? I expect it to be ready within 2 weeks.

-- 
bye, Joerg
Some AM after a mistake:
Sigh.  One shouldn't AM in the early AM, as it were.  <grin>

Attachment: pgp04RT0rnNZD.pgp
Description: PGP signature


Reply to: