Re: Draft: Proposal Alternative: Traning data is not source code

To: Stefano Zacchiroli <zack@debian.org>, Aigars Mahinovs <aigarius@debian.org>
Cc: Debian vote <debian-vote@lists.debian.org>
Subject: Re: Draft: Proposal Alternative: Traning data is not source code
From: Sam Hartman <hartmans@debian.org>
Date: Wed, 07 May 2025 08:20:40 -0600
Message-id: <[🔎] tslh61wbiyv.fsf@suchdamage.org>
In-reply-to: <[🔎] 20250507092756.b6f2dbubiavfziwy@upsilon.cc>
References: <[🔎] CABpYwDV3Cx93zOEXgvoOx21PaUPAxiJK6q_tYgNNCh-8FtYMrw@mail.gmail.com> <[🔎] 20250507092756.b6f2dbubiavfziwy@upsilon.cc>

>>>>> "Stefano" == Stefano Zacchiroli <zack@debian.org> writes:

    Stefano> Thanks for this proposal, Aigars.  How would you compare it
    Stefano> with Sam's proposal? As I can see it the general idea
    Stefano> behind both proposals is quite similar, even though the
    Stefano> wording is different. The main content different I can see
    Stefano> is that you focus on the notion of "data information",
    Stefano> whereas Sam's proposal is more general and focus on the
    Stefano> practicality of being able to make modifications.

I think  Aigars proposal handles the case of  models where the training
data cannot be distributed better than mine.  (spam messages and ham
messages for a spam classifier, creative commons works that do not allow
modification for an LLM).
I think my proposal allows people to be more sloppy so long as practical
modification is possible.  One of my concerns about the OSI AI
definition  is that the requirements around training information sound
like a minimum quality bar for free models.
In my experience we have not required free software to be high quality
software to be good.
I wouldn't generally say that software is non-free because the
documentation of its build scripts is buggy and I cannot get them to
run.
I appreciate that the SFC has held people to fairly high standards for
documenting their build systems as part of GPL enforcement actions, but
I would argue that if the people had tried to follow the GPL in the
first place, this level of rigor would have been inconsistent with
requirements for freedom.
Obviously it would have been desirable, but I don't like it when other
things get mixed in with freedom.

I think my proposal should be fixed to handle the case where upstream
distributes training information because they cannot distribute training
data even if they have it.
One possibility would be to remove the last sentence from my proposal
and assume ftpmaster will judge appropriately when upstreams are clearly
acting in bad faith.

Right now though, I don't see enough support for either my proposal or Aigars's
proposal to move forward.
I am quite disappointed by that, because I think  the current ballot
option undermines part of the core of what software freedom is to me.

To me, software freedom is about an achievable set of standards we
commit to in order to empower our users.
Software freedom may be a sacrifice in terms of not being able to use
some convenient market options.
But it's never before been a sacrifice about potential. Russ's comment
that he doesn't think a Bayesian classifier can be free software hit me
hard--my immediate reaction was "If that's true, then software freedom is
wrong."

Users might want a Bayesian classifier--I do enough that I've trained
one. Software in main like a mail reader or a mail system might well
want to include a classifier.
Saying that even if someone is as dedicated to freedom as they can be,
they can never live up to our standards and include that reasonable
functionality in Debian main makes me think we have lost sight of our
users.

I appreciate that if you take a position less strong that Russ you could
ship a crappy Bayesian classifier trained only on DFSG-licensed spam and
ham messages.
I think it's clear that such a classifier would function significantly
less effectively than other classifiers.

In the past, we have said that specific software is not free because the
copyright holder is unwilling (but able) to make the necessary grants.
Or perhaps the copyright holder  did a poor job of tracking their
licensing and is unable to document what the license is.
But none of that has restricted the type of software that can be free.

In this discussion, I  have been convinced that training data for some
of the models we might want will never be licensed under a DFSG-free
license. This is in part because  the copyright holder  of the training
data  is often not the person training a model. In many cases (spam,
virus detection, etc), the interests of the copyright holder in the
training data are not aligned with the interest of those training the
model.
Yet in the same discussion, I have been convinced that it is much more
likely such training is fair use under copyright law.
I do not think there is consensus in our community on the ethics of
training on news articles or books even if it is legally fair use.
For me at least I have no ethical problem using spam and ham messages to
train a classifier.

I am sad that it looks like there is not even support to put an option
on the ballot that would empower our users to have a spam classifier in
main.

(For what it's worth, I'd be totally fine moving model data to a new
archive section (or expanding the definition of contrib) that did not
have the negative connotations of non-free.)

Attachment: signature.asc
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Draft: Proposal Alternative: Traning data is not source code
  - From: Clint Adams <clint@debian.org>
- Re: Draft: Proposal Alternative: Traning data is not source code
  - From: Stefano Zacchiroli <zack@debian.org>

References:
- Draft: Proposal Alternative: Traning data is not source code
  - From: Aigars Mahinovs <aigarius@debian.org>
- Re: Draft: Proposal Alternative: Traning data is not source code
  - From: Stefano Zacchiroli <zack@debian.org>

Prev by Date: Re: Non-LLM example where we do not in practice use original training data
Next by Date: Re: Draft: Proposal Alternative: Traning data is not source code
Previous by thread: Re: Draft: Proposal Alternative: Traning data is not source code
Next by thread: Re: Draft: Proposal Alternative: Traning data is not source code
Index(es):
- Date
- Thread