Re: Non-LLM example where we do not in practice use original training data

To: Simon Josefsson <simon@josefsson.org>
Cc: Aigars Mahinovs <aigarius@gmail.com>, Russ Allbery <rra@debian.org>, debian-vote@lists.debian.org
Subject: Re: Non-LLM example where we do not in practice use original training data
From: Clint Adams <clint@debian.org>
Date: Wed, 7 May 2025 14:05:05 +0000
Message-id: <[🔎] aBtokalxhqdJY3ij@scru.org>
Mail-followup-to: Clint Adams <clint@debian.org>, Simon Josefsson <simon@josefsson.org>, Aigars Mahinovs <aigarius@gmail.com>, Russ Allbery <rra@debian.org>, debian-vote@lists.debian.org
In-reply-to: <[🔎] 8734dgehnn.fsf@josefsson.org>
References: <[🔎] tslzffqbwrb.fsf@suchdamage.org> <[🔎] 878qnaiwvd.fsf@hope.eyrie.org> <[🔎] 20250506115857.mr44k5tqywisx75h@upsilon.cc> <[🔎] 87zffp21kd.fsf@hope.eyrie.org> <[🔎] aBqi0NHM0I50snSM@scru.org> <[🔎] 871pt1s0gk.fsf@hope.eyrie.org> <[🔎] CABpYwDU9j0xMFwKZjcRGd_wDWCsQ0xEgxANJ6L68uv8y3Tc53A@mail.gmail.com> <[🔎] 87frhgepcs.fsf@josefsson.org> <[🔎] CABpYwDX6DbUt2LbmrQCrVExyzwfZ8CRP7YR_-FopQJOi2Zgd-A@mail.gmail.com> <[🔎] 8734dgehnn.fsf@josefsson.org>

On Wed, May 07, 2025 at 02:20:44PM +0200, Simon Josefsson wrote:

Thanks for answers!  Surprisingly I now find myself agreeing that your
approach is reasonable and is consistent with existing Debian practices.
I just wish that the existing practices were more libre and more
consistent with documented policies, but I also think this is not the
popular opinion.


So, let's delve deeper on the practical impact of such consistency
or not.  Let's say we have a hypothetical package called
gnipgnop-rattrap.  It's an accessibility tool which tracks elements
of your face using pretrained Haar cascade classifier models, and
based on where you look, moves the "mouse" pointer.   The models
we ship it with have been trained solely on 75 gigabytes of images
captured from Disney films, which are not available anywhere
because the people who trained the models are afraid of being sued.

What should Debian do?  Remove the package from the archive so no
one can use it?  Patch it to download the models from a random
URL which may or may not be accessible?  Construct 75 gigabytes of
DFSG-free annotated training data to stuff into the source package?

Reply to:

Follow-Ups:
- Re: Non-LLM example where we do not in practice use original training data
  - From: Simon Josefsson <simon@josefsson.org>

References:
- Re: Non-LLM example where we do not in practice use original training data
  - From: Sam Hartman <hartmans@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Stefano Zacchiroli <zack@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Clint Adams <clint@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Aigars Mahinovs <aigarius@gmail.com>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Simon Josefsson <simon@josefsson.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Aigars Mahinovs <aigarius@gmail.com>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Simon Josefsson <simon@josefsson.org>

Prev by Date: Re: Draft: Proposal Alternative: Traning data is not source code
Next by Date: Re: Non-LLM example where we do not in practice use original training data
Previous by thread: Re: Non-LLM example where we do not in practice use original training data
Next by thread: Re: Non-LLM example where we do not in practice use original training data
Index(es):
- Date
- Thread