Re: Non-LLM example where we do not in practice use original training data
On Wed, May 07, 2025 at 02:20:44PM +0200, Simon Josefsson wrote:
Thanks for answers! Surprisingly I now find myself agreeing that your
approach is reasonable and is consistent with existing Debian practices.
I just wish that the existing practices were more libre and more
consistent with documented policies, but I also think this is not the
popular opinion.
So, let's delve deeper on the practical impact of such consistency
or not. Let's say we have a hypothetical package called
gnipgnop-rattrap. It's an accessibility tool which tracks elements
of your face using pretrained Haar cascade classifier models, and
based on where you look, moves the "mouse" pointer. The models
we ship it with have been trained solely on 75 gigabytes of images
captured from Disney films, which are not available anywhere
because the people who trained the models are afraid of being sued.
What should Debian do? Remove the package from the archive so no
one can use it? Patch it to download the models from a random
URL which may or may not be accessible? Construct 75 gigabytes of
DFSG-free annotated training data to stuff into the source package?
Reply to: