Re: Non-LLM example where we do not in practice use original training data

To: Simon McVittie <smcv@debian.org>
Cc: debian-vote@lists.debian.org
Subject: Re: Non-LLM example where we do not in practice use original training data
From: Aigars Mahinovs <aigarius@gmail.com>
Date: Wed, 7 May 2025 17:13:06 +0200
Message-id: <[🔎] CABpYwDU_a0U2E8dSavXJewi5pVvTqL8OwMJM5Jd6E779m9WhAA@mail.gmail.com>
In-reply-to: <[🔎] aBtqiCerNM0tg2Et@remnant.pseudorandom.co.uk>
References: <[🔎] tslecx2dcr4.fsf@suchdamage.org> <[🔎] 00e0aaaedf5c050a7b09b53c880aecbf9b9220b0.camel@debian.org> <[🔎] tslzffqbwrb.fsf@suchdamage.org> <[🔎] 878qnaiwvd.fsf@hope.eyrie.org> <[🔎] 20250506115857.mr44k5tqywisx75h@upsilon.cc> <[🔎] aBoegK0AYm6i3wMd@remnant.pseudorandom.co.uk> <[🔎] vve1ci$8ttn$1@posted-at.bofh.it> <[🔎] aBtqiCerNM0tg2Et@remnant.pseudorandom.co.uk>

On Wed, 7 May 2025 at 16:13, Simon McVittie <smcv@debian.org> wrote:
>
> On Tue, 06 May 2025 at 22:10:28 -0000, Marco d'Itri wrote:
> >smcv@debian.org wrote:
> >>Debian is unusual in the way we interpret our mission statement as
> >>extending to everything we distribute being Free, not just our
> >>executable code. Many other FOSS distributions apply the DFSG, the OSD,
> >>the FSF's guidelines or similar principles to executable code (only),
> >>and do not see a problem with having non-executable data that Debian
> >>would consider to be non-Free.
> >
> >I have been a Debian developer for almost 30 years, and I remember that
> >when I joined the project we had no plans to apply the DFSG to e.g.
> >documentation.
> >Then the "editorial changes" (not) GR happened, and some people were
> >very surprised by the practical outcome.
>
> Yes, I didn't mean to imply that I think our interpretation is
> necessarily the one that brings most benefit to our users and Free
> Software, only that it's the one that the project enforces.
>
> I personally think there's a risk that we put too much emphasis on
> following the chain of "true" source code to justifiable but impractical
> conclusions

There are things that are technically trivial to modify and they are
distributed in simple plain text - ripe for modification, but its
actual modification would involve a very complex process stretching
over months and involving consent of hundreds of people across many
companies and many countries. And yet we ship it in main. I am talking
about the text of DFSG, for example.

Then there are complex binary data blobs (which Debian currently does
not distribute, but others do, with a free license) whose significant
modification would involve a literal act of God (as defined by
insurance contracts) or massive use of heavy explosives - I am talking
about, for example, topographic map data.

Similarly documents like the RFCs defining the Internet protocols are
nothing more than representations of facts of life - agreed and
codified by many parties. And the concept of modification does not
even really apply to them. You can derive from them, for example, by
producing a mobile-friendly HTML version, but a modified text of RFC
2795 does not really make sense as a concept.

In some cases it makes sense to dig deeper for source code. Sometimes
things that look like source code in fact have a deeper source code (I
am looking at some Bison generated parsers, for example). Sometimes
they are an aggregation of multiple source codes. But sometimes we
have to come to a conclusion that one or more of our branches of
digging have hit a foundation - they have hit something that is not
source code, but instead a *fact*. Searching for a source code for a
fact does not make sense. We can try to search for the source code to
the specific way that this fact was communicated, but that is a
distraction as it does not actually change or modify the fact itself,
just its format of expression.

And similarly a license for the fact itself also does not make sense.
The statistical frequency of occurrence of words in the texts of a
particular language is a fact and has no copyright or license.
Regardless of how it was computed and what data sources were used to
come to the conclusion.

We are too used to thinking in terms of "source code" (license) ->
compilation -> "binary, derived work" (same license) and we
automatically try to apply that thinking to everything around us. But
that is a simplification that only applies in the very narrow scope.
Out of that, there are many exceptions and many areas where that does
not apply at all.
-- 
Best regards,
    Aigars Mahinovs

Reply to:

References:
- Non-LLM example where we do not in practice use original training data
  - From: Sam Hartman <hartmans@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Ansgar 🙀 <ansgar@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Sam Hartman <hartmans@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Stefano Zacchiroli <zack@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Simon McVittie <smcv@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Marco d'Itri <md@Linux.IT>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Simon McVittie <smcv@debian.org>

Prev by Date: Re: Draft: Proposal Alternative: Traning data is not source code
Next by Date: Re: Draft: Proposal Alternative: Traning data is not source code
Previous by thread: Re: Non-LLM example where we do not in practice use original training data
Next by thread: Re: Non-LLM example where we do not in practice use original training data
Index(es):
- Date
- Thread