On Wednesday, May 14, 2025 5:31:53 AM Mountain Standard Time Aigars Mahinovs wrote: > On Wed, 14 May 2025 at 00:03, Soren Stoutner <soren@debian.org> wrote: > > On Tuesday, May 13, 2025 12:06:05 PM Mountain Standard Time Ilu wrote: > > > 2. What is the preferred form of modification? This is IMHO the > > > deciding, relevant question. > > > Aigars says weights and I've heard that from several other people active > > > in machine learning. OSI says the same. > > > Mo Zhu says training data is. I haven't heard that from anybody else. > > > > I thought several other people besides Mo Zhu had also said that on this > > list, but just in case they haven’t, I would like to go on the records that > > I also feel that training data is one of the preferred forms of > > modification in machine learning and should be thus considered for anything > > being included in main. > > Could you expand a bit on this topic, so I can understand this position > better? > > Say that we are talking about an otherwise-free LLM model trained on a > multi-gigabyte data set. Data from the dataset may be downloaded from > the Internet (but may not redistributed by Debian). Let's assume that > the source code of the LLM also includes a script that would, if > executed, do all the downloading and formatting of the training data > from Internet sources for you. The data *may* even be binary identical > to the original training data (if it is only trained on snapshotted > data mining collections that one can download from torrent via a > magnet link for example), or it may be in a newer state than when it > was trained originally (if you choose to switch to newer snapshots or > if data collection happens directly from source servers or their > proxies). You can add, remove or filter data sources to modify the > contents of the training data on a high or granular level. > > Would that be a sufficient definition of training data to satisfy the > preferred form of modification criteria for you? If Debian cannot redistribute the training dataset (part of your description above), then it cannot be in main. If the LLM model source code is DFSG-free but depends on this non-DFSG free training data or weights derived from it, then it is fine if it goes in contrib. The weights derived from this non-DFSG free training data can go in non-free as long as Debian can redistribute them. If there is a scenario where the LLM can work with several different data sets derived from different training data, and some of those data sets are DFSG- free while others are not, then the free data sets and the model can go in main. It can depend on, recommend, or suggest the DFSG-free weights and data sets in main. But it can only suggest those in non-free. I find this understanding to accomplish two things. 1. It is a consistent application of DFSG principles to machine learning applications. 2. It makes the benefits of non-DFSG ML applications available in non-free to those who would like to use them. > If any use of the original training data (or of its description as > above) requires 100 000 Nvidia H100 cards running for a month using a > few billion USD of investment and several million dollars of > electricity, does that training data *still* satisfy the criteria for > "preferred form of modification"? I find discussions about how much hardware it takes to process the training data to be orthogonal to a discussion of whether a ML training dataset is DFSG-free, so I don’t feel it is useful to discuss here. > And, to ask explicitly, is raw training data a better form of > modification for you compared to a description of that same training > data, in automated form that would generate the training data for you > on request? 1. Raw training data is non-negotiably required for me to consider a ML application DFSG-free. 2. Additionally, a description of that training data would also be nice, but I don’t think it would be non-negotiably required. However, I might be open to arguemnts that both should be required. > Is it important for you if the training data *only* comes to you from > Debian mirrors? Or is the same data coming to you from other sources > also fine? For main, yes, I think it must come from Debian mirrors. For non-free, I don’t see a difference between Debian mirrors or a script that downloads the data from some other source on the internet, as your describe in your example above. I should note that I do not feel as strongly about this point as I do about the training data being available under a DFSG-free license to be in main. So, if Debian decides that hosting the training data on some Debian approved location that is not an official Debian mirror is acceptable, I wouldn’t push back against that as long as the training data itself was DFSG-free could be included in main in the future if we ever decided to do so. > > In my opinion, it is fine to include otherwise distributable ML applications > > without available training data in non-free. > > Technically - yes, and I would be fine to include OSI-free AI in > Debian non-free, but IMHO it does nothing to resolve ethical concerns. > If we limit that to only OSI-free AI then that would also be giving > the same kind of guidance to the AI community - with both upsides and > downsides. I would go beyond that to say that we can host things on non-free that even OSI does not consider free as long as we have the rights to distribute it. We already do that for a number of other things in non-free. I think the practical result of the policy I describe above would be that most ML applications would end up in non-free. But I also think that a smaller number of ML applications would end up in main, especially as developers start intentionally creating fully DFSG-free ML applications and training data sets. Also, as has been mentioned on this list, the discussion about LLMs has caused us to look more closely at other MLs already in main, like some games or image and audio processing applications. In the past, I, along with others, didn’t think too deeply about the training data used to create the weights used by those MLs. If the above standards are adopted, it would require moving some of these games and image and audio processing applications to contrib with their weights in non-free. In other cases, it might be possible to retrain them on DFSG-free training data sets, especially if upstream is interested in doing so. I think this would actually benefit the free-software movement in the long term, even if it requires a bunch of work to address it now. If such a change were made, I would be in favor of doing so at the beginning of a release cycle so that Debian and upstream developers have a couple of years to figure out how to either keep the software in main or move it to contrib and non-free in such a way that it is not disruptive to users. -- Soren Stoutner soren@debian.org
Attachment:
signature.asc
Description: This is a digitally signed message part.