>>>>> "Stefano" == Stefano Zacchiroli <zack@debian.org> writes: Stefano> Thanks for this proposal, Aigars. How would you compare it Stefano> with Sam's proposal? As I can see it the general idea Stefano> behind both proposals is quite similar, even though the Stefano> wording is different. The main content different I can see Stefano> is that you focus on the notion of "data information", Stefano> whereas Sam's proposal is more general and focus on the Stefano> practicality of being able to make modifications. I think Aigars proposal handles the case of models where the training data cannot be distributed better than mine. (spam messages and ham messages for a spam classifier, creative commons works that do not allow modification for an LLM). I think my proposal allows people to be more sloppy so long as practical modification is possible. One of my concerns about the OSI AI definition is that the requirements around training information sound like a minimum quality bar for free models. In my experience we have not required free software to be high quality software to be good. I wouldn't generally say that software is non-free because the documentation of its build scripts is buggy and I cannot get them to run. I appreciate that the SFC has held people to fairly high standards for documenting their build systems as part of GPL enforcement actions, but I would argue that if the people had tried to follow the GPL in the first place, this level of rigor would have been inconsistent with requirements for freedom. Obviously it would have been desirable, but I don't like it when other things get mixed in with freedom. I think my proposal should be fixed to handle the case where upstream distributes training information because they cannot distribute training data even if they have it. One possibility would be to remove the last sentence from my proposal and assume ftpmaster will judge appropriately when upstreams are clearly acting in bad faith. Right now though, I don't see enough support for either my proposal or Aigars's proposal to move forward. I am quite disappointed by that, because I think the current ballot option undermines part of the core of what software freedom is to me. To me, software freedom is about an achievable set of standards we commit to in order to empower our users. Software freedom may be a sacrifice in terms of not being able to use some convenient market options. But it's never before been a sacrifice about potential. Russ's comment that he doesn't think a Bayesian classifier can be free software hit me hard--my immediate reaction was "If that's true, then software freedom is wrong." Users might want a Bayesian classifier--I do enough that I've trained one. Software in main like a mail reader or a mail system might well want to include a classifier. Saying that even if someone is as dedicated to freedom as they can be, they can never live up to our standards and include that reasonable functionality in Debian main makes me think we have lost sight of our users. I appreciate that if you take a position less strong that Russ you could ship a crappy Bayesian classifier trained only on DFSG-licensed spam and ham messages. I think it's clear that such a classifier would function significantly less effectively than other classifiers. In the past, we have said that specific software is not free because the copyright holder is unwilling (but able) to make the necessary grants. Or perhaps the copyright holder did a poor job of tracking their licensing and is unable to document what the license is. But none of that has restricted the type of software that can be free. In this discussion, I have been convinced that training data for some of the models we might want will never be licensed under a DFSG-free license. This is in part because the copyright holder of the training data is often not the person training a model. In many cases (spam, virus detection, etc), the interests of the copyright holder in the training data are not aligned with the interest of those training the model. Yet in the same discussion, I have been convinced that it is much more likely such training is fair use under copyright law. I do not think there is consensus in our community on the ethics of training on news articles or books even if it is legally fair use. For me at least I have no ethical problem using spam and ham messages to train a classifier. I am sad that it looks like there is not even support to put an option on the ballot that would empower our users to have a spam classifier in main. (For what it's worth, I'd be totally fine moving model data to a new archive section (or expanding the definition of contrib) that did not have the negative connotations of non-free.)
Attachment:
signature.asc
Description: PGP signature