[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"



Hi Osamu,

On 2019-06-09 08:28, Osamu Aoki wrote:
> Although I understand the intent of "SemiFree" or "Tainted" (by Yao), I
> don't think these are a good choice.  We need to draw a line between
> FREE(=main) and NON-FREE(non-free) as a organization.  I think there are

There is no such a line as a big grey area exists. Pure-free models plus
Pure-non-free models doesn't cover all the possible cases. But
Free + SemiFree + NonFree covers all possible cases.

SemiFree lies in a grey area because the ways people interpret it vary:

1. If one regards a model as sort of human artifact such as artwork
   or font, a free software licensed SemiFreeModel is free even if
   it's trained from non-free data. (Ah, yes, there is an MIT license!
   It's a free blob made by human.)

2. If one regards a model as a production from a mathematical process
   such as training or compilation, a free software licensed
   SemiFreeModel is actually non-free. (Oops, where did these
MIT-licensed
   digits came from and how can I reproduce it? Can I trust the source?
   What if the MIT-licensed model is trained from evil data but we don't
   know?)

I'm not going to draw a line across this grey area, or say minefield.
Personally I prefer the second interpretation.

> 2 FREE models we are allowing for "main" as the current practice.
> 
>  * Pure      Free Model from pure free pre-train data only
>  * Sanitized Free Model from free and non-free mixed pre-train data

Please don't make the definition of FreeModel complicated.
FreeModel should be literally and purely free.
We can divide SemiFreeModel into several categories according to
future case study and make DL-Policy properly match with the practice.

> And, we don't allow Non-Free Model in "main"

I think no one would argue about NonFreeModel.

> Question is when do you call it "sanitized" (or "distilled") to be clean
> enough to qualify for "main" ;-)

I expect a model, once sanitized, to he purely free. For example by
removing
all non-free data from the training dataset and only use free training
data. Any non-free single data pulls the model into the minefield.

>> 2. It's not required to re-train a FreeModel with our infra, because
>>    the outcome/cost ratio is impractical. The outcome is nearly zero
>>    compared to directly using a pre-trained FreeModel, while the cost
>>    is increased carbon dioxide in our atmosphere and wasted developer
>>    time. (Deep learning is producing much more carbon dioxide than we
>>    thought).
>>
>>    For classical probablistic graph models such as MRF or the mentioned
>>    CRF, the training process might be trivial, but re-training is still
>>    not required.
> 
> ... but re-training is highly desirable in line with the spirit of the
> free software.

I guess you didn't catch my point. In my definition of FreeModel and the
SemiFree/ToxicCandy model, providing training script is mandatory. Any
model without training script must be non-free. This requirement also
implies that the upstream must provide all information about the
datasets
and the training process. Software freedom can be guaranteed even if
we don't always re-train the free models, as it will only waste
electricity. On the other hand, developers should check whether a model
provide such freedom, and local re-training as an verification step
is encouraged.

Enforcing re-training will be a painful decision and would drive
energetic
contributors away especially when the contributor refuse to use Nvidia
suckware.

> Let's use SanitizedModel to be neutral.

Once sanitized a model should turn into a free model. If it doesn't,
then
why does one sanitize the model?

> We need to have some guideline principle for this sanitization process.
> (I don't have an answer now)

I need case study at this point.

> This sanitization mechanism shouldn't be used to include obfuscated
> binary blob equivalents.  It's worse than FIRMWARE case since it runs on
> the same CPU as the program code.
> 
> Although "Further Discussion" was the outcome, B in
> https://www.debian.org/vote/2006/vote_004 is worth looking at:
>   Strongly recommends that all non-programmatic works distribute the form
>   that the copyright holder or upstream developer would actually use for
>   modification. Such forms need not be distributed in the orig.tar.gz
>   (unless required by license) but should be made available on upstream
>   websites and/or using Debian project resources.
> 
> Please note this is "Strongly recommends ... should be made
> available..." and not "must be made available ...".

Umm....

> Aside from Policy/Guideline for FREE/NON-FREE discussion, we also need
> to address for the spirit of the reproducible build.  It is nice to have
> checking mechanism for the validity and health of these MODELs.  I know
> one of the Japanese keyboard input method "Anthy" is suffering some
> regression in the upcoming release.  The fix was found too late so I
> uploaded to experimental since it contained too many changes while
> impact was subtle.  If we had a test suite with numerical score outputs,
> we could have detected such regressions by the upstream.  It may be
> unrealistic to aim for exact match for such probabilistic model but
> objectively traceable measure is very desirable to have. 

Isn't this checking mechanism a part of upstream work? When developing
machine learning software, the model reproduciblity (two different runs
should produce very similar results) is important.

This reproducibility issue is much different than that of code.
Software upstream doesn't compile a cxx program twice to see whether
the same hashsum is produced because it's a compiler bug once mismatch.
For a machine learning program, if the first time training produced
a model with 95% accuracy but merely 30% accuracy on the second run,
it's a fatal bug to the program itself. (94% for the second run may
be acceptable)


Reply to: