[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"



Hi Mo,

On Sat, Jun 08, 2019 at 10:07:13PM -0700, Mo Zhou wrote:
> Hi Osamu,
> 
> On 2019-06-08 18:43, Osamu Aoki wrote:
> >> This draft is conservative and overkilling, and currently
> >> only focus on software freedom. That's exactly where we
> >> start, right?
> > 
> > OK but it can't be where we end-up-with.
> 
> That's why I said the two words "conservative" and "overkilling".
> In my blueprint we can actually loosen these restrictions bit
> by bit with further case study.

Yes, we agree here!

> > Before scientific "deep learning" data, we already have practical "deep
> > learning" data in our archive.
> 
> Thanks for pointing them out. They are good case study
> for me to revise the DL-Policy.
> 
> > Please note one of the most popular Japanese input method mozc will be
> > kicked out from main as a starter if we start enforcing this new
> > guideline.
> 
> I'm in no position of irresponsibly enforcing an experimental
> policy without having finished enough case study.

I noticed it since you were thinking deep enough but I saw some danger
for other people to make decision too quickly based on the "Labeling".

Please check our history on the following GRs:
 https://www.debian.org/vote/2004/vote_003
 https://www.debian.org/vote/2006/vote_004

We are stack with "Further discussion" at this moment.

> >> Specifically, I defined 3 types of pre-trained machine
> >> learning models / deep learning models:
> >>
> >>   Free Model, ToxicCandy Model. Non-free Model
> >>
> >> Developers who'd like to touch DL software should be
> >> cautious to the "ToxicCandy" models. Details can be
> >> found in my draft.
> > 
> > With a labeling like "ToxicCandy Model" for the situation, it makes bad
> > impression on people and I am afraid people may not be make rational
> > decision.  Is this characterization correct and sane one?  At least,
> > it looks to me that this is changing status-quo of our policy and
> > practice severely.  So it is worth evaluating idea without labeling.
> 
> My motivation for the naming "ToxicCandy" is pure: to warn developers
> about this special case as it may lead to very difficult copyright
> or software freedom questions. I admit that this name looks not
> quite friendly. Maybe "SemiFree" look better?

Although I understand the intent of "SemiFree" or "Tainted" (by Yao), I
don't think these are a good choice.  We need to draw a line between
FREE(=main) and NON-FREE(non-free) as a organization.  I think there are
2 FREE models we are allowing for "main" as the current practice.

 * Pure      Free Model from pure free pre-train data only
 * Sanitized Free Model from free and non-free mixed pre-train data

And, we don't allow Non-Free Model in "main"

Question is when do you call it "sanitized" (or "distilled") to be clean
enough to qualify for "main" ;-)

> > As long as the "data" comes in the form which allows us to modify it and
> > re-train it to make it better with a set of free software tools to do it,
> > we shouldn't make it non-free, for sure.  That is my position and I
> > think this was what we operated as the project.  We never asked how they
> > are originally made.  The touchy question is how easy it should be to
> > modify and re-train, etc.
> >
> > Let's list analogy cases.  We allow a photo of something on our archive
> > as wallpaper etc.  We don't ask object of photo or tool used to make it
> > to be FREE.  Debian logo is one example which was created by Photoshop
> > as I understand.  Another analogy to consider is how we allow
> > independent copyright and license for the dictionary like data which
> > must have processed previous copyrighted (possibly non-free) texts by
> > human brain and maybe with some script processing.  Packages such as
> > opendict, *spell-*, dict-freedict-all, ... are in main.

...

> Thank you Osamu. These cases inspired me on finding a better
> balance point for DL-Policy. I'll add these cases to the case
> study section, and I'm going to add the following points to DL-Policy:
> 
> 1. Free datasets used to train FreeModel are not required to upload
>    to our main section, for example those Osamu mentioned and wikipedia
>    dump. We are not scientific data archiving organization and these
>    data will blow up our infra if we upload too much.
> 
> 2. It's not required to re-train a FreeModel with our infra, because
>    the outcome/cost ratio is impractical. The outcome is nearly zero
>    compared to directly using a pre-trained FreeModel, while the cost
>    is increased carbon dioxide in our atmosphere and wasted developer
>    time. (Deep learning is producing much more carbon dioxide than we
>    thought).
> 
>    For classical probablistic graph models such as MRF or the mentioned
>    CRF, the training process might be trivial, but re-training is still
>    not required.

... but re-training is highly desirable in line with the spirit of the
free software.

> For SemiFreeModel  I still hesitate to make any decision. Once we let
      SanitizedModel
> them enter the main section there will be many unreproducible
> or hard-to-reproduce but surprisingly "legal" (in terms of DL-Policy)
> files. Maybe this case is to some extent similar to artworks and fonts.
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                           YES.
> Further study needed. And it's still not easy to find a balance point
> for SemiFreeModel  between usefulness and freedom.
      SanitizedModel

Let's use SanitizedModel to be neutral.

We need to have some guideline principle for this sanitization process.
(I don't have an answer now)

This sanitization mechanism shouldn't be used to include obfuscated
binary blob equivalents.  It's worse than FIRMWARE case since it runs on
the same CPU as the program code.

Although "Further Discussion" was the outcome, B in
https://www.debian.org/vote/2006/vote_004 is worth looking at:
  Strongly recommends that all non-programmatic works distribute the form
  that the copyright holder or upstream developer would actually use for
  modification. Such forms need not be distributed in the orig.tar.gz
  (unless required by license) but should be made available on upstream
  websites and/or using Debian project resources.

Please note this is "Strongly recommends ... should be made
available..." and not "must be made available ...".

Aside from Policy/Guideline for FREE/NON-FREE discussion, we also need
to address for the spirit of the reproducible build.  It is nice to have
checking mechanism for the validity and health of these MODELs.  I know
one of the Japanese keyboard input method "Anthy" is suffering some
regression in the upcoming release.  The fix was found too late so I
uploaded to experimental since it contained too many changes while
impact was subtle.  If we had a test suite with numerical score outputs,
we could have detected such regressions by the upstream.  It may be
unrealistic to aim for exact match for such probabilistic model but
objectively traceable measure is very desirable to have. 

Osamu


Reply to: