Re: Concern for: A humble draft policy on "deep learning v.s. freedom"

To: Osamu Aoki <osamu@debian.org>
Cc: debian-devel@lists.debian.org, debian-science@lists.debian.org
Subject: Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
From: Mo Zhou <lumin@debian.org>
Date: Sat, 08 Jun 2019 22:07:13 -0700
Message-id: <[🔎] eaf1b80cd65eb510fb56703869071784@debian.org>
In-reply-to: <[🔎] 20190608184309.GA10146@goofy.osamu.debian.net>
References: <f544829dcd6c0f92ea11cdb25543bdac@debian.org> <[🔎] 20190608184309.GA10146@goofy.osamu.debian.net>

Hi Osamu,

On 2019-06-08 18:43, Osamu Aoki wrote:
>> This draft is conservative and overkilling, and currently
>> only focus on software freedom. That's exactly where we
>> start, right?
> 
> OK but it can't be where we end-up-with.

That's why I said the two words "conservative" and "overkilling".
In my blueprint we can actually loosen these restrictions bit
by bit with further case study.

> Before scientific "deep learning" data, we already have practical "deep
> learning" data in our archive.

Thanks for pointing them out. They are good case study
for me to revise the DL-Policy.

> Please note one of the most popular Japanese input method mozc will be
> kicked out from main as a starter if we start enforcing this new
> guideline.

I'm in no position of irresponsibly enforcing an experimental
policy without having finished enough case study.

>> Specifically, I defined 3 types of pre-trained machine
>> learning models / deep learning models:
>>
>>   Free Model, ToxicCandy Model. Non-free Model
>>
>> Developers who'd like to touch DL software should be
>> cautious to the "ToxicCandy" models. Details can be
>> found in my draft.
> 
> With a labeling like "ToxicCandy Model" for the situation, it makes bad
> impression on people and I am afraid people may not be make rational
> decision.  Is this characterization correct and sane one?  At least,
> it looks to me that this is changing status-quo of our policy and
> practice severely.  So it is worth evaluating idea without labeling.

My motivation for the naming "ToxicCandy" is pure: to warn developers
about this special case as it may lead to very difficult copyright
or software freedom questions. I admit that this name looks not
quite friendly. Maybe "SemiFree" look better?

> As long as the "data" comes in the form which allows us to modify it and
> re-train it to make it better with a set of free software tools to do it,
> we shouldn't make it non-free, for sure.  That is my position and I
> think this was what we operated as the project.  We never asked how they
> are originally made.  The touchy question is how easy it should be to
> modify and re-train, etc.
>
> Let's list analogy cases.  We allow a photo of something on our archive
> as wallpaper etc.  We don't ask object of photo or tool used to make it
> to be FREE.  Debian logo is one example which was created by Photoshop
> as I understand.  Another analogy to consider is how we allow
> independent copyright and license for the dictionary like data which
> must have processed previous copyrighted (possibly non-free) texts by
> human brain and maybe with some script processing.  Packages such as
> opendict, *spell-*, dict-freedict-all, ... are in main.
> 
> I agree it is nice to have base data in the package.  If you can, please
> include the training data if it is a FREE set.  But it may become
> unrealistic for Debian to getting into business of distributing many GB
> of training data for this purpose.  You may be talking data size being over
> 10s of GB.  This is another thing you should realize -- So mandating its
> inclusion is unpractical since it is not the focus point on which Debian
> needs to spend its resource.
>
> Let's talk about actual cases in main.
> 
> "mecab" is free a tool for Japanese text morphological analysis which
> can create CRF optimized parameters from the marked-up training data.
> 
> (This is also the base of mozc which uses such data to create desirable
> typing output in normal Japanese text input from the keyboard.)
> 
> One of the dictionary for mecab is 800MB compressed deb in main:
> unidic-mecab which is 2.2GB data in text format containing CRF optimized
> parameters and other text data obtained by training. These text and
> parameters are triple licensed BSD/LGPL/GPL. Re-training this is very
> straight forward application of mecab tool with additional data only.
> So this is FREE as it can be in current practice and we have it in main.
>   https://unidic.ninjal.ac.jp/
> 
> When these CRF parameters were initially made, it used non-free data
> (Japanese Government funded) available in multiple DVDs with hefty price
> and restriction on its use and its redistribution.  This base data for
> training is as NON-FREE as it can be so we don't distribute.
>   https://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html
> 
> In case of MOZC, the original training data is only available in Google
> and not published by them.  Actually, tweaking data is possible but
> consistently retraining this data in MOZC may not be a trivial
> application of mecab tool.  We are placing this in main now, anyway
> since its data (CRF optimized parameters and other text data ) are
> licensed under BSD-3-clause and we have MOZC in main.

Thank you Osamu. These cases inspired me on finding a better
balance point for DL-Policy. I'll add these cases to the case
study section, and I'm going to add the following points to DL-Policy:

1. Free datasets used to train FreeModel are not required to upload
   to our main section, for example those Osamu mentioned and wikipedia
   dump. We are not scientific data archiving organization and these
   data will blow up our infra if we upload too much.

2. It's not required to re-train a FreeModel with our infra, because
   the outcome/cost ratio is impractical. The outcome is nearly zero
   compared to directly using a pre-trained FreeModel, while the cost
   is increased carbon dioxide in our atmosphere and wasted developer
   time. (Deep learning is producing much more carbon dioxide than we
   thought).

   For classical probablistic graph models such as MRF or the mentioned
   CRF, the training process might be trivial, but re-training is still
   not required.

For SemiFreeModel I still hesitate to make any decision. Once we let
them enter the main section there will be many unreproducible
or hard-to-reproduce but surprisingly "legal" (in terms of DL-Policy)
files. Maybe this case is to some extent similar to artworks and fonts.
Further study needed. And it's still not easy to find a balance point
for SemiFreeModel between usefulness and freedom.

Thanks,
Mo.

Reply to:

Follow-Ups:
- Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
  - From: "Yao Wei (魏銘廷)" <mwei@debian.org>
- Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
  - From: Osamu Aoki <osamu@debian.org>
- Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
  - From: Sam Hartman <hartmans@debian.org>

References:
- Concern for: A humble draft policy on "deep learning v.s. freedom"
  - From: Osamu Aoki <osamu@debian.org>

Prev by Date: Re: speeding up installs
Next by Date: Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
Previous by thread: Concern for: A humble draft policy on "deep learning v.s. freedom"
Next by thread: Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
Index(es):
- Date
- Thread