Concern for: A humble draft policy on "deep learning v.s. freedom"

To: debian-devel@lists.debian.org
Cc: debian-science@lists.debian.org
Subject: Concern for: A humble draft policy on "deep learning v.s. freedom"
From: Osamu Aoki <osamu@debian.org>
Date: Sun, 9 Jun 2019 03:43:09 +0900
Message-id: <[🔎] 20190608184309.GA10146@goofy.osamu.debian.net>
In-reply-to: <f544829dcd6c0f92ea11cdb25543bdac@debian.org>
References: <f544829dcd6c0f92ea11cdb25543bdac@debian.org>

Hi,

On Tue, May 21, 2019 at 12:11:14AM -0700, Mo Zhou wrote:
> Hi people,

I see your good intention but this is basically changing status-quo for
the main requirement.

>   https://salsa.debian.org/lumin/deeplearning-policy
>   (issue tracker is enabled)

I read it ;-)

> This draft is conservative and overkilling, and currently
> only focus on software freedom. That's exactly where we
> start, right?

OK but it can't be where we end-up-with.

Before scientific "deep learning" data, we already have practical "deep
learning" data in our archive.

Please note one of the most popular Japanese input method mozc will be
kicked out from main as a starter if we start enforcing this new
guideline.

> Specifically, I defined 3 types of pre-trained machine
> learning models / deep learning models:
> 
>   Free Model, ToxicCandy Model. Non-free Model
> 
> Developers who'd like to touch DL software should be
> cautious to the "ToxicCandy" models. Details can be
> found in my draft.

With a labeling like "ToxicCandy Model" for the situation, it makes bad
impression on people and I am afraid people may not be make rational
decision.  Is this characterization correct and sane one?  At least,
it looks to me that this is changing status-quo of our policy and
practice severely.  So it is worth evaluating idea without labeling.

As long as the "data" comes in the form which allows us to modify it and
re-train it to make it better with a set of free software tools to do it,
we shouldn't make it non-free, for sure.  That is my position and I
think this was what we operated as the project.  We never asked how they
are originally made.  The touchy question is how easy it should be to
modify and re-train, etc.

Let's list analogy cases.  We allow a photo of something on our archive
as wallpaper etc.  We don't ask object of photo or tool used to make it
to be FREE.  Debian logo is one example which was created by Photoshop
as I understand.  Another analogy to consider is how we allow
independent copyright and license for the dictionary like data which
must have processed previous copyrighted (possibly non-free) texts by
human brain and maybe with some script processing.  Packages such as
opendict, *spell-*, dict-freedict-all, ... are in main.

I agree it is nice to have base data in the package.  If you can, please
include the training data if it is a FREE set.  But it may become
unrealistic for Debian to getting into business of distributing many GB
of training data for this purpose.  You may be talking data size being over
10s of GB.  This is another thing you should realize -- So mandating its
inclusion is unpractical since it is not the focus point on which Debian
needs to spend its resource.

Let's talk about actual cases in main.

"mecab" is free a tool for Japanese text morphological analysis which
can create CRF optimized parameters from the marked-up training data.

(This is also the base of mozc which uses such data to create desirable
typing output in normal Japanese text input from the keyboard.)

One of the dictionary for mecab is 800MB compressed deb in main:
unidic-mecab which is 2.2GB data in text format containing CRF optimized
parameters and other text data obtained by training. These text and
parameters are triple licensed BSD/LGPL/GPL. Re-training this is very
straight forward application of mecab tool with additional data only.
So this is FREE as it can be in current practice and we have it in main.
  https://unidic.ninjal.ac.jp/

When these CRF parameters were initially made, it used non-free data
(Japanese Government funded) available in multiple DVDs with hefty price
and restriction on its use and its redistribution.  This base data for
training is as NON-FREE as it can be so we don't distribute.
  https://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html

In case of MOZC, the original training data is only available in Google
and not published by them.  Actually, tweaking data is possible but
consistently retraining this data in MOZC may not be a trivial
application of mecab tool.  We are placing this in main now, anyway
since its data (CRF optimized parameters and other text data ) are
licensed under BSD-3-clause and we have MOZC in main.

Regards,

Osamu

Reply to:

Follow-Ups:
- Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
  - From: Mo Zhou <lumin@debian.org>

Next by Date: Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
Next by thread: Re: Concern for: A humble draft policy on "deep learning v.s. freedom"
Index(es):
- Date
- Thread