[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"



Hi Osamu,

On 2019-06-09 13:48, Osamu Aoki wrote:
> Let's think in a bit different perspective.
> 
> What is the outcome of "Deep Lerning".  That's "knowledge".

Don't mix everything into a single obscure word "knowledge".
That things is not representable through programming language
or mathematical language because we cannot define what
"knowledge" is in an unambiguous way. Squashing everything
into "knowledge" does exactly the inverse as what I'm doing.

> If the dictionary of "knowledge" is expressed in a freely usable
> software format with free license, isn't it enough?

A free license doesn't solve all my concerns. If we just treat
models as sort of artwork, what if

1. upstream happened to license a model trained from non-free
   data under GPL. Is upstream violating GPL by not releasing
   "source" (or material that is necessary to reproduce a work)?

2. upstream trained a model on a private dataset that contains
   deliberate evil data, and released it under MIT license.
   (Then malware just sneaked into main?)

I have to consider all possible models and applications in
the whole machine learning and deep learning area. The experience
learned from input methods cannot cover all possible cases.

A pile of digits from classifical machine learning model is
generally interpretable. That means human can understand what
each digit means (e.g. conditional probability, frequency, etc).

A pile of digits from deep neural network is basically not
interpretable -- human cannot fully understand them. Something
malicious could hide in this pile of digits due to the complexity
of the non-linearity mapping that neural networks have learned.

Proposed updates:

1. If a SemiFreeModel won't raise any security concern, we
   can accept them into main section. For an imagined example,
   upstream foobar wrote an input method, and trained a probablistic
   model based on developer's personal diary. The upstream released
   the model under a free license but didn't release his/her diary.
   Such model is fine as it doesn't incur any security problem.

2. Security sensitive SemiFreeModel is prohibited from entering
   the main section. Why should we trust it if we cannot inspect
   every thing about it?

Let me emphasize this again: Don't forget security when talking
about machine learning models and deep learning models. Data
used to train input method don't harm in any way, but data
used to train a model that controls authentication is ...
Security concern is inevitable along with industrial application
of deep learning.

Maybe I'm just too sensitive after reading ~100 papers about
attacking/fooling machine learning models. Here is a ridiculous
example: [Adversarial Reprogramming of Neural Networks]
(https://arxiv.org/abs/1806.11146)

> If you want more for your package, that's fine.  Please promote such
> program for your project.  (FYI: the reason I spent my time for fixing
> "anthy" for Japanese text input is I didn't like the way "mozc" looked
> as a sort of dump-ware by Google containing the free license dictionary
> of "knowledge" without free base training data.)  But placing some kind
> of fancy purist "Policy" wording to police other software doesn't help
> FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
> good functional free alternative.
> 
> If you can make model without any reliance to non-free base training
> data for your project, that's great.

I'll create a subcategory under SemiFreeModel as an umbrella for input
methods and alike to reduce the overkilling level of DL-Policy. After
reviewing the code by myself. It may take some time because I have
to understand how things work.

> I think it's a dangerous and counter productive thing to do to deprive
> access to useful functionality of software by requesting to use only
> free data to obtain "knowledge".

The policy needs to balance not only usefulness/productivity but also
software freedom (as per definition), reproducibility, security,
doability, possibility and difficulties.

The first priority is software freedom instead of productivity
when we can only choose one, even if users will complain.
That's why our official ISO cannot ship ZFS kernel module
and very useful non-free firmware or alike.

> Please note that the re-training will not erase "knowledge".  It usually
> just mix-in new "knowledge" to the existing dictionary of "knowledge".
> So the resulting dictionary of "knowledge" is not completely free of
> the original training data.  We really need to treat this kind of
> dictionary of "knowledge" in line with artwork --- not as a software
> code.

My interpretation of "re-train" is "train from scratch again" instead
of "train increamentaly". For neural networks the "incremental training"
process is called "fine-tune".

I understand that you don't wish DL-Policy to kick off input methods
or alike and make developers down, and this will be sorted out soon...

> Training process itself may be mathematical, but the preparation of
> training data and its iterative process of providing the re-calibrating
> data set involves huge human inputs.

I don't buy it because I cannot neglect my concerns.

>> Enforcing re-training will be a painful decision...
> 
> Hmmm... this may depends on what kind of re-training.

Based on DL-Policy's scope of discussion, that "re-training" word
have global effects.

> At least for unidic-mecab, re-training to add many new words to be
> recognized by the morphological analyzer is an easier task.  People has
> used unidic-mecab and web crawler to create even bigger dictionary with
> minimal work of re-training (mostly automated, I guess.)
>   https://github.com/neologd/mecab-unidic-neologd/
> 
> I can't imagine to re-create the original core dictionary of "knowledge"
> for Japanese text processing purely by training with newly provided free
> data since it takes too much human works and I agree it is unrealistic
> without serious government or corporate sponsorship project.
> 
> Also, the "knowledge" for Japanese text processing should be able to
> cover non-free texts.  Without using non-free texts as input data, how
> do you know it works on them.

Understood. The information you provided is enough to help DL-Policy
set up an area for input methods and prevent them from begin kicked
out from archive (given the fundamental requirements hold).

>> Isn't this checking mechanism a part of upstream work? When developing
>> machine learning software, the model reproduciblity (two different runs
>> should produce very similar results) is important.
> 
> Do you always have a luxury of relying on such friendly/active upstream?
> If so, I see no problem.  But what should we do if not?

Generally speaking a deep learning software that fails to reproduce
in any way is rubbish and should not be packaged. Special cases such
as input methods or board game models trained collectively by a
community may exist but they cannot be used to conclude the general
law.

> Anthy's upstream is practically Debian repo now.
> 
> Osamu


Reply to: