[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"



Hi,

On Sun, Jun 09, 2019 at 09:27:42AM -0700, Mo Zhou wrote:
> Hi Osamu,
> 
> On 2019-06-09 13:48, Osamu Aoki wrote:
> > Let's think in a bit different perspective.
...
... (I have some explanation for GPL-contamination concern later)
...
> Let me emphasize this again: Don't forget security when talking
> about machine learning models and deep learning models. Data
> used to train input method don't harm in any way, but data
> used to train a model that controls authentication is ...
> Security concern is inevitable along with industrial application
> of deep learning.

Very true. Are you talking things like facial recognition? 

My immediate concern cases were relatively shallow learning models where
meaning of each resulting parameter is mostly self-explanatory one and
data were in ascii.  In other words, transparency is there.

> Maybe I'm just too sensitive after reading ~100 papers about
> attacking/fooling machine learning models. Here is a ridiculous
> example: [Adversarial Reprogramming of Neural Networks]
> (https://arxiv.org/abs/1806.11146)

I see.

I am not even sure having FREE and open base data is enough, after
reading the first few lines of linked text.  If the input data
distributed in the upstream data contains steganography and deep
learning may pick it while human data reviewer for the input data may
overlook it.  Then bad thing can happen with the decision which uses
this set of seemingly nice data with deep learning.

I think one of the important question is the transparency
(=ability to inspect and test with independent data) of the resulting
data.  I am no expert on this subject, though.  This is just my gut
feeling.

(Of course, another thing is the trust to the people which offer base
data.)

This reminds me of the "trust of C-code" situation.  We need the source,
YES.  We need the source of compiler, YES. But these aren't enough.  If
boot strapping of compiler was tainted, the trust of C-code can't be
secured.  We must have GDB to inspect compiled result to be sure.

Even shallow data like Japanese input string can contain intentional
twist which could be biased.  String "64" may be linked to a string
"Tiananmen".   I don't see that yet ;-)  This kind of rogue data which
may upset some is minor concern since this is like Easter egg in C code.
We have enough transparency of data in this case.

I am wondering about another shallow learning data case such as Bayesian
spam filter.  I can't agree to package a pre-trained binary data made
from unknown source data set text, even if upstream tells us this data
is under FREE license.  This is like binary firmware.  If pre-trained
binary data is dumped in a readable and meaningful ascii text data
format, it may be OK if it is licensed under FREE license.  Here again,
transparency of data is important.

> > If you want more for your package, that's fine.  Please promote such
> > program for your project.  (FYI: the reason I spent my time for fixing
> > "anthy" for Japanese text input is I didn't like the way "mozc" looked
> > as a sort of dump-ware by Google containing the free license dictionary
> > of "knowledge" without free base training data.)  But placing some kind
> > of fancy purist "Policy" wording to police other software doesn't help
> > FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
> > good functional free alternative.
> > 
> > If you can make model without any reliance to non-free base training
> > data for your project, that's great.
> 
> I'll create a subcategory under SemiFreeModel as an umbrella for input
> methods and alike to reduce the overkilling level of DL-Policy. After
> reviewing the code by myself. It may take some time because I have
> to understand how things work.
> 
> > I think it's a dangerous and counter productive thing to do to deprive
> > access to useful functionality of software by requesting to use only
> > free data to obtain "knowledge".
> 
> The policy needs to balance not only usefulness/productivity but also
> software freedom (as per definition), reproducibility, security,
> doability, possibility and difficulties.
> 
> The first priority is software freedom instead of productivity
> when we can only choose one, even if users will complain.
> That's why our official ISO cannot ship ZFS kernel module
> and very useful non-free firmware or alike.
> 
> > Please note that the re-training will not erase "knowledge".  It usually
> > just mix-in new "knowledge" to the existing dictionary of "knowledge".
> > So the resulting dictionary of "knowledge" is not completely free of
> > the original training data.  We really need to treat this kind of
> > dictionary of "knowledge" in line with artwork --- not as a software
> > code.
> 
> My interpretation of "re-train" is "train from scratch again" instead
> of "train increamentaly". For neural networks the "incremental training"
> process is called "fine-tune".

I see.

> I understand that you don't wish DL-Policy to kick off input methods
> or alike and make developers down, and this will be sorted out soon...
> 
> > Training process itself may be mathematical, but the preparation of
> > training data and its iterative process of providing the re-calibrating
> > data set involves huge human inputs.
> 
> I don't buy it because I cannot neglect my concerns.
> 
> >> Enforcing re-training will be a painful decision...
> > 
> > Hmmm... this may depends on what kind of re-training.
> 
> Based on DL-Policy's scope of discussion, that "re-training" word
> have global effects.

I see.

> > At least for unidic-mecab, re-training to add many new words to be
> > recognized by the morphological analyzer is an easier task.  People has
> > used unidic-mecab and web crawler to create even bigger dictionary with
> > minimal work of re-training (mostly automated, I guess.)
> >   https://github.com/neologd/mecab-unidic-neologd/
> > 
> > I can't imagine to re-create the original core dictionary of "knowledge"
> > for Japanese text processing purely by training with newly provided free
> > data since it takes too much human works and I agree it is unrealistic
> > without serious government or corporate sponsorship project.
> > 
> > Also, the "knowledge" for Japanese text processing should be able to
> > cover non-free texts.  Without using non-free texts as input data, how
> > do you know it works on them.
> 
> Understood. The information you provided is enough to help DL-Policy
> set up an area for input methods and prevent them from begin kicked
> out from archive (given the fundamental requirements hold).
> 
> >> Isn't this checking mechanism a part of upstream work? When developing
> >> machine learning software, the model reproduciblity (two different runs
> >> should produce very similar results) is important.
> > 
> > Do you always have a luxury of relying on such friendly/active upstream?
> > If so, I see no problem.  But what should we do if not?
> 
> Generally speaking a deep learning software that fails to reproduce
> in any way is rubbish and should not be packaged. Special cases such
> as input methods or board game models trained collectively by a
> community may exist but they cannot be used to conclude the general
> law.

I am not sure if it is right to segregate things just by the use case.
(It's OK to use this as the gauideline to spend time on them.)

I also think it is important to have transparency test for any of these
data.  It's like binary vs. source code.

Just to be sure:

mecab is not Japanese input method tool. mecab is generic morphological
analysis tool which checks only the nearest neighbor word.

mozc is the input method which seems to use mecab code variant within it
with its own dictionary data.

mecab with any one of mecab dictionaries can be used to tokenize
Japanese text into words (some even generate data for proper
pronunciation information).  This is sometimes the first step to scan
Japanese text for any analytical process.

They are also used to crawl net to find unknown words and add them to a
new updated fine-tuned dictionary with default property settings.  

The upstream corpus of mecab-unidic contain not only Government paperers
but also published ordinary books, journals, and  newspaper articles.
Thus they can't be distributed as Free.  Also initial base words and
some pronunciation intonation instruction data came from proprietary
dictionaries.  So it was not FREE data initially.  But after good
efforts by the government agency and people around it to negotiate with
the original dictionary data suppliers, they let the dictionary
publisher agree to distribute this scope of data in 3 licenses:
(BSD/LGPL/GPL) to make it available and compatible in most use cases.
So there is no GPL-contamination issue here.  Also this dictionary has
no definition for each word meaning which dictionary publisher didn't
wish to be released under FREE license.  That is the sanitization
process of these mecab data.  GPL-compatibility is not the issue with
mecab-unidic.  

It's extended mecab-unidic-nlogd is in apache license which I guess is
OK if we take mecab-unidic as BSD license.

Regards,

Osamu


Reply to: