[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Concern for: A humble draft policy on "deep learning v.s. freedom"



Hi,

Let's think in a bit different perspective.

What is the outcome of "Deep Lerning".  That's "knowledge".

If the dictionary of "knowledge" is expressed in a freely usable
software format with free license, isn't it enough?

If you want more for your package, that's fine.  Please promote such
program for your project.  (FYI: the reason I spent my time for fixing
"anthy" for Japanese text input is I didn't like the way "mozc" looked
as a sort of dump-ware by Google containing the free license dictionary
of "knowledge" without free base training data.)  But placing some kind
of fancy purist "Policy" wording to police other software doesn't help
FREE SOFTWARE.  We got rid of Netscape from Debian because we now have
good functional free alternative.

If you can make model without any reliance to non-free base training
data for your project, that's great.

I think it's a dangerous and counter productive thing to do to deprive
access to useful functionality of software by requesting to use only
free data to obtain "knowledge".

Please note that the re-training will not erase "knowledge".  It usually
just mix-in new "knowledge" to the existing dictionary of "knowledge".
So the resulting dictionary of "knowledge" is not completely free of
the original training data.  We really need to treat this kind of
dictionary of "knowledge" in line with artwork --- not as a software
code.

Training process itself may be mathematical, but the preparation of
training data and its iterative process of providing the re-calibrating
data set involves huge human inputs.

> Enforcing re-training will be a painful decision...

Hmmm... this may depends on what kind of re-training.

At least for unidic-mecab, re-training to add many new words to be
recognized by the morphological analyzer is an easier task.  People has
used unidic-mecab and web crawler to create even bigger dictionary with
minimal work of re-training (mostly automated, I guess.)
  https://github.com/neologd/mecab-unidic-neologd/

I can't imagine to re-create the original core dictionary of "knowledge"
for Japanese text processing purely by training with newly provided free
data since it takes too much human works and I agree it is unrealistic
without serious government or corporate sponsorship project.

Also, the "knowledge" for Japanese text processing should be able to
cover non-free texts.  Without using non-free texts as input data, how
do you know it works on them.

> Isn't this checking mechanism a part of upstream work? When developing
> machine learning software, the model reproduciblity (two different runs
> should produce very similar results) is important.

Do you always have a luxury of relying on such friendly/active upstream?
If so, I see no problem.  But what should we do if not?

Anthy's upstream is practically Debian repo now.

Osamu


Reply to: