ML-Policy and tesseract-ocr

To: debian-devel@lists.debian.org
Subject: ML-Policy and tesseract-ocr
From: Mo Zhou <lumin@debian.org>
Date: Mon, 12 Aug 2019 07:30:37 -0700
Message-id: <[🔎] 33417ce2bcf9b6a0efaf4771b83c6df1@debian.org>

Hi -devel,

src:tesseract-ocr has two concrete bugs that are associated
to ML-Policy[1]. Let's discuss about the possible solution.

BTW, today I drew a diagram[2] for the ML-Policy. Those
who don't bother to read the full text could have a look
at the diagram.

-- #933878: training files are split across libtesseract-dev ...

We've already had several rounds of discussion on this topic,
and we know that the training scripts/programs are critical
for software freedom. So it's natural to require that
the ML model must co-exist with it's training program.

To this end, I wrote the policy #5 [3]:

   A package that includes a machine learning model, must also include
   the corresponding training program, or depend on the package that
provides
   the corresponding training program.

Does that make sense? If it looks good, then the solution
for this bug is already obvious enough.

-- #699609: please provide source for language files ...

We had diverged opinion on how to interpret the pretrained
model itself -- is it sort of artistic creation, or the
result of compilation?

In the past ML-Policy divides ML models into 3 types:
Free, Non-Free and ToxicCandy. However the definition
of ToxicCandy might be so rigid to overkill. Hence I
split the original "ToxicCandy" into 2 types:
"ToxicCandy" and "Sourceless". The split line between
them is "whether the model takes part in critical
decision[4], or takes part in the training process of
another model?"

There are many ML applications where we can loosen
the restriction a little bit, in order to balance
the software freedom interpretation and usefulness.
For example, input method (involves seqence modeling:
given the previous input tokens t_1, t_2, ..., t_n,
predict the next token t_{n+1}), image super resolution
(involves parametric model/mapping learning: maps
a low-res, say 4x4 matrix into a high-res, say 8x8
matrix. linear upsampling/upscaling for image produce
much inferior result compared to ML/DL-based solution).
For these examples I think FOSS-licensed models trained
on non-free datasets are acceptable, as long as the
training program is provided (the user could train
a somewhat similar/working model from other datasets),
and they don't involve in critical decisions or
the training of another model[4]. I call such model
"Sourceless Model" (better name?).

Models that involves critical decision that may
easily threaten security (authentication) of even life
(auto pilot), are classified as ToxicCandy when
their training data is not free.

In other words, my ML-Policy suggests that non-critical
models could be treated as some sort of artistic
creation, and those critical ones should be treated
seriously. In this way the ML-Policy will be less
overkilling.

Does this look good to you? According to the proposed
ML-Policy, the training data doesn't have to be uploaded
to Debian archive (we are not scientific data archiving
organization), and the package build process doesn't
have to re-train the model (too expensive to train),
and the pre-trained tesseract-ocr models are[5] "Sourceless"
models since they are not designed for critical tasks.
If everybody agrees with my proposal, #699609 can be
closed without action.

--- towards ML-Policy standarlization

I talked to policy maintainers previously. The ML-Policy
itself is too experimental/prospective, and involes
content about "software freedom interpretation" that
doesn't fit for the policy document.

I don't know where the topic/concern/field will eventually
go but I'm sure I'll maintain ML-Policy "out-of-tree" for
a while for the community's reference.

---

Every comment will be appreciated.

[1] https://salsa.debian.org/lumin/ml-policy
[2] https://salsa.debian.org/lumin/ml-policy/blob/master/diag.svg
[3] https://salsa.debian.org/lumin/ml-policy/blob/master/README.rst#L89
[4] https://salsa.debian.org/lumin/ml-policy/blob/master/README.rst#L40
[5] I didn't looked into the deep detail of the model. I might be wrong.

Reply to:

Follow-Ups:
- Re: ML-Policy and tesseract-ocr
  - From: Marvin Renich <mrvn@renich.org>

Prev by Date: Bug#934598: ITP: fonts-materialdesignicons-iconfont -- Font containing Google's Material Icons
Next by Date: Re: Why keep upstream sources in Git at salsa.d.o? (was: how to handle upstream orig tarball with git-lfs reference files?)
Previous by thread: Bug#934598: ITP: fonts-materialdesignicons-iconfont -- Font containing Google's Material Icons
Next by thread: Re: ML-Policy and tesseract-ocr
Index(es):
- Date
- Thread