Re: ML-Policy and tesseract-ocr

To: Marvin Renich <mrvn@renich.org>
Cc: debian-devel@lists.debian.org
Subject: Re: ML-Policy and tesseract-ocr
From: Mo Zhou <lumin@debian.org>
Date: Mon, 12 Aug 2019 17:28:48 -0700
Message-id: <[🔎] 743dc9c77c5b842123124d6d41f58b7c@debian.org>
In-reply-to: <[🔎] 20190812183520.oqyumnve5m6eqjm6@basil.wdw>
References: <[🔎] 33417ce2bcf9b6a0efaf4771b83c6df1@debian.org> <[🔎] 20190812183520.oqyumnve5m6eqjm6@basil.wdw>

Hi Marvin,

On 2019-08-12 18:35, Marvin Renich wrote:
> * Mo Zhou <lumin@debian.org> [190812 10:31]:
>> To this end, I wrote the policy #5 [3]:
>>
>>    A package that includes a machine learning model, must also include
>>    the corresponding training program, or depend on the package that
>> provides
>>    the corresponding training program.
>>
>> Does that make sense? If it looks good, then the solution
>> for this bug is already obvious enough.
> 
> Perhaps I am not interpreting what you are saying correctly, but I would
> say it is wrong.  The corresponding training program must be packaged in
> Debian, but it seems unlikely that there would be a binary package
> dependency from the model to the training program

The original "policy" was based on a rather strong restriction that
training script must be present when an ML model has been installed.
I meant "Depends" on the original text, but perhaps "Suggests" is better
than that since "Depends" may introduce circular dependency or the
arch-all-dep-on-arch-any problem.

That means "depend on ..." could be revised to "`Suggests:`"

> (result of running the training program with
> specific input data, if I understand correctly?) 

Yes, correct.

> The source package would need to Build-Depend on the training
> program and its inputs, but in general there would not need to be a
> normal Depends.

I see. The idea is that an ELF binary (ML model) doesn't have to
Depend on it's compiler (training program) and source (input data).
This makes sense to me and the "Suggest:" restriction may be better.

The "Suggest:" relationship implicitly hints the user about the
following questions:
1. what is the binary blob /usr/.../foobar.ml-model installed by the
   package foobar?
2. where did these digits come from?
3. how can I well understand how this model is created by the
   original author?
4. how do I obtain a similar model with my own dataset?
etc.

For most users I think they'll not try do actually dig into
the detail of the model, or even try to understand what it
is. So changing the model -> training script relationship
from "Depends" to "Suggest" could also avoid pulling the
whole stack of training software when installing the model.

> Perhaps you were just being sloppy about Build-Depends vs Depends, but
> when writing policy it is important to be very specific about that.

Thanks, I'll keep that in mind.

Reply to:

Follow-Ups:
- Re: ML-Policy and tesseract-ocr
  - From: Sam Hartman <hartmans@debian.org>
- Re: ML-Policy and tesseract-ocr
  - From: Marvin Renich <mrvn@renich.org>

References:
- ML-Policy and tesseract-ocr
  - From: Mo Zhou <lumin@debian.org>
- Re: ML-Policy and tesseract-ocr
  - From: Marvin Renich <mrvn@renich.org>

Prev by Date: Re: Why keep upstream sources in Git at salsa.d.o?
Next by Date: Re: ML-Policy and tesseract-ocr
Previous by thread: Re: ML-Policy and tesseract-ocr
Next by thread: Re: ML-Policy and tesseract-ocr
Index(es):
- Date
- Thread