[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ML-Policy and tesseract-ocr



On 2019-08-13 00:44, Sam Hartman wrote:
>>>>>> "Mo" == Mo Zhou <lumin@debian.org> writes:
> 
> 
>     >> (result of running the training program with specific input data,
>     >> if I understand correctly?)
> 
>     Mo> Yes, correct.
> 
>     >> The source package would need to Build-Depend on the training
>     >> program and its inputs, but in general there would not need to be
>     >> a normal Depends.
> 
>     Mo> I see. The idea is that an ELF binary (ML model) doesn't have to
>     Mo> Depend on it's compiler (training program) and source (input
>     Mo> data).  This makes sense to me and the "Suggest:" restriction
>     Mo> may be better.
> 
>     Mo> The "Suggest:" relationship implicitly hints the user about the
>     Mo> following questions: 1. what is the binary blob
>     Mo> /usr/.../foobar.ml-model installed by the package foobar?
>     Mo> 2. where did these digits come from?  3. how can I well
>     Mo> understand how this model is created by the original author?
>     Mo> 4. how do I obtain a similar model with my own dataset?  etc.
> 
> As a user, if I want to understand how some binary thing gets created,
> I'll
> apt source <package_containing_binary_thing>
> 
> rather than looking at suggests.
> 
> In cases where the model is created in the build process, I think
> build-depends is better than suggests.
> 
> In cases where the model is not recreated, but where software in Debian
> could create the model, I think a README file is better than a package
> relationship.

How about these:

* A source package that produces binary package containing ML model
  must contain the corresponding training program/script, or
  Build-Depends on the package that ships the program/script.

  (The training program will be present if the user wanted to
investigate)

* A package that contains ML model should annotate the type of
  model it ships in README.Debian, and answer at least the following
  questions: (1) what is the binary blob / pile of digits? (2) where
  is the training program located, which binary package and which path?
  (3) on what dataset was the model trained? (4) what's the license
  of the original dataset?

  (These are necessary information for the user to study a model)

* In a complex system where several ML models interact with each
  other and contribute to the system output, any ML model in
  different type would taint Free Model. Non-Free model taints
  any other model when used together.

  (well, I need to update the definition of the model types)


Reply to: