Re: Training and large data sets?

To: "M. Zhou" <lumin@debian.org>
Cc: debian-ai@lists.debian.org
Subject: Re: Training and large data sets?
From: Wouter Verhelst <wouter@debian.org>
Date: Tue, 28 Sep 2021 10:37:26 +0200
Message-id: <[🔎] YVLURqKHOOL6TXb7@pc181009.grep.be>
In-reply-to: <[🔎] a6ad7c8ad53ea1b6445aadcfe101f25353dd405a.camel@riseup.net>
References: <YSpWqvYVO279japJ@pc181009.grep.be> <[🔎] e3c97417ff70514911155fb30429a9d7d2b606a0.camel@riseup.net> <[🔎] YVH73HIMVbo3eT+X@pc181009.grep.be> <[🔎] a6ad7c8ad53ea1b6445aadcfe101f25353dd405a.camel@riseup.net>

On Mon, Sep 27, 2021 at 08:26:04PM -0400, M. Zhou wrote:
> Hi Wouter,
> 
> On Mon, 2021-09-27 at 19:14 +0200, Wouter Verhelst wrote:
> > > 
> > > The motivation is wonderful but I'd like to point this out:
> > > https://github.com/MycroftAI/mimic2/blob/master/requirements.txt#L2
> > > This software packaging involves a huge amount of dependencies
> > > including hard bones like tensorflow, which has not yet been
> > > fully prepared (the one in NEW queue does not produce the python
> > > package). 
> > 
> > I'm aware of that. There's two reasons why this isn't a major issue as
> > far as I'm concerned:
> > 
> > - The set of packages required to get MycroftAI into Debian is fairly
> >   large, and mimic2 is far far far down the road for now. I'm not there
> >   yet, but I do want to make sure it's *possible* for me at some point
> >   down the line to upload mimic2.
> 
> >From that point of view -- tensorflow will eventually be conquered,
> so I think the decision is fully up to you -- whether you are willing
> to maintain it or not.
> 
> > - Even if we don't get there, from a cursory view it looks like
> >   MycroftAI uses a configurable tts middle layer. If I'm not mistaken,
> >   then while results are best with mimic2, using mimic2 is not
> >   required (it should be possible to use some other tts implementation
> >   in the mean time).
> 
> Being able to switch the backend would be quite cool. I didn't
> investigate into this, but supporting either tensorflow or
> pytorch backend will be very cool. These two are the dominating
> deep learning frameworks. Although the speech recognition
> community have their own toolkit as well.

Oh, no, not at that level :-D

Mycroft can use any one of a number of text-to-speech engines; mimic2
is only one of them. I care about mimic2 only insofar as it is part of
the Mycroft suite. If uploading and training that is not going to
happen, it will still be possible to use mycroft. Just not with mimic2.

But mimic2 has a hard requirement on tensorflow. No two ways about that.

[...]
> > > My general recommendation is to only consider things related to
> > > dataset
> > > when a machine learning or deep learning framework is ready in the
> > > archive. Dealing with the framework itself is already difficult
> > > enough.
> > > Good AI software upstreams will try their best to keep dataset
> > > available for long term.
> > 
> > I am absolutely not planning to upload a dataset tomorrow :-) but I
> > *will* need to upload something at some point in the mid to long
> > future,
> > and I just wanted to see what the recommended strategy for that is when
> > I get to it some time down the line. I don't have any good ideas here,
> > that's why I'm asking.
> > 
> > It might be relevant for what I do while packaging the framework, after
> > all.
> 
> My recommendation is basically unchanged from the term 7 "external-
> data" and term 5 "reproduce-rules" in ML-Policy:
> https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.pdf
> 
> Uploading giant dataset does not make sense. A presumably better
> way to solve problem is to upload pretrained model as long as
> (1) the maintainer can reproduce it; (2) there is no license
> problem for any component involved in the process; (3) the data
> is always publically and anonymously available (e.g., one does
> not have to register and login before download).
> 
> I'm open to revising the policy :-)
> Any suggestion is welcome. 

I think if we are going to require maintainers to upload pretrained
models, especially if those are models that can only be trained on
nvidia GPUs, that we've essentially given up. That can't be the intent.

We are finally at a point where *all* software in Debian stable was
built on Debian hardware, most of it reproducibly. Yes, training data on
Debian hardware is a hard problem to tackle; but saying "so let's not do
it" is throwing in the towel, and we shouldn't do that, at least IMO.

Even if not, I personally just do not have the infrastructure to train
the model myself (I'd have to keep my laptop running for days on end,
and, well, it's a laptop, so...), so if that's going to be required,
it's going to be a no from me.

[...]
> Separate archive -- that's my blind spot. You put forward a very good
> idea.

:-)

> As some datasets can be used for multiple servers and shared by
> different frameworks / applications, we indeed can setup something to
> track the status of datasets needed by something in our archive, and
> automate things. Automation matters.

Right.

There has been talk of a "data" repository in Debian for essentially
forever[1], but it has also never happened, mostly because there wasn't
a very strong need. I think machine learning changes that, and I think
we should really push for getting that.

It's fine if such a data repository contains large data sets. The reason
we don't want to do that today is because it would explode the Debian
archive, which would upset our mirrors; if they're in a separate
repository that we don't expect the same mirrors to mirror, it's not as
much of a problem.

If a data repository is not going to happen, we can make do with
packages that consist of, essentially, a script to download the data.
There's plenty of things like that in contrib for non-free game data. I
don't think that's ideal, but it's a possible workaround.

[1] As anecdotal evidence of this: When I took over the Linux Gazette
    packages as the first thing I was going to do for Debian, the guy
    who had been maintaining them before me but had run out of time for
    doing so, said something about "perhaps this should be moved to that
    data repository that's being talked about"... and that's over 20
    years ago now ;-)

> Just created a new issue regarding this for ML-Policy:
> https://salsa.debian.org/deeplearning-team/ml-policy/-/issues/17
> 
> > So my question is, is there any strategy planned for how to manage
> > datasets and their links with software that is trained on these
> > datasets? I'm sure there will be other datasets that might be used to
> > train free software, and that 2.6G will not be an exception, size
> > wise?
> 
> As suggested above.
> 
> > Even if that assumption is wrong, I think training models on a
> > dataset
> > is not something we should do for every trivial upload, and so a
> > standard "let's do it on the buildd hosts" seems wrong.
> 
> Unless Debian is willing to add Nvidia proprietary driver into
> the building servers, re-training anything is not feasible for us.
> (Excluding training on small toy datasets for sanity testing)

I think Debian *should* do this, otherwise I think machine learning is a
no-go IMO (but I accept that not everyone necessarily agrees with that).
I don't think we necessarily need to train everything from day one while
we figure out the details, but I do think it should be the eventual end
goal.

-- 
     w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

Reply to:

Follow-Ups:
- Re: Training and large data sets?
  - From: Alastair McKinstry <mckinstry@debian.org>

References:
- Re: Training and large data sets?
  - From: "M. Zhou" <lumin@debian.org>
- Re: Training and large data sets?
  - From: Wouter Verhelst <w@uter.be>
- Re: Training and large data sets?
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: Re: Training and large data sets?
Next by Date: Re: Training and large data sets?
Previous by thread: Re: Training and large data sets?
Next by thread: Re: Training and large data sets?
Index(es):
- Date
- Thread