Re: Training and large data sets?

To: Wouter Verhelst <w@uter.be>
Cc: debian-ai@lists.debian.org
Subject: Re: Training and large data sets?
From: "M. Zhou" <lumin@debian.org>
Date: Mon, 27 Sep 2021 20:26:04 -0400
Message-id: <[🔎] a6ad7c8ad53ea1b6445aadcfe101f25353dd405a.camel@riseup.net>
In-reply-to: <[🔎] YVH73HIMVbo3eT+X@pc181009.grep.be>
References: <YSpWqvYVO279japJ@pc181009.grep.be> <[🔎] e3c97417ff70514911155fb30429a9d7d2b606a0.camel@riseup.net> <[🔎] YVH73HIMVbo3eT+X@pc181009.grep.be>

Hi Wouter,

On Mon, 2021-09-27 at 19:14 +0200, Wouter Verhelst wrote:
> > 
> > The motivation is wonderful but I'd like to point this out:
> > https://github.com/MycroftAI/mimic2/blob/master/requirements.txt#L2
> > This software packaging involves a huge amount of dependencies
> > including hard bones like tensorflow, which has not yet been
> > fully prepared (the one in NEW queue does not produce the python
> > package). 
> 
> I'm aware of that. There's two reasons why this isn't a major issue as
> far as I'm concerned:
> 
> - The set of packages required to get MycroftAI into Debian is fairly
>   large, and mimic2 is far far far down the road for now. I'm not there
>   yet, but I do want to make sure it's *possible* for me at some point
>   down the line to upload mimic2.

>From that point of view -- tensorflow will eventually be conquered,
so I think the decision is fully up to you -- whether you are willing
to maintain it or not.

> - Even if we don't get there, from a cursory view it looks like
>   MycroftAI uses a configurable tts middle layer. If I'm not mistaken,
>   then while results are best with mimic2, using mimic2 is not
>   required (it should be possible to use some other tts implementation
>   in the mean time).

Being able to switch the backend would be quite cool. I didn't
investigate into this, but supporting either tensorflow or
pytorch backend will be very cool. These two are the dominating
deep learning frameworks. Although the speech recognition
community have their own toolkit as well.

> > 
> > My general recommendation is to only consider things related to
> > dataset
> > when a machine learning or deep learning framework is ready in the
> > archive. Dealing with the framework itself is already difficult
> > enough.
> > Good AI software upstreams will try their best to keep dataset
> > available for long term.
> 
> I am absolutely not planning to upload a dataset tomorrow :-) but I
> *will* need to upload something at some point in the mid to long
> future,
> and I just wanted to see what the recommended strategy for that is when
> I get to it some time down the line. I don't have any good ideas here,
> that's why I'm asking.
> 
> It might be relevant for what I do while packaging the framework, after
> all.

My recommendation is basically unchanged from the term 7 "external-
data" and term 5 "reproduce-rules" in ML-Policy:
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.pdf

Uploading giant dataset does not make sense. A presumably better
way to solve problem is to upload pretrained model as long as
(1) the maintainer can reproduce it; (2) there is no license
problem for any component involved in the process; (3) the data
is always publically and anonymously available (e.g., one does
not have to register and login before download).

I'm open to revising the policy :-)
Any suggestion is welcome. 

> > 
> > 2.6GB of our archive space is not trivial.
> 
> Yes, that's entirely my point :-)

Got your point.

> > Only when its importance 
> > and number of audience outweighs the bulky size should we discuss it.
> > 
> > > So, although I'm not ready yet to start packaging mimic2, I think
> > > it's
> > > reasonable to see how I would go about that.
> > 
> > Tensorflow is expected to be a big blocker.
> 
> I know, but that's just software, and a "mere" matter of packaging it.
> It's not trivial, but we've got good procedures for doing that and we
> know how to tackle a problem of that type.
> 
> Datasets are a different matter. I know that 2.6G is not trivial. I'm
> not suggesting we upload it to the archive; I think it does not make
> sense to do that. Instead, I think we require some form of a separate
> archive with separate tracking, where the dataset and the metadata may
> even need to be in separate files. The LJ Speech dataset was not
> written
> by the Mycroft team; I suspect it might be used by other projects, too
> (but I am not certain of that). Even if that turns out to not be the
> case, mimic2 supports being trained on different datasets, so it should
> be possible to use a different speech dataset, one that will be used by
> other software in the archive, to train it.

Separate archive -- that's my blind spot. You put forward a very good
idea. As some datasets can be used for multiple servers and shared by
different frameworks / applications, we indeed can setup something
to track the status of datasets needed by something in our archive,
and automate things. Automation matters.

Just created a new issue regarding this for ML-Policy:
https://salsa.debian.org/deeplearning-team/ml-policy/-/issues/17

> So my question is, is there any strategy planned for how to manage
> datasets and their links with software that is trained on these
> datasets? I'm sure there will be other datasets that might be used to
> train free software, and that 2.6G will not be an exception, size
> wise?

As suggested above.

> Even if that assumption is wrong, I think training models on a
> dataset
> is not something we should do for every trivial upload, and so a
> standard "let's do it on the buildd hosts" seems wrong.

Unless Debian is willing to add Nvidia proprietary driver into
the building servers, re-training anything is not feasible for us.
(Excluding training on small toy datasets for sanity testing)

> > > Is there any infrastructure available, or in the process of being
> > > set
> > > up, to upload datasets and train models? Are there any plans for
> > > that?
> > 
> > Some people including me in the deep learning team had some ideas
> > about infrastructured equipped with some decent GPU. But it seems
> > uneasy to take into action, as GPU-related stuff often involves
> > non-free blobs which makes the development work in Debian not
> > quite comfortable.
> 
> Yes, I can see how that might be a problem.
> 
> > > I suspect it can't be the intention to have to retrain a model
> > > with
> > > every upload of the software, no matter how minor, nor that I
> > > should
> > > have to upload a 2.6G .deb just so the model can be trained...
> > 
> > Don't worry about that. Worry about the framework itself first.
> > Then you will encounter more problems than that a dataset incurs.
> 
> Oh, I'm sure about that, but I'm also sure that the problem of
> training
> models is one that we'll eventually need to tackle. I wanted to know
> if
> it's being taken care of.
> 
> I'm not sure that that is the case right now. Perhaps some initial
> stuff
> has been looked at, but it feels to me like there's a lot to be done
> still?
> 
> If not, then yay, happy days. If so, then I'll be happy to help work
> on
> this -- but I thought I'd ask first before I start talking to random
> people :)
> 
> [...]

Thank you for the inspiring point.

Have a good day.

Reply to:

Follow-Ups:
- Re: Training and large data sets?
  - From: Wouter Verhelst <wouter@debian.org>

References:
- Re: Training and large data sets?
  - From: "M. Zhou" <lumin@debian.org>
- Re: Training and large data sets?
  - From: Wouter Verhelst <w@uter.be>

Prev by Date: Re: Training and large data sets?
Next by Date: Re: Training and large data sets?
Previous by thread: Re: Training and large data sets?
Next by thread: Re: Training and large data sets?
Index(es):
- Date
- Thread