Re: Training and large data sets?

To: "M. Zhou" <lumin@debian.org>
Cc: debian-ai@lists.debian.org
Subject: Re: Training and large data sets?
From: Wouter Verhelst <w@uter.be>
Date: Mon, 27 Sep 2021 19:14:04 +0200
Message-id: <[🔎] YVH73HIMVbo3eT+X@pc181009.grep.be>
In-reply-to: <[🔎] e3c97417ff70514911155fb30429a9d7d2b606a0.camel@riseup.net>
References: <YSpWqvYVO279japJ@pc181009.grep.be> <[🔎] e3c97417ff70514911155fb30429a9d7d2b606a0.camel@riseup.net>

Hi,

On Sun, Sep 26, 2021 at 07:50:53PM -0400, M. Zhou wrote:
> Hi Wouter,
> 
> Was just too busy to respond... sorry.

No worries it happens :)

> On Sat, 2021-08-28 at 17:30 +0200, Wouter Verhelst wrote:
> > Hi folks,
> > 
> > I've started working on packaging the MycroftAI software for Debian.
> > This is a large set of python packages which together form a voice
> > assistant.
> 
> The motivation is wonderful but I'd like to point this out:
> https://github.com/MycroftAI/mimic2/blob/master/requirements.txt#L2
> This software packaging involves a huge amount of dependencies
> including hard bones like tensorflow, which has not yet been
> fully prepared (the one in NEW queue does not produce the python
> package). 

I'm aware of that. There's two reasons why this isn't a major issue as
far as I'm concerned:

- The set of packages required to get MycroftAI into Debian is fairly
  large, and mimic2 is far far far down the road for now. I'm not there
  yet, but I do want to make sure it's *possible* for me at some point
  down the line to upload mimic2.
- Even if we don't get there, from a cursory view it looks like
  MycroftAI uses a configurable tts middle layer. If I'm not mistaken,
  then while results are best with mimic2, using mimic2 is not
  required (it should be possible to use some other tts implementation
  in the mean time).

I should know more about this as I work my way through the MycroftAI
stack, i.e., not now :-)

> > One of the things I'll eventually have to look at is "mimic2", a
> > speech
> > synthesiser based on Google's tacotron paper. It can be trained on
> > most
> > of the free speech datasets out there, and they also wrote something
> > called "mimic-recording-studio" with which you can just create your
> > own
> > dataset.
> 
> My general recommendation is to only consider things related to dataset
> when a machine learning or deep learning framework is ready in the
> archive. Dealing with the framework itself is already difficult enough.
> Good AI software upstreams will try their best to keep dataset
> available for long term.

I am absolutely not planning to upload a dataset tomorrow :-) but I
*will* need to upload something at some point in the mid to long future,
and I just wanted to see what the recommended strategy for that is when
I get to it some time down the line. I don't have any good ideas here,
that's why I'm asking.

It might be relevant for what I do while packaging the framework, after
all.

> > The trouble is that it requires significant resources (no surprise
> > there): the LJ Speech dataset is 2.6G (3.6G unpacked), and the mimic2
> > README states that it requires about 40G of storage and several hours
> > of
> > runtime to fully train the model. The good news though, is that it
> > falls
> > completely under the "Free Model" definition of the ML policy; in
> > fact
> > the software doesn't ship with any model and you have to train it
> > yourself if you want to do any development (and I suspect it would be
> > Type-F reproducible, at least, although I haven't investigated that
> > yet).
> 
> 2.6GB of our archive space is not trivial.

Yes, that's entirely my point :-)

> Only when its importance 
> and number of audience outweighs the bulky size should we discuss it.
> 
> > So, although I'm not ready yet to start packaging mimic2, I think
> > it's
> > reasonable to see how I would go about that.
> 
> Tensorflow is expected to be a big blocker.

I know, but that's just software, and a "mere" matter of packaging it.
It's not trivial, but we've got good procedures for doing that and we
know how to tackle a problem of that type.

Datasets are a different matter. I know that 2.6G is not trivial. I'm
not suggesting we upload it to the archive; I think it does not make
sense to do that. Instead, I think we require some form of a separate
archive with separate tracking, where the dataset and the metadata may
even need to be in separate files. The LJ Speech dataset was not written
by the Mycroft team; I suspect it might be used by other projects, too
(but I am not certain of that). Even if that turns out to not be the
case, mimic2 supports being trained on different datasets, so it should
be possible to use a different speech dataset, one that will be used by
other software in the archive, to train it.

So my question is, is there any strategy planned for how to manage
datasets and their links with software that is trained on these
datasets? I'm sure there will be other datasets that might be used to
train free software, and that 2.6G will not be an exception, size wise?

Even if that assumption is wrong, I think training models on a dataset
is not something we should do for every trivial upload, and so a
standard "let's do it on the buildd hosts" seems wrong.

> > Is there any infrastructure available, or in the process of being set
> > up, to upload datasets and train models? Are there any plans for
> > that?
> 
> Some people including me in the deep learning team had some ideas
> about infrastructured equipped with some decent GPU. But it seems
> uneasy to take into action, as GPU-related stuff often involves
> non-free blobs which makes the development work in Debian not
> quite comfortable.

Yes, I can see how that might be a problem.

> > I suspect it can't be the intention to have to retrain a model with
> > every upload of the software, no matter how minor, nor that I should
> > have to upload a 2.6G .deb just so the model can be trained...
> 
> Don't worry about that. Worry about the framework itself first.
> Then you will encounter more problems than that a dataset incurs.

Oh, I'm sure about that, but I'm also sure that the problem of training
models is one that we'll eventually need to tackle. I wanted to know if
it's being taken care of.

I'm not sure that that is the case right now. Perhaps some initial stuff
has been looked at, but it feels to me like there's a lot to be done
still?

If not, then yay, happy days. If so, then I'll be happy to help work on
this -- but I thought I'd ask first before I start talking to random
people :)

[...]
-- 
     w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

Reply to:

Follow-Ups:
- Re: Training and large data sets?
  - From: "M. Zhou" <lumin@debian.org>

References:
- Re: Training and large data sets?
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: Re: googleapis packaging
Next by Date: Re: Training and large data sets?
Previous by thread: Re: Training and large data sets?
Next by thread: Re: Training and large data sets?
Index(es):
- Date
- Thread