Re: Training and large data sets?

To: Wouter Verhelst <wouter@debian.org>, debian-ai@lists.debian.org
Subject: Re: Training and large data sets?
From: "M. Zhou" <lumin@debian.org>
Date: Sun, 26 Sep 2021 19:50:53 -0400
Message-id: <[🔎] e3c97417ff70514911155fb30429a9d7d2b606a0.camel@riseup.net>
In-reply-to: <YSpWqvYVO279japJ@pc181009.grep.be>
References: <YSpWqvYVO279japJ@pc181009.grep.be>

Hi Wouter,

Was just too busy to respond... sorry.

On Sat, 2021-08-28 at 17:30 +0200, Wouter Verhelst wrote:
> Hi folks,
> 
> I've started working on packaging the MycroftAI software for Debian.
> This is a large set of python packages which together form a voice
> assistant.

The motivation is wonderful but I'd like to point this out:
https://github.com/MycroftAI/mimic2/blob/master/requirements.txt#L2
This software packaging involves a huge amount of dependencies
including hard bones like tensorflow, which has not yet been
fully prepared (the one in NEW queue does not produce the python
package). 

> One of the things I'll eventually have to look at is "mimic2", a
> speech
> synthesiser based on Google's tacotron paper. It can be trained on
> most
> of the free speech datasets out there, and they also wrote something
> called "mimic-recording-studio" with which you can just create your
> own
> dataset.

My general recommendation is to only consider things related to dataset
when a machine learning or deep learning framework is ready in the
archive. Dealing with the framework itself is already difficult enough.
Good AI software upstreams will try their best to keep dataset
available for long term.

> The trouble is that it requires significant resources (no surprise
> there): the LJ Speech dataset is 2.6G (3.6G unpacked), and the mimic2
> README states that it requires about 40G of storage and several hours
> of
> runtime to fully train the model. The good news though, is that it
> falls
> completely under the "Free Model" definition of the ML policy; in
> fact
> the software doesn't ship with any model and you have to train it
> yourself if you want to do any development (and I suspect it would be
> Type-F reproducible, at least, although I haven't investigated that
> yet).

2.6GB of our archive space is not trivial. Only when its importance 
and number of audience outweighs the bulky size should we discuss it.

> So, although I'm not ready yet to start packaging mimic2, I think
> it's
> reasonable to see how I would go about that.

Tensorflow is expected to be a big blocker.

> Is there any infrastructure available, or in the process of being set
> up, to upload datasets and train models? Are there any plans for
> that?

Some people including me in the deep learning team had some ideas
about infrastructured equipped with some decent GPU. But it seems
uneasy to take into action, as GPU-related stuff often involves
non-free blobs which makes the development work in Debian not
quite comfortable.

> I suspect it can't be the intention to have to retrain a model with
> every upload of the software, no matter how minor, nor that I should
> have to upload a 2.6G .deb just so the model can be trained...

Don't worry about that. Worry about the framework itself first.
Then you will encounter more problems than that a dataset incurs.

> Thanks for any info,

Thank you for your motivation -- it's appreciated. I pointed out
some pitfalls and hope the information helps.

Reply to:

Follow-Ups:
- Re: Training and large data sets?
  - From: Wouter Verhelst <w@uter.be>

Prev by Date: Re: Training and large data sets?
Next by Date: Re: googleapis packaging
Previous by thread: Re: Training and large data sets?
Next by thread: Re: Training and large data sets?
Index(es):
- Date
- Thread