[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Training and large data sets?



Hi folks,

I've started working on packaging the MycroftAI software for Debian.
This is a large set of python packages which together form a voice
assistant.

One of the things I'll eventually have to look at is "mimic2", a speech
synthesiser based on Google's tacotron paper. It can be trained on most
of the free speech datasets out there, and they also wrote something
called "mimic-recording-studio" with which you can just create your own
dataset.

The trouble is that it requires significant resources (no surprise
there): the LJ Speech dataset is 2.6G (3.6G unpacked), and the mimic2
README states that it requires about 40G of storage and several hours of
runtime to fully train the model. The good news though, is that it falls
completely under the "Free Model" definition of the ML policy; in fact
the software doesn't ship with any model and you have to train it
yourself if you want to do any development (and I suspect it would be
Type-F reproducible, at least, although I haven't investigated that
yet).

So, although I'm not ready yet to start packaging mimic2, I think it's
reasonable to see how I would go about that.

Is there any infrastructure available, or in the process of being set
up, to upload datasets and train models? Are there any plans for that?

I suspect it can't be the intention to have to retrain a model with
every upload of the software, no matter how minor, nor that I should
have to upload a 2.6G .deb just so the model can be trained...

Thanks for any info,

-- 
     w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}


Reply to: