Re: Training and large data sets?
If there's nothing there yet currently, then so be it, but I thought I'd
ask here before I start talking to DSA. Also, my experience with deep
learning is "I know it exists", so I'm probably not the best person to
start listing up requirements etc.
On Sat, Aug 28, 2021 at 05:30:50PM +0200, Wouter Verhelst wrote:
> Hi folks,
> I've started working on packaging the MycroftAI software for Debian.
> This is a large set of python packages which together form a voice
> One of the things I'll eventually have to look at is "mimic2", a speech
> synthesiser based on Google's tacotron paper. It can be trained on most
> of the free speech datasets out there, and they also wrote something
> called "mimic-recording-studio" with which you can just create your own
> The trouble is that it requires significant resources (no surprise
> there): the LJ Speech dataset is 2.6G (3.6G unpacked), and the mimic2
> README states that it requires about 40G of storage and several hours of
> runtime to fully train the model. The good news though, is that it falls
> completely under the "Free Model" definition of the ML policy; in fact
> the software doesn't ship with any model and you have to train it
> yourself if you want to do any development (and I suspect it would be
> Type-F reproducible, at least, although I haven't investigated that
> So, although I'm not ready yet to start packaging mimic2, I think it's
> reasonable to see how I would go about that.
> Is there any infrastructure available, or in the process of being set
> up, to upload datasets and train models? Are there any plans for that?
> I suspect it can't be the intention to have to retrain a model with
> every upload of the software, no matter how minor, nor that I should
> have to upload a 2.6G .deb just so the model can be trained...
> Thanks for any info,