Training and large data sets?

To: debian-ai@lists.debian.org
Subject: Training and large data sets?
From: Wouter Verhelst <wouter@debian.org>
Date: Sat, 28 Aug 2021 17:30:50 +0200
Message-id: <[🔎] YSpWqvYVO279japJ@pc181009.grep.be>

Hi folks,

I've started working on packaging the MycroftAI software for Debian.
This is a large set of python packages which together form a voice
assistant.

One of the things I'll eventually have to look at is "mimic2", a speech
synthesiser based on Google's tacotron paper. It can be trained on most
of the free speech datasets out there, and they also wrote something
called "mimic-recording-studio" with which you can just create your own
dataset.

The trouble is that it requires significant resources (no surprise
there): the LJ Speech dataset is 2.6G (3.6G unpacked), and the mimic2
README states that it requires about 40G of storage and several hours of
runtime to fully train the model. The good news though, is that it falls
completely under the "Free Model" definition of the ML policy; in fact
the software doesn't ship with any model and you have to train it
yourself if you want to do any development (and I suspect it would be
Type-F reproducible, at least, although I haven't investigated that
yet).

So, although I'm not ready yet to start packaging mimic2, I think it's
reasonable to see how I would go about that.

Is there any infrastructure available, or in the process of being set
up, to upload datasets and train models? Are there any plans for that?

I suspect it can't be the intention to have to retrain a model with
every upload of the software, no matter how minor, nor that I should
have to upload a 2.6G .deb just so the model can be trained...

Thanks for any info,

-- 
     w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

Reply to:

Prev by Date: TensorFlow Lite Debian Packaging
Previous by thread: TensorFlow Lite Debian Packaging
Index(es):
- Date
- Thread