[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Viable speech recognition tools?

On 5/16/21 8:19 AM, Richard Owlett wrote:

> I notice PocketSphinx in the Debian repositories.
> How suitable is it for dictation by a single speaker?
> I realize it is designed to be speaker independent.
I wouldn't say it is designed to be speaker independent. They do have
several speaker independent acoustic models for different languages, but
I would highly recommend adapting those to your voice, which isn't
particularly difficult once you have collected a set of labeled
recordings. The biggest factors in deciding whether PocketSphinx is a
good choice would be the size of your dictionary, how quickly you need
it to respond, and how accurate you need it to be. Pocketsphinx is very,
very fast. It works best if you only have a few words you want it to
recognize (around 100 seems optimal). It's precision is okay but recall
is pretty bad, so it will probably hear you say words you have not, or
even recognize sounds as words (it can be pretty fascinating to listen
to some of those recordings and comparing them to the transcript it
proposed) so you need some way of letting it know you are taking to it
(an intercom button or Voice Activity Detection). It also has a keyword
search mode that allows you to set specific confidence levels before it
will react to a word. It and Julius both work well as a command interface.

It also depends on what sort of dictation you want. Most dictation
relies heavily on a language model, so you need a lot of samples of the
sort of language you are expecting. Pocketsphinx uses a pretty simple
language model and doesn't really prioritize words based on their
relations to each other.

For dictation, you should probably check out Mozilla DeepSpeech. It can
also be adapted to a single user pretty easily, can handle a much larger
vocabulary, generally does better at both precision and recall, and uses
the KenLM language model by default which is pretty easy to work with.
It is almost real-time running on a CPU (either ARMv7/8 or x86_64) but
gets much faster if you can use CUDA.

If you don't need real-time transcriptions, FlashlightASR could be a
good choice. It seems to be more accurate but also takes more resources.

There are also a number of recognition engines that run on Kaldi, and
those pretty much span the range between speed and accuracy.

So what kind of dictation are you trying to do exactly?

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply to: