Re: Viable speech recognition tools?
On 05/16/2021 01:00 PM, Aaron wrote:
On 5/16/21 8:19 AM, Richard Owlett wrote:
[I'm subscribed to the list ;]
I notice PocketSphinx in the Debian repositories.
How suitable is it for dictation by a single speaker?
I realize it is designed to be speaker independent.
I wouldn't say it is designed to be speaker independent.
When I read the description I may have "seen" what I wanted to see.
I haven't investigated speech recognition since I was using Windows a
I'm assuming training to my voice and speaking style. I want continuous
speech and as large a vocabulary as possible.
They do have
several speaker independent acoustic models for different languages, but
I would highly recommend adapting those to your voice, which isn't
particularly difficult once you have collected a set of labeled
recordings. The biggest factors in deciding whether PocketSphinx is a
good choice would be the
size of your dictionary,
large as possible
how quickly you need it to respond,
Not an issue. It will be running on a dedicated laptop.
and how accurate you need it to be.
Pocketsphinx is very,
very fast. It works best if you only have a few words you want it to
recognize (around 100 seems optimal). It's precision is okay but recall
is pretty bad, so it will probably hear you say words you have not, or
even recognize sounds as words (it can be pretty fascinating to listen
to some of those recordings and comparing them to the transcript it
proposed) so you need some way of letting it know you are taking to it
(an intercom button or Voice Activity Detection). It also has a keyword
search mode that allows you to set specific confidence levels before it
will react to a word. It and Julius both work well as a command interface.
I'm not interested in "command and control".
It also depends on what sort of dictation you want. Most dictation
relies heavily on a language model, so you need a lot of samples of the
sort of language you are expecting. Pocketsphinx uses a pretty simple
language model and doesn't really prioritize words based on their
relations to each other.
For dictation, you should probably check out Mozilla DeepSpeech. It can
also be adapted to a single user pretty easily, can handle a much larger
vocabulary, generally does better at both precision and recall, and uses
the KenLM language model by default which is pretty easy to work with.
It is almost real-time running on a CPU (either ARMv7/8 or x86_64) but
gets much faster if you can use CUDA.
If you don't need real-time transcriptions, FlashlightASR could be a
good choice. It seems to be more accurate but also takes more resources.
There are also a number of recognition engines that run on Kaldi, and
those pretty much span the range between speed and accuracy.
Does Kaldi, DeepSpeech, or FlashlightASR have end-user support for
Linux? Recommended links for evaluating appropriateness?
So what kind of dictation are you trying to do exactly?
Primarily verbal note-taking and composing emails. I'm a poor two-finger