[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Viable speech recognition tools?



On 5/19/21 5:48 AM, Richard Owlett wrote:
> On 05/16/2021 01:00 PM, Aaron wrote:
>> On 5/16/21 8:19 AM, Richard Owlett wrote:
>
> [I'm subscribed to the list ;]
>
>>
>>> I notice PocketSphinx in the Debian repositories.
>>> How suitable is it for dictation by a single speaker?
>>> I realize it is designed to be speaker independent.
>>> TIA
>>>
>> I wouldn't say it is designed to be speaker independent.
>
> When I read the description I may have "seen" what I wanted to see.
> I haven't investigated speech recognition since I was using Windows a
> decade ago.
>
> I'm assuming training to my voice and speaking style. I want
> continuous speech and as large a vocabulary as possible.
>
Thank you for getting in touch. I feel like I have a somewhat better
idea of what you are trying to do.

Kaldi, Deepspeech and FlashlightASR all recommend Linux for your
environment. I'm not sure if any of them can run on Windows or OSX.

Pocketsphinx is definitely not going to work for the purposes of taking
dictation for letters.

The easiest way to get speech recognition is going to be to use an
online service like Google Cloud TTS. This has the full power of the
Google search engine behind it as far as language model, and they handle
all the optimizations on their side automatically. I think there is
still a free version of this service. The main reason to avoid it is, of
course, privacy, and second being that it requires an internet
connection. I only mention it because it is so much easier to get set up
right now and you didn't explicitly state what your requirements are.

Kaldi, Mozilla Deepspeech, and FlashlightASR are all viable options.
They are all free and open source, run locally, and interface well with
Python as far as scripting the training and recognition processes (they
also interface with c++ but I'd at least prototype stuff in python
first). Kaldi and FlashlightASR are currently aimed at researchers, so
they are not easy to set up and the documentation is full of
intimidating formulas and technical jargon. Mozilla Deepspeech is
somewhat gentler to work with and seems to have better support, plus it
can be installed with a simple "pip install deepspeech". Mozilla
Deepspeech and FlashlightASR both use KenLM language models by default,
while Kaldi uses a variety of language models.

I'm currently working on a research project where I am trying to compare
the current state of different speech recognition engines and classify
them according to strengths and weaknesses. If I can be helpful, please
let me know.


Attachment: OpenPGP_signature
Description: OpenPGP digital signature


Reply to: