[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1051812: RFP: vosk-api -- Offline speech recognition API



Package: wnpp
Severity: wishlist

* Package name    : vosk-api
  Version         : 0.3.45
  Upstream Contact: Alpha Cephei <https://github.com/alphacep/>
* URL             : https://alphacephei.com/vosk/
* License         : Apache-2.0
  Programming Lang: Jupyter, C++, Python, Java, ...
  Description     : Offline speech recognition API

Vosk is an offline open source speech recognition toolkit. It enables
speech recognition for 20+ languages and dialects - English, Indian
English, German, French, Spanish, Portuguese, Chinese, Russian,
Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi,
Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi,
Czech, Polish.

Vosk models are small (50 Mb) but provide continuous large vocabulary
transcription, zero-latency response with streaming API,
reconfigurable vocabulary and speaker identification.

Speech recognition bindings implemented for various programming
languages like Python, Java, Node.JS, C#, C++, Rust, Go and others.

Vosk supplies speech recognition for chatbots, smart home appliances,
virtual assistants. It can also create subtitles for movies,
transcription for lectures and interviews.

Vosk scales from small devices like Raspberry Pi or Android smartphone
to big clusters.

----

Debian has been shipping speech recognition software for a
while, mostly in the form of Sphinx, which is... well, it's not as
good as one would imagine those things to be.

Historically, such programs used to be extremely inaccurate and
largely in the realm of sci-fi and play things, but recent advances in
machine learning have shown tremendous progress in this area, which
makes it possible make use of (free!) software to enable voice-driven
applications of all sorts.

vosk is an API layer that can be used by other programs to implement
such solutions, and I think it would be a great addition to Debian.

The models are small and all free although the licenses vary:

https://alphacephei.com/vosk/models

Also, it could be possible to just package the API bits without
shipping the models in Debian, which of course would be less useful,
but more useful than nothing.

I'm not exactly sure what our policy is on models, actually: the
license of the models above is "free" in the sense that you can get
the binary and do what you want with it, but i'm not sure it would
pass the smell test of "wait, but where's the training data" kind of
stuff. I leave that to people more familiar with those sticky issues
and focus this RFP on the software side of things.

Also bewarned that I'm only peripherally familiar with ML and current
developments in AI.  I mostly fell (again) on vosk because of Numen:

https://numenvoice.org/

There are other models out there, that might be better targets. For
example, <https://ggml.ai/> "is a tensor library for machine learning
to enable large models and high performance on commodity hardware. It
is used by llama.cpp and whisper.cpp". But the latter two are on
relatively shakier legal grounds, as far as Debian is concerned.

There's also Mozilla's <https://github.com/mozilla/DeepSpeech>, "an
open source embedded (offline, on-device) speech-to-text engine which
can run in real time on devices ranging from a Raspberry Pi 4 to high
power GPU servers."

And there's a whole area of "home assistants" that have their own way
of doing things. This is just one of them, and I would be happy to
hear what's the best solution for this problem space.


Reply to: