Whisper STT Service uses [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to perform offline speech-to-text in openHAB.
It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity detection to isolate single command to transcribe, speeding up the execution.
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
Whisper enables speech recognition for multiple languages and dialects:
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so the add-ons can find them.
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
## Using alternative whisper.cpp library
It's possible to use your own build of the whisper.cpp shared library with this add-on.
On `Linux/macOs` you need to place the `libwhisper.so/libwhisper.dydib` at `/usr/local/lib/`.
On `Windows` the `whisper.dll` file needs to be placed in any directory listed at the variable `$env:PATH`, for example `X:\\Windows\System32\`.
In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can find information about the required flags to enable different acceleration methods on the cmake build and other relevant information.
Note: You need to restart openHAB to reload the library.
## Grammar
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
Internally whisper works by inferring a matrix of possible tokens from the audio and then resolving the final transcription from it using either the Greedy or Bean Search algorithm.
The grammar feature allows you to modify the probabilities of the inferred tokens by adding a penalty to the tokens outside the grammar so that the transcription gets resolved in a different way.
It's a way to get the smallest models to perform better over a limited grammar.
The grammar should be defined using [BNF](https://en.wikipedia.org/wiki/Backus–Naur_form), and the root variable should resolve the full grammar.
It allows using regex and optional parts to make it more dynamic.
- **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
- **Preload Model** - Keep whisper model loaded.
- **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
- **Min Transcription Seconds** - Forces min audio duration passed to whisper, in seconds.
- **Max Transcription Seconds** - Max seconds for force trigger the transcription, without wait for detect silence.
- **Initial Silence Seconds** - Max seconds without any voice activity to abort the transcription.
- **Max Silence Seconds** - Max consecutive silence seconds to trigger the transcription.
- **Remove Silence** - Remove start and end silence from the audio to transcribe.
- **Audio Step** - Audio processing step in seconds for the voice activity detection.
- **Voice Activity Detection Mode** - Selected VAD Mode.
- **Voice Activity Detection Sensitivity** - Percentage in range 0-1 of voice activity in one second to consider it as voice.
- **Voice Activity Detection Step** - VAD detector internal step in ms (only allows 10, 20 or 30). (Audio Step / Voice Activity Detection Step = number of vad executions per audio step).
In case you would like to set up these settings via a text file, you can edit the file `runtime.cfg` in `$OPENHAB_ROOT/conf/services` and set the following entries: