mirror of
https://github.com/openhab/openhab-addons.git
synced 2025-01-25 14:55:55 +01:00
[WhisperSTT] Initial contribution (#15166)
Signed-off-by: Miguel Álvarez <miguelwork92@gmail.com> Signed-off-by: GiviMAD <GiviMAD@users.noreply.github.com> Signed-off-by: Ciprian Pascu <contact@ciprianpascu.ro>
This commit is contained in:
parent
84b1c525ed
commit
097b64ec16
@ -451,6 +451,7 @@
|
||||
/bundles/org.openhab.voice.voicerss/ @lolodomo
|
||||
/bundles/org.openhab.voice.voskstt/ @GiviMAD
|
||||
/bundles/org.openhab.voice.watsonstt/ @GiviMAD
|
||||
/bundles/org.openhab.voice.whisperstt/ @GiviMAD
|
||||
/itests/org.openhab.automation.groovyscripting.tests/ @wborn
|
||||
/itests/org.openhab.automation.jsscriptingnashorn.tests/ @wborn
|
||||
/itests/org.openhab.binding.astro.tests/ @gerrieg
|
||||
|
@ -2251,6 +2251,11 @@
|
||||
<artifactId>org.openhab.voice.watsonstt</artifactId>
|
||||
<version>${project.version}</version>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.openhab.addons.bundles</groupId>
|
||||
<artifactId>org.openhab.voice.whisperstt</artifactId>
|
||||
<version>${project.version}</version>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
|
||||
</project>
|
||||
|
35
bundles/org.openhab.voice.whisperstt/NOTICE
Normal file
35
bundles/org.openhab.voice.whisperstt/NOTICE
Normal file
@ -0,0 +1,35 @@
|
||||
This content is produced and maintained by the openHAB project.
|
||||
|
||||
* Project home: https://www.openhab.org
|
||||
|
||||
== Declared Project Licenses
|
||||
|
||||
This program and the accompanying materials are made available under the terms
|
||||
of the Eclipse Public License 2.0 which is available at
|
||||
https://www.eclipse.org/legal/epl-2.0/.
|
||||
|
||||
== Source Code
|
||||
|
||||
https://github.com/openhab/openhab-addons
|
||||
|
||||
== Third-party Content
|
||||
|
||||
io.github.givimad: whisper-jni
|
||||
* License: Apache 2.0 License
|
||||
* Project: https://github.com/GiviMAD/whisper-jni
|
||||
* Source: https://github.com/GiviMAD/whisper-jni/tree/main/src/
|
||||
|
||||
native dependency: whisper.cpp
|
||||
* License: MIT License https://github.com/ggerganov/whisper.cpp/blob/master/LICENSE
|
||||
* Project: https://github.com/ggerganov/whisper.cpp
|
||||
* Source: https://github.com/ggerganov/whisper.cpp
|
||||
|
||||
io.github.givimad: libfvad-jni
|
||||
* License: Apache 2.0 License https://github.com/GiviMAD/libfvad-jni/blob/main/LICENSE
|
||||
* Project: https://github.com/GiviMAD/libfvad-jni
|
||||
* Source: https://github.com/GiviMAD/libfvad-jni/tree/main/src/
|
||||
|
||||
native dependency: libfvad
|
||||
* License: BSD License https://github.com/dpirch/libfvad/blob/master/LICENSE
|
||||
* Project: https://github.com/dpirch/libfvad
|
||||
* Source: https://github.com/dpirch/libfvad
|
248
bundles/org.openhab.voice.whisperstt/README.md
Normal file
248
bundles/org.openhab.voice.whisperstt/README.md
Normal file
@ -0,0 +1,248 @@
|
||||
# Whisper Speech-to-Text
|
||||
|
||||
Whisper STT Service uses [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to perform offline speech-to-text in openHAB.
|
||||
It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity detection to isolate single command to transcribe, speeding up the execution.
|
||||
|
||||
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
|
||||
|
||||
Whisper enables speech recognition for multiple languages and dialects:
|
||||
|
||||
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
|
||||
italian, indonesian, hindi, finnish, vietnamese, hebrew, ukrainian, greek, malay, czech, romanian, danish, hungarian, tamil, norwegian,
|
||||
thai, urdu, croatian, bulgarian, lithuanian, latin, maori, malayalam, welsh, slovak, telugu, persian, latvian, bengali, serbian, azerbaijani,
|
||||
slovenian, kannada, estonian, macedonian, breton, basque, icelandic, armenian, nepali, mongolian, bosnian, kazakh, albanian, swahili, galician,
|
||||
marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, georgian, belarusian, tajik, sindhi, gujarati, amharic, yiddish, lao,
|
||||
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
|
||||
hausa, bashkir, javanese and sundanese.
|
||||
|
||||
## Supported platforms
|
||||
|
||||
This add-on uses some native binaries to work.
|
||||
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
|
||||
|
||||
The following platforms are supported:
|
||||
|
||||
* Windows10 x86_64
|
||||
* Debian GLIBC x86_64/arm64 (min GLIBC version 2.31 / min Debian version Focal)
|
||||
* macOS x86_64/arm64 (min version v11.0)
|
||||
|
||||
The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
|
||||
|
||||
## CPU compatibility
|
||||
|
||||
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
|
||||
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
|
||||
|
||||
If you are going to use the binding in a `x86_64` host the CPU should support the flags: `avx2`, `fma`, `f16c`, `avx`.
|
||||
You can check those flags on linux using the terminal with `lscpu`.
|
||||
You can check those flags on Windows using a program like `CPU-Z`.
|
||||
|
||||
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
|
||||
You can check those flags on linux using the terminal with `lscpu`.
|
||||
|
||||
## Transcription time
|
||||
|
||||
On a Raspberry PI 5, the approximate transcription times are:
|
||||
|
||||
| model | exec time |
|
||||
| ---------- | --------: |
|
||||
| tiny.bin | 1.5s |
|
||||
| base.bin | 3s |
|
||||
| small.bin | 8.5s |
|
||||
| medium.bin | 17s |
|
||||
|
||||
|
||||
## Configuring the model
|
||||
|
||||
Before you can use this service you should configure your model.
|
||||
|
||||
You can download them from the sources provided by the [whisper.cpp](https://github.com/ggerganov/whisper.cpp) author:
|
||||
|
||||
* https://huggingface.co/ggerganov/whisper.cpp
|
||||
* https://ggml.ggerganov.com
|
||||
|
||||
You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so the add-ons can find them.
|
||||
|
||||
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
|
||||
|
||||
## Using alternative whisper.cpp library
|
||||
|
||||
It's possible to use your own build of the whisper.cpp shared library with this add-on.
|
||||
|
||||
On `Linux/macOs` you need to place the `libwhisper.so/libwhisper.dydib` at `/usr/local/lib/`.
|
||||
|
||||
On `Windows` the `whisper.dll` file needs to be placed in any directory listed at the variable `$env:PATH`, for example `X:\\Windows\System32\`.
|
||||
|
||||
In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can find information about the required flags to enable different acceleration methods on the cmake build and other relevant information.
|
||||
|
||||
Note: You need to restart openHAB to reload the library.
|
||||
|
||||
## Grammar
|
||||
|
||||
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
|
||||
|
||||
Internally whisper works by inferring a matrix of possible tokens from the audio and then resolving the final transcription from it using either the Greedy or Bean Search algorithm.
|
||||
The grammar feature allows you to modify the probabilities of the inferred tokens by adding a penalty to the tokens outside the grammar so that the transcription gets resolved in a different way.
|
||||
|
||||
It's a way to get the smallest models to perform better over a limited grammar.
|
||||
|
||||
The grammar should be defined using [BNF](https://en.wikipedia.org/wiki/Backus–Naur_form), and the root variable should resolve the full grammar.
|
||||
It allows using regex and optional parts to make it more dynamic.
|
||||
|
||||
This is a basic grammar example:
|
||||
|
||||
```BNF
|
||||
root ::= (light_switch | light_state | tv_channel) "."
|
||||
light_switch ::= "turn the light " ("on" | "off")
|
||||
light_state ::= "set light to " ("high" | "low")
|
||||
tv_channel ::= ("set ")? "tv channel to " [0-9]+
|
||||
```
|
||||
|
||||
You can provide the grammar and enable its usage using the binding configuration.
|
||||
|
||||
## Configuration
|
||||
|
||||
Use your favorite configuration UI to edit the Whisper settings:
|
||||
|
||||
### Speech to Text Configuration
|
||||
|
||||
General options.
|
||||
|
||||
* **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
|
||||
* **Preload Model** - Keep whisper model loaded.
|
||||
* **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
|
||||
* **Min Transcription Seconds** - Forces min audio duration passed to whisper, in seconds.
|
||||
* **Max Transcription Seconds** - Max seconds for force trigger the transcription, without wait for detect silence.
|
||||
* **Initial Silence Seconds** - Max seconds without any voice activity to abort the transcription.
|
||||
* **Max Silence Seconds** - Max consecutive silence seconds to trigger the transcription.
|
||||
* **Remove Silence** - Remove start and end silence from the audio to transcribe.
|
||||
|
||||
### Voice Activity Detection Configuration
|
||||
|
||||
Configure VAD options.
|
||||
|
||||
* **Audio Step** - Audio processing step in seconds for the voice activity detection.
|
||||
* **Voice Activity Detection Mode** - Selected VAD Mode.
|
||||
* **Voice Activity Detection Sensitivity** - Percentage in range 0-1 of voice activity in one second to consider it as voice.
|
||||
* **Voice Activity Detection Step** - VAD detector internal step in ms (only allows 10, 20 or 30). (Audio Step / Voice Activity Detection Step = number of vad executions per audio step).
|
||||
|
||||
### Whisper Configuration
|
||||
|
||||
Configure whisper options.
|
||||
|
||||
* **Threads** - Number of threads used by whisper. (0 to use host max threads)
|
||||
* **Sampling Strategy** - Sampling strategy used.
|
||||
* **Beam Size** - Beam Size configuration for sampling strategy Bean Search.
|
||||
* **Greedy Best Of** - Best Of configuration for sampling strategy Greedy.
|
||||
* **Speed Up** - Speed up audio by x2. (Reduced accuracy)
|
||||
* **Audio Context** - Overwrite the audio context size. (0 to use whisper default context size)
|
||||
* **Temperature** - Temperature threshold.
|
||||
* **Initial Prompt** - Initial prompt for whisper.
|
||||
* **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
||||
* **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
|
||||
|
||||
### Grammar Configuration
|
||||
|
||||
Configure the grammar options.
|
||||
|
||||
* **Grammar** - Grammar to use in GBNF format (whisper.cpp BNF variant).
|
||||
* **Use Grammar** - Enable grammar usage.
|
||||
* **Grammar penalty** - Penalty for non grammar tokens.
|
||||
|
||||
#### Grammar Example:
|
||||
|
||||
|
||||
```gbnf
|
||||
# Grammar should define a root expression that should end with a dot.
|
||||
root ::= " " command "."
|
||||
# Alternative command expression to expand into the root.
|
||||
command ::= "Turn " onoff " " (connector)? thing |
|
||||
put " " thing " to " state |
|
||||
watch " " show " at bedroom" |
|
||||
"Start " timer " minutes timer"
|
||||
|
||||
# You can use as many expressions as you need.
|
||||
|
||||
thing ::= "light" | "bedroom light" | "living room light" | "tv"
|
||||
|
||||
put ::= "set" | "put"
|
||||
|
||||
onoff ::= "on" | "off"
|
||||
|
||||
watch ::= "watch" | "play"
|
||||
|
||||
connector ::= "the"
|
||||
|
||||
state ::= "low" | "high" | "normal"
|
||||
|
||||
show ::= [a-zA-Z]+
|
||||
|
||||
timer ::= [0-9]+
|
||||
|
||||
```
|
||||
|
||||
### Messages Configuration
|
||||
|
||||
* **No Results Message** - Message to be told on no results.
|
||||
* **Error Message** - Message to be told on exception.
|
||||
|
||||
### Developer Configuration
|
||||
|
||||
* **Create WAV Record** - Create wav audio file on each whisper execution, also creates a '.prop' file containing the transcription.
|
||||
* **Record Sample Format** - Change the record sample format. (allows i16 or f32)
|
||||
* **Enable Whisper Log** - Emit whisper.cpp library logs as add-on debug logs.
|
||||
|
||||
You can find [here](https://github.com/givimad/whisper-finetune-oh) information on how to fine-tune a model using the generated records.
|
||||
|
||||
### Configuration via a text file
|
||||
|
||||
In case you would like to set up the service via a text file, create a new file in `$OPENHAB_ROOT/conf/services` named `whisperstt.cfg`
|
||||
|
||||
Its contents should look similar to:
|
||||
|
||||
```
|
||||
org.openhab.voice.whisperstt:modelName=tiny
|
||||
org.openhab.voice.whisperstt:initSilenceSeconds=0.3
|
||||
org.openhab.voice.whisperstt:removeSilence=true
|
||||
org.openhab.voice.whisperstt:stepSeconds=0.3
|
||||
org.openhab.voice.whisperstt:vadStep=0.5
|
||||
org.openhab.voice.whisperstt:singleUtteranceMode=true
|
||||
org.openhab.voice.whisperstt:preloadModel=false
|
||||
org.openhab.voice.whisperstt:vadMode=LOW_BITRATE
|
||||
org.openhab.voice.whisperstt:vadSensitivity=0.1
|
||||
org.openhab.voice.whisperstt:maxSilenceSeconds=2
|
||||
org.openhab.voice.whisperstt:minSeconds=2
|
||||
org.openhab.voice.whisperstt:maxSeconds=10
|
||||
org.openhab.voice.whisperstt:threads=0
|
||||
org.openhab.voice.whisperstt:audioContext=0
|
||||
org.openhab.voice.whisperstt:samplingStrategy=GREEDY
|
||||
org.openhab.voice.whisperstt:temperature=0
|
||||
org.openhab.voice.whisperstt:noResultsMessage="Sorry, I didn't understand you"
|
||||
org.openhab.voice.whisperstt:errorMessage="Sorry, something went wrong"
|
||||
org.openhab.voice.whisperstt:createWAVRecord=false
|
||||
org.openhab.voice.whisperstt:recordSampleFormat=i16
|
||||
org.openhab.voice.whisperstt:speedUp=false
|
||||
org.openhab.voice.whisperstt:beamSize=4
|
||||
org.openhab.voice.whisperstt:enableWhisperLog=false
|
||||
org.openhab.voice.whisperstt:greedyBestOf=4
|
||||
org.openhab.voice.whisperstt:initialPrompt=
|
||||
org.openhab.voice.whisperstt:openvinoDevice=""
|
||||
org.openhab.voice.whisperstt:useGPU=false
|
||||
org.openhab.voice.whisperstt:useGrammar=false
|
||||
org.openhab.voice.whisperstt:grammarPenalty=80.0
|
||||
org.openhab.voice.whisperstt:grammarLines=
|
||||
```
|
||||
|
||||
### Default Speech-to-Text Configuration
|
||||
|
||||
You can select your preferred default Speech-to-Text in the UI:
|
||||
|
||||
* Go to **Settings**.
|
||||
* Edit **System Services - Voice**.
|
||||
* Set **Whisper** as **Speech-to-Text**.
|
||||
|
||||
In case you would like to set up these settings via a text file, you can edit the file `runtime.cfg` in `$OPENHAB_ROOT/conf/services` and set the following entries:
|
||||
|
||||
```
|
||||
org.openhab.voice:defaultSTT=whisperstt
|
||||
```
|
29
bundles/org.openhab.voice.whisperstt/pom.xml
Normal file
29
bundles/org.openhab.voice.whisperstt/pom.xml
Normal file
@ -0,0 +1,29 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"
|
||||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
|
||||
|
||||
<modelVersion>4.0.0</modelVersion>
|
||||
|
||||
<parent>
|
||||
<groupId>org.openhab.addons.bundles</groupId>
|
||||
<artifactId>org.openhab.addons.reactor.bundles</artifactId>
|
||||
<version>4.2.0-SNAPSHOT</version>
|
||||
</parent>
|
||||
|
||||
<artifactId>org.openhab.voice.whisperstt</artifactId>
|
||||
|
||||
<name>openHAB Add-ons :: Bundles :: Voice :: Whisper Speech-to-Text</name>
|
||||
<dependencies>
|
||||
<!--Deps -->
|
||||
<dependency>
|
||||
<groupId>io.github.givimad</groupId>
|
||||
<artifactId>whisper-jni</artifactId>
|
||||
<version>1.6.1</version>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>io.github.givimad</groupId>
|
||||
<artifactId>libfvad-jni</artifactId>
|
||||
<version>1.0.0-0</version>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
</project>
|
@ -0,0 +1,9 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<features name="org.openhab.voice.whisperstt-${project.version}" xmlns="http://karaf.apache.org/xmlns/features/v1.4.0">
|
||||
<repository>mvn:org.openhab.core.features.karaf/org.openhab.core.features.karaf.openhab-core/${ohc.version}/xml/features</repository>
|
||||
|
||||
<feature name="openhab-voice-whisperstt" description="Whisper Speech-to-Text" version="${project.version}">
|
||||
<feature>openhab-runtime-base</feature>
|
||||
<bundle start-level="80">mvn:org.openhab.addons.bundles/org.openhab.voice.whisperstt/${project.version}</bundle>
|
||||
</feature>
|
||||
</features>
|
@ -0,0 +1,77 @@
|
||||
/**
|
||||
* Copyright (c) 2010-2024 Contributors to the openHAB project
|
||||
*
|
||||
* See the NOTICE file(s) distributed with this work for additional
|
||||
* information.
|
||||
*
|
||||
* This program and the accompanying materials are made available under the
|
||||
* terms of the Eclipse Public License 2.0 which is available at
|
||||
* http://www.eclipse.org/legal/epl-2.0
|
||||
*
|
||||
* SPDX-License-Identifier: EPL-2.0
|
||||
*/
|
||||
package org.openhab.voice.whisperstt.internal;
|
||||
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTService.WHISPER_FOLDER;
|
||||
|
||||
import java.net.URI;
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Locale;
|
||||
import java.util.stream.Collectors;
|
||||
import java.util.stream.Stream;
|
||||
|
||||
import org.eclipse.jdt.annotation.NonNullByDefault;
|
||||
import org.eclipse.jdt.annotation.Nullable;
|
||||
import org.openhab.core.config.core.ConfigOptionProvider;
|
||||
import org.openhab.core.config.core.ConfigurableService;
|
||||
import org.openhab.core.config.core.ParameterOption;
|
||||
import org.osgi.framework.Constants;
|
||||
import org.osgi.service.component.annotations.Component;
|
||||
|
||||
/**
|
||||
* The {@link WhisperConfigOptionProvider} class provides some dynamic configuration options
|
||||
*
|
||||
* @author Miguel Álvarez - Initial contribution
|
||||
*/
|
||||
@Component(service = ConfigOptionProvider.class, configurationPid = SERVICE_PID, property = Constants.SERVICE_PID + "="
|
||||
+ SERVICE_PID)
|
||||
@ConfigurableService(category = SERVICE_CATEGORY, label = SERVICE_NAME
|
||||
+ " Speech-to-Text", description_uri = SERVICE_CATEGORY + ":" + SERVICE_ID)
|
||||
@NonNullByDefault
|
||||
public class WhisperConfigOptionProvider implements ConfigOptionProvider {
|
||||
@Override
|
||||
public @Nullable Collection<ParameterOption> getParameterOptions(URI uri, String param, @Nullable String context,
|
||||
@Nullable Locale locale) {
|
||||
if (context == null && (SERVICE_CATEGORY + ":" + SERVICE_ID).equals(uri.toString())) {
|
||||
if ("modelName".equals(param)) {
|
||||
return getAvailableModelOptions();
|
||||
}
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
private List<ParameterOption> getAvailableModelOptions() {
|
||||
var folderFile = WHISPER_FOLDER.toFile();
|
||||
var files = folderFile.listFiles();
|
||||
if (!folderFile.exists() || !folderFile.isDirectory() || files == null) {
|
||||
return List.of();
|
||||
}
|
||||
String modelExtension = ".bin";
|
||||
return Stream.of(files).filter(file -> !file.isDirectory() && file.getName().endsWith(modelExtension))
|
||||
.map(file -> {
|
||||
String fileName = file.getName();
|
||||
String optionName = file.getName();
|
||||
String optionalPrefix = "ggml-";
|
||||
if (optionName.startsWith(optionalPrefix)) {
|
||||
optionName = optionName.substring(optionalPrefix.length());
|
||||
}
|
||||
optionName = optionName.substring(0, optionName.length() - modelExtension.length());
|
||||
return new ParameterOption(fileName, optionName);
|
||||
}).collect(Collectors.toList());
|
||||
}
|
||||
}
|
@ -0,0 +1,149 @@
|
||||
/**
|
||||
* Copyright (c) 2010-2024 Contributors to the openHAB project
|
||||
*
|
||||
* See the NOTICE file(s) distributed with this work for additional
|
||||
* information.
|
||||
*
|
||||
* This program and the accompanying materials are made available under the
|
||||
* terms of the Eclipse Public License 2.0 which is available at
|
||||
* http://www.eclipse.org/legal/epl-2.0
|
||||
*
|
||||
* SPDX-License-Identifier: EPL-2.0
|
||||
*/
|
||||
package org.openhab.voice.whisperstt.internal;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
import org.eclipse.jdt.annotation.NonNullByDefault;
|
||||
|
||||
import io.github.givimad.libfvadjni.VoiceActivityDetector;
|
||||
|
||||
/**
|
||||
* The {@link WhisperSTTConfiguration} class contains fields mapping thing configuration parameters.
|
||||
*
|
||||
* @author Miguel Álvarez Díez - Initial contribution
|
||||
*/
|
||||
@NonNullByDefault
|
||||
public class WhisperSTTConfiguration {
|
||||
|
||||
/**
|
||||
* Model name without '.bin' extension.
|
||||
*/
|
||||
public String modelName = "";
|
||||
/**
|
||||
* Keep model loaded.
|
||||
*/
|
||||
public boolean preloadModel;
|
||||
/**
|
||||
* Defines the audio step.
|
||||
*/
|
||||
public float stepSeconds = 1f;
|
||||
/**
|
||||
* Min audio seconds to call whisper with.
|
||||
*/
|
||||
public float minSeconds = 2f;
|
||||
/**
|
||||
* Max seconds to wait to force stop the transcription.
|
||||
*/
|
||||
public int maxSeconds = 10;
|
||||
/**
|
||||
* Voice activity detection mode.
|
||||
*/
|
||||
public String vadMode = VoiceActivityDetector.Mode.VERY_AGGRESSIVE.toString();
|
||||
/**
|
||||
* Voice activity detection sensitivity.
|
||||
*/
|
||||
public float vadSensitivity = 0.3f;
|
||||
/**
|
||||
* Voice activity detection step in ms (vad dependency only allows 10, 20 or 30 ms steps).
|
||||
*/
|
||||
public int vadStep = 20;
|
||||
/**
|
||||
* Initial silence seconds for discard transcription.
|
||||
*/
|
||||
public float initSilenceSeconds = 3;
|
||||
/**
|
||||
* Max silence seconds for triggering transcription.
|
||||
*/
|
||||
public float maxSilenceSeconds = 0.5f;
|
||||
/**
|
||||
* Remove silence frames.
|
||||
*/
|
||||
public boolean removeSilence = true;
|
||||
/**
|
||||
* Number of threads used by whisper. (0 to use host max threads)
|
||||
*/
|
||||
public int threads;
|
||||
/**
|
||||
* Overwrite the audio context size. (0 to use whisper default context size).
|
||||
*/
|
||||
public int audioContext;
|
||||
/**
|
||||
* Speed up audio by x2 (reduced accuracy).
|
||||
*/
|
||||
public boolean speedUp;
|
||||
/**
|
||||
* Sampling strategy.
|
||||
*/
|
||||
public String samplingStrategy = "BEAN_SEARCH";
|
||||
/**
|
||||
* Beam Size configuration for sampling strategy Bean Search.
|
||||
*/
|
||||
public int beamSize = 2;
|
||||
/**
|
||||
* Best Of configuration for sampling strategy Greedy.
|
||||
*/
|
||||
public int greedyBestOf = -1;
|
||||
/**
|
||||
* Temperature threshold.
|
||||
*/
|
||||
public float temperature;
|
||||
/**
|
||||
* Initial whisper prompt
|
||||
*/
|
||||
public String initialPrompt = "";
|
||||
/**
|
||||
* Grammar in GBNF format.
|
||||
*/
|
||||
public List<String> grammarLines = List.of();
|
||||
/**
|
||||
* Enables grammar usage.
|
||||
*/
|
||||
public boolean useGrammar = false;
|
||||
/**
|
||||
* Grammar penalty.
|
||||
*/
|
||||
public float grammarPenalty = 100f;
|
||||
/**
|
||||
* Enables GPU usage. (built-in binaries do not support GPU usage)
|
||||
*/
|
||||
public boolean useGPU = true;
|
||||
/**
|
||||
* OpenVINO device name
|
||||
*/
|
||||
public String openvinoDevice = "CPU";
|
||||
/**
|
||||
* Single phrase mode.
|
||||
*/
|
||||
public boolean singleUtteranceMode = true;
|
||||
/**
|
||||
* Message to be told when no results.
|
||||
*/
|
||||
public String noResultsMessage = "Sorry, I didn't understand you";
|
||||
/**
|
||||
* Message to be told when an error has happened.
|
||||
*/
|
||||
public String errorMessage = "Sorry, something went wrong";
|
||||
/**
|
||||
* Create wav audio record for each whisper invocation.
|
||||
*/
|
||||
public boolean createWAVRecord;
|
||||
/**
|
||||
* Record sample format. Values: i16, f32.
|
||||
*/
|
||||
public String recordSampleFormat = "i16";
|
||||
/**
|
||||
* Print whisper.cpp library logs as binding debug logs.
|
||||
*/
|
||||
public boolean enableWhisperLog;
|
||||
}
|
@ -0,0 +1,44 @@
|
||||
/**
|
||||
* Copyright (c) 2010-2024 Contributors to the openHAB project
|
||||
*
|
||||
* See the NOTICE file(s) distributed with this work for additional
|
||||
* information.
|
||||
*
|
||||
* This program and the accompanying materials are made available under the
|
||||
* terms of the Eclipse Public License 2.0 which is available at
|
||||
* http://www.eclipse.org/legal/epl-2.0
|
||||
*
|
||||
* SPDX-License-Identifier: EPL-2.0
|
||||
*/
|
||||
package org.openhab.voice.whisperstt.internal;
|
||||
|
||||
import org.eclipse.jdt.annotation.NonNullByDefault;
|
||||
|
||||
/**
|
||||
* The {@link WhisperSTTConstants} class defines common constants, which are
|
||||
* used across the whole binding.
|
||||
*
|
||||
* @author Miguel Álvarez Díez - Initial contribution
|
||||
*/
|
||||
@NonNullByDefault
|
||||
public class WhisperSTTConstants {
|
||||
|
||||
/**
|
||||
* Service name
|
||||
*/
|
||||
public static final String SERVICE_NAME = "Whisper";
|
||||
/**
|
||||
* Service id
|
||||
*/
|
||||
public static final String SERVICE_ID = "whisperstt";
|
||||
|
||||
/**
|
||||
* Service category
|
||||
*/
|
||||
public static final String SERVICE_CATEGORY = "voice";
|
||||
|
||||
/**
|
||||
* Service pid
|
||||
*/
|
||||
public static final String SERVICE_PID = "org.openhab." + SERVICE_CATEGORY + "." + SERVICE_ID;
|
||||
}
|
@ -0,0 +1,657 @@
|
||||
/**
|
||||
* Copyright (c) 2010-2024 Contributors to the openHAB project
|
||||
*
|
||||
* See the NOTICE file(s) distributed with this work for additional
|
||||
* information.
|
||||
*
|
||||
* This program and the accompanying materials are made available under the
|
||||
* terms of the Eclipse Public License 2.0 which is available at
|
||||
* http://www.eclipse.org/legal/epl-2.0
|
||||
*
|
||||
* SPDX-License-Identifier: EPL-2.0
|
||||
*/
|
||||
package org.openhab.voice.whisperstt.internal;
|
||||
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
|
||||
|
||||
import java.io.ByteArrayInputStream;
|
||||
import java.io.FileOutputStream;
|
||||
import java.io.IOException;
|
||||
import java.nio.ByteBuffer;
|
||||
import java.nio.ByteOrder;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.Paths;
|
||||
import java.text.ParseException;
|
||||
import java.text.SimpleDateFormat;
|
||||
import java.util.Date;
|
||||
import java.util.Locale;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.ScheduledExecutorService;
|
||||
import java.util.concurrent.atomic.AtomicBoolean;
|
||||
|
||||
import javax.sound.sampled.AudioFileFormat;
|
||||
import javax.sound.sampled.AudioInputStream;
|
||||
import javax.sound.sampled.AudioSystem;
|
||||
|
||||
import org.eclipse.jdt.annotation.NonNullByDefault;
|
||||
import org.eclipse.jdt.annotation.Nullable;
|
||||
import org.openhab.core.OpenHAB;
|
||||
import org.openhab.core.audio.AudioFormat;
|
||||
import org.openhab.core.audio.AudioStream;
|
||||
import org.openhab.core.audio.utils.AudioWaveUtils;
|
||||
import org.openhab.core.common.ThreadPoolManager;
|
||||
import org.openhab.core.config.core.ConfigurableService;
|
||||
import org.openhab.core.config.core.Configuration;
|
||||
import org.openhab.core.io.rest.LocaleService;
|
||||
import org.openhab.core.voice.RecognitionStartEvent;
|
||||
import org.openhab.core.voice.RecognitionStopEvent;
|
||||
import org.openhab.core.voice.STTException;
|
||||
import org.openhab.core.voice.STTListener;
|
||||
import org.openhab.core.voice.STTService;
|
||||
import org.openhab.core.voice.STTServiceHandle;
|
||||
import org.openhab.core.voice.SpeechRecognitionErrorEvent;
|
||||
import org.openhab.core.voice.SpeechRecognitionEvent;
|
||||
import org.openhab.voice.whisperstt.internal.utils.VAD;
|
||||
import org.osgi.framework.Constants;
|
||||
import org.osgi.service.component.annotations.Activate;
|
||||
import org.osgi.service.component.annotations.Component;
|
||||
import org.osgi.service.component.annotations.Deactivate;
|
||||
import org.osgi.service.component.annotations.Modified;
|
||||
import org.osgi.service.component.annotations.Reference;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import io.github.givimad.libfvadjni.VoiceActivityDetector;
|
||||
import io.github.givimad.whisperjni.WhisperContext;
|
||||
import io.github.givimad.whisperjni.WhisperContextParams;
|
||||
import io.github.givimad.whisperjni.WhisperFullParams;
|
||||
import io.github.givimad.whisperjni.WhisperGrammar;
|
||||
import io.github.givimad.whisperjni.WhisperJNI;
|
||||
import io.github.givimad.whisperjni.WhisperSamplingStrategy;
|
||||
import io.github.givimad.whisperjni.WhisperState;
|
||||
|
||||
/**
|
||||
* The {@link WhisperSTTService} class is a service implementation to use whisper.cpp for Speech-to-Text.
|
||||
*
|
||||
* @author Miguel Álvarez - Initial contribution
|
||||
*/
|
||||
@NonNullByDefault
|
||||
@Component(configurationPid = SERVICE_PID, property = Constants.SERVICE_PID + "=" + SERVICE_PID)
|
||||
@ConfigurableService(category = SERVICE_CATEGORY, label = SERVICE_NAME
|
||||
+ " Speech-to-Text", description_uri = SERVICE_CATEGORY + ":" + SERVICE_ID)
|
||||
public class WhisperSTTService implements STTService {
|
||||
protected static final Path WHISPER_FOLDER = Path.of(OpenHAB.getUserDataFolder(), "whisper");
|
||||
private static final Path SAMPLES_FOLDER = Path.of(WHISPER_FOLDER.toString(), "samples");
|
||||
private static final int WHISPER_SAMPLE_RATE = 16000;
|
||||
private final Logger logger = LoggerFactory.getLogger(WhisperSTTService.class);
|
||||
private final ScheduledExecutorService executor = ThreadPoolManager.getScheduledPool("OH-voice-whisperstt");
|
||||
private final LocaleService localeService;
|
||||
private WhisperSTTConfiguration config = new WhisperSTTConfiguration();
|
||||
private @Nullable WhisperContext context;
|
||||
private @Nullable WhisperGrammar grammar;
|
||||
private @Nullable WhisperJNI whisper;
|
||||
|
||||
@Activate
|
||||
public WhisperSTTService(@Reference LocaleService localeService) {
|
||||
this.localeService = localeService;
|
||||
}
|
||||
|
||||
@Activate
|
||||
protected void activate(Map<String, Object> config) {
|
||||
try {
|
||||
if (!Files.exists(WHISPER_FOLDER)) {
|
||||
Files.createDirectory(WHISPER_FOLDER);
|
||||
}
|
||||
WhisperJNI.loadLibrary(getLoadOptions());
|
||||
VoiceActivityDetector.loadLibrary();
|
||||
whisper = new WhisperJNI();
|
||||
} catch (IOException | RuntimeException e) {
|
||||
logger.warn("Unable to register native library: {}", e.getMessage());
|
||||
}
|
||||
configChange(config);
|
||||
}
|
||||
|
||||
private WhisperJNI.LoadOptions getLoadOptions() {
|
||||
Path libFolder = Paths.get("/usr/local/lib");
|
||||
Path libFolderWin = Paths.get("/Windows/System32");
|
||||
var options = new WhisperJNI.LoadOptions();
|
||||
// Overwrite whisper jni shared library
|
||||
Path whisperJNILinuxLibrary = libFolder.resolve("libwhisperjni.so");
|
||||
Path whisperJNIMacLibrary = libFolder.resolve("libwhisperjni.dylib");
|
||||
Path whisperJNIWinLibrary = libFolderWin.resolve("libwhisperjni.dll");
|
||||
if (Files.exists(whisperJNILinuxLibrary)) {
|
||||
options.whisperJNILib = whisperJNILinuxLibrary;
|
||||
} else if (Files.exists(whisperJNIMacLibrary)) {
|
||||
options.whisperJNILib = whisperJNIMacLibrary;
|
||||
} else if (Files.exists(whisperJNIWinLibrary)) {
|
||||
options.whisperJNILib = whisperJNIWinLibrary;
|
||||
}
|
||||
// Overwrite whisper shared library, Windows searches library in $env:PATH
|
||||
Path whisperLinuxLibrary = libFolder.resolve("libwhisper.so");
|
||||
Path whisperMacLibrary = libFolder.resolve("libwhisper.dylib");
|
||||
if (Files.exists(whisperLinuxLibrary)) {
|
||||
options.whisperLib = whisperLinuxLibrary;
|
||||
} else if (Files.exists(whisperMacLibrary)) {
|
||||
options.whisperLib = whisperMacLibrary;
|
||||
}
|
||||
// Log library registration
|
||||
options.logger = (msg) -> logger.debug("Library load: {}", msg);
|
||||
return options;
|
||||
}
|
||||
|
||||
@Modified
|
||||
protected void modified(Map<String, Object> config) {
|
||||
configChange(config);
|
||||
}
|
||||
|
||||
@Deactivate
|
||||
protected void deactivate(Map<String, Object> config) {
|
||||
try {
|
||||
WhisperGrammar grammar = this.grammar;
|
||||
if (grammar != null) {
|
||||
grammar.close();
|
||||
this.grammar = null;
|
||||
}
|
||||
unloadContext();
|
||||
} catch (IOException e) {
|
||||
logger.warn("IOException unloading model: {}", e.getMessage());
|
||||
}
|
||||
WhisperJNI.setLibraryLogger(null);
|
||||
}
|
||||
|
||||
private void configChange(Map<String, Object> config) {
|
||||
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
|
||||
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
|
||||
WhisperGrammar grammar = this.grammar;
|
||||
if (grammar != null) {
|
||||
grammar.close();
|
||||
this.grammar = null;
|
||||
}
|
||||
WhisperJNI whisper;
|
||||
try {
|
||||
whisper = getWhisper();
|
||||
} catch (IOException ignored) {
|
||||
logger.warn("library not loaded, the add-on will not work");
|
||||
return;
|
||||
}
|
||||
String grammarText = String.join("\n", this.config.grammarLines);
|
||||
if (this.config.useGrammar && isValidGrammar(grammarText)) {
|
||||
try {
|
||||
logger.debug("Parsing GBNF grammar...");
|
||||
this.grammar = whisper.parseGrammar(grammarText);
|
||||
} catch (IOException e) {
|
||||
logger.warn("Error parsing grammar: {}", e.getMessage());
|
||||
}
|
||||
}
|
||||
if (this.config.preloadModel) {
|
||||
try {
|
||||
loadContext();
|
||||
} catch (IOException e) {
|
||||
logger.warn("IOException loading model: {}", e.getMessage());
|
||||
} catch (UnsatisfiedLinkError e) {
|
||||
logger.warn("Missing native dependency: {}", e.getMessage());
|
||||
}
|
||||
} else {
|
||||
try {
|
||||
unloadContext();
|
||||
} catch (IOException e) {
|
||||
logger.warn("IOException unloading model: {}", e.getMessage());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private boolean isValidGrammar(String grammarText) {
|
||||
try {
|
||||
WhisperGrammar.assertValidGrammar(grammarText);
|
||||
} catch (IllegalArgumentException | ParseException e) {
|
||||
logger.warn("Invalid grammar: {}", e.getMessage());
|
||||
return false;
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getId() {
|
||||
return SERVICE_ID;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getLabel(@Nullable Locale locale) {
|
||||
return SERVICE_NAME;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Set<Locale> getSupportedLocales() {
|
||||
// as it is not possible to determine the language of the model that was downloaded and setup by the user, it is
|
||||
// assumed the language of the model is matching the locale of the openHAB server
|
||||
return Set.of(localeService.getLocale(null));
|
||||
}
|
||||
|
||||
@Override
|
||||
public Set<AudioFormat> getSupportedFormats() {
|
||||
return Set.of(
|
||||
new AudioFormat(AudioFormat.CONTAINER_NONE, AudioFormat.CODEC_PCM_SIGNED, false, 16, null,
|
||||
(long) WHISPER_SAMPLE_RATE, 1),
|
||||
new AudioFormat(AudioFormat.CONTAINER_WAVE, AudioFormat.CODEC_PCM_SIGNED, false, 16, null,
|
||||
(long) WHISPER_SAMPLE_RATE, 1));
|
||||
}
|
||||
|
||||
@Override
|
||||
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
|
||||
throws STTException {
|
||||
AtomicBoolean aborted = new AtomicBoolean(false);
|
||||
WhisperContext ctx = null;
|
||||
WhisperState state = null;
|
||||
try {
|
||||
var whisper = getWhisper();
|
||||
ctx = getContext();
|
||||
logger.debug("Creating whisper state...");
|
||||
state = whisper.initState(ctx);
|
||||
logger.debug("Whisper state created");
|
||||
logger.debug("Creating VAD instance...");
|
||||
final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||
VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
|
||||
config.vadStep, config.vadSensitivity);
|
||||
logger.debug("VAD instance created");
|
||||
sttListener.sttEventReceived(new RecognitionStartEvent());
|
||||
backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted);
|
||||
} catch (IOException e) {
|
||||
if (ctx != null && !config.preloadModel) {
|
||||
ctx.close();
|
||||
}
|
||||
if (state != null) {
|
||||
state.close();
|
||||
}
|
||||
throw new STTException("Exception during initialization", e);
|
||||
}
|
||||
return () -> {
|
||||
aborted.set(true);
|
||||
};
|
||||
}
|
||||
|
||||
private WhisperJNI getWhisper() throws IOException {
|
||||
var whisper = this.whisper;
|
||||
if (whisper == null) {
|
||||
throw new IOException("Library not loaded");
|
||||
}
|
||||
return whisper;
|
||||
}
|
||||
|
||||
private WhisperContext getContext() throws IOException, UnsatisfiedLinkError {
|
||||
var context = this.context;
|
||||
if (context != null) {
|
||||
return context;
|
||||
}
|
||||
return loadContext();
|
||||
}
|
||||
|
||||
private synchronized WhisperContext loadContext() throws IOException {
|
||||
unloadContext();
|
||||
String modelFilename = this.config.modelName;
|
||||
if (modelFilename.isBlank()) {
|
||||
throw new IOException("The modelName configuration is missing");
|
||||
}
|
||||
String modelPrefix = "ggml-";
|
||||
String modelExtension = ".bin";
|
||||
if (!modelFilename.startsWith(modelPrefix)) {
|
||||
modelFilename = modelPrefix + modelFilename;
|
||||
}
|
||||
if (!modelFilename.endsWith(modelExtension)) {
|
||||
modelFilename = modelFilename + modelExtension;
|
||||
}
|
||||
Path modelPath = WHISPER_FOLDER.resolve(modelFilename);
|
||||
if (!Files.exists(modelPath) || Files.isDirectory(modelPath)) {
|
||||
throw new IOException("Missing model file: " + modelPath);
|
||||
}
|
||||
logger.debug("Loading whisper context...");
|
||||
WhisperJNI whisper = getWhisper();
|
||||
var context = whisper.initNoState(modelPath, getWhisperContextParams());
|
||||
logger.debug("Whisper context loaded");
|
||||
if (config.preloadModel) {
|
||||
this.context = context;
|
||||
}
|
||||
if (!config.openvinoDevice.isBlank()) {
|
||||
// has no effect if OpenVINO is not enabled in whisper.cpp library.
|
||||
logger.debug("Init OpenVINO device");
|
||||
whisper.initOpenVINO(context, config.openvinoDevice);
|
||||
}
|
||||
return context;
|
||||
}
|
||||
|
||||
private WhisperContextParams getWhisperContextParams() {
|
||||
var params = new WhisperContextParams();
|
||||
params.useGPU = config.useGPU;
|
||||
return params;
|
||||
}
|
||||
|
||||
private void unloadContext() throws IOException {
|
||||
var context = this.context;
|
||||
if (context != null) {
|
||||
logger.debug("Unloading model");
|
||||
context.close();
|
||||
this.context = null;
|
||||
}
|
||||
}
|
||||
|
||||
private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep,
|
||||
Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
|
||||
var releaseContext = !config.preloadModel;
|
||||
final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
|
||||
final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||
final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||
final int nMaxSilenceSamples = (int) (config.maxSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||
logger.debug("Samples per step {}", nSamplesStep);
|
||||
logger.debug("Min transcription samples {}", nSamplesMin);
|
||||
logger.debug("Max transcription samples {}", nSamplesMax);
|
||||
logger.debug("Max init silence samples {}", nInitSilenceSamples);
|
||||
logger.debug("Max silence samples {}", nMaxSilenceSamples);
|
||||
// used to store the step samples in libfvad wanted format 16-bit int
|
||||
final short[] stepAudioSamples = new short[nSamplesStep];
|
||||
// used to store the full samples in whisper wanted format 32-bit float
|
||||
final float[] audioSamples = new float[nSamplesMax];
|
||||
executor.submit(() -> {
|
||||
int audioSamplesOffset = 0;
|
||||
int silenceSamplesCounter = 0;
|
||||
int nProcessedSamples = 0;
|
||||
int numBytesRead;
|
||||
boolean voiceDetected = false;
|
||||
String transcription = "";
|
||||
String tempTranscription = "";
|
||||
VAD.@Nullable VADResult lastVADResult;
|
||||
VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
|
||||
try {
|
||||
try (state; //
|
||||
audioStream; //
|
||||
vad) {
|
||||
if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
|
||||
AudioWaveUtils.removeFMT(audioStream);
|
||||
}
|
||||
final ByteBuffer captureBuffer = ByteBuffer.allocate(nSamplesStep * 2)
|
||||
.order(ByteOrder.LITTLE_ENDIAN);
|
||||
// init remaining to full capacity
|
||||
int remaining = captureBuffer.capacity();
|
||||
WhisperFullParams params = getWhisperFullParams(ctx, locale);
|
||||
while (!aborted.get()) {
|
||||
// read until no remaining so we get the complete step samples
|
||||
numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
|
||||
remaining);
|
||||
if (aborted.get() || numBytesRead == -1) {
|
||||
break;
|
||||
}
|
||||
if (numBytesRead != remaining) {
|
||||
remaining = remaining - numBytesRead;
|
||||
continue;
|
||||
}
|
||||
// reset remaining to full capacity
|
||||
remaining = captureBuffer.capacity();
|
||||
// encode step samples and copy them to the audio buffers
|
||||
var shortBuffer = captureBuffer.asShortBuffer();
|
||||
while (shortBuffer.hasRemaining()) {
|
||||
var position = shortBuffer.position();
|
||||
short i16BitSample = shortBuffer.get();
|
||||
float f32BitSample = Float.min(1f,
|
||||
Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
|
||||
stepAudioSamples[position] = i16BitSample;
|
||||
audioSamples[audioSamplesOffset++] = f32BitSample;
|
||||
nProcessedSamples++;
|
||||
}
|
||||
// run vad
|
||||
if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
|
||||
logger.debug("VAD: Skipping, max length reached");
|
||||
} else {
|
||||
lastVADResult = vad.analyze(stepAudioSamples);
|
||||
if (lastVADResult.isVoice()) {
|
||||
voiceDetected = true;
|
||||
logger.debug("VAD: voice detected");
|
||||
silenceSamplesCounter = 0;
|
||||
firstConsecutiveSilenceVADResult = null;
|
||||
continue;
|
||||
} else {
|
||||
if (firstConsecutiveSilenceVADResult == null) {
|
||||
firstConsecutiveSilenceVADResult = lastVADResult;
|
||||
}
|
||||
silenceSamplesCounter += nSamplesStep;
|
||||
int maxSilenceSamples = voiceDetected ? nMaxSilenceSamples : nInitSilenceSamples;
|
||||
if (silenceSamplesCounter < maxSilenceSamples) {
|
||||
if (logger.isDebugEnabled()) {
|
||||
int totalSteps = maxSilenceSamples / nSamplesStep;
|
||||
int currentSteps = totalSteps
|
||||
- ((maxSilenceSamples - silenceSamplesCounter) / nSamplesStep);
|
||||
logger.debug("VAD: silence detected {}/{}", currentSteps, totalSteps);
|
||||
}
|
||||
if (!voiceDetected && config.removeSilence) {
|
||||
logger.debug("removing start silence");
|
||||
int samplesToKeep = lastVADResult.voiceSamplesInTail();
|
||||
if (samplesToKeep > 0) {
|
||||
for (int i = 0; i < samplesToKeep; i++) {
|
||||
audioSamples[i] = audioSamples[audioSamplesOffset
|
||||
- (samplesToKeep - i)];
|
||||
}
|
||||
audioSamplesOffset = samplesToKeep;
|
||||
logger.debug("some audio was kept");
|
||||
} else {
|
||||
audioSamplesOffset = 0;
|
||||
}
|
||||
}
|
||||
continue;
|
||||
} else {
|
||||
logger.debug("VAD: silence detected");
|
||||
if (audioSamplesOffset < nSamplesMin) {
|
||||
logger.debug("Not enough samples, continue");
|
||||
continue;
|
||||
}
|
||||
if (config.singleUtteranceMode) {
|
||||
// close the audio stream to avoid keep getting audio we don't need
|
||||
try {
|
||||
audioStream.close();
|
||||
} catch (IOException ignored) {
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
if (config.removeSilence) {
|
||||
if (voiceDetected) {
|
||||
logger.debug("removing end silence");
|
||||
int samplesToKeep = firstConsecutiveSilenceVADResult.voiceSamplesInHead();
|
||||
if (samplesToKeep > 0) {
|
||||
logger.debug("some audio was kept");
|
||||
}
|
||||
var samplesToRemove = silenceSamplesCounter - samplesToKeep;
|
||||
if (audioSamplesOffset - samplesToRemove < nSamplesMin) {
|
||||
logger.debug("avoid removing under min audio seconds");
|
||||
samplesToRemove = audioSamplesOffset - nSamplesMin;
|
||||
}
|
||||
if (samplesToRemove > 0) {
|
||||
audioSamplesOffset -= samplesToRemove;
|
||||
}
|
||||
} else {
|
||||
audioSamplesOffset = 0;
|
||||
}
|
||||
}
|
||||
if (audioSamplesOffset == 0) {
|
||||
if (config.singleUtteranceMode) {
|
||||
logger.debug("no audio to transcribe, ending");
|
||||
break;
|
||||
} else {
|
||||
logger.debug("no audio to transcribe, continue listening");
|
||||
continue;
|
||||
}
|
||||
}
|
||||
}
|
||||
// run whisper
|
||||
logger.debug("running whisper with {} seconds of audio...",
|
||||
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
|
||||
long execStartTime = System.currentTimeMillis();
|
||||
var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset);
|
||||
logger.debug("whisper ended in {}ms with result code {}",
|
||||
System.currentTimeMillis() - execStartTime, result);
|
||||
// process result
|
||||
if (result != 0) {
|
||||
emitSpeechRecognitionError(sttListener);
|
||||
break;
|
||||
}
|
||||
int nSegments = whisper.fullNSegmentsFromState(state);
|
||||
logger.debug("Available transcription segments {}", nSegments);
|
||||
if (nSegments == 1) {
|
||||
tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
|
||||
if (config.createWAVRecord) {
|
||||
createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
|
||||
locale.getLanguage());
|
||||
}
|
||||
if (config.singleUtteranceMode) {
|
||||
logger.debug("single utterance mode, ending transcription");
|
||||
transcription = tempTranscription;
|
||||
break;
|
||||
} else {
|
||||
// start a new transcription segment
|
||||
transcription += tempTranscription;
|
||||
tempTranscription = "";
|
||||
}
|
||||
} else if (nSegments == 0 && config.singleUtteranceMode) {
|
||||
logger.debug("Single utterance mode and no results, ending transcription");
|
||||
break;
|
||||
} else if (nSegments > 1) {
|
||||
// non reachable
|
||||
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
|
||||
break;
|
||||
}
|
||||
// reset state to start with next segment
|
||||
voiceDetected = false;
|
||||
silenceSamplesCounter = 0;
|
||||
audioSamplesOffset = 0;
|
||||
logger.debug("Partial transcription: {}", tempTranscription);
|
||||
logger.debug("Transcription: {}", transcription);
|
||||
}
|
||||
} finally {
|
||||
if (releaseContext) {
|
||||
ctx.close();
|
||||
}
|
||||
}
|
||||
// emit result
|
||||
if (!aborted.get()) {
|
||||
sttListener.sttEventReceived(new RecognitionStopEvent());
|
||||
logger.debug("Final transcription: '{}'", transcription);
|
||||
if (!transcription.isBlank()) {
|
||||
sttListener.sttEventReceived(new SpeechRecognitionEvent(transcription.trim(), 1));
|
||||
} else {
|
||||
emitSpeechRecognitionNoResultsError(sttListener);
|
||||
}
|
||||
}
|
||||
} catch (IOException e) {
|
||||
logger.warn("Error running speech to text: {}", e.getMessage());
|
||||
emitSpeechRecognitionError(sttListener);
|
||||
} catch (UnsatisfiedLinkError e) {
|
||||
logger.warn("Missing native dependency: {}", e.getMessage());
|
||||
emitSpeechRecognitionError(sttListener);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException {
|
||||
WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
|
||||
var params = new WhisperFullParams(strategy);
|
||||
params.temperature = config.temperature;
|
||||
params.nThreads = config.threads;
|
||||
params.audioCtx = config.audioContext;
|
||||
params.speedUp = config.speedUp;
|
||||
params.beamSearchBeamSize = config.beamSize;
|
||||
params.greedyBestOf = config.greedyBestOf;
|
||||
if (!config.initialPrompt.isBlank()) {
|
||||
params.initialPrompt = config.initialPrompt;
|
||||
}
|
||||
if (grammar != null) {
|
||||
params.grammar = grammar;
|
||||
params.grammarPenalty = config.grammarPenalty;
|
||||
}
|
||||
// there is no single language models other than the english ones
|
||||
params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en";
|
||||
// implementation assumes this options
|
||||
params.translate = false;
|
||||
params.detectLanguage = false;
|
||||
params.printProgress = false;
|
||||
params.noTimestamps = true;
|
||||
params.printRealtime = false;
|
||||
params.printSpecial = false;
|
||||
params.printTimestamps = false;
|
||||
params.suppressBlank = true;
|
||||
params.suppressNonSpeechTokens = true;
|
||||
params.singleSegment = true;
|
||||
params.noContext = true;
|
||||
return params;
|
||||
}
|
||||
|
||||
private void emitSpeechRecognitionNoResultsError(STTListener sttListener) {
|
||||
sttListener.sttEventReceived(new SpeechRecognitionErrorEvent(config.noResultsMessage));
|
||||
}
|
||||
|
||||
private void emitSpeechRecognitionError(STTListener sttListener) {
|
||||
sttListener.sttEventReceived(new SpeechRecognitionErrorEvent(config.errorMessage));
|
||||
}
|
||||
|
||||
private void createSamplesDir() {
|
||||
if (!Files.exists(SAMPLES_FOLDER)) {
|
||||
try {
|
||||
Files.createDirectory(SAMPLES_FOLDER);
|
||||
logger.info("Whisper samples dir created {}", SAMPLES_FOLDER);
|
||||
} catch (IOException ignored) {
|
||||
logger.warn("Unable to create whisper samples dir {}", SAMPLES_FOLDER);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private void createAudioFile(float[] samples, int size, String transcription, String language) {
|
||||
createSamplesDir();
|
||||
javax.sound.sampled.AudioFormat jAudioFormat;
|
||||
ByteBuffer byteBuffer;
|
||||
if ("i16".equals(config.recordSampleFormat)) {
|
||||
logger.debug("Saving audio file with sample format i16");
|
||||
jAudioFormat = new javax.sound.sampled.AudioFormat(javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED,
|
||||
WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
|
||||
byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
|
||||
for (int i = 0; i < size; i++) {
|
||||
byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE));
|
||||
}
|
||||
} else {
|
||||
logger.debug("Saving audio file with sample format f32");
|
||||
jAudioFormat = new javax.sound.sampled.AudioFormat(javax.sound.sampled.AudioFormat.Encoding.PCM_FLOAT,
|
||||
WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
|
||||
byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
|
||||
for (int i = 0; i < size; i++) {
|
||||
byteBuffer.putFloat(samples[i]);
|
||||
}
|
||||
}
|
||||
AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),
|
||||
jAudioFormat, samples.length);
|
||||
try {
|
||||
var scapedTranscription = transcription.replaceAll("[^a-zA-ZÀ-ú0-9.-]", "_");
|
||||
if (scapedTranscription.length() > 60) {
|
||||
scapedTranscription = scapedTranscription.substring(0, 60);
|
||||
}
|
||||
String fileName = new SimpleDateFormat("yyyy-MM-dd.HH.mm.ss.SS").format(new Date()) + "("
|
||||
+ scapedTranscription + ")";
|
||||
Path audioPath = Path.of(SAMPLES_FOLDER.toString(), fileName + ".wav");
|
||||
Path propertiesPath = Path.of(SAMPLES_FOLDER.toString(), fileName + ".props");
|
||||
logger.debug("Saving audio file: {}", audioPath);
|
||||
FileOutputStream audioFileOutputStream = new FileOutputStream(audioPath.toFile());
|
||||
AudioSystem.write(audioInputStreamTemp, AudioFileFormat.Type.WAVE, audioFileOutputStream);
|
||||
audioFileOutputStream.close();
|
||||
String properties = "transcription=" + transcription + "\nlanguage=" + language + "\n";
|
||||
logger.debug("Saving properties file: {}", propertiesPath);
|
||||
FileOutputStream propertiesFileOutputStream = new FileOutputStream(propertiesPath.toFile());
|
||||
propertiesFileOutputStream.write(properties.getBytes(StandardCharsets.UTF_8));
|
||||
propertiesFileOutputStream.close();
|
||||
} catch (IOException e) {
|
||||
logger.warn("Unable to store sample.", e);
|
||||
}
|
||||
}
|
||||
|
||||
private void onWhisperLog(String text) {
|
||||
logger.debug("[whisper.cpp] {}", text);
|
||||
}
|
||||
}
|
@ -0,0 +1,95 @@
|
||||
/**
|
||||
* Copyright (c) 2010-2024 Contributors to the openHAB project
|
||||
*
|
||||
* See the NOTICE file(s) distributed with this work for additional
|
||||
* information.
|
||||
*
|
||||
* This program and the accompanying materials are made available under the
|
||||
* terms of the Eclipse Public License 2.0 which is available at
|
||||
* http://www.eclipse.org/legal/epl-2.0
|
||||
*
|
||||
* SPDX-License-Identifier: EPL-2.0
|
||||
*/
|
||||
package org.openhab.voice.whisperstt.internal.utils;
|
||||
|
||||
import java.io.IOException;
|
||||
|
||||
import org.eclipse.jdt.annotation.NonNullByDefault;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import io.github.givimad.libfvadjni.VoiceActivityDetector;
|
||||
|
||||
/**
|
||||
* The {@link VAD} class is a voice activity detector implementation over libfvad-jni.
|
||||
*
|
||||
* @author Miguel Álvarez - Initial contribution
|
||||
*/
|
||||
@NonNullByDefault
|
||||
public class VAD implements AutoCloseable {
|
||||
private final Logger logger = LoggerFactory.getLogger(VAD.class);
|
||||
private final VoiceActivityDetector libfvad;
|
||||
private final short[] stepSamples;
|
||||
private final int totalPartialDetections;
|
||||
private final int detectionThreshold;
|
||||
|
||||
/**
|
||||
*
|
||||
* @param mode desired vad mode.
|
||||
* @param sampleRate audio sample rate.
|
||||
* @param frameSize detector input frame size.
|
||||
* @param stepMs detector partial step ms.
|
||||
* @param sensitivity detector sensitivity percent in range 0 - 1.
|
||||
* @throws IOException
|
||||
*/
|
||||
public VAD(VoiceActivityDetector.Mode mode, int sampleRate, int frameSize, int stepMs, float sensitivity)
|
||||
throws IOException {
|
||||
this.libfvad = VoiceActivityDetector.newInstance();
|
||||
this.libfvad.setMode(mode);
|
||||
this.libfvad.setSampleRate(VoiceActivityDetector.SampleRate.fromValue(sampleRate));
|
||||
this.stepSamples = new short[sampleRate / 1000 * stepMs];
|
||||
this.totalPartialDetections = (frameSize / stepSamples.length);
|
||||
this.detectionThreshold = (int) ((((float) totalPartialDetections) / 100f) * (sensitivity * 100));
|
||||
}
|
||||
|
||||
public VADResult analyze(short[] samples) throws IOException {
|
||||
int voiceInHead = 0;
|
||||
int voiceInTail = 0;
|
||||
boolean silenceFound = false;
|
||||
int partialVADCounter = 0;
|
||||
for (int i = 0; i < totalPartialDetections; i++) {
|
||||
System.arraycopy(samples, i * stepSamples.length, stepSamples, 0, stepSamples.length);
|
||||
if (libfvad.process(stepSamples, stepSamples.length)) {
|
||||
partialVADCounter++;
|
||||
if (!silenceFound) {
|
||||
voiceInHead++;
|
||||
}
|
||||
voiceInTail++;
|
||||
} else {
|
||||
silenceFound = true;
|
||||
voiceInTail = 0;
|
||||
}
|
||||
}
|
||||
logger.debug("VAD: {}/{} - required: {}", partialVADCounter, totalPartialDetections, detectionThreshold);
|
||||
return new VADResult( //
|
||||
partialVADCounter >= detectionThreshold, //
|
||||
voiceInHead * stepSamples.length, //
|
||||
voiceInTail * stepSamples.length //
|
||||
);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() {
|
||||
libfvad.close();
|
||||
}
|
||||
|
||||
/**
|
||||
* Voice activity detection result.
|
||||
*
|
||||
* @param isVoice Does the block contain enough voice
|
||||
* @param voiceSamplesInHead Number of samples consecutively reported as voice from the beginning of the chunk
|
||||
* @param voiceSamplesInTail Number of samples consecutively reported as voice from the end of the chunk
|
||||
*/
|
||||
public record VADResult(boolean isVoice, int voiceSamplesInHead, int voiceSamplesInTail) {
|
||||
}
|
||||
}
|
@ -0,0 +1,15 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<addon:addon id="whisperstt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||
xmlns:addon="https://openhab.org/schemas/addon/v1.0.0"
|
||||
xsi:schemaLocation="https://openhab.org/schemas/addon/v1.0.0 https://openhab.org/schemas/addon-1.0.0.xsd">
|
||||
|
||||
<type>voice</type>
|
||||
<name>Whisper Speech-to-Text</name>
|
||||
<description>Whisper STT Service uses the whisper.cpp library to transcript audio data to text.</description>
|
||||
<connection>none</connection>
|
||||
|
||||
<service-id>org.openhab.voice.whisperstt</service-id>
|
||||
|
||||
<config-description-ref uri="voice:whisperstt"/>
|
||||
|
||||
</addon:addon>
|
@ -0,0 +1,229 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<config-description:config-descriptions
|
||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||
xmlns:config-description="https://openhab.org/schemas/config-description/v1.0.0"
|
||||
xsi:schemaLocation="https://openhab.org/schemas/config-description/v1.0.0
|
||||
https://openhab.org/schemas/config-description-1.0.0.xsd">
|
||||
<config-description uri="voice:whisperstt">
|
||||
<parameter-group name="stt">
|
||||
<label>STT Configuration</label>
|
||||
<description>Configure Speech to Text.</description>
|
||||
</parameter-group>
|
||||
<parameter-group name="vad">
|
||||
<label>Voice Activity Detection</label>
|
||||
<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description>
|
||||
</parameter-group>
|
||||
<parameter-group name="whisper">
|
||||
<label>Whisper Options</label>
|
||||
<description>Configure the whisper.cpp transcription options.</description>
|
||||
</parameter-group>
|
||||
<parameter-group name="grammar">
|
||||
<label>Grammar</label>
|
||||
<description>Define a grammar to improve transcrptions.</description>
|
||||
</parameter-group>
|
||||
<parameter-group name="messages">
|
||||
<label>Info Messages</label>
|
||||
<description>Configure service information messages.</description>
|
||||
</parameter-group>
|
||||
<parameter-group name="developer">
|
||||
<label>Developer</label>
|
||||
<description>Options added for developers.</description>
|
||||
<advanced>true</advanced>
|
||||
</parameter-group>
|
||||
<parameter name="modelName" type="text" groupName="stt" required="true">
|
||||
<label>Model Name</label>
|
||||
<description>Model name without extension.</description>
|
||||
</parameter>
|
||||
<parameter name="preloadModel" type="boolean" groupName="stt">
|
||||
<label>Preload Model</label>
|
||||
<description>Keep the model loaded. If the parameter is set to true, the model will be reloaded only on
|
||||
configuration
|
||||
updates. If the model is not loaded when needed, the service will try to load it. If the parameter is
|
||||
set to false,
|
||||
the model will be loaded and unloaded on each run.
|
||||
</description>
|
||||
<default>false</default>
|
||||
</parameter>
|
||||
<parameter name="singleUtteranceMode" type="boolean" groupName="stt">
|
||||
<label>Single Utterance Mode</label>
|
||||
<description>When enabled recognition stops listening after a single utterance.</description>
|
||||
<default>true</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="minSeconds" type="decimal" step="0.1" min="1" unit="s" groupName="stt">
|
||||
<label>Min Transcription Seconds</label>
|
||||
<description>Min transcription seconds passed to whisper.</description>
|
||||
<default>2</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="maxSeconds" type="integer" min="2" unit="s" groupName="stt">
|
||||
<label>Max Transcription Seconds</label>
|
||||
<description>Seconds to force transcription before silence detection.</description>
|
||||
<default>10</default>
|
||||
</parameter>
|
||||
<parameter name="initSilenceSeconds" type="decimal" min="0.1" step="0.1" unit="s" groupName="stt">
|
||||
<label>Initial Silence Seconds</label>
|
||||
<description>Max initial seconds of silence to discard transcription.</description>
|
||||
<default>3</default>
|
||||
</parameter>
|
||||
<parameter name="maxSilenceSeconds" type="decimal" min="0.1" step="0.1" unit="s" groupName="stt">
|
||||
<label>Max Silence Seconds</label>
|
||||
<description>Seconds of silence to trigger transcription.</description>
|
||||
<default>0.5</default>
|
||||
</parameter>
|
||||
<parameter name="removeSilence" type="boolean" groupName="stt">
|
||||
<label>Remove Silence</label>
|
||||
<description>Remove silence frames from the beginning and end of the audio.</description>
|
||||
<default>true</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="stepSeconds" type="decimal" groupName="vad">
|
||||
<label>Audio Step</label>
|
||||
<description>Audio step for the voice activity detection.</description>
|
||||
<default>1</default>
|
||||
<options>
|
||||
<option value="0.1">100ms</option>
|
||||
<option value="0.2">200ms</option>
|
||||
<option value="0.3">300ms</option>
|
||||
<option value="0.5">500ms</option>
|
||||
<option value="0.6">600ms</option>
|
||||
<option value="1">1s</option>
|
||||
</options>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="vadSensitivity" type="decimal" groupName="vad" min="0" max="1" step="0.01">
|
||||
<label>Voice Activity Detection Sensitivity</label>
|
||||
<description>Percentage in range 0-1 of voice activity in each audio step analyzed to consider it as voice.</description>
|
||||
<default>0.3</default>
|
||||
</parameter>
|
||||
<parameter name="vadMode" type="text" groupName="vad">
|
||||
<label>Voice Activity Detection Mode</label>
|
||||
<description>Available VAD modes. Quality is the most likely to detect voice.</description>
|
||||
<default>VERY_AGGRESSIVE</default>
|
||||
<options>
|
||||
<option value="QUALITY">Quality</option>
|
||||
<option value="LOW_BITRATE">Low Bitrate</option>
|
||||
<option value="AGGRESSIVE">Aggressive</option>
|
||||
<option value="VERY_AGGRESSIVE">Very Aggressive</option>
|
||||
</options>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="vadStep" type="integer" groupName="vad">
|
||||
<label>Voice Activity Detector Step</label>
|
||||
<description>Audio milliseconds passed to the voice activity detector. Defines how much times the voice activity
|
||||
detector is executed per audio step.</description>
|
||||
<default>20</default>
|
||||
<options>
|
||||
<option value="10">10ms</option>
|
||||
<option value="20">20ms</option>
|
||||
<option value="30">30ms</option>
|
||||
</options>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="threads" type="integer" groupName="whisper">
|
||||
<label>Threads</label>
|
||||
<description>Number of threads used by whisper. (0 to use host max threads)</description>
|
||||
<default>0</default>
|
||||
</parameter>
|
||||
<parameter name="audioContext" type="integer" groupName="whisper" min="0">
|
||||
<label>Audio Context</label>
|
||||
<description>Overwrite the audio context size. (0 to use whisper default context size)</description>
|
||||
<default>0</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="samplingStrategy" type="text" groupName="whisper">
|
||||
<label>Sampling strategy</label>
|
||||
<description>Sampling strategy used.</description>
|
||||
<default>BEAN_SEARCH</default>
|
||||
<options>
|
||||
<option value="GREEDY">Greedy</option>
|
||||
<option value="BEAN_SEARCH">Bean Search</option>
|
||||
</options>
|
||||
</parameter>
|
||||
<parameter name="beamSize" type="integer" groupName="whisper" min="1">
|
||||
<label>Beam Size</label>
|
||||
<description>Beam Size configuration for sampling strategy Bean Search.</description>
|
||||
<default>2</default>
|
||||
</parameter>
|
||||
<parameter name="greedyBestOf" type="integer" groupName="whisper" min="-1">
|
||||
<label>Greedy Best Of</label>
|
||||
<description>Best Of configuration for sampling strategy Greedy. (-1 for unlimited)</description>
|
||||
<default>-1</default>
|
||||
</parameter>
|
||||
<parameter name="temperature" type="decimal" groupName="whisper">
|
||||
<label>Temperature</label>
|
||||
<description>Temperature threshold.</description>
|
||||
<default>0</default>
|
||||
</parameter>
|
||||
<parameter name="initialPrompt" type="text" groupName="whisper">
|
||||
<label>Initial Prompt</label>
|
||||
<description>Initial prompt to feed whisper with.</description>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="openvinoDevice" type="text" groupName="whisper">
|
||||
<label>OpenVINO Device</label>
|
||||
<description>Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)</description>
|
||||
<advanced>true</advanced>
|
||||
<default>CPU</default>
|
||||
</parameter>
|
||||
<parameter name="speedUp" type="boolean" groupName="whisper">
|
||||
<label>Speed Up</label>
|
||||
<description>Speed up audio by x2. (reduced accuracy)</description>
|
||||
<default>false</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="useGPU" type="boolean" groupName="whisper">
|
||||
<label>Use GPU</label>
|
||||
<description>Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)</description>
|
||||
<default>true</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="useGrammar" type="boolean" groupName="grammar">
|
||||
<label>Use Grammar</label>
|
||||
<description>Enables grammar usage.</description>
|
||||
<default>false</default>
|
||||
</parameter>
|
||||
<parameter name="grammarPenalty" type="decimal" groupName="grammar" min="0" max="100" step="0.1">
|
||||
<label>Grammar Penalty</label>
|
||||
<description>Penalty for non grammar tokens when using grammar.</description>
|
||||
<default>100</default>
|
||||
</parameter>
|
||||
<parameter name="grammarLines" type="text" groupName="grammar" multiple="true">
|
||||
<label>Grammar</label>
|
||||
<description>Grammar to use in GBNF format. (BNF variant used by whisper.cpp).</description>
|
||||
<default></default>
|
||||
</parameter>
|
||||
<parameter name="noResultsMessage" type="text" groupName="messages">
|
||||
<label>No Results Message</label>
|
||||
<description>Message to be told when no results. (Empty for disabled)</description>
|
||||
<default>Sorry, I didn't understand you</default>
|
||||
</parameter>
|
||||
<parameter name="errorMessage" type="text" groupName="messages">
|
||||
<label>Error Message</label>
|
||||
<description>Message to be told when an error has happened. (Empty for disabled)</description>
|
||||
<default>Sorry, something went wrong</default>
|
||||
</parameter>
|
||||
<parameter name="createWAVRecord" type="boolean" groupName="developer">
|
||||
<label>Create WAV Record</label>
|
||||
<description>Create WAV audio record on each whisper execution.</description>
|
||||
<default>false</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="recordSampleFormat" type="text" groupName="developer">
|
||||
<label>Record Sample Format</label>
|
||||
<description>Defines the sample type and bit-size used by the created WAV audio record.</description>
|
||||
<default>i16</default>
|
||||
<options>
|
||||
<option value="i16">Integer 16bit</option>
|
||||
<option value="f32">Float 32bit</option>
|
||||
</options>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="enableWhisperLog" type="boolean" groupName="developer">
|
||||
<label>Enable Whisper Log</label>
|
||||
<description>Emit whisper.cpp library logs as add-on debug logs.</description>
|
||||
<default>false</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
</config-description>
|
||||
</config-description:config-descriptions>
|
@ -0,0 +1,94 @@
|
||||
# add-on
|
||||
|
||||
addon.whisperstt.name = Whisper Speech-to-Text
|
||||
addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text.
|
||||
|
||||
voice.config.whisperstt.audioContext.label = Audio Context
|
||||
voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size)
|
||||
voice.config.whisperstt.beamSize.label = Beam Size
|
||||
voice.config.whisperstt.beamSize.description = Beam Size configuration for sampling strategy Bean Search.
|
||||
voice.config.whisperstt.createWAVRecord.label = Create WAV Record
|
||||
voice.config.whisperstt.createWAVRecord.description = Create WAV audio record on each whisper execution.
|
||||
voice.config.whisperstt.enableWhisperLog.label = Enable Whisper Log
|
||||
voice.config.whisperstt.enableWhisperLog.description = Emit whisper.cpp library logs as add-on debug logs.
|
||||
voice.config.whisperstt.noResultsMessage.label = No Results Message
|
||||
voice.config.whisperstt.noResultsMessage.description = Message to be told when no results. (Empty for disabled)
|
||||
voice.config.whisperstt.errorMessage.label = Error Message
|
||||
voice.config.whisperstt.errorMessage.description = Message to be told when an error has happened. (Empty for disabled)
|
||||
voice.config.whisperstt.grammarLines.label = Grammar
|
||||
voice.config.whisperstt.grammarLines.description = Grammar to use in GBNF format. (BNF variant used by whisper.cpp).
|
||||
voice.config.whisperstt.grammarPenalty.label = Grammar Penalty
|
||||
voice.config.whisperstt.grammarPenalty.description = Penalty for non grammar tokens when using grammar.
|
||||
voice.config.whisperstt.greedyBestOf.label = Greedy Best Of
|
||||
voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sampling strategy Greedy. (-1 for unlimited)
|
||||
voice.config.whisperstt.group.developer.label = Developer
|
||||
voice.config.whisperstt.group.developer.description = Options added for developers.
|
||||
voice.config.whisperstt.group.grammar.label = Grammar
|
||||
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions.
|
||||
voice.config.whisperstt.group.messages.label = Info Messages
|
||||
voice.config.whisperstt.group.messages.description = Configure service information messages.
|
||||
voice.config.whisperstt.group.stt.label = STT Configuration
|
||||
voice.config.whisperstt.group.stt.description = Configure Speech to Text.
|
||||
voice.config.whisperstt.group.vad.label = Voice Activity Detection
|
||||
voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with.
|
||||
voice.config.whisperstt.group.whisper.label = Whisper Options
|
||||
voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options.
|
||||
voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds
|
||||
voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription.
|
||||
voice.config.whisperstt.initialPrompt.label = Initial Prompt
|
||||
voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with.
|
||||
voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds
|
||||
voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection.
|
||||
voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds
|
||||
voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription.
|
||||
voice.config.whisperstt.minSeconds.label = Min Transcription Seconds
|
||||
voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper.
|
||||
voice.config.whisperstt.modelName.label = Model Name
|
||||
voice.config.whisperstt.modelName.description = Model name without extension.
|
||||
voice.config.whisperstt.openvinoDevice.label = OpenVINO Device
|
||||
voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
||||
voice.config.whisperstt.preloadModel.label = Preload Model
|
||||
voice.config.whisperstt.preloadModel.description = Keep the model loaded. If the parameter is set to true, the model will be reloaded only on configuration updates. If the model is not loaded when needed, the service will try to load it. If the parameter is set to false, the model will be loaded and unloaded on each run.
|
||||
voice.config.whisperstt.recordSampleFormat.label = Record Sample Format
|
||||
voice.config.whisperstt.recordSampleFormat.description = Defines the sample type and bit-size used by the created WAV audio record.
|
||||
voice.config.whisperstt.recordSampleFormat.option.i16 = Integer 16bit
|
||||
voice.config.whisperstt.recordSampleFormat.option.f32 = Float 32bit
|
||||
voice.config.whisperstt.removeSilence.label = Remove Silence
|
||||
voice.config.whisperstt.removeSilence.description = Remove silence frames from the beginning and end of the audio.
|
||||
voice.config.whisperstt.samplingStrategy.label = Sampling strategy
|
||||
voice.config.whisperstt.samplingStrategy.description = Sampling strategy used.
|
||||
voice.config.whisperstt.samplingStrategy.option.GREEDY = Greedy
|
||||
voice.config.whisperstt.samplingStrategy.option.BEAN_SEARCH = Bean Search
|
||||
voice.config.whisperstt.singleUtteranceMode.label = Single Utterance Mode
|
||||
voice.config.whisperstt.singleUtteranceMode.description = When enabled recognition stops listening after a single utterance.
|
||||
voice.config.whisperstt.speedUp.label = Speed Up
|
||||
voice.config.whisperstt.speedUp.description = Speed up audio by x2. (reduced accuracy)
|
||||
voice.config.whisperstt.stepSeconds.label = Audio Step
|
||||
voice.config.whisperstt.stepSeconds.description = Audio step for the voice activity detection.
|
||||
voice.config.whisperstt.stepSeconds.option.0.1 = 100ms
|
||||
voice.config.whisperstt.stepSeconds.option.0.2 = 200ms
|
||||
voice.config.whisperstt.stepSeconds.option.0.3 = 300ms
|
||||
voice.config.whisperstt.stepSeconds.option.0.5 = 500ms
|
||||
voice.config.whisperstt.stepSeconds.option.0.6 = 600ms
|
||||
voice.config.whisperstt.stepSeconds.option.1 = 1s
|
||||
voice.config.whisperstt.temperature.label = Temperature
|
||||
voice.config.whisperstt.temperature.description = Temperature threshold.
|
||||
voice.config.whisperstt.threads.label = Threads
|
||||
voice.config.whisperstt.threads.description = Number of threads used by whisper. (0 to use host max threads)
|
||||
voice.config.whisperstt.useGPU.label = Use GPU
|
||||
voice.config.whisperstt.useGPU.description = Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
|
||||
voice.config.whisperstt.useGrammar.label = Use Grammar
|
||||
voice.config.whisperstt.useGrammar.description = Enables grammar usage.
|
||||
voice.config.whisperstt.vadMode.label = Voice Activity Detection Mode
|
||||
voice.config.whisperstt.vadMode.description = Available VAD modes. Quality is the most likely to detect voice.
|
||||
voice.config.whisperstt.vadMode.option.QUALITY = Quality
|
||||
voice.config.whisperstt.vadMode.option.LOW_BITRATE = Low Bitrate
|
||||
voice.config.whisperstt.vadMode.option.AGGRESSIVE = Aggressive
|
||||
voice.config.whisperstt.vadMode.option.VERY_AGGRESSIVE = Very Aggressive
|
||||
voice.config.whisperstt.vadSensitivity.label = Voice Activity Detection Sensitivity
|
||||
voice.config.whisperstt.vadSensitivity.description = Percentage in range 0-1 of voice activity in each audio step analyzed to consider it as voice.
|
||||
voice.config.whisperstt.vadStep.label = Voice Activity Detector Step
|
||||
voice.config.whisperstt.vadStep.description = Audio milliseconds passed to the voice activity detector. Defines how much times the voice activity detector is executed per audio step.
|
||||
voice.config.whisperstt.vadStep.option.10 = 10ms
|
||||
voice.config.whisperstt.vadStep.option.20 = 20ms
|
||||
voice.config.whisperstt.vadStep.option.30 = 30ms
|
@ -472,6 +472,7 @@
|
||||
<module>org.openhab.voice.voicerss</module>
|
||||
<module>org.openhab.voice.voskstt</module>
|
||||
<module>org.openhab.voice.watsonstt</module>
|
||||
<module>org.openhab.voice.whisperstt</module>
|
||||
</modules>
|
||||
|
||||
<properties>
|
||||
|
Loading…
Reference in New Issue
Block a user