[WhisperSTT] Initial contribution (#15166)

Signed-off-by: Miguel Álvarez <miguelwork92@gmail.com>
Signed-off-by: GiviMAD <GiviMAD@users.noreply.github.com>
This commit is contained in:
GiviMAD 2024-06-23 13:31:05 +02:00 committed by GitHub
parent 56db6f8bce
commit bf822211d9
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
15 changed files with 1688 additions and 0 deletions

View File

@ -451,6 +451,7 @@
/bundles/org.openhab.voice.voicerss/ @lolodomo /bundles/org.openhab.voice.voicerss/ @lolodomo
/bundles/org.openhab.voice.voskstt/ @GiviMAD /bundles/org.openhab.voice.voskstt/ @GiviMAD
/bundles/org.openhab.voice.watsonstt/ @GiviMAD /bundles/org.openhab.voice.watsonstt/ @GiviMAD
/bundles/org.openhab.voice.whisperstt/ @GiviMAD
/itests/org.openhab.automation.groovyscripting.tests/ @wborn /itests/org.openhab.automation.groovyscripting.tests/ @wborn
/itests/org.openhab.automation.jsscriptingnashorn.tests/ @wborn /itests/org.openhab.automation.jsscriptingnashorn.tests/ @wborn
/itests/org.openhab.binding.astro.tests/ @gerrieg /itests/org.openhab.binding.astro.tests/ @gerrieg

View File

@ -2251,6 +2251,11 @@
<artifactId>org.openhab.voice.watsonstt</artifactId> <artifactId>org.openhab.voice.watsonstt</artifactId>
<version>${project.version}</version> <version>${project.version}</version>
</dependency> </dependency>
<dependency>
<groupId>org.openhab.addons.bundles</groupId>
<artifactId>org.openhab.voice.whisperstt</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies> </dependencies>
</project> </project>

View File

@ -0,0 +1,35 @@
This content is produced and maintained by the openHAB project.
* Project home: https://www.openhab.org
== Declared Project Licenses
This program and the accompanying materials are made available under the terms
of the Eclipse Public License 2.0 which is available at
https://www.eclipse.org/legal/epl-2.0/.
== Source Code
https://github.com/openhab/openhab-addons
== Third-party Content
io.github.givimad: whisper-jni
* License: Apache 2.0 License
* Project: https://github.com/GiviMAD/whisper-jni
* Source: https://github.com/GiviMAD/whisper-jni/tree/main/src/
native dependency: whisper.cpp
* License: MIT License https://github.com/ggerganov/whisper.cpp/blob/master/LICENSE
* Project: https://github.com/ggerganov/whisper.cpp
* Source: https://github.com/ggerganov/whisper.cpp
io.github.givimad: libfvad-jni
* License: Apache 2.0 License https://github.com/GiviMAD/libfvad-jni/blob/main/LICENSE
* Project: https://github.com/GiviMAD/libfvad-jni
* Source: https://github.com/GiviMAD/libfvad-jni/tree/main/src/
native dependency: libfvad
* License: BSD License https://github.com/dpirch/libfvad/blob/master/LICENSE
* Project: https://github.com/dpirch/libfvad
* Source: https://github.com/dpirch/libfvad

View File

@ -0,0 +1,248 @@
# Whisper Speech-to-Text
Whisper STT Service uses [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to perform offline speech-to-text in openHAB.
It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity detection to isolate single command to transcribe, speeding up the execution.
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
Whisper enables speech recognition for multiple languages and dialects:
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
italian, indonesian, hindi, finnish, vietnamese, hebrew, ukrainian, greek, malay, czech, romanian, danish, hungarian, tamil, norwegian,
thai, urdu, croatian, bulgarian, lithuanian, latin, maori, malayalam, welsh, slovak, telugu, persian, latvian, bengali, serbian, azerbaijani,
slovenian, kannada, estonian, macedonian, breton, basque, icelandic, armenian, nepali, mongolian, bosnian, kazakh, albanian, swahili, galician,
marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, georgian, belarusian, tajik, sindhi, gujarati, amharic, yiddish, lao,
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
hausa, bashkir, javanese and sundanese.
## Supported platforms
This add-on uses some native binaries to work.
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
The following platforms are supported:
* Windows10 x86_64
* Debian GLIBC x86_64/arm64 (min GLIBC version 2.31 / min Debian version Focal)
* macOS x86_64/arm64 (min version v11.0)
The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
## CPU compatibility
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
If you are going to use the binding in a `x86_64` host the CPU should support the flags: `avx2`, `fma`, `f16c`, `avx`.
You can check those flags on linux using the terminal with `lscpu`.
You can check those flags on Windows using a program like `CPU-Z`.
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
You can check those flags on linux using the terminal with `lscpu`.
## Transcription time
On a Raspberry PI 5, the approximate transcription times are:
| model | exec time |
| ---------- | --------: |
| tiny.bin | 1.5s |
| base.bin | 3s |
| small.bin | 8.5s |
| medium.bin | 17s |
## Configuring the model
Before you can use this service you should configure your model.
You can download them from the sources provided by the [whisper.cpp](https://github.com/ggerganov/whisper.cpp) author:
* https://huggingface.co/ggerganov/whisper.cpp
* https://ggml.ggerganov.com
You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so the add-ons can find them.
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
## Using alternative whisper.cpp library
It's possible to use your own build of the whisper.cpp shared library with this add-on.
On `Linux/macOs` you need to place the `libwhisper.so/libwhisper.dydib` at `/usr/local/lib/`.
On `Windows` the `whisper.dll` file needs to be placed in any directory listed at the variable `$env:PATH`, for example `X:\\Windows\System32\`.
In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can find information about the required flags to enable different acceleration methods on the cmake build and other relevant information.
Note: You need to restart openHAB to reload the library.
## Grammar
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
Internally whisper works by inferring a matrix of possible tokens from the audio and then resolving the final transcription from it using either the Greedy or Bean Search algorithm.
The grammar feature allows you to modify the probabilities of the inferred tokens by adding a penalty to the tokens outside the grammar so that the transcription gets resolved in a different way.
It's a way to get the smallest models to perform better over a limited grammar.
The grammar should be defined using [BNF](https://en.wikipedia.org/wiki/BackusNaur_form), and the root variable should resolve the full grammar.
It allows using regex and optional parts to make it more dynamic.
This is a basic grammar example:
```BNF
root ::= (light_switch | light_state | tv_channel) "."
light_switch ::= "turn the light " ("on" | "off")
light_state ::= "set light to " ("high" | "low")
tv_channel ::= ("set ")? "tv channel to " [0-9]+
```
You can provide the grammar and enable its usage using the binding configuration.
## Configuration
Use your favorite configuration UI to edit the Whisper settings:
### Speech to Text Configuration
General options.
* **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
* **Preload Model** - Keep whisper model loaded.
* **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
* **Min Transcription Seconds** - Forces min audio duration passed to whisper, in seconds.
* **Max Transcription Seconds** - Max seconds for force trigger the transcription, without wait for detect silence.
* **Initial Silence Seconds** - Max seconds without any voice activity to abort the transcription.
* **Max Silence Seconds** - Max consecutive silence seconds to trigger the transcription.
* **Remove Silence** - Remove start and end silence from the audio to transcribe.
### Voice Activity Detection Configuration
Configure VAD options.
* **Audio Step** - Audio processing step in seconds for the voice activity detection.
* **Voice Activity Detection Mode** - Selected VAD Mode.
* **Voice Activity Detection Sensitivity** - Percentage in range 0-1 of voice activity in one second to consider it as voice.
* **Voice Activity Detection Step** - VAD detector internal step in ms (only allows 10, 20 or 30). (Audio Step / Voice Activity Detection Step = number of vad executions per audio step).
### Whisper Configuration
Configure whisper options.
* **Threads** - Number of threads used by whisper. (0 to use host max threads)
* **Sampling Strategy** - Sampling strategy used.
* **Beam Size** - Beam Size configuration for sampling strategy Bean Search.
* **Greedy Best Of** - Best Of configuration for sampling strategy Greedy.
* **Speed Up** - Speed up audio by x2. (Reduced accuracy)
* **Audio Context** - Overwrite the audio context size. (0 to use whisper default context size)
* **Temperature** - Temperature threshold.
* **Initial Prompt** - Initial prompt for whisper.
* **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
* **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
### Grammar Configuration
Configure the grammar options.
* **Grammar** - Grammar to use in GBNF format (whisper.cpp BNF variant).
* **Use Grammar** - Enable grammar usage.
* **Grammar penalty** - Penalty for non grammar tokens.
#### Grammar Example:
```gbnf
# Grammar should define a root expression that should end with a dot.
root ::= " " command "."
# Alternative command expression to expand into the root.
command ::= "Turn " onoff " " (connector)? thing |
put " " thing " to " state |
watch " " show " at bedroom" |
"Start " timer " minutes timer"
# You can use as many expressions as you need.
thing ::= "light" | "bedroom light" | "living room light" | "tv"
put ::= "set" | "put"
onoff ::= "on" | "off"
watch ::= "watch" | "play"
connector ::= "the"
state ::= "low" | "high" | "normal"
show ::= [a-zA-Z]+
timer ::= [0-9]+
```
### Messages Configuration
* **No Results Message** - Message to be told on no results.
* **Error Message** - Message to be told on exception.
### Developer Configuration
* **Create WAV Record** - Create wav audio file on each whisper execution, also creates a '.prop' file containing the transcription.
* **Record Sample Format** - Change the record sample format. (allows i16 or f32)
* **Enable Whisper Log** - Emit whisper.cpp library logs as add-on debug logs.
You can find [here](https://github.com/givimad/whisper-finetune-oh) information on how to fine-tune a model using the generated records.
### Configuration via a text file
In case you would like to set up the service via a text file, create a new file in `$OPENHAB_ROOT/conf/services` named `whisperstt.cfg`
Its contents should look similar to:
```
org.openhab.voice.whisperstt:modelName=tiny
org.openhab.voice.whisperstt:initSilenceSeconds=0.3
org.openhab.voice.whisperstt:removeSilence=true
org.openhab.voice.whisperstt:stepSeconds=0.3
org.openhab.voice.whisperstt:vadStep=0.5
org.openhab.voice.whisperstt:singleUtteranceMode=true
org.openhab.voice.whisperstt:preloadModel=false
org.openhab.voice.whisperstt:vadMode=LOW_BITRATE
org.openhab.voice.whisperstt:vadSensitivity=0.1
org.openhab.voice.whisperstt:maxSilenceSeconds=2
org.openhab.voice.whisperstt:minSeconds=2
org.openhab.voice.whisperstt:maxSeconds=10
org.openhab.voice.whisperstt:threads=0
org.openhab.voice.whisperstt:audioContext=0
org.openhab.voice.whisperstt:samplingStrategy=GREEDY
org.openhab.voice.whisperstt:temperature=0
org.openhab.voice.whisperstt:noResultsMessage="Sorry, I didn't understand you"
org.openhab.voice.whisperstt:errorMessage="Sorry, something went wrong"
org.openhab.voice.whisperstt:createWAVRecord=false
org.openhab.voice.whisperstt:recordSampleFormat=i16
org.openhab.voice.whisperstt:speedUp=false
org.openhab.voice.whisperstt:beamSize=4
org.openhab.voice.whisperstt:enableWhisperLog=false
org.openhab.voice.whisperstt:greedyBestOf=4
org.openhab.voice.whisperstt:initialPrompt=
org.openhab.voice.whisperstt:openvinoDevice=""
org.openhab.voice.whisperstt:useGPU=false
org.openhab.voice.whisperstt:useGrammar=false
org.openhab.voice.whisperstt:grammarPenalty=80.0
org.openhab.voice.whisperstt:grammarLines=
```
### Default Speech-to-Text Configuration
You can select your preferred default Speech-to-Text in the UI:
* Go to **Settings**.
* Edit **System Services - Voice**.
* Set **Whisper** as **Speech-to-Text**.
In case you would like to set up these settings via a text file, you can edit the file `runtime.cfg` in `$OPENHAB_ROOT/conf/services` and set the following entries:
```
org.openhab.voice:defaultSTT=whisperstt
```

View File

@ -0,0 +1,29 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.openhab.addons.bundles</groupId>
<artifactId>org.openhab.addons.reactor.bundles</artifactId>
<version>4.2.0-SNAPSHOT</version>
</parent>
<artifactId>org.openhab.voice.whisperstt</artifactId>
<name>openHAB Add-ons :: Bundles :: Voice :: Whisper Speech-to-Text</name>
<dependencies>
<!--Deps -->
<dependency>
<groupId>io.github.givimad</groupId>
<artifactId>whisper-jni</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>io.github.givimad</groupId>
<artifactId>libfvad-jni</artifactId>
<version>1.0.0-0</version>
</dependency>
</dependencies>
</project>

View File

@ -0,0 +1,9 @@
<?xml version="1.0" encoding="UTF-8"?>
<features name="org.openhab.voice.whisperstt-${project.version}" xmlns="http://karaf.apache.org/xmlns/features/v1.4.0">
<repository>mvn:org.openhab.core.features.karaf/org.openhab.core.features.karaf.openhab-core/${ohc.version}/xml/features</repository>
<feature name="openhab-voice-whisperstt" description="Whisper Speech-to-Text" version="${project.version}">
<feature>openhab-runtime-base</feature>
<bundle start-level="80">mvn:org.openhab.addons.bundles/org.openhab.voice.whisperstt/${project.version}</bundle>
</feature>
</features>

View File

@ -0,0 +1,77 @@
/**
* Copyright (c) 2010-2024 Contributors to the openHAB project
*
* See the NOTICE file(s) distributed with this work for additional
* information.
*
* This program and the accompanying materials are made available under the
* terms of the Eclipse Public License 2.0 which is available at
* http://www.eclipse.org/legal/epl-2.0
*
* SPDX-License-Identifier: EPL-2.0
*/
package org.openhab.voice.whisperstt.internal;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
import static org.openhab.voice.whisperstt.internal.WhisperSTTService.WHISPER_FOLDER;
import java.net.URI;
import java.util.Collection;
import java.util.List;
import java.util.Locale;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.eclipse.jdt.annotation.NonNullByDefault;
import org.eclipse.jdt.annotation.Nullable;
import org.openhab.core.config.core.ConfigOptionProvider;
import org.openhab.core.config.core.ConfigurableService;
import org.openhab.core.config.core.ParameterOption;
import org.osgi.framework.Constants;
import org.osgi.service.component.annotations.Component;
/**
* The {@link WhisperConfigOptionProvider} class provides some dynamic configuration options
*
* @author Miguel Álvarez - Initial contribution
*/
@Component(service = ConfigOptionProvider.class, configurationPid = SERVICE_PID, property = Constants.SERVICE_PID + "="
+ SERVICE_PID)
@ConfigurableService(category = SERVICE_CATEGORY, label = SERVICE_NAME
+ " Speech-to-Text", description_uri = SERVICE_CATEGORY + ":" + SERVICE_ID)
@NonNullByDefault
public class WhisperConfigOptionProvider implements ConfigOptionProvider {
@Override
public @Nullable Collection<ParameterOption> getParameterOptions(URI uri, String param, @Nullable String context,
@Nullable Locale locale) {
if (context == null && (SERVICE_CATEGORY + ":" + SERVICE_ID).equals(uri.toString())) {
if ("modelName".equals(param)) {
return getAvailableModelOptions();
}
}
return null;
}
private List<ParameterOption> getAvailableModelOptions() {
var folderFile = WHISPER_FOLDER.toFile();
var files = folderFile.listFiles();
if (!folderFile.exists() || !folderFile.isDirectory() || files == null) {
return List.of();
}
String modelExtension = ".bin";
return Stream.of(files).filter(file -> !file.isDirectory() && file.getName().endsWith(modelExtension))
.map(file -> {
String fileName = file.getName();
String optionName = file.getName();
String optionalPrefix = "ggml-";
if (optionName.startsWith(optionalPrefix)) {
optionName = optionName.substring(optionalPrefix.length());
}
optionName = optionName.substring(0, optionName.length() - modelExtension.length());
return new ParameterOption(fileName, optionName);
}).collect(Collectors.toList());
}
}

View File

@ -0,0 +1,149 @@
/**
* Copyright (c) 2010-2024 Contributors to the openHAB project
*
* See the NOTICE file(s) distributed with this work for additional
* information.
*
* This program and the accompanying materials are made available under the
* terms of the Eclipse Public License 2.0 which is available at
* http://www.eclipse.org/legal/epl-2.0
*
* SPDX-License-Identifier: EPL-2.0
*/
package org.openhab.voice.whisperstt.internal;
import java.util.List;
import org.eclipse.jdt.annotation.NonNullByDefault;
import io.github.givimad.libfvadjni.VoiceActivityDetector;
/**
* The {@link WhisperSTTConfiguration} class contains fields mapping thing configuration parameters.
*
* @author Miguel Álvarez Díez - Initial contribution
*/
@NonNullByDefault
public class WhisperSTTConfiguration {
/**
* Model name without '.bin' extension.
*/
public String modelName = "";
/**
* Keep model loaded.
*/
public boolean preloadModel;
/**
* Defines the audio step.
*/
public float stepSeconds = 1f;
/**
* Min audio seconds to call whisper with.
*/
public float minSeconds = 2f;
/**
* Max seconds to wait to force stop the transcription.
*/
public int maxSeconds = 10;
/**
* Voice activity detection mode.
*/
public String vadMode = VoiceActivityDetector.Mode.VERY_AGGRESSIVE.toString();
/**
* Voice activity detection sensitivity.
*/
public float vadSensitivity = 0.3f;
/**
* Voice activity detection step in ms (vad dependency only allows 10, 20 or 30 ms steps).
*/
public int vadStep = 20;
/**
* Initial silence seconds for discard transcription.
*/
public float initSilenceSeconds = 3;
/**
* Max silence seconds for triggering transcription.
*/
public float maxSilenceSeconds = 0.5f;
/**
* Remove silence frames.
*/
public boolean removeSilence = true;
/**
* Number of threads used by whisper. (0 to use host max threads)
*/
public int threads;
/**
* Overwrite the audio context size. (0 to use whisper default context size).
*/
public int audioContext;
/**
* Speed up audio by x2 (reduced accuracy).
*/
public boolean speedUp;
/**
* Sampling strategy.
*/
public String samplingStrategy = "BEAN_SEARCH";
/**
* Beam Size configuration for sampling strategy Bean Search.
*/
public int beamSize = 2;
/**
* Best Of configuration for sampling strategy Greedy.
*/
public int greedyBestOf = -1;
/**
* Temperature threshold.
*/
public float temperature;
/**
* Initial whisper prompt
*/
public String initialPrompt = "";
/**
* Grammar in GBNF format.
*/
public List<String> grammarLines = List.of();
/**
* Enables grammar usage.
*/
public boolean useGrammar = false;
/**
* Grammar penalty.
*/
public float grammarPenalty = 100f;
/**
* Enables GPU usage. (built-in binaries do not support GPU usage)
*/
public boolean useGPU = true;
/**
* OpenVINO device name
*/
public String openvinoDevice = "CPU";
/**
* Single phrase mode.
*/
public boolean singleUtteranceMode = true;
/**
* Message to be told when no results.
*/
public String noResultsMessage = "Sorry, I didn't understand you";
/**
* Message to be told when an error has happened.
*/
public String errorMessage = "Sorry, something went wrong";
/**
* Create wav audio record for each whisper invocation.
*/
public boolean createWAVRecord;
/**
* Record sample format. Values: i16, f32.
*/
public String recordSampleFormat = "i16";
/**
* Print whisper.cpp library logs as binding debug logs.
*/
public boolean enableWhisperLog;
}

View File

@ -0,0 +1,44 @@
/**
* Copyright (c) 2010-2024 Contributors to the openHAB project
*
* See the NOTICE file(s) distributed with this work for additional
* information.
*
* This program and the accompanying materials are made available under the
* terms of the Eclipse Public License 2.0 which is available at
* http://www.eclipse.org/legal/epl-2.0
*
* SPDX-License-Identifier: EPL-2.0
*/
package org.openhab.voice.whisperstt.internal;
import org.eclipse.jdt.annotation.NonNullByDefault;
/**
* The {@link WhisperSTTConstants} class defines common constants, which are
* used across the whole binding.
*
* @author Miguel Álvarez Díez - Initial contribution
*/
@NonNullByDefault
public class WhisperSTTConstants {
/**
* Service name
*/
public static final String SERVICE_NAME = "Whisper";
/**
* Service id
*/
public static final String SERVICE_ID = "whisperstt";
/**
* Service category
*/
public static final String SERVICE_CATEGORY = "voice";
/**
* Service pid
*/
public static final String SERVICE_PID = "org.openhab." + SERVICE_CATEGORY + "." + SERVICE_ID;
}

View File

@ -0,0 +1,657 @@
/**
* Copyright (c) 2010-2024 Contributors to the openHAB project
*
* See the NOTICE file(s) distributed with this work for additional
* information.
*
* This program and the accompanying materials are made available under the
* terms of the Eclipse Public License 2.0 which is available at
* http://www.eclipse.org/legal/epl-2.0
*
* SPDX-License-Identifier: EPL-2.0
*/
package org.openhab.voice.whisperstt.internal;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
import java.io.ByteArrayInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.atomic.AtomicBoolean;
import javax.sound.sampled.AudioFileFormat;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import org.eclipse.jdt.annotation.NonNullByDefault;
import org.eclipse.jdt.annotation.Nullable;
import org.openhab.core.OpenHAB;
import org.openhab.core.audio.AudioFormat;
import org.openhab.core.audio.AudioStream;
import org.openhab.core.audio.utils.AudioWaveUtils;
import org.openhab.core.common.ThreadPoolManager;
import org.openhab.core.config.core.ConfigurableService;
import org.openhab.core.config.core.Configuration;
import org.openhab.core.io.rest.LocaleService;
import org.openhab.core.voice.RecognitionStartEvent;
import org.openhab.core.voice.RecognitionStopEvent;
import org.openhab.core.voice.STTException;
import org.openhab.core.voice.STTListener;
import org.openhab.core.voice.STTService;
import org.openhab.core.voice.STTServiceHandle;
import org.openhab.core.voice.SpeechRecognitionErrorEvent;
import org.openhab.core.voice.SpeechRecognitionEvent;
import org.openhab.voice.whisperstt.internal.utils.VAD;
import org.osgi.framework.Constants;
import org.osgi.service.component.annotations.Activate;
import org.osgi.service.component.annotations.Component;
import org.osgi.service.component.annotations.Deactivate;
import org.osgi.service.component.annotations.Modified;
import org.osgi.service.component.annotations.Reference;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import io.github.givimad.libfvadjni.VoiceActivityDetector;
import io.github.givimad.whisperjni.WhisperContext;
import io.github.givimad.whisperjni.WhisperContextParams;
import io.github.givimad.whisperjni.WhisperFullParams;
import io.github.givimad.whisperjni.WhisperGrammar;
import io.github.givimad.whisperjni.WhisperJNI;
import io.github.givimad.whisperjni.WhisperSamplingStrategy;
import io.github.givimad.whisperjni.WhisperState;
/**
* The {@link WhisperSTTService} class is a service implementation to use whisper.cpp for Speech-to-Text.
*
* @author Miguel Álvarez - Initial contribution
*/
@NonNullByDefault
@Component(configurationPid = SERVICE_PID, property = Constants.SERVICE_PID + "=" + SERVICE_PID)
@ConfigurableService(category = SERVICE_CATEGORY, label = SERVICE_NAME
+ " Speech-to-Text", description_uri = SERVICE_CATEGORY + ":" + SERVICE_ID)
public class WhisperSTTService implements STTService {
protected static final Path WHISPER_FOLDER = Path.of(OpenHAB.getUserDataFolder(), "whisper");
private static final Path SAMPLES_FOLDER = Path.of(WHISPER_FOLDER.toString(), "samples");
private static final int WHISPER_SAMPLE_RATE = 16000;
private final Logger logger = LoggerFactory.getLogger(WhisperSTTService.class);
private final ScheduledExecutorService executor = ThreadPoolManager.getScheduledPool("OH-voice-whisperstt");
private final LocaleService localeService;
private WhisperSTTConfiguration config = new WhisperSTTConfiguration();
private @Nullable WhisperContext context;
private @Nullable WhisperGrammar grammar;
private @Nullable WhisperJNI whisper;
@Activate
public WhisperSTTService(@Reference LocaleService localeService) {
this.localeService = localeService;
}
@Activate
protected void activate(Map<String, Object> config) {
try {
if (!Files.exists(WHISPER_FOLDER)) {
Files.createDirectory(WHISPER_FOLDER);
}
WhisperJNI.loadLibrary(getLoadOptions());
VoiceActivityDetector.loadLibrary();
whisper = new WhisperJNI();
} catch (IOException | RuntimeException e) {
logger.warn("Unable to register native library: {}", e.getMessage());
}
configChange(config);
}
private WhisperJNI.LoadOptions getLoadOptions() {
Path libFolder = Paths.get("/usr/local/lib");
Path libFolderWin = Paths.get("/Windows/System32");
var options = new WhisperJNI.LoadOptions();
// Overwrite whisper jni shared library
Path whisperJNILinuxLibrary = libFolder.resolve("libwhisperjni.so");
Path whisperJNIMacLibrary = libFolder.resolve("libwhisperjni.dylib");
Path whisperJNIWinLibrary = libFolderWin.resolve("libwhisperjni.dll");
if (Files.exists(whisperJNILinuxLibrary)) {
options.whisperJNILib = whisperJNILinuxLibrary;
} else if (Files.exists(whisperJNIMacLibrary)) {
options.whisperJNILib = whisperJNIMacLibrary;
} else if (Files.exists(whisperJNIWinLibrary)) {
options.whisperJNILib = whisperJNIWinLibrary;
}
// Overwrite whisper shared library, Windows searches library in $env:PATH
Path whisperLinuxLibrary = libFolder.resolve("libwhisper.so");
Path whisperMacLibrary = libFolder.resolve("libwhisper.dylib");
if (Files.exists(whisperLinuxLibrary)) {
options.whisperLib = whisperLinuxLibrary;
} else if (Files.exists(whisperMacLibrary)) {
options.whisperLib = whisperMacLibrary;
}
// Log library registration
options.logger = (msg) -> logger.debug("Library load: {}", msg);
return options;
}
@Modified
protected void modified(Map<String, Object> config) {
configChange(config);
}
@Deactivate
protected void deactivate(Map<String, Object> config) {
try {
WhisperGrammar grammar = this.grammar;
if (grammar != null) {
grammar.close();
this.grammar = null;
}
unloadContext();
} catch (IOException e) {
logger.warn("IOException unloading model: {}", e.getMessage());
}
WhisperJNI.setLibraryLogger(null);
}
private void configChange(Map<String, Object> config) {
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
WhisperGrammar grammar = this.grammar;
if (grammar != null) {
grammar.close();
this.grammar = null;
}
WhisperJNI whisper;
try {
whisper = getWhisper();
} catch (IOException ignored) {
logger.warn("library not loaded, the add-on will not work");
return;
}
String grammarText = String.join("\n", this.config.grammarLines);
if (this.config.useGrammar && isValidGrammar(grammarText)) {
try {
logger.debug("Parsing GBNF grammar...");
this.grammar = whisper.parseGrammar(grammarText);
} catch (IOException e) {
logger.warn("Error parsing grammar: {}", e.getMessage());
}
}
if (this.config.preloadModel) {
try {
loadContext();
} catch (IOException e) {
logger.warn("IOException loading model: {}", e.getMessage());
} catch (UnsatisfiedLinkError e) {
logger.warn("Missing native dependency: {}", e.getMessage());
}
} else {
try {
unloadContext();
} catch (IOException e) {
logger.warn("IOException unloading model: {}", e.getMessage());
}
}
}
private boolean isValidGrammar(String grammarText) {
try {
WhisperGrammar.assertValidGrammar(grammarText);
} catch (IllegalArgumentException | ParseException e) {
logger.warn("Invalid grammar: {}", e.getMessage());
return false;
}
return true;
}
@Override
public String getId() {
return SERVICE_ID;
}
@Override
public String getLabel(@Nullable Locale locale) {
return SERVICE_NAME;
}
@Override
public Set<Locale> getSupportedLocales() {
// as it is not possible to determine the language of the model that was downloaded and setup by the user, it is
// assumed the language of the model is matching the locale of the openHAB server
return Set.of(localeService.getLocale(null));
}
@Override
public Set<AudioFormat> getSupportedFormats() {
return Set.of(
new AudioFormat(AudioFormat.CONTAINER_NONE, AudioFormat.CODEC_PCM_SIGNED, false, 16, null,
(long) WHISPER_SAMPLE_RATE, 1),
new AudioFormat(AudioFormat.CONTAINER_WAVE, AudioFormat.CODEC_PCM_SIGNED, false, 16, null,
(long) WHISPER_SAMPLE_RATE, 1));
}
@Override
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
throws STTException {
AtomicBoolean aborted = new AtomicBoolean(false);
WhisperContext ctx = null;
WhisperState state = null;
try {
var whisper = getWhisper();
ctx = getContext();
logger.debug("Creating whisper state...");
state = whisper.initState(ctx);
logger.debug("Whisper state created");
logger.debug("Creating VAD instance...");
final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE);
VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
config.vadStep, config.vadSensitivity);
logger.debug("VAD instance created");
sttListener.sttEventReceived(new RecognitionStartEvent());
backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted);
} catch (IOException e) {
if (ctx != null && !config.preloadModel) {
ctx.close();
}
if (state != null) {
state.close();
}
throw new STTException("Exception during initialization", e);
}
return () -> {
aborted.set(true);
};
}
private WhisperJNI getWhisper() throws IOException {
var whisper = this.whisper;
if (whisper == null) {
throw new IOException("Library not loaded");
}
return whisper;
}
private WhisperContext getContext() throws IOException, UnsatisfiedLinkError {
var context = this.context;
if (context != null) {
return context;
}
return loadContext();
}
private synchronized WhisperContext loadContext() throws IOException {
unloadContext();
String modelFilename = this.config.modelName;
if (modelFilename.isBlank()) {
throw new IOException("The modelName configuration is missing");
}
String modelPrefix = "ggml-";
String modelExtension = ".bin";
if (!modelFilename.startsWith(modelPrefix)) {
modelFilename = modelPrefix + modelFilename;
}
if (!modelFilename.endsWith(modelExtension)) {
modelFilename = modelFilename + modelExtension;
}
Path modelPath = WHISPER_FOLDER.resolve(modelFilename);
if (!Files.exists(modelPath) || Files.isDirectory(modelPath)) {
throw new IOException("Missing model file: " + modelPath);
}
logger.debug("Loading whisper context...");
WhisperJNI whisper = getWhisper();
var context = whisper.initNoState(modelPath, getWhisperContextParams());
logger.debug("Whisper context loaded");
if (config.preloadModel) {
this.context = context;
}
if (!config.openvinoDevice.isBlank()) {
// has no effect if OpenVINO is not enabled in whisper.cpp library.
logger.debug("Init OpenVINO device");
whisper.initOpenVINO(context, config.openvinoDevice);
}
return context;
}
private WhisperContextParams getWhisperContextParams() {
var params = new WhisperContextParams();
params.useGPU = config.useGPU;
return params;
}
private void unloadContext() throws IOException {
var context = this.context;
if (context != null) {
logger.debug("Unloading model");
context.close();
this.context = null;
}
}
private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep,
Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
var releaseContext = !config.preloadModel;
final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
final int nMaxSilenceSamples = (int) (config.maxSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
logger.debug("Samples per step {}", nSamplesStep);
logger.debug("Min transcription samples {}", nSamplesMin);
logger.debug("Max transcription samples {}", nSamplesMax);
logger.debug("Max init silence samples {}", nInitSilenceSamples);
logger.debug("Max silence samples {}", nMaxSilenceSamples);
// used to store the step samples in libfvad wanted format 16-bit int
final short[] stepAudioSamples = new short[nSamplesStep];
// used to store the full samples in whisper wanted format 32-bit float
final float[] audioSamples = new float[nSamplesMax];
executor.submit(() -> {
int audioSamplesOffset = 0;
int silenceSamplesCounter = 0;
int nProcessedSamples = 0;
int numBytesRead;
boolean voiceDetected = false;
String transcription = "";
String tempTranscription = "";
VAD.@Nullable VADResult lastVADResult;
VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
try {
try (state; //
audioStream; //
vad) {
if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
AudioWaveUtils.removeFMT(audioStream);
}
final ByteBuffer captureBuffer = ByteBuffer.allocate(nSamplesStep * 2)
.order(ByteOrder.LITTLE_ENDIAN);
// init remaining to full capacity
int remaining = captureBuffer.capacity();
WhisperFullParams params = getWhisperFullParams(ctx, locale);
while (!aborted.get()) {
// read until no remaining so we get the complete step samples
numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
remaining);
if (aborted.get() || numBytesRead == -1) {
break;
}
if (numBytesRead != remaining) {
remaining = remaining - numBytesRead;
continue;
}
// reset remaining to full capacity
remaining = captureBuffer.capacity();
// encode step samples and copy them to the audio buffers
var shortBuffer = captureBuffer.asShortBuffer();
while (shortBuffer.hasRemaining()) {
var position = shortBuffer.position();
short i16BitSample = shortBuffer.get();
float f32BitSample = Float.min(1f,
Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
stepAudioSamples[position] = i16BitSample;
audioSamples[audioSamplesOffset++] = f32BitSample;
nProcessedSamples++;
}
// run vad
if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
logger.debug("VAD: Skipping, max length reached");
} else {
lastVADResult = vad.analyze(stepAudioSamples);
if (lastVADResult.isVoice()) {
voiceDetected = true;
logger.debug("VAD: voice detected");
silenceSamplesCounter = 0;
firstConsecutiveSilenceVADResult = null;
continue;
} else {
if (firstConsecutiveSilenceVADResult == null) {
firstConsecutiveSilenceVADResult = lastVADResult;
}
silenceSamplesCounter += nSamplesStep;
int maxSilenceSamples = voiceDetected ? nMaxSilenceSamples : nInitSilenceSamples;
if (silenceSamplesCounter < maxSilenceSamples) {
if (logger.isDebugEnabled()) {
int totalSteps = maxSilenceSamples / nSamplesStep;
int currentSteps = totalSteps
- ((maxSilenceSamples - silenceSamplesCounter) / nSamplesStep);
logger.debug("VAD: silence detected {}/{}", currentSteps, totalSteps);
}
if (!voiceDetected && config.removeSilence) {
logger.debug("removing start silence");
int samplesToKeep = lastVADResult.voiceSamplesInTail();
if (samplesToKeep > 0) {
for (int i = 0; i < samplesToKeep; i++) {
audioSamples[i] = audioSamples[audioSamplesOffset
- (samplesToKeep - i)];
}
audioSamplesOffset = samplesToKeep;
logger.debug("some audio was kept");
} else {
audioSamplesOffset = 0;
}
}
continue;
} else {
logger.debug("VAD: silence detected");
if (audioSamplesOffset < nSamplesMin) {
logger.debug("Not enough samples, continue");
continue;
}
if (config.singleUtteranceMode) {
// close the audio stream to avoid keep getting audio we don't need
try {
audioStream.close();
} catch (IOException ignored) {
}
}
}
}
if (config.removeSilence) {
if (voiceDetected) {
logger.debug("removing end silence");
int samplesToKeep = firstConsecutiveSilenceVADResult.voiceSamplesInHead();
if (samplesToKeep > 0) {
logger.debug("some audio was kept");
}
var samplesToRemove = silenceSamplesCounter - samplesToKeep;
if (audioSamplesOffset - samplesToRemove < nSamplesMin) {
logger.debug("avoid removing under min audio seconds");
samplesToRemove = audioSamplesOffset - nSamplesMin;
}
if (samplesToRemove > 0) {
audioSamplesOffset -= samplesToRemove;
}
} else {
audioSamplesOffset = 0;
}
}
if (audioSamplesOffset == 0) {
if (config.singleUtteranceMode) {
logger.debug("no audio to transcribe, ending");
break;
} else {
logger.debug("no audio to transcribe, continue listening");
continue;
}
}
}
// run whisper
logger.debug("running whisper with {} seconds of audio...",
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
long execStartTime = System.currentTimeMillis();
var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset);
logger.debug("whisper ended in {}ms with result code {}",
System.currentTimeMillis() - execStartTime, result);
// process result
if (result != 0) {
emitSpeechRecognitionError(sttListener);
break;
}
int nSegments = whisper.fullNSegmentsFromState(state);
logger.debug("Available transcription segments {}", nSegments);
if (nSegments == 1) {
tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
if (config.createWAVRecord) {
createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
locale.getLanguage());
}
if (config.singleUtteranceMode) {
logger.debug("single utterance mode, ending transcription");
transcription = tempTranscription;
break;
} else {
// start a new transcription segment
transcription += tempTranscription;
tempTranscription = "";
}
} else if (nSegments == 0 && config.singleUtteranceMode) {
logger.debug("Single utterance mode and no results, ending transcription");
break;
} else if (nSegments > 1) {
// non reachable
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
break;
}
// reset state to start with next segment
voiceDetected = false;
silenceSamplesCounter = 0;
audioSamplesOffset = 0;
logger.debug("Partial transcription: {}", tempTranscription);
logger.debug("Transcription: {}", transcription);
}
} finally {
if (releaseContext) {
ctx.close();
}
}
// emit result
if (!aborted.get()) {
sttListener.sttEventReceived(new RecognitionStopEvent());
logger.debug("Final transcription: '{}'", transcription);
if (!transcription.isBlank()) {
sttListener.sttEventReceived(new SpeechRecognitionEvent(transcription.trim(), 1));
} else {
emitSpeechRecognitionNoResultsError(sttListener);
}
}
} catch (IOException e) {
logger.warn("Error running speech to text: {}", e.getMessage());
emitSpeechRecognitionError(sttListener);
} catch (UnsatisfiedLinkError e) {
logger.warn("Missing native dependency: {}", e.getMessage());
emitSpeechRecognitionError(sttListener);
}
});
}
private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException {
WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
var params = new WhisperFullParams(strategy);
params.temperature = config.temperature;
params.nThreads = config.threads;
params.audioCtx = config.audioContext;
params.speedUp = config.speedUp;
params.beamSearchBeamSize = config.beamSize;
params.greedyBestOf = config.greedyBestOf;
if (!config.initialPrompt.isBlank()) {
params.initialPrompt = config.initialPrompt;
}
if (grammar != null) {
params.grammar = grammar;
params.grammarPenalty = config.grammarPenalty;
}
// there is no single language models other than the english ones
params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en";
// implementation assumes this options
params.translate = false;
params.detectLanguage = false;
params.printProgress = false;
params.noTimestamps = true;
params.printRealtime = false;
params.printSpecial = false;
params.printTimestamps = false;
params.suppressBlank = true;
params.suppressNonSpeechTokens = true;
params.singleSegment = true;
params.noContext = true;
return params;
}
private void emitSpeechRecognitionNoResultsError(STTListener sttListener) {
sttListener.sttEventReceived(new SpeechRecognitionErrorEvent(config.noResultsMessage));
}
private void emitSpeechRecognitionError(STTListener sttListener) {
sttListener.sttEventReceived(new SpeechRecognitionErrorEvent(config.errorMessage));
}
private void createSamplesDir() {
if (!Files.exists(SAMPLES_FOLDER)) {
try {
Files.createDirectory(SAMPLES_FOLDER);
logger.info("Whisper samples dir created {}", SAMPLES_FOLDER);
} catch (IOException ignored) {
logger.warn("Unable to create whisper samples dir {}", SAMPLES_FOLDER);
}
}
}
private void createAudioFile(float[] samples, int size, String transcription, String language) {
createSamplesDir();
javax.sound.sampled.AudioFormat jAudioFormat;
ByteBuffer byteBuffer;
if ("i16".equals(config.recordSampleFormat)) {
logger.debug("Saving audio file with sample format i16");
jAudioFormat = new javax.sound.sampled.AudioFormat(javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED,
WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < size; i++) {
byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE));
}
} else {
logger.debug("Saving audio file with sample format f32");
jAudioFormat = new javax.sound.sampled.AudioFormat(javax.sound.sampled.AudioFormat.Encoding.PCM_FLOAT,
WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < size; i++) {
byteBuffer.putFloat(samples[i]);
}
}
AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),
jAudioFormat, samples.length);
try {
var scapedTranscription = transcription.replaceAll("[^a-zA-ZÀ-ú0-9.-]", "_");
if (scapedTranscription.length() > 60) {
scapedTranscription = scapedTranscription.substring(0, 60);
}
String fileName = new SimpleDateFormat("yyyy-MM-dd.HH.mm.ss.SS").format(new Date()) + "("
+ scapedTranscription + ")";
Path audioPath = Path.of(SAMPLES_FOLDER.toString(), fileName + ".wav");
Path propertiesPath = Path.of(SAMPLES_FOLDER.toString(), fileName + ".props");
logger.debug("Saving audio file: {}", audioPath);
FileOutputStream audioFileOutputStream = new FileOutputStream(audioPath.toFile());
AudioSystem.write(audioInputStreamTemp, AudioFileFormat.Type.WAVE, audioFileOutputStream);
audioFileOutputStream.close();
String properties = "transcription=" + transcription + "\nlanguage=" + language + "\n";
logger.debug("Saving properties file: {}", propertiesPath);
FileOutputStream propertiesFileOutputStream = new FileOutputStream(propertiesPath.toFile());
propertiesFileOutputStream.write(properties.getBytes(StandardCharsets.UTF_8));
propertiesFileOutputStream.close();
} catch (IOException e) {
logger.warn("Unable to store sample.", e);
}
}
private void onWhisperLog(String text) {
logger.debug("[whisper.cpp] {}", text);
}
}

View File

@ -0,0 +1,95 @@
/**
* Copyright (c) 2010-2024 Contributors to the openHAB project
*
* See the NOTICE file(s) distributed with this work for additional
* information.
*
* This program and the accompanying materials are made available under the
* terms of the Eclipse Public License 2.0 which is available at
* http://www.eclipse.org/legal/epl-2.0
*
* SPDX-License-Identifier: EPL-2.0
*/
package org.openhab.voice.whisperstt.internal.utils;
import java.io.IOException;
import org.eclipse.jdt.annotation.NonNullByDefault;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import io.github.givimad.libfvadjni.VoiceActivityDetector;
/**
* The {@link VAD} class is a voice activity detector implementation over libfvad-jni.
*
* @author Miguel Álvarez - Initial contribution
*/
@NonNullByDefault
public class VAD implements AutoCloseable {
private final Logger logger = LoggerFactory.getLogger(VAD.class);
private final VoiceActivityDetector libfvad;
private final short[] stepSamples;
private final int totalPartialDetections;
private final int detectionThreshold;
/**
*
* @param mode desired vad mode.
* @param sampleRate audio sample rate.
* @param frameSize detector input frame size.
* @param stepMs detector partial step ms.
* @param sensitivity detector sensitivity percent in range 0 - 1.
* @throws IOException
*/
public VAD(VoiceActivityDetector.Mode mode, int sampleRate, int frameSize, int stepMs, float sensitivity)
throws IOException {
this.libfvad = VoiceActivityDetector.newInstance();
this.libfvad.setMode(mode);
this.libfvad.setSampleRate(VoiceActivityDetector.SampleRate.fromValue(sampleRate));
this.stepSamples = new short[sampleRate / 1000 * stepMs];
this.totalPartialDetections = (frameSize / stepSamples.length);
this.detectionThreshold = (int) ((((float) totalPartialDetections) / 100f) * (sensitivity * 100));
}
public VADResult analyze(short[] samples) throws IOException {
int voiceInHead = 0;
int voiceInTail = 0;
boolean silenceFound = false;
int partialVADCounter = 0;
for (int i = 0; i < totalPartialDetections; i++) {
System.arraycopy(samples, i * stepSamples.length, stepSamples, 0, stepSamples.length);
if (libfvad.process(stepSamples, stepSamples.length)) {
partialVADCounter++;
if (!silenceFound) {
voiceInHead++;
}
voiceInTail++;
} else {
silenceFound = true;
voiceInTail = 0;
}
}
logger.debug("VAD: {}/{} - required: {}", partialVADCounter, totalPartialDetections, detectionThreshold);
return new VADResult( //
partialVADCounter >= detectionThreshold, //
voiceInHead * stepSamples.length, //
voiceInTail * stepSamples.length //
);
}
@Override
public void close() {
libfvad.close();
}
/**
* Voice activity detection result.
*
* @param isVoice Does the block contain enough voice
* @param voiceSamplesInHead Number of samples consecutively reported as voice from the beginning of the chunk
* @param voiceSamplesInTail Number of samples consecutively reported as voice from the end of the chunk
*/
public record VADResult(boolean isVoice, int voiceSamplesInHead, int voiceSamplesInTail) {
}
}

View File

@ -0,0 +1,15 @@
<?xml version="1.0" encoding="UTF-8"?>
<addon:addon id="whisperstt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:addon="https://openhab.org/schemas/addon/v1.0.0"
xsi:schemaLocation="https://openhab.org/schemas/addon/v1.0.0 https://openhab.org/schemas/addon-1.0.0.xsd">
<type>voice</type>
<name>Whisper Speech-to-Text</name>
<description>Whisper STT Service uses the whisper.cpp library to transcript audio data to text.</description>
<connection>none</connection>
<service-id>org.openhab.voice.whisperstt</service-id>
<config-description-ref uri="voice:whisperstt"/>
</addon:addon>

View File

@ -0,0 +1,229 @@
<?xml version="1.0" encoding="UTF-8"?>
<config-description:config-descriptions
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:config-description="https://openhab.org/schemas/config-description/v1.0.0"
xsi:schemaLocation="https://openhab.org/schemas/config-description/v1.0.0
https://openhab.org/schemas/config-description-1.0.0.xsd">
<config-description uri="voice:whisperstt">
<parameter-group name="stt">
<label>STT Configuration</label>
<description>Configure Speech to Text.</description>
</parameter-group>
<parameter-group name="vad">
<label>Voice Activity Detection</label>
<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description>
</parameter-group>
<parameter-group name="whisper">
<label>Whisper Options</label>
<description>Configure the whisper.cpp transcription options.</description>
</parameter-group>
<parameter-group name="grammar">
<label>Grammar</label>
<description>Define a grammar to improve transcrptions.</description>
</parameter-group>
<parameter-group name="messages">
<label>Info Messages</label>
<description>Configure service information messages.</description>
</parameter-group>
<parameter-group name="developer">
<label>Developer</label>
<description>Options added for developers.</description>
<advanced>true</advanced>
</parameter-group>
<parameter name="modelName" type="text" groupName="stt" required="true">
<label>Model Name</label>
<description>Model name without extension.</description>
</parameter>
<parameter name="preloadModel" type="boolean" groupName="stt">
<label>Preload Model</label>
<description>Keep the model loaded. If the parameter is set to true, the model will be reloaded only on
configuration
updates. If the model is not loaded when needed, the service will try to load it. If the parameter is
set to false,
the model will be loaded and unloaded on each run.
</description>
<default>false</default>
</parameter>
<parameter name="singleUtteranceMode" type="boolean" groupName="stt">
<label>Single Utterance Mode</label>
<description>When enabled recognition stops listening after a single utterance.</description>
<default>true</default>
<advanced>true</advanced>
</parameter>
<parameter name="minSeconds" type="decimal" step="0.1" min="1" unit="s" groupName="stt">
<label>Min Transcription Seconds</label>
<description>Min transcription seconds passed to whisper.</description>
<default>2</default>
<advanced>true</advanced>
</parameter>
<parameter name="maxSeconds" type="integer" min="2" unit="s" groupName="stt">
<label>Max Transcription Seconds</label>
<description>Seconds to force transcription before silence detection.</description>
<default>10</default>
</parameter>
<parameter name="initSilenceSeconds" type="decimal" min="0.1" step="0.1" unit="s" groupName="stt">
<label>Initial Silence Seconds</label>
<description>Max initial seconds of silence to discard transcription.</description>
<default>3</default>
</parameter>
<parameter name="maxSilenceSeconds" type="decimal" min="0.1" step="0.1" unit="s" groupName="stt">
<label>Max Silence Seconds</label>
<description>Seconds of silence to trigger transcription.</description>
<default>0.5</default>
</parameter>
<parameter name="removeSilence" type="boolean" groupName="stt">
<label>Remove Silence</label>
<description>Remove silence frames from the beginning and end of the audio.</description>
<default>true</default>
<advanced>true</advanced>
</parameter>
<parameter name="stepSeconds" type="decimal" groupName="vad">
<label>Audio Step</label>
<description>Audio step for the voice activity detection.</description>
<default>1</default>
<options>
<option value="0.1">100ms</option>
<option value="0.2">200ms</option>
<option value="0.3">300ms</option>
<option value="0.5">500ms</option>
<option value="0.6">600ms</option>
<option value="1">1s</option>
</options>
<advanced>true</advanced>
</parameter>
<parameter name="vadSensitivity" type="decimal" groupName="vad" min="0" max="1" step="0.01">
<label>Voice Activity Detection Sensitivity</label>
<description>Percentage in range 0-1 of voice activity in each audio step analyzed to consider it as voice.</description>
<default>0.3</default>
</parameter>
<parameter name="vadMode" type="text" groupName="vad">
<label>Voice Activity Detection Mode</label>
<description>Available VAD modes. Quality is the most likely to detect voice.</description>
<default>VERY_AGGRESSIVE</default>
<options>
<option value="QUALITY">Quality</option>
<option value="LOW_BITRATE">Low Bitrate</option>
<option value="AGGRESSIVE">Aggressive</option>
<option value="VERY_AGGRESSIVE">Very Aggressive</option>
</options>
<advanced>true</advanced>
</parameter>
<parameter name="vadStep" type="integer" groupName="vad">
<label>Voice Activity Detector Step</label>
<description>Audio milliseconds passed to the voice activity detector. Defines how much times the voice activity
detector is executed per audio step.</description>
<default>20</default>
<options>
<option value="10">10ms</option>
<option value="20">20ms</option>
<option value="30">30ms</option>
</options>
<advanced>true</advanced>
</parameter>
<parameter name="threads" type="integer" groupName="whisper">
<label>Threads</label>
<description>Number of threads used by whisper. (0 to use host max threads)</description>
<default>0</default>
</parameter>
<parameter name="audioContext" type="integer" groupName="whisper" min="0">
<label>Audio Context</label>
<description>Overwrite the audio context size. (0 to use whisper default context size)</description>
<default>0</default>
<advanced>true</advanced>
</parameter>
<parameter name="samplingStrategy" type="text" groupName="whisper">
<label>Sampling strategy</label>
<description>Sampling strategy used.</description>
<default>BEAN_SEARCH</default>
<options>
<option value="GREEDY">Greedy</option>
<option value="BEAN_SEARCH">Bean Search</option>
</options>
</parameter>
<parameter name="beamSize" type="integer" groupName="whisper" min="1">
<label>Beam Size</label>
<description>Beam Size configuration for sampling strategy Bean Search.</description>
<default>2</default>
</parameter>
<parameter name="greedyBestOf" type="integer" groupName="whisper" min="-1">
<label>Greedy Best Of</label>
<description>Best Of configuration for sampling strategy Greedy. (-1 for unlimited)</description>
<default>-1</default>
</parameter>
<parameter name="temperature" type="decimal" groupName="whisper">
<label>Temperature</label>
<description>Temperature threshold.</description>
<default>0</default>
</parameter>
<parameter name="initialPrompt" type="text" groupName="whisper">
<label>Initial Prompt</label>
<description>Initial prompt to feed whisper with.</description>
<advanced>true</advanced>
</parameter>
<parameter name="openvinoDevice" type="text" groupName="whisper">
<label>OpenVINO Device</label>
<description>Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)</description>
<advanced>true</advanced>
<default>CPU</default>
</parameter>
<parameter name="speedUp" type="boolean" groupName="whisper">
<label>Speed Up</label>
<description>Speed up audio by x2. (reduced accuracy)</description>
<default>false</default>
<advanced>true</advanced>
</parameter>
<parameter name="useGPU" type="boolean" groupName="whisper">
<label>Use GPU</label>
<description>Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)</description>
<default>true</default>
<advanced>true</advanced>
</parameter>
<parameter name="useGrammar" type="boolean" groupName="grammar">
<label>Use Grammar</label>
<description>Enables grammar usage.</description>
<default>false</default>
</parameter>
<parameter name="grammarPenalty" type="decimal" groupName="grammar" min="0" max="100" step="0.1">
<label>Grammar Penalty</label>
<description>Penalty for non grammar tokens when using grammar.</description>
<default>100</default>
</parameter>
<parameter name="grammarLines" type="text" groupName="grammar" multiple="true">
<label>Grammar</label>
<description>Grammar to use in GBNF format. (BNF variant used by whisper.cpp).</description>
<default></default>
</parameter>
<parameter name="noResultsMessage" type="text" groupName="messages">
<label>No Results Message</label>
<description>Message to be told when no results. (Empty for disabled)</description>
<default>Sorry, I didn't understand you</default>
</parameter>
<parameter name="errorMessage" type="text" groupName="messages">
<label>Error Message</label>
<description>Message to be told when an error has happened. (Empty for disabled)</description>
<default>Sorry, something went wrong</default>
</parameter>
<parameter name="createWAVRecord" type="boolean" groupName="developer">
<label>Create WAV Record</label>
<description>Create WAV audio record on each whisper execution.</description>
<default>false</default>
<advanced>true</advanced>
</parameter>
<parameter name="recordSampleFormat" type="text" groupName="developer">
<label>Record Sample Format</label>
<description>Defines the sample type and bit-size used by the created WAV audio record.</description>
<default>i16</default>
<options>
<option value="i16">Integer 16bit</option>
<option value="f32">Float 32bit</option>
</options>
<advanced>true</advanced>
</parameter>
<parameter name="enableWhisperLog" type="boolean" groupName="developer">
<label>Enable Whisper Log</label>
<description>Emit whisper.cpp library logs as add-on debug logs.</description>
<default>false</default>
<advanced>true</advanced>
</parameter>
</config-description>
</config-description:config-descriptions>

View File

@ -0,0 +1,94 @@
# add-on
addon.whisperstt.name = Whisper Speech-to-Text
addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text.
voice.config.whisperstt.audioContext.label = Audio Context
voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size)
voice.config.whisperstt.beamSize.label = Beam Size
voice.config.whisperstt.beamSize.description = Beam Size configuration for sampling strategy Bean Search.
voice.config.whisperstt.createWAVRecord.label = Create WAV Record
voice.config.whisperstt.createWAVRecord.description = Create WAV audio record on each whisper execution.
voice.config.whisperstt.enableWhisperLog.label = Enable Whisper Log
voice.config.whisperstt.enableWhisperLog.description = Emit whisper.cpp library logs as add-on debug logs.
voice.config.whisperstt.noResultsMessage.label = No Results Message
voice.config.whisperstt.noResultsMessage.description = Message to be told when no results. (Empty for disabled)
voice.config.whisperstt.errorMessage.label = Error Message
voice.config.whisperstt.errorMessage.description = Message to be told when an error has happened. (Empty for disabled)
voice.config.whisperstt.grammarLines.label = Grammar
voice.config.whisperstt.grammarLines.description = Grammar to use in GBNF format. (BNF variant used by whisper.cpp).
voice.config.whisperstt.grammarPenalty.label = Grammar Penalty
voice.config.whisperstt.grammarPenalty.description = Penalty for non grammar tokens when using grammar.
voice.config.whisperstt.greedyBestOf.label = Greedy Best Of
voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sampling strategy Greedy. (-1 for unlimited)
voice.config.whisperstt.group.developer.label = Developer
voice.config.whisperstt.group.developer.description = Options added for developers.
voice.config.whisperstt.group.grammar.label = Grammar
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions.
voice.config.whisperstt.group.messages.label = Info Messages
voice.config.whisperstt.group.messages.description = Configure service information messages.
voice.config.whisperstt.group.stt.label = STT Configuration
voice.config.whisperstt.group.stt.description = Configure Speech to Text.
voice.config.whisperstt.group.vad.label = Voice Activity Detection
voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with.
voice.config.whisperstt.group.whisper.label = Whisper Options
voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options.
voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds
voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription.
voice.config.whisperstt.initialPrompt.label = Initial Prompt
voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with.
voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds
voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection.
voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds
voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription.
voice.config.whisperstt.minSeconds.label = Min Transcription Seconds
voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper.
voice.config.whisperstt.modelName.label = Model Name
voice.config.whisperstt.modelName.description = Model name without extension.
voice.config.whisperstt.openvinoDevice.label = OpenVINO Device
voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
voice.config.whisperstt.preloadModel.label = Preload Model
voice.config.whisperstt.preloadModel.description = Keep the model loaded. If the parameter is set to true, the model will be reloaded only on configuration updates. If the model is not loaded when needed, the service will try to load it. If the parameter is set to false, the model will be loaded and unloaded on each run.
voice.config.whisperstt.recordSampleFormat.label = Record Sample Format
voice.config.whisperstt.recordSampleFormat.description = Defines the sample type and bit-size used by the created WAV audio record.
voice.config.whisperstt.recordSampleFormat.option.i16 = Integer 16bit
voice.config.whisperstt.recordSampleFormat.option.f32 = Float 32bit
voice.config.whisperstt.removeSilence.label = Remove Silence
voice.config.whisperstt.removeSilence.description = Remove silence frames from the beginning and end of the audio.
voice.config.whisperstt.samplingStrategy.label = Sampling strategy
voice.config.whisperstt.samplingStrategy.description = Sampling strategy used.
voice.config.whisperstt.samplingStrategy.option.GREEDY = Greedy
voice.config.whisperstt.samplingStrategy.option.BEAN_SEARCH = Bean Search
voice.config.whisperstt.singleUtteranceMode.label = Single Utterance Mode
voice.config.whisperstt.singleUtteranceMode.description = When enabled recognition stops listening after a single utterance.
voice.config.whisperstt.speedUp.label = Speed Up
voice.config.whisperstt.speedUp.description = Speed up audio by x2. (reduced accuracy)
voice.config.whisperstt.stepSeconds.label = Audio Step
voice.config.whisperstt.stepSeconds.description = Audio step for the voice activity detection.
voice.config.whisperstt.stepSeconds.option.0.1 = 100ms
voice.config.whisperstt.stepSeconds.option.0.2 = 200ms
voice.config.whisperstt.stepSeconds.option.0.3 = 300ms
voice.config.whisperstt.stepSeconds.option.0.5 = 500ms
voice.config.whisperstt.stepSeconds.option.0.6 = 600ms
voice.config.whisperstt.stepSeconds.option.1 = 1s
voice.config.whisperstt.temperature.label = Temperature
voice.config.whisperstt.temperature.description = Temperature threshold.
voice.config.whisperstt.threads.label = Threads
voice.config.whisperstt.threads.description = Number of threads used by whisper. (0 to use host max threads)
voice.config.whisperstt.useGPU.label = Use GPU
voice.config.whisperstt.useGPU.description = Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
voice.config.whisperstt.useGrammar.label = Use Grammar
voice.config.whisperstt.useGrammar.description = Enables grammar usage.
voice.config.whisperstt.vadMode.label = Voice Activity Detection Mode
voice.config.whisperstt.vadMode.description = Available VAD modes. Quality is the most likely to detect voice.
voice.config.whisperstt.vadMode.option.QUALITY = Quality
voice.config.whisperstt.vadMode.option.LOW_BITRATE = Low Bitrate
voice.config.whisperstt.vadMode.option.AGGRESSIVE = Aggressive
voice.config.whisperstt.vadMode.option.VERY_AGGRESSIVE = Very Aggressive
voice.config.whisperstt.vadSensitivity.label = Voice Activity Detection Sensitivity
voice.config.whisperstt.vadSensitivity.description = Percentage in range 0-1 of voice activity in each audio step analyzed to consider it as voice.
voice.config.whisperstt.vadStep.label = Voice Activity Detector Step
voice.config.whisperstt.vadStep.description = Audio milliseconds passed to the voice activity detector. Defines how much times the voice activity detector is executed per audio step.
voice.config.whisperstt.vadStep.option.10 = 10ms
voice.config.whisperstt.vadStep.option.20 = 20ms
voice.config.whisperstt.vadStep.option.30 = 30ms

View File

@ -471,6 +471,7 @@
<module>org.openhab.voice.voicerss</module> <module>org.openhab.voice.voicerss</module>
<module>org.openhab.voice.voskstt</module> <module>org.openhab.voice.voskstt</module>
<module>org.openhab.voice.watsonstt</module> <module>org.openhab.voice.watsonstt</module>
<module>org.openhab.voice.whisperstt</module>
</modules> </modules>
<properties> <properties>