diff --git a/CODEOWNERS b/CODEOWNERS index 0496f7b8dff..292a42062bf 100755 --- a/CODEOWNERS +++ b/CODEOWNERS @@ -451,6 +451,7 @@ /bundles/org.openhab.voice.voicerss/ @lolodomo /bundles/org.openhab.voice.voskstt/ @GiviMAD /bundles/org.openhab.voice.watsonstt/ @GiviMAD +/bundles/org.openhab.voice.whisperstt/ @GiviMAD /itests/org.openhab.automation.groovyscripting.tests/ @wborn /itests/org.openhab.automation.jsscriptingnashorn.tests/ @wborn /itests/org.openhab.binding.astro.tests/ @gerrieg diff --git a/bom/openhab-addons/pom.xml b/bom/openhab-addons/pom.xml index 95da35babef..1bcffbae283 100644 --- a/bom/openhab-addons/pom.xml +++ b/bom/openhab-addons/pom.xml @@ -2251,6 +2251,11 @@ org.openhab.voice.watsonstt ${project.version} + + org.openhab.addons.bundles + org.openhab.voice.whisperstt + ${project.version} + diff --git a/bundles/org.openhab.voice.whisperstt/NOTICE b/bundles/org.openhab.voice.whisperstt/NOTICE new file mode 100644 index 00000000000..f34c018407e --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/NOTICE @@ -0,0 +1,35 @@ +This content is produced and maintained by the openHAB project. + +* Project home: https://www.openhab.org + +== Declared Project Licenses + +This program and the accompanying materials are made available under the terms +of the Eclipse Public License 2.0 which is available at +https://www.eclipse.org/legal/epl-2.0/. + +== Source Code + +https://github.com/openhab/openhab-addons + +== Third-party Content + +io.github.givimad: whisper-jni +* License: Apache 2.0 License +* Project: https://github.com/GiviMAD/whisper-jni +* Source: https://github.com/GiviMAD/whisper-jni/tree/main/src/ + +native dependency: whisper.cpp +* License: MIT License https://github.com/ggerganov/whisper.cpp/blob/master/LICENSE +* Project: https://github.com/ggerganov/whisper.cpp +* Source: https://github.com/ggerganov/whisper.cpp + +io.github.givimad: libfvad-jni +* License: Apache 2.0 License https://github.com/GiviMAD/libfvad-jni/blob/main/LICENSE +* Project: https://github.com/GiviMAD/libfvad-jni +* Source: https://github.com/GiviMAD/libfvad-jni/tree/main/src/ + +native dependency: libfvad +* License: BSD License https://github.com/dpirch/libfvad/blob/master/LICENSE +* Project: https://github.com/dpirch/libfvad +* Source: https://github.com/dpirch/libfvad diff --git a/bundles/org.openhab.voice.whisperstt/README.md b/bundles/org.openhab.voice.whisperstt/README.md new file mode 100644 index 00000000000..03ea4b6849e --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/README.md @@ -0,0 +1,248 @@ +# Whisper Speech-to-Text + +Whisper STT Service uses [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to perform offline speech-to-text in openHAB. +It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity detection to isolate single command to transcribe, speeding up the execution. + +[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications. + +Whisper enables speech recognition for multiple languages and dialects: + +english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish, +italian, indonesian, hindi, finnish, vietnamese, hebrew, ukrainian, greek, malay, czech, romanian, danish, hungarian, tamil, norwegian, +thai, urdu, croatian, bulgarian, lithuanian, latin, maori, malayalam, welsh, slovak, telugu, persian, latvian, bengali, serbian, azerbaijani, +slovenian, kannada, estonian, macedonian, breton, basque, icelandic, armenian, nepali, mongolian, bosnian, kazakh, albanian, swahili, galician, +marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, georgian, belarusian, tajik, sindhi, gujarati, amharic, yiddish, lao, +uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala, +hausa, bashkir, javanese and sundanese. + +## Supported platforms + +This add-on uses some native binaries to work. +You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni). + +The following platforms are supported: + +* Windows10 x86_64 +* Debian GLIBC x86_64/arm64 (min GLIBC version 2.31 / min Debian version Focal) +* macOS x86_64/arm64 (min version v11.0) + +The native binaries for those platforms are included in this add-on provided with the openHAB distribution. + +## CPU compatibility + +To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU. +The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds. + +If you are going to use the binding in a `x86_64` host the CPU should support the flags: `avx2`, `fma`, `f16c`, `avx`. +You can check those flags on linux using the terminal with `lscpu`. +You can check those flags on Windows using a program like `CPU-Z`. + +If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`. +You can check those flags on linux using the terminal with `lscpu`. + +## Transcription time + +On a Raspberry PI 5, the approximate transcription times are: + +| model | exec time | +| ---------- | --------: | +| tiny.bin | 1.5s | +| base.bin | 3s | +| small.bin | 8.5s | +| medium.bin | 17s | + + +## Configuring the model + +Before you can use this service you should configure your model. + +You can download them from the sources provided by the [whisper.cpp](https://github.com/ggerganov/whisper.cpp) author: + +* https://huggingface.co/ggerganov/whisper.cpp +* https://ggml.ggerganov.com + +You should place the downloaded .bin model in '\/whisper/' so the add-ons can find them. + +Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link. + +## Using alternative whisper.cpp library + +It's possible to use your own build of the whisper.cpp shared library with this add-on. + +On `Linux/macOs` you need to place the `libwhisper.so/libwhisper.dydib` at `/usr/local/lib/`. + +On `Windows` the `whisper.dll` file needs to be placed in any directory listed at the variable `$env:PATH`, for example `X:\\Windows\System32\`. + +In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can find information about the required flags to enable different acceleration methods on the cmake build and other relevant information. + +Note: You need to restart openHAB to reload the library. + +## Grammar + +The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model. + +Internally whisper works by inferring a matrix of possible tokens from the audio and then resolving the final transcription from it using either the Greedy or Bean Search algorithm. +The grammar feature allows you to modify the probabilities of the inferred tokens by adding a penalty to the tokens outside the grammar so that the transcription gets resolved in a different way. + +It's a way to get the smallest models to perform better over a limited grammar. + +The grammar should be defined using [BNF](https://en.wikipedia.org/wiki/Backus–Naur_form), and the root variable should resolve the full grammar. +It allows using regex and optional parts to make it more dynamic. + +This is a basic grammar example: + +```BNF +root ::= (light_switch | light_state | tv_channel) "." +light_switch ::= "turn the light " ("on" | "off") +light_state ::= "set light to " ("high" | "low") +tv_channel ::= ("set ")? "tv channel to " [0-9]+ +``` + +You can provide the grammar and enable its usage using the binding configuration. + +## Configuration + +Use your favorite configuration UI to edit the Whisper settings: + +### Speech to Text Configuration + +General options. + +* **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin) +* **Preload Model** - Keep whisper model loaded. +* **Single Utterance Mode** - When enabled recognition stops listening after a single utterance. +* **Min Transcription Seconds** - Forces min audio duration passed to whisper, in seconds. +* **Max Transcription Seconds** - Max seconds for force trigger the transcription, without wait for detect silence. +* **Initial Silence Seconds** - Max seconds without any voice activity to abort the transcription. +* **Max Silence Seconds** - Max consecutive silence seconds to trigger the transcription. +* **Remove Silence** - Remove start and end silence from the audio to transcribe. + +### Voice Activity Detection Configuration + +Configure VAD options. + +* **Audio Step** - Audio processing step in seconds for the voice activity detection. +* **Voice Activity Detection Mode** - Selected VAD Mode. +* **Voice Activity Detection Sensitivity** - Percentage in range 0-1 of voice activity in one second to consider it as voice. +* **Voice Activity Detection Step** - VAD detector internal step in ms (only allows 10, 20 or 30). (Audio Step / Voice Activity Detection Step = number of vad executions per audio step). + +### Whisper Configuration + +Configure whisper options. + +* **Threads** - Number of threads used by whisper. (0 to use host max threads) +* **Sampling Strategy** - Sampling strategy used. +* **Beam Size** - Beam Size configuration for sampling strategy Bean Search. +* **Greedy Best Of** - Best Of configuration for sampling strategy Greedy. +* **Speed Up** - Speed up audio by x2. (Reduced accuracy) +* **Audio Context** - Overwrite the audio context size. (0 to use whisper default context size) +* **Temperature** - Temperature threshold. +* **Initial Prompt** - Initial prompt for whisper. +* **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) +* **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect) + +### Grammar Configuration + +Configure the grammar options. + +* **Grammar** - Grammar to use in GBNF format (whisper.cpp BNF variant). +* **Use Grammar** - Enable grammar usage. +* **Grammar penalty** - Penalty for non grammar tokens. + +#### Grammar Example: + + +```gbnf +# Grammar should define a root expression that should end with a dot. +root ::= " " command "." +# Alternative command expression to expand into the root. +command ::= "Turn " onoff " " (connector)? thing | + put " " thing " to " state | + watch " " show " at bedroom" | + "Start " timer " minutes timer" + +# You can use as many expressions as you need. + +thing ::= "light" | "bedroom light" | "living room light" | "tv" + +put ::= "set" | "put" + +onoff ::= "on" | "off" + +watch ::= "watch" | "play" + +connector ::= "the" + +state ::= "low" | "high" | "normal" + +show ::= [a-zA-Z]+ + +timer ::= [0-9]+ + +``` + +### Messages Configuration + +* **No Results Message** - Message to be told on no results. +* **Error Message** - Message to be told on exception. + +### Developer Configuration + +* **Create WAV Record** - Create wav audio file on each whisper execution, also creates a '.prop' file containing the transcription. +* **Record Sample Format** - Change the record sample format. (allows i16 or f32) +* **Enable Whisper Log** - Emit whisper.cpp library logs as add-on debug logs. + +You can find [here](https://github.com/givimad/whisper-finetune-oh) information on how to fine-tune a model using the generated records. + +### Configuration via a text file + +In case you would like to set up the service via a text file, create a new file in `$OPENHAB_ROOT/conf/services` named `whisperstt.cfg` + +Its contents should look similar to: + +``` +org.openhab.voice.whisperstt:modelName=tiny +org.openhab.voice.whisperstt:initSilenceSeconds=0.3 +org.openhab.voice.whisperstt:removeSilence=true +org.openhab.voice.whisperstt:stepSeconds=0.3 +org.openhab.voice.whisperstt:vadStep=0.5 +org.openhab.voice.whisperstt:singleUtteranceMode=true +org.openhab.voice.whisperstt:preloadModel=false +org.openhab.voice.whisperstt:vadMode=LOW_BITRATE +org.openhab.voice.whisperstt:vadSensitivity=0.1 +org.openhab.voice.whisperstt:maxSilenceSeconds=2 +org.openhab.voice.whisperstt:minSeconds=2 +org.openhab.voice.whisperstt:maxSeconds=10 +org.openhab.voice.whisperstt:threads=0 +org.openhab.voice.whisperstt:audioContext=0 +org.openhab.voice.whisperstt:samplingStrategy=GREEDY +org.openhab.voice.whisperstt:temperature=0 +org.openhab.voice.whisperstt:noResultsMessage="Sorry, I didn't understand you" +org.openhab.voice.whisperstt:errorMessage="Sorry, something went wrong" +org.openhab.voice.whisperstt:createWAVRecord=false +org.openhab.voice.whisperstt:recordSampleFormat=i16 +org.openhab.voice.whisperstt:speedUp=false +org.openhab.voice.whisperstt:beamSize=4 +org.openhab.voice.whisperstt:enableWhisperLog=false +org.openhab.voice.whisperstt:greedyBestOf=4 +org.openhab.voice.whisperstt:initialPrompt= +org.openhab.voice.whisperstt:openvinoDevice="" +org.openhab.voice.whisperstt:useGPU=false +org.openhab.voice.whisperstt:useGrammar=false +org.openhab.voice.whisperstt:grammarPenalty=80.0 +org.openhab.voice.whisperstt:grammarLines= +``` + +### Default Speech-to-Text Configuration + +You can select your preferred default Speech-to-Text in the UI: + +* Go to **Settings**. +* Edit **System Services - Voice**. +* Set **Whisper** as **Speech-to-Text**. + +In case you would like to set up these settings via a text file, you can edit the file `runtime.cfg` in `$OPENHAB_ROOT/conf/services` and set the following entries: + +``` +org.openhab.voice:defaultSTT=whisperstt +``` diff --git a/bundles/org.openhab.voice.whisperstt/pom.xml b/bundles/org.openhab.voice.whisperstt/pom.xml new file mode 100644 index 00000000000..03143d76dc5 --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/pom.xml @@ -0,0 +1,29 @@ + + + + 4.0.0 + + + org.openhab.addons.bundles + org.openhab.addons.reactor.bundles + 4.2.0-SNAPSHOT + + + org.openhab.voice.whisperstt + + openHAB Add-ons :: Bundles :: Voice :: Whisper Speech-to-Text + + + + io.github.givimad + whisper-jni + 1.6.1 + + + io.github.givimad + libfvad-jni + 1.0.0-0 + + + diff --git a/bundles/org.openhab.voice.whisperstt/src/main/feature/feature.xml b/bundles/org.openhab.voice.whisperstt/src/main/feature/feature.xml new file mode 100644 index 00000000000..d034a0863d9 --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/feature/feature.xml @@ -0,0 +1,9 @@ + + + mvn:org.openhab.core.features.karaf/org.openhab.core.features.karaf.openhab-core/${ohc.version}/xml/features + + + openhab-runtime-base + mvn:org.openhab.addons.bundles/org.openhab.voice.whisperstt/${project.version} + + diff --git a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperConfigOptionProvider.java b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperConfigOptionProvider.java new file mode 100644 index 00000000000..eaff91436cd --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperConfigOptionProvider.java @@ -0,0 +1,77 @@ +/** + * Copyright (c) 2010-2024 Contributors to the openHAB project + * + * See the NOTICE file(s) distributed with this work for additional + * information. + * + * This program and the accompanying materials are made available under the + * terms of the Eclipse Public License 2.0 which is available at + * http://www.eclipse.org/legal/epl-2.0 + * + * SPDX-License-Identifier: EPL-2.0 + */ +package org.openhab.voice.whisperstt.internal; + +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY; +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID; +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME; +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID; +import static org.openhab.voice.whisperstt.internal.WhisperSTTService.WHISPER_FOLDER; + +import java.net.URI; +import java.util.Collection; +import java.util.List; +import java.util.Locale; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import org.eclipse.jdt.annotation.NonNullByDefault; +import org.eclipse.jdt.annotation.Nullable; +import org.openhab.core.config.core.ConfigOptionProvider; +import org.openhab.core.config.core.ConfigurableService; +import org.openhab.core.config.core.ParameterOption; +import org.osgi.framework.Constants; +import org.osgi.service.component.annotations.Component; + +/** + * The {@link WhisperConfigOptionProvider} class provides some dynamic configuration options + * + * @author Miguel Álvarez - Initial contribution + */ +@Component(service = ConfigOptionProvider.class, configurationPid = SERVICE_PID, property = Constants.SERVICE_PID + "=" + + SERVICE_PID) +@ConfigurableService(category = SERVICE_CATEGORY, label = SERVICE_NAME + + " Speech-to-Text", description_uri = SERVICE_CATEGORY + ":" + SERVICE_ID) +@NonNullByDefault +public class WhisperConfigOptionProvider implements ConfigOptionProvider { + @Override + public @Nullable Collection getParameterOptions(URI uri, String param, @Nullable String context, + @Nullable Locale locale) { + if (context == null && (SERVICE_CATEGORY + ":" + SERVICE_ID).equals(uri.toString())) { + if ("modelName".equals(param)) { + return getAvailableModelOptions(); + } + } + return null; + } + + private List getAvailableModelOptions() { + var folderFile = WHISPER_FOLDER.toFile(); + var files = folderFile.listFiles(); + if (!folderFile.exists() || !folderFile.isDirectory() || files == null) { + return List.of(); + } + String modelExtension = ".bin"; + return Stream.of(files).filter(file -> !file.isDirectory() && file.getName().endsWith(modelExtension)) + .map(file -> { + String fileName = file.getName(); + String optionName = file.getName(); + String optionalPrefix = "ggml-"; + if (optionName.startsWith(optionalPrefix)) { + optionName = optionName.substring(optionalPrefix.length()); + } + optionName = optionName.substring(0, optionName.length() - modelExtension.length()); + return new ParameterOption(fileName, optionName); + }).collect(Collectors.toList()); + } +} diff --git a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java new file mode 100644 index 00000000000..0eed735113b --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConfiguration.java @@ -0,0 +1,149 @@ +/** + * Copyright (c) 2010-2024 Contributors to the openHAB project + * + * See the NOTICE file(s) distributed with this work for additional + * information. + * + * This program and the accompanying materials are made available under the + * terms of the Eclipse Public License 2.0 which is available at + * http://www.eclipse.org/legal/epl-2.0 + * + * SPDX-License-Identifier: EPL-2.0 + */ +package org.openhab.voice.whisperstt.internal; + +import java.util.List; + +import org.eclipse.jdt.annotation.NonNullByDefault; + +import io.github.givimad.libfvadjni.VoiceActivityDetector; + +/** + * The {@link WhisperSTTConfiguration} class contains fields mapping thing configuration parameters. + * + * @author Miguel Álvarez Díez - Initial contribution + */ +@NonNullByDefault +public class WhisperSTTConfiguration { + + /** + * Model name without '.bin' extension. + */ + public String modelName = ""; + /** + * Keep model loaded. + */ + public boolean preloadModel; + /** + * Defines the audio step. + */ + public float stepSeconds = 1f; + /** + * Min audio seconds to call whisper with. + */ + public float minSeconds = 2f; + /** + * Max seconds to wait to force stop the transcription. + */ + public int maxSeconds = 10; + /** + * Voice activity detection mode. + */ + public String vadMode = VoiceActivityDetector.Mode.VERY_AGGRESSIVE.toString(); + /** + * Voice activity detection sensitivity. + */ + public float vadSensitivity = 0.3f; + /** + * Voice activity detection step in ms (vad dependency only allows 10, 20 or 30 ms steps). + */ + public int vadStep = 20; + /** + * Initial silence seconds for discard transcription. + */ + public float initSilenceSeconds = 3; + /** + * Max silence seconds for triggering transcription. + */ + public float maxSilenceSeconds = 0.5f; + /** + * Remove silence frames. + */ + public boolean removeSilence = true; + /** + * Number of threads used by whisper. (0 to use host max threads) + */ + public int threads; + /** + * Overwrite the audio context size. (0 to use whisper default context size). + */ + public int audioContext; + /** + * Speed up audio by x2 (reduced accuracy). + */ + public boolean speedUp; + /** + * Sampling strategy. + */ + public String samplingStrategy = "BEAN_SEARCH"; + /** + * Beam Size configuration for sampling strategy Bean Search. + */ + public int beamSize = 2; + /** + * Best Of configuration for sampling strategy Greedy. + */ + public int greedyBestOf = -1; + /** + * Temperature threshold. + */ + public float temperature; + /** + * Initial whisper prompt + */ + public String initialPrompt = ""; + /** + * Grammar in GBNF format. + */ + public List grammarLines = List.of(); + /** + * Enables grammar usage. + */ + public boolean useGrammar = false; + /** + * Grammar penalty. + */ + public float grammarPenalty = 100f; + /** + * Enables GPU usage. (built-in binaries do not support GPU usage) + */ + public boolean useGPU = true; + /** + * OpenVINO device name + */ + public String openvinoDevice = "CPU"; + /** + * Single phrase mode. + */ + public boolean singleUtteranceMode = true; + /** + * Message to be told when no results. + */ + public String noResultsMessage = "Sorry, I didn't understand you"; + /** + * Message to be told when an error has happened. + */ + public String errorMessage = "Sorry, something went wrong"; + /** + * Create wav audio record for each whisper invocation. + */ + public boolean createWAVRecord; + /** + * Record sample format. Values: i16, f32. + */ + public String recordSampleFormat = "i16"; + /** + * Print whisper.cpp library logs as binding debug logs. + */ + public boolean enableWhisperLog; +} diff --git a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConstants.java b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConstants.java new file mode 100644 index 00000000000..505e14c684f --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTConstants.java @@ -0,0 +1,44 @@ +/** + * Copyright (c) 2010-2024 Contributors to the openHAB project + * + * See the NOTICE file(s) distributed with this work for additional + * information. + * + * This program and the accompanying materials are made available under the + * terms of the Eclipse Public License 2.0 which is available at + * http://www.eclipse.org/legal/epl-2.0 + * + * SPDX-License-Identifier: EPL-2.0 + */ +package org.openhab.voice.whisperstt.internal; + +import org.eclipse.jdt.annotation.NonNullByDefault; + +/** + * The {@link WhisperSTTConstants} class defines common constants, which are + * used across the whole binding. + * + * @author Miguel Álvarez Díez - Initial contribution + */ +@NonNullByDefault +public class WhisperSTTConstants { + + /** + * Service name + */ + public static final String SERVICE_NAME = "Whisper"; + /** + * Service id + */ + public static final String SERVICE_ID = "whisperstt"; + + /** + * Service category + */ + public static final String SERVICE_CATEGORY = "voice"; + + /** + * Service pid + */ + public static final String SERVICE_PID = "org.openhab." + SERVICE_CATEGORY + "." + SERVICE_ID; +} diff --git a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java new file mode 100644 index 00000000000..00d55590d9f --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java @@ -0,0 +1,657 @@ +/** + * Copyright (c) 2010-2024 Contributors to the openHAB project + * + * See the NOTICE file(s) distributed with this work for additional + * information. + * + * This program and the accompanying materials are made available under the + * terms of the Eclipse Public License 2.0 which is available at + * http://www.eclipse.org/legal/epl-2.0 + * + * SPDX-License-Identifier: EPL-2.0 + */ +package org.openhab.voice.whisperstt.internal; + +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY; +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID; +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME; +import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID; + +import java.io.ByteArrayInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.text.ParseException; +import java.text.SimpleDateFormat; +import java.util.Date; +import java.util.Locale; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.atomic.AtomicBoolean; + +import javax.sound.sampled.AudioFileFormat; +import javax.sound.sampled.AudioInputStream; +import javax.sound.sampled.AudioSystem; + +import org.eclipse.jdt.annotation.NonNullByDefault; +import org.eclipse.jdt.annotation.Nullable; +import org.openhab.core.OpenHAB; +import org.openhab.core.audio.AudioFormat; +import org.openhab.core.audio.AudioStream; +import org.openhab.core.audio.utils.AudioWaveUtils; +import org.openhab.core.common.ThreadPoolManager; +import org.openhab.core.config.core.ConfigurableService; +import org.openhab.core.config.core.Configuration; +import org.openhab.core.io.rest.LocaleService; +import org.openhab.core.voice.RecognitionStartEvent; +import org.openhab.core.voice.RecognitionStopEvent; +import org.openhab.core.voice.STTException; +import org.openhab.core.voice.STTListener; +import org.openhab.core.voice.STTService; +import org.openhab.core.voice.STTServiceHandle; +import org.openhab.core.voice.SpeechRecognitionErrorEvent; +import org.openhab.core.voice.SpeechRecognitionEvent; +import org.openhab.voice.whisperstt.internal.utils.VAD; +import org.osgi.framework.Constants; +import org.osgi.service.component.annotations.Activate; +import org.osgi.service.component.annotations.Component; +import org.osgi.service.component.annotations.Deactivate; +import org.osgi.service.component.annotations.Modified; +import org.osgi.service.component.annotations.Reference; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.github.givimad.libfvadjni.VoiceActivityDetector; +import io.github.givimad.whisperjni.WhisperContext; +import io.github.givimad.whisperjni.WhisperContextParams; +import io.github.givimad.whisperjni.WhisperFullParams; +import io.github.givimad.whisperjni.WhisperGrammar; +import io.github.givimad.whisperjni.WhisperJNI; +import io.github.givimad.whisperjni.WhisperSamplingStrategy; +import io.github.givimad.whisperjni.WhisperState; + +/** + * The {@link WhisperSTTService} class is a service implementation to use whisper.cpp for Speech-to-Text. + * + * @author Miguel Álvarez - Initial contribution + */ +@NonNullByDefault +@Component(configurationPid = SERVICE_PID, property = Constants.SERVICE_PID + "=" + SERVICE_PID) +@ConfigurableService(category = SERVICE_CATEGORY, label = SERVICE_NAME + + " Speech-to-Text", description_uri = SERVICE_CATEGORY + ":" + SERVICE_ID) +public class WhisperSTTService implements STTService { + protected static final Path WHISPER_FOLDER = Path.of(OpenHAB.getUserDataFolder(), "whisper"); + private static final Path SAMPLES_FOLDER = Path.of(WHISPER_FOLDER.toString(), "samples"); + private static final int WHISPER_SAMPLE_RATE = 16000; + private final Logger logger = LoggerFactory.getLogger(WhisperSTTService.class); + private final ScheduledExecutorService executor = ThreadPoolManager.getScheduledPool("OH-voice-whisperstt"); + private final LocaleService localeService; + private WhisperSTTConfiguration config = new WhisperSTTConfiguration(); + private @Nullable WhisperContext context; + private @Nullable WhisperGrammar grammar; + private @Nullable WhisperJNI whisper; + + @Activate + public WhisperSTTService(@Reference LocaleService localeService) { + this.localeService = localeService; + } + + @Activate + protected void activate(Map config) { + try { + if (!Files.exists(WHISPER_FOLDER)) { + Files.createDirectory(WHISPER_FOLDER); + } + WhisperJNI.loadLibrary(getLoadOptions()); + VoiceActivityDetector.loadLibrary(); + whisper = new WhisperJNI(); + } catch (IOException | RuntimeException e) { + logger.warn("Unable to register native library: {}", e.getMessage()); + } + configChange(config); + } + + private WhisperJNI.LoadOptions getLoadOptions() { + Path libFolder = Paths.get("/usr/local/lib"); + Path libFolderWin = Paths.get("/Windows/System32"); + var options = new WhisperJNI.LoadOptions(); + // Overwrite whisper jni shared library + Path whisperJNILinuxLibrary = libFolder.resolve("libwhisperjni.so"); + Path whisperJNIMacLibrary = libFolder.resolve("libwhisperjni.dylib"); + Path whisperJNIWinLibrary = libFolderWin.resolve("libwhisperjni.dll"); + if (Files.exists(whisperJNILinuxLibrary)) { + options.whisperJNILib = whisperJNILinuxLibrary; + } else if (Files.exists(whisperJNIMacLibrary)) { + options.whisperJNILib = whisperJNIMacLibrary; + } else if (Files.exists(whisperJNIWinLibrary)) { + options.whisperJNILib = whisperJNIWinLibrary; + } + // Overwrite whisper shared library, Windows searches library in $env:PATH + Path whisperLinuxLibrary = libFolder.resolve("libwhisper.so"); + Path whisperMacLibrary = libFolder.resolve("libwhisper.dylib"); + if (Files.exists(whisperLinuxLibrary)) { + options.whisperLib = whisperLinuxLibrary; + } else if (Files.exists(whisperMacLibrary)) { + options.whisperLib = whisperMacLibrary; + } + // Log library registration + options.logger = (msg) -> logger.debug("Library load: {}", msg); + return options; + } + + @Modified + protected void modified(Map config) { + configChange(config); + } + + @Deactivate + protected void deactivate(Map config) { + try { + WhisperGrammar grammar = this.grammar; + if (grammar != null) { + grammar.close(); + this.grammar = null; + } + unloadContext(); + } catch (IOException e) { + logger.warn("IOException unloading model: {}", e.getMessage()); + } + WhisperJNI.setLibraryLogger(null); + } + + private void configChange(Map config) { + this.config = new Configuration(config).as(WhisperSTTConfiguration.class); + WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null); + WhisperGrammar grammar = this.grammar; + if (grammar != null) { + grammar.close(); + this.grammar = null; + } + WhisperJNI whisper; + try { + whisper = getWhisper(); + } catch (IOException ignored) { + logger.warn("library not loaded, the add-on will not work"); + return; + } + String grammarText = String.join("\n", this.config.grammarLines); + if (this.config.useGrammar && isValidGrammar(grammarText)) { + try { + logger.debug("Parsing GBNF grammar..."); + this.grammar = whisper.parseGrammar(grammarText); + } catch (IOException e) { + logger.warn("Error parsing grammar: {}", e.getMessage()); + } + } + if (this.config.preloadModel) { + try { + loadContext(); + } catch (IOException e) { + logger.warn("IOException loading model: {}", e.getMessage()); + } catch (UnsatisfiedLinkError e) { + logger.warn("Missing native dependency: {}", e.getMessage()); + } + } else { + try { + unloadContext(); + } catch (IOException e) { + logger.warn("IOException unloading model: {}", e.getMessage()); + } + } + } + + private boolean isValidGrammar(String grammarText) { + try { + WhisperGrammar.assertValidGrammar(grammarText); + } catch (IllegalArgumentException | ParseException e) { + logger.warn("Invalid grammar: {}", e.getMessage()); + return false; + } + return true; + } + + @Override + public String getId() { + return SERVICE_ID; + } + + @Override + public String getLabel(@Nullable Locale locale) { + return SERVICE_NAME; + } + + @Override + public Set getSupportedLocales() { + // as it is not possible to determine the language of the model that was downloaded and setup by the user, it is + // assumed the language of the model is matching the locale of the openHAB server + return Set.of(localeService.getLocale(null)); + } + + @Override + public Set getSupportedFormats() { + return Set.of( + new AudioFormat(AudioFormat.CONTAINER_NONE, AudioFormat.CODEC_PCM_SIGNED, false, 16, null, + (long) WHISPER_SAMPLE_RATE, 1), + new AudioFormat(AudioFormat.CONTAINER_WAVE, AudioFormat.CODEC_PCM_SIGNED, false, 16, null, + (long) WHISPER_SAMPLE_RATE, 1)); + } + + @Override + public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set set) + throws STTException { + AtomicBoolean aborted = new AtomicBoolean(false); + WhisperContext ctx = null; + WhisperState state = null; + try { + var whisper = getWhisper(); + ctx = getContext(); + logger.debug("Creating whisper state..."); + state = whisper.initState(ctx); + logger.debug("Whisper state created"); + logger.debug("Creating VAD instance..."); + final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE); + VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep, + config.vadStep, config.vadSensitivity); + logger.debug("VAD instance created"); + sttListener.sttEventReceived(new RecognitionStartEvent()); + backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted); + } catch (IOException e) { + if (ctx != null && !config.preloadModel) { + ctx.close(); + } + if (state != null) { + state.close(); + } + throw new STTException("Exception during initialization", e); + } + return () -> { + aborted.set(true); + }; + } + + private WhisperJNI getWhisper() throws IOException { + var whisper = this.whisper; + if (whisper == null) { + throw new IOException("Library not loaded"); + } + return whisper; + } + + private WhisperContext getContext() throws IOException, UnsatisfiedLinkError { + var context = this.context; + if (context != null) { + return context; + } + return loadContext(); + } + + private synchronized WhisperContext loadContext() throws IOException { + unloadContext(); + String modelFilename = this.config.modelName; + if (modelFilename.isBlank()) { + throw new IOException("The modelName configuration is missing"); + } + String modelPrefix = "ggml-"; + String modelExtension = ".bin"; + if (!modelFilename.startsWith(modelPrefix)) { + modelFilename = modelPrefix + modelFilename; + } + if (!modelFilename.endsWith(modelExtension)) { + modelFilename = modelFilename + modelExtension; + } + Path modelPath = WHISPER_FOLDER.resolve(modelFilename); + if (!Files.exists(modelPath) || Files.isDirectory(modelPath)) { + throw new IOException("Missing model file: " + modelPath); + } + logger.debug("Loading whisper context..."); + WhisperJNI whisper = getWhisper(); + var context = whisper.initNoState(modelPath, getWhisperContextParams()); + logger.debug("Whisper context loaded"); + if (config.preloadModel) { + this.context = context; + } + if (!config.openvinoDevice.isBlank()) { + // has no effect if OpenVINO is not enabled in whisper.cpp library. + logger.debug("Init OpenVINO device"); + whisper.initOpenVINO(context, config.openvinoDevice); + } + return context; + } + + private WhisperContextParams getWhisperContextParams() { + var params = new WhisperContextParams(); + params.useGPU = config.useGPU; + return params; + } + + private void unloadContext() throws IOException { + var context = this.context; + if (context != null) { + logger.debug("Unloading model"); + context.close(); + this.context = null; + } + } + + private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep, + Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) { + var releaseContext = !config.preloadModel; + final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE; + final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE); + final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE); + final int nMaxSilenceSamples = (int) (config.maxSilenceSeconds * (float) WHISPER_SAMPLE_RATE); + logger.debug("Samples per step {}", nSamplesStep); + logger.debug("Min transcription samples {}", nSamplesMin); + logger.debug("Max transcription samples {}", nSamplesMax); + logger.debug("Max init silence samples {}", nInitSilenceSamples); + logger.debug("Max silence samples {}", nMaxSilenceSamples); + // used to store the step samples in libfvad wanted format 16-bit int + final short[] stepAudioSamples = new short[nSamplesStep]; + // used to store the full samples in whisper wanted format 32-bit float + final float[] audioSamples = new float[nSamplesMax]; + executor.submit(() -> { + int audioSamplesOffset = 0; + int silenceSamplesCounter = 0; + int nProcessedSamples = 0; + int numBytesRead; + boolean voiceDetected = false; + String transcription = ""; + String tempTranscription = ""; + VAD.@Nullable VADResult lastVADResult; + VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null; + try { + try (state; // + audioStream; // + vad) { + if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) { + AudioWaveUtils.removeFMT(audioStream); + } + final ByteBuffer captureBuffer = ByteBuffer.allocate(nSamplesStep * 2) + .order(ByteOrder.LITTLE_ENDIAN); + // init remaining to full capacity + int remaining = captureBuffer.capacity(); + WhisperFullParams params = getWhisperFullParams(ctx, locale); + while (!aborted.get()) { + // read until no remaining so we get the complete step samples + numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining, + remaining); + if (aborted.get() || numBytesRead == -1) { + break; + } + if (numBytesRead != remaining) { + remaining = remaining - numBytesRead; + continue; + } + // reset remaining to full capacity + remaining = captureBuffer.capacity(); + // encode step samples and copy them to the audio buffers + var shortBuffer = captureBuffer.asShortBuffer(); + while (shortBuffer.hasRemaining()) { + var position = shortBuffer.position(); + short i16BitSample = shortBuffer.get(); + float f32BitSample = Float.min(1f, + Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f)); + stepAudioSamples[position] = i16BitSample; + audioSamples[audioSamplesOffset++] = f32BitSample; + nProcessedSamples++; + } + // run vad + if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) { + logger.debug("VAD: Skipping, max length reached"); + } else { + lastVADResult = vad.analyze(stepAudioSamples); + if (lastVADResult.isVoice()) { + voiceDetected = true; + logger.debug("VAD: voice detected"); + silenceSamplesCounter = 0; + firstConsecutiveSilenceVADResult = null; + continue; + } else { + if (firstConsecutiveSilenceVADResult == null) { + firstConsecutiveSilenceVADResult = lastVADResult; + } + silenceSamplesCounter += nSamplesStep; + int maxSilenceSamples = voiceDetected ? nMaxSilenceSamples : nInitSilenceSamples; + if (silenceSamplesCounter < maxSilenceSamples) { + if (logger.isDebugEnabled()) { + int totalSteps = maxSilenceSamples / nSamplesStep; + int currentSteps = totalSteps + - ((maxSilenceSamples - silenceSamplesCounter) / nSamplesStep); + logger.debug("VAD: silence detected {}/{}", currentSteps, totalSteps); + } + if (!voiceDetected && config.removeSilence) { + logger.debug("removing start silence"); + int samplesToKeep = lastVADResult.voiceSamplesInTail(); + if (samplesToKeep > 0) { + for (int i = 0; i < samplesToKeep; i++) { + audioSamples[i] = audioSamples[audioSamplesOffset + - (samplesToKeep - i)]; + } + audioSamplesOffset = samplesToKeep; + logger.debug("some audio was kept"); + } else { + audioSamplesOffset = 0; + } + } + continue; + } else { + logger.debug("VAD: silence detected"); + if (audioSamplesOffset < nSamplesMin) { + logger.debug("Not enough samples, continue"); + continue; + } + if (config.singleUtteranceMode) { + // close the audio stream to avoid keep getting audio we don't need + try { + audioStream.close(); + } catch (IOException ignored) { + } + } + } + } + if (config.removeSilence) { + if (voiceDetected) { + logger.debug("removing end silence"); + int samplesToKeep = firstConsecutiveSilenceVADResult.voiceSamplesInHead(); + if (samplesToKeep > 0) { + logger.debug("some audio was kept"); + } + var samplesToRemove = silenceSamplesCounter - samplesToKeep; + if (audioSamplesOffset - samplesToRemove < nSamplesMin) { + logger.debug("avoid removing under min audio seconds"); + samplesToRemove = audioSamplesOffset - nSamplesMin; + } + if (samplesToRemove > 0) { + audioSamplesOffset -= samplesToRemove; + } + } else { + audioSamplesOffset = 0; + } + } + if (audioSamplesOffset == 0) { + if (config.singleUtteranceMode) { + logger.debug("no audio to transcribe, ending"); + break; + } else { + logger.debug("no audio to transcribe, continue listening"); + continue; + } + } + } + // run whisper + logger.debug("running whisper with {} seconds of audio...", + Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f); + long execStartTime = System.currentTimeMillis(); + var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset); + logger.debug("whisper ended in {}ms with result code {}", + System.currentTimeMillis() - execStartTime, result); + // process result + if (result != 0) { + emitSpeechRecognitionError(sttListener); + break; + } + int nSegments = whisper.fullNSegmentsFromState(state); + logger.debug("Available transcription segments {}", nSegments); + if (nSegments == 1) { + tempTranscription = whisper.fullGetSegmentTextFromState(state, 0); + if (config.createWAVRecord) { + createAudioFile(audioSamples, audioSamplesOffset, tempTranscription, + locale.getLanguage()); + } + if (config.singleUtteranceMode) { + logger.debug("single utterance mode, ending transcription"); + transcription = tempTranscription; + break; + } else { + // start a new transcription segment + transcription += tempTranscription; + tempTranscription = ""; + } + } else if (nSegments == 0 && config.singleUtteranceMode) { + logger.debug("Single utterance mode and no results, ending transcription"); + break; + } else if (nSegments > 1) { + // non reachable + logger.warn("Whisper should be configured in single segment mode {}", nSegments); + break; + } + // reset state to start with next segment + voiceDetected = false; + silenceSamplesCounter = 0; + audioSamplesOffset = 0; + logger.debug("Partial transcription: {}", tempTranscription); + logger.debug("Transcription: {}", transcription); + } + } finally { + if (releaseContext) { + ctx.close(); + } + } + // emit result + if (!aborted.get()) { + sttListener.sttEventReceived(new RecognitionStopEvent()); + logger.debug("Final transcription: '{}'", transcription); + if (!transcription.isBlank()) { + sttListener.sttEventReceived(new SpeechRecognitionEvent(transcription.trim(), 1)); + } else { + emitSpeechRecognitionNoResultsError(sttListener); + } + } + } catch (IOException e) { + logger.warn("Error running speech to text: {}", e.getMessage()); + emitSpeechRecognitionError(sttListener); + } catch (UnsatisfiedLinkError e) { + logger.warn("Missing native dependency: {}", e.getMessage()); + emitSpeechRecognitionError(sttListener); + } + }); + } + + private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException { + WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy); + var params = new WhisperFullParams(strategy); + params.temperature = config.temperature; + params.nThreads = config.threads; + params.audioCtx = config.audioContext; + params.speedUp = config.speedUp; + params.beamSearchBeamSize = config.beamSize; + params.greedyBestOf = config.greedyBestOf; + if (!config.initialPrompt.isBlank()) { + params.initialPrompt = config.initialPrompt; + } + if (grammar != null) { + params.grammar = grammar; + params.grammarPenalty = config.grammarPenalty; + } + // there is no single language models other than the english ones + params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en"; + // implementation assumes this options + params.translate = false; + params.detectLanguage = false; + params.printProgress = false; + params.noTimestamps = true; + params.printRealtime = false; + params.printSpecial = false; + params.printTimestamps = false; + params.suppressBlank = true; + params.suppressNonSpeechTokens = true; + params.singleSegment = true; + params.noContext = true; + return params; + } + + private void emitSpeechRecognitionNoResultsError(STTListener sttListener) { + sttListener.sttEventReceived(new SpeechRecognitionErrorEvent(config.noResultsMessage)); + } + + private void emitSpeechRecognitionError(STTListener sttListener) { + sttListener.sttEventReceived(new SpeechRecognitionErrorEvent(config.errorMessage)); + } + + private void createSamplesDir() { + if (!Files.exists(SAMPLES_FOLDER)) { + try { + Files.createDirectory(SAMPLES_FOLDER); + logger.info("Whisper samples dir created {}", SAMPLES_FOLDER); + } catch (IOException ignored) { + logger.warn("Unable to create whisper samples dir {}", SAMPLES_FOLDER); + } + } + } + + private void createAudioFile(float[] samples, int size, String transcription, String language) { + createSamplesDir(); + javax.sound.sampled.AudioFormat jAudioFormat; + ByteBuffer byteBuffer; + if ("i16".equals(config.recordSampleFormat)) { + logger.debug("Saving audio file with sample format i16"); + jAudioFormat = new javax.sound.sampled.AudioFormat(javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, + WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false); + byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN); + for (int i = 0; i < size; i++) { + byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE)); + } + } else { + logger.debug("Saving audio file with sample format f32"); + jAudioFormat = new javax.sound.sampled.AudioFormat(javax.sound.sampled.AudioFormat.Encoding.PCM_FLOAT, + WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false); + byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN); + for (int i = 0; i < size; i++) { + byteBuffer.putFloat(samples[i]); + } + } + AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()), + jAudioFormat, samples.length); + try { + var scapedTranscription = transcription.replaceAll("[^a-zA-ZÀ-ú0-9.-]", "_"); + if (scapedTranscription.length() > 60) { + scapedTranscription = scapedTranscription.substring(0, 60); + } + String fileName = new SimpleDateFormat("yyyy-MM-dd.HH.mm.ss.SS").format(new Date()) + "(" + + scapedTranscription + ")"; + Path audioPath = Path.of(SAMPLES_FOLDER.toString(), fileName + ".wav"); + Path propertiesPath = Path.of(SAMPLES_FOLDER.toString(), fileName + ".props"); + logger.debug("Saving audio file: {}", audioPath); + FileOutputStream audioFileOutputStream = new FileOutputStream(audioPath.toFile()); + AudioSystem.write(audioInputStreamTemp, AudioFileFormat.Type.WAVE, audioFileOutputStream); + audioFileOutputStream.close(); + String properties = "transcription=" + transcription + "\nlanguage=" + language + "\n"; + logger.debug("Saving properties file: {}", propertiesPath); + FileOutputStream propertiesFileOutputStream = new FileOutputStream(propertiesPath.toFile()); + propertiesFileOutputStream.write(properties.getBytes(StandardCharsets.UTF_8)); + propertiesFileOutputStream.close(); + } catch (IOException e) { + logger.warn("Unable to store sample.", e); + } + } + + private void onWhisperLog(String text) { + logger.debug("[whisper.cpp] {}", text); + } +} diff --git a/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/utils/VAD.java b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/utils/VAD.java new file mode 100644 index 00000000000..cf505c39205 --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/utils/VAD.java @@ -0,0 +1,95 @@ +/** + * Copyright (c) 2010-2024 Contributors to the openHAB project + * + * See the NOTICE file(s) distributed with this work for additional + * information. + * + * This program and the accompanying materials are made available under the + * terms of the Eclipse Public License 2.0 which is available at + * http://www.eclipse.org/legal/epl-2.0 + * + * SPDX-License-Identifier: EPL-2.0 + */ +package org.openhab.voice.whisperstt.internal.utils; + +import java.io.IOException; + +import org.eclipse.jdt.annotation.NonNullByDefault; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.github.givimad.libfvadjni.VoiceActivityDetector; + +/** + * The {@link VAD} class is a voice activity detector implementation over libfvad-jni. + * + * @author Miguel Álvarez - Initial contribution + */ +@NonNullByDefault +public class VAD implements AutoCloseable { + private final Logger logger = LoggerFactory.getLogger(VAD.class); + private final VoiceActivityDetector libfvad; + private final short[] stepSamples; + private final int totalPartialDetections; + private final int detectionThreshold; + + /** + * + * @param mode desired vad mode. + * @param sampleRate audio sample rate. + * @param frameSize detector input frame size. + * @param stepMs detector partial step ms. + * @param sensitivity detector sensitivity percent in range 0 - 1. + * @throws IOException + */ + public VAD(VoiceActivityDetector.Mode mode, int sampleRate, int frameSize, int stepMs, float sensitivity) + throws IOException { + this.libfvad = VoiceActivityDetector.newInstance(); + this.libfvad.setMode(mode); + this.libfvad.setSampleRate(VoiceActivityDetector.SampleRate.fromValue(sampleRate)); + this.stepSamples = new short[sampleRate / 1000 * stepMs]; + this.totalPartialDetections = (frameSize / stepSamples.length); + this.detectionThreshold = (int) ((((float) totalPartialDetections) / 100f) * (sensitivity * 100)); + } + + public VADResult analyze(short[] samples) throws IOException { + int voiceInHead = 0; + int voiceInTail = 0; + boolean silenceFound = false; + int partialVADCounter = 0; + for (int i = 0; i < totalPartialDetections; i++) { + System.arraycopy(samples, i * stepSamples.length, stepSamples, 0, stepSamples.length); + if (libfvad.process(stepSamples, stepSamples.length)) { + partialVADCounter++; + if (!silenceFound) { + voiceInHead++; + } + voiceInTail++; + } else { + silenceFound = true; + voiceInTail = 0; + } + } + logger.debug("VAD: {}/{} - required: {}", partialVADCounter, totalPartialDetections, detectionThreshold); + return new VADResult( // + partialVADCounter >= detectionThreshold, // + voiceInHead * stepSamples.length, // + voiceInTail * stepSamples.length // + ); + } + + @Override + public void close() { + libfvad.close(); + } + + /** + * Voice activity detection result. + * + * @param isVoice Does the block contain enough voice + * @param voiceSamplesInHead Number of samples consecutively reported as voice from the beginning of the chunk + * @param voiceSamplesInTail Number of samples consecutively reported as voice from the end of the chunk + */ + public record VADResult(boolean isVoice, int voiceSamplesInHead, int voiceSamplesInTail) { + } +} diff --git a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/addon/addon.xml b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/addon/addon.xml new file mode 100644 index 00000000000..aded2d3872c --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/addon/addon.xml @@ -0,0 +1,15 @@ + + + + voice + Whisper Speech-to-Text + Whisper STT Service uses the whisper.cpp library to transcript audio data to text. + none + + org.openhab.voice.whisperstt + + + + diff --git a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml new file mode 100644 index 00000000000..c1f08cfb15c --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/config/config.xml @@ -0,0 +1,229 @@ + + + + + + Configure Speech to Text. + + + + Configure the VAD mechanisim used to isolate single phrases to feed whisper with. + + + + Configure the whisper.cpp transcription options. + + + + Define a grammar to improve transcrptions. + + + + Configure service information messages. + + + + Options added for developers. + true + + + + Model name without extension. + + + + Keep the model loaded. If the parameter is set to true, the model will be reloaded only on + configuration + updates. If the model is not loaded when needed, the service will try to load it. If the parameter is + set to false, + the model will be loaded and unloaded on each run. + + false + + + + When enabled recognition stops listening after a single utterance. + true + true + + + + Min transcription seconds passed to whisper. + 2 + true + + + + Seconds to force transcription before silence detection. + 10 + + + + Max initial seconds of silence to discard transcription. + 3 + + + + Seconds of silence to trigger transcription. + 0.5 + + + + Remove silence frames from the beginning and end of the audio. + true + true + + + + Audio step for the voice activity detection. + 1 + + + + + + + + + true + + + + Percentage in range 0-1 of voice activity in each audio step analyzed to consider it as voice. + 0.3 + + + + Available VAD modes. Quality is the most likely to detect voice. + VERY_AGGRESSIVE + + + + + + + true + + + + Audio milliseconds passed to the voice activity detector. Defines how much times the voice activity + detector is executed per audio step. + 20 + + + + + + true + + + + Number of threads used by whisper. (0 to use host max threads) + 0 + + + + Overwrite the audio context size. (0 to use whisper default context size) + 0 + true + + + + Sampling strategy used. + BEAN_SEARCH + + + + + + + + Beam Size configuration for sampling strategy Bean Search. + 2 + + + + Best Of configuration for sampling strategy Greedy. (-1 for unlimited) + -1 + + + + Temperature threshold. + 0 + + + + Initial prompt to feed whisper with. + true + + + + Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) + true + CPU + + + + Speed up audio by x2. (reduced accuracy) + false + true + + + + Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect) + true + true + + + + Enables grammar usage. + false + + + + Penalty for non grammar tokens when using grammar. + 100 + + + + Grammar to use in GBNF format. (BNF variant used by whisper.cpp). + + + + + Message to be told when no results. (Empty for disabled) + Sorry, I didn't understand you + + + + Message to be told when an error has happened. (Empty for disabled) + Sorry, something went wrong + + + + Create WAV audio record on each whisper execution. + false + true + + + + Defines the sample type and bit-size used by the created WAV audio record. + i16 + + + + + true + + + + Emit whisper.cpp library logs as add-on debug logs. + false + true + + + diff --git a/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/i18n/whisperstt.properties b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/i18n/whisperstt.properties new file mode 100644 index 00000000000..0780316715b --- /dev/null +++ b/bundles/org.openhab.voice.whisperstt/src/main/resources/OH-INF/i18n/whisperstt.properties @@ -0,0 +1,94 @@ +# add-on + +addon.whisperstt.name = Whisper Speech-to-Text +addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text. + +voice.config.whisperstt.audioContext.label = Audio Context +voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size) +voice.config.whisperstt.beamSize.label = Beam Size +voice.config.whisperstt.beamSize.description = Beam Size configuration for sampling strategy Bean Search. +voice.config.whisperstt.createWAVRecord.label = Create WAV Record +voice.config.whisperstt.createWAVRecord.description = Create WAV audio record on each whisper execution. +voice.config.whisperstt.enableWhisperLog.label = Enable Whisper Log +voice.config.whisperstt.enableWhisperLog.description = Emit whisper.cpp library logs as add-on debug logs. +voice.config.whisperstt.noResultsMessage.label = No Results Message +voice.config.whisperstt.noResultsMessage.description = Message to be told when no results. (Empty for disabled) +voice.config.whisperstt.errorMessage.label = Error Message +voice.config.whisperstt.errorMessage.description = Message to be told when an error has happened. (Empty for disabled) +voice.config.whisperstt.grammarLines.label = Grammar +voice.config.whisperstt.grammarLines.description = Grammar to use in GBNF format. (BNF variant used by whisper.cpp). +voice.config.whisperstt.grammarPenalty.label = Grammar Penalty +voice.config.whisperstt.grammarPenalty.description = Penalty for non grammar tokens when using grammar. +voice.config.whisperstt.greedyBestOf.label = Greedy Best Of +voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sampling strategy Greedy. (-1 for unlimited) +voice.config.whisperstt.group.developer.label = Developer +voice.config.whisperstt.group.developer.description = Options added for developers. +voice.config.whisperstt.group.grammar.label = Grammar +voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions. +voice.config.whisperstt.group.messages.label = Info Messages +voice.config.whisperstt.group.messages.description = Configure service information messages. +voice.config.whisperstt.group.stt.label = STT Configuration +voice.config.whisperstt.group.stt.description = Configure Speech to Text. +voice.config.whisperstt.group.vad.label = Voice Activity Detection +voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with. +voice.config.whisperstt.group.whisper.label = Whisper Options +voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options. +voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds +voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription. +voice.config.whisperstt.initialPrompt.label = Initial Prompt +voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with. +voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds +voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection. +voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds +voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription. +voice.config.whisperstt.minSeconds.label = Min Transcription Seconds +voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper. +voice.config.whisperstt.modelName.label = Model Name +voice.config.whisperstt.modelName.description = Model name without extension. +voice.config.whisperstt.openvinoDevice.label = OpenVINO Device +voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) +voice.config.whisperstt.preloadModel.label = Preload Model +voice.config.whisperstt.preloadModel.description = Keep the model loaded. If the parameter is set to true, the model will be reloaded only on configuration updates. If the model is not loaded when needed, the service will try to load it. If the parameter is set to false, the model will be loaded and unloaded on each run. +voice.config.whisperstt.recordSampleFormat.label = Record Sample Format +voice.config.whisperstt.recordSampleFormat.description = Defines the sample type and bit-size used by the created WAV audio record. +voice.config.whisperstt.recordSampleFormat.option.i16 = Integer 16bit +voice.config.whisperstt.recordSampleFormat.option.f32 = Float 32bit +voice.config.whisperstt.removeSilence.label = Remove Silence +voice.config.whisperstt.removeSilence.description = Remove silence frames from the beginning and end of the audio. +voice.config.whisperstt.samplingStrategy.label = Sampling strategy +voice.config.whisperstt.samplingStrategy.description = Sampling strategy used. +voice.config.whisperstt.samplingStrategy.option.GREEDY = Greedy +voice.config.whisperstt.samplingStrategy.option.BEAN_SEARCH = Bean Search +voice.config.whisperstt.singleUtteranceMode.label = Single Utterance Mode +voice.config.whisperstt.singleUtteranceMode.description = When enabled recognition stops listening after a single utterance. +voice.config.whisperstt.speedUp.label = Speed Up +voice.config.whisperstt.speedUp.description = Speed up audio by x2. (reduced accuracy) +voice.config.whisperstt.stepSeconds.label = Audio Step +voice.config.whisperstt.stepSeconds.description = Audio step for the voice activity detection. +voice.config.whisperstt.stepSeconds.option.0.1 = 100ms +voice.config.whisperstt.stepSeconds.option.0.2 = 200ms +voice.config.whisperstt.stepSeconds.option.0.3 = 300ms +voice.config.whisperstt.stepSeconds.option.0.5 = 500ms +voice.config.whisperstt.stepSeconds.option.0.6 = 600ms +voice.config.whisperstt.stepSeconds.option.1 = 1s +voice.config.whisperstt.temperature.label = Temperature +voice.config.whisperstt.temperature.description = Temperature threshold. +voice.config.whisperstt.threads.label = Threads +voice.config.whisperstt.threads.description = Number of threads used by whisper. (0 to use host max threads) +voice.config.whisperstt.useGPU.label = Use GPU +voice.config.whisperstt.useGPU.description = Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect) +voice.config.whisperstt.useGrammar.label = Use Grammar +voice.config.whisperstt.useGrammar.description = Enables grammar usage. +voice.config.whisperstt.vadMode.label = Voice Activity Detection Mode +voice.config.whisperstt.vadMode.description = Available VAD modes. Quality is the most likely to detect voice. +voice.config.whisperstt.vadMode.option.QUALITY = Quality +voice.config.whisperstt.vadMode.option.LOW_BITRATE = Low Bitrate +voice.config.whisperstt.vadMode.option.AGGRESSIVE = Aggressive +voice.config.whisperstt.vadMode.option.VERY_AGGRESSIVE = Very Aggressive +voice.config.whisperstt.vadSensitivity.label = Voice Activity Detection Sensitivity +voice.config.whisperstt.vadSensitivity.description = Percentage in range 0-1 of voice activity in each audio step analyzed to consider it as voice. +voice.config.whisperstt.vadStep.label = Voice Activity Detector Step +voice.config.whisperstt.vadStep.description = Audio milliseconds passed to the voice activity detector. Defines how much times the voice activity detector is executed per audio step. +voice.config.whisperstt.vadStep.option.10 = 10ms +voice.config.whisperstt.vadStep.option.20 = 20ms +voice.config.whisperstt.vadStep.option.30 = 30ms diff --git a/bundles/pom.xml b/bundles/pom.xml index b0c818cc844..8eb9e1456cd 100644 --- a/bundles/pom.xml +++ b/bundles/pom.xml @@ -471,6 +471,7 @@ org.openhab.voice.voicerss org.openhab.voice.voskstt org.openhab.voice.watsonstt + org.openhab.voice.whisperstt