mirror of
https://github.com/openhab/openhab-addons.git
synced 2025-01-25 14:55:55 +01:00
Merge 5487ef17bc
into adacdebb9f
This commit is contained in:
commit
fe8f06aa09
@ -5,6 +5,8 @@ It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity det
|
|||||||
|
|
||||||
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
|
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
|
||||||
|
|
||||||
|
Alternatively, if you do not want to perform speech-to-text on the computer hosting openHAB, this add-on can consume an OpenAI/Whisper compatible transcription API.
|
||||||
|
|
||||||
Whisper enables speech recognition for multiple languages and dialects:
|
Whisper enables speech recognition for multiple languages and dialects:
|
||||||
|
|
||||||
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
|
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
|
||||||
@ -15,9 +17,11 @@ marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, geo
|
|||||||
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
|
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
|
||||||
hausa, bashkir, javanese and sundanese.
|
hausa, bashkir, javanese and sundanese.
|
||||||
|
|
||||||
## Supported platforms
|
## Local mode (offline)
|
||||||
|
|
||||||
This add-on uses some native binaries to work.
|
### Supported platforms
|
||||||
|
|
||||||
|
This add-on uses some native binaries to work when performing offline recognition.
|
||||||
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
|
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
|
||||||
|
|
||||||
The following platforms are supported:
|
The following platforms are supported:
|
||||||
@ -28,7 +32,7 @@ The following platforms are supported:
|
|||||||
|
|
||||||
The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
|
The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
|
||||||
|
|
||||||
## CPU compatibility
|
### CPU compatibility
|
||||||
|
|
||||||
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
|
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
|
||||||
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
|
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
|
||||||
@ -40,18 +44,18 @@ You can check those flags on Windows using a program like `CPU-Z`.
|
|||||||
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
|
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
|
||||||
You can check those flags on linux using the terminal with `lscpu`.
|
You can check those flags on linux using the terminal with `lscpu`.
|
||||||
|
|
||||||
## Transcription time
|
### Transcription time
|
||||||
|
|
||||||
On a Raspberry PI 5, the approximate transcription times are:
|
On a Raspberry PI 5, the approximate transcription times are:
|
||||||
|
|
||||||
| model | exec time |
|
| model | exec time |
|
||||||
| ---------- | --------: |
|
|------------|----------:|
|
||||||
| tiny.bin | 1.5s |
|
| tiny.bin | 1.5s |
|
||||||
| base.bin | 3s |
|
| base.bin | 3s |
|
||||||
| small.bin | 8.5s |
|
| small.bin | 8.5s |
|
||||||
| medium.bin | 17s |
|
| medium.bin | 17s |
|
||||||
|
|
||||||
## Configuring the model
|
### Configuring the model
|
||||||
|
|
||||||
Before you can use this service you should configure your model.
|
Before you can use this service you should configure your model.
|
||||||
|
|
||||||
@ -64,7 +68,7 @@ You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so
|
|||||||
|
|
||||||
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
|
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
|
||||||
|
|
||||||
## Using alternative whisper.cpp library
|
### Using alternative whisper.cpp library
|
||||||
|
|
||||||
It's possible to use your own build of the whisper.cpp shared library with this add-on.
|
It's possible to use your own build of the whisper.cpp shared library with this add-on.
|
||||||
|
|
||||||
@ -76,7 +80,7 @@ In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can fi
|
|||||||
|
|
||||||
Note: You need to restart openHAB to reload the library.
|
Note: You need to restart openHAB to reload the library.
|
||||||
|
|
||||||
## Grammar
|
### Grammar
|
||||||
|
|
||||||
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
|
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
|
||||||
|
|
||||||
@ -99,6 +103,14 @@ tv_channel ::= ("set ")? "tv channel to " [0-9]+
|
|||||||
|
|
||||||
You can provide the grammar and enable its usage using the binding configuration.
|
You can provide the grammar and enable its usage using the binding configuration.
|
||||||
|
|
||||||
|
## API mode
|
||||||
|
|
||||||
|
You can also use this add-on with a remote API that is compatible with the 'transcription' API from OpenAI. Online services exposing such an API may require an API key (paid services, such as OpenAI).
|
||||||
|
|
||||||
|
You can host you own compatible service elsewhere on your network, with third-party software such as faster-whisper-server.
|
||||||
|
|
||||||
|
Please note that API mode also uses libvfad for voice activity detection, and that grammar parameters are not available.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
Use your favorite configuration UI to edit the Whisper settings:
|
Use your favorite configuration UI to edit the Whisper settings:
|
||||||
@ -107,6 +119,7 @@ Use your favorite configuration UI to edit the Whisper settings:
|
|||||||
|
|
||||||
General options.
|
General options.
|
||||||
|
|
||||||
|
- **Mode : LOCAL or API** - Choose either local computation or remote API use.
|
||||||
- **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
|
- **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
|
||||||
- **Preload Model** - Keep whisper model loaded.
|
- **Preload Model** - Keep whisper model loaded.
|
||||||
- **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
|
- **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
|
||||||
@ -139,6 +152,13 @@ Configure whisper options.
|
|||||||
- **Initial Prompt** - Initial prompt for whisper.
|
- **Initial Prompt** - Initial prompt for whisper.
|
||||||
- **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
- **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
||||||
- **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
|
- **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
|
||||||
|
- **Language** - If specified, speed up recognition by avoiding auto-detection. Default to system locale.
|
||||||
|
|
||||||
|
### API Configuration
|
||||||
|
|
||||||
|
- **API key** - Optional use of an API key for online services requiring it.
|
||||||
|
- **API url** - You may use your own service and define its URL here. Default set to OpenAI transcription API.
|
||||||
|
- **API model name** - Your hosted service may have other models. Default to OpenAI only model 'whisper-1'.
|
||||||
|
|
||||||
### Grammar Configuration
|
### Grammar Configuration
|
||||||
|
|
||||||
@ -199,7 +219,9 @@ In case you would like to set up the service via a text file, create a new file
|
|||||||
Its contents should look similar to:
|
Its contents should look similar to:
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
|
org.openhab.voice.whisperstt:mode=LOCAL
|
||||||
org.openhab.voice.whisperstt:modelName=tiny
|
org.openhab.voice.whisperstt:modelName=tiny
|
||||||
|
org.openhab.voice.whisperstt:language=en
|
||||||
org.openhab.voice.whisperstt:initSilenceSeconds=0.3
|
org.openhab.voice.whisperstt:initSilenceSeconds=0.3
|
||||||
org.openhab.voice.whisperstt:removeSilence=true
|
org.openhab.voice.whisperstt:removeSilence=true
|
||||||
org.openhab.voice.whisperstt:stepSeconds=0.3
|
org.openhab.voice.whisperstt:stepSeconds=0.3
|
||||||
@ -229,6 +251,9 @@ org.openhab.voice.whisperstt:useGPU=false
|
|||||||
org.openhab.voice.whisperstt:useGrammar=false
|
org.openhab.voice.whisperstt:useGrammar=false
|
||||||
org.openhab.voice.whisperstt:grammarPenalty=80.0
|
org.openhab.voice.whisperstt:grammarPenalty=80.0
|
||||||
org.openhab.voice.whisperstt:grammarLines=
|
org.openhab.voice.whisperstt:grammarLines=
|
||||||
|
org.openhab.voice.whisperstt:apiKey=mykeyaaaa
|
||||||
|
org.openhab.voice.whisperstt:apiUrl=https://api.openai.com/v1/audio/transcriptions
|
||||||
|
org.openhab.voice.whisperstt:apiModelName=whisper-1
|
||||||
```
|
```
|
||||||
|
|
||||||
### Default Speech-to-Text Configuration
|
### Default Speech-to-Text Configuration
|
||||||
|
@ -146,4 +146,29 @@ public class WhisperSTTConfiguration {
|
|||||||
* Print whisper.cpp library logs as binding debug logs.
|
* Print whisper.cpp library logs as binding debug logs.
|
||||||
*/
|
*/
|
||||||
public boolean enableWhisperLog;
|
public boolean enableWhisperLog;
|
||||||
|
/**
|
||||||
|
* local to use embedded whisper or openaiapi to use an external API
|
||||||
|
*/
|
||||||
|
public Mode mode = Mode.LOCAL;
|
||||||
|
/**
|
||||||
|
* If mode set to openaiapi, then use this URL
|
||||||
|
*/
|
||||||
|
public String apiUrl = "https://api.openai.com/v1/audio/transcriptions";
|
||||||
|
/**
|
||||||
|
* if mode set to openaiapi, use this api key to access apiUrl
|
||||||
|
*/
|
||||||
|
public String apiKey = "";
|
||||||
|
/**
|
||||||
|
* If specified, speed up recognition by avoiding auto-detection
|
||||||
|
*/
|
||||||
|
public String language = "";
|
||||||
|
/**
|
||||||
|
* Model name (API only)
|
||||||
|
*/
|
||||||
|
public String apiModelName = "whisper-1";
|
||||||
|
|
||||||
|
public static enum Mode {
|
||||||
|
LOCAL,
|
||||||
|
API;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@ -12,12 +12,10 @@
|
|||||||
*/
|
*/
|
||||||
package org.openhab.voice.whisperstt.internal;
|
package org.openhab.voice.whisperstt.internal;
|
||||||
|
|
||||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
|
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.*;
|
||||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
|
|
||||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
|
|
||||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
|
|
||||||
|
|
||||||
import java.io.ByteArrayInputStream;
|
import java.io.ByteArrayInputStream;
|
||||||
|
import java.io.ByteArrayOutputStream;
|
||||||
import java.io.FileOutputStream;
|
import java.io.FileOutputStream;
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.nio.ByteBuffer;
|
import java.nio.ByteBuffer;
|
||||||
@ -32,7 +30,9 @@ import java.util.Date;
|
|||||||
import java.util.Locale;
|
import java.util.Locale;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
import java.util.Set;
|
import java.util.Set;
|
||||||
|
import java.util.concurrent.ExecutionException;
|
||||||
import java.util.concurrent.ScheduledExecutorService;
|
import java.util.concurrent.ScheduledExecutorService;
|
||||||
|
import java.util.concurrent.TimeoutException;
|
||||||
import java.util.concurrent.atomic.AtomicBoolean;
|
import java.util.concurrent.atomic.AtomicBoolean;
|
||||||
|
|
||||||
import javax.sound.sampled.AudioFileFormat;
|
import javax.sound.sampled.AudioFileFormat;
|
||||||
@ -41,6 +41,13 @@ import javax.sound.sampled.AudioSystem;
|
|||||||
|
|
||||||
import org.eclipse.jdt.annotation.NonNullByDefault;
|
import org.eclipse.jdt.annotation.NonNullByDefault;
|
||||||
import org.eclipse.jdt.annotation.Nullable;
|
import org.eclipse.jdt.annotation.Nullable;
|
||||||
|
import org.eclipse.jetty.client.HttpClient;
|
||||||
|
import org.eclipse.jetty.client.api.ContentResponse;
|
||||||
|
import org.eclipse.jetty.client.api.Request;
|
||||||
|
import org.eclipse.jetty.client.util.InputStreamContentProvider;
|
||||||
|
import org.eclipse.jetty.client.util.MultiPartContentProvider;
|
||||||
|
import org.eclipse.jetty.client.util.StringContentProvider;
|
||||||
|
import org.eclipse.jetty.http.HttpMethod;
|
||||||
import org.openhab.core.OpenHAB;
|
import org.openhab.core.OpenHAB;
|
||||||
import org.openhab.core.audio.AudioFormat;
|
import org.openhab.core.audio.AudioFormat;
|
||||||
import org.openhab.core.audio.AudioStream;
|
import org.openhab.core.audio.AudioStream;
|
||||||
@ -48,6 +55,7 @@ import org.openhab.core.audio.utils.AudioWaveUtils;
|
|||||||
import org.openhab.core.common.ThreadPoolManager;
|
import org.openhab.core.common.ThreadPoolManager;
|
||||||
import org.openhab.core.config.core.ConfigurableService;
|
import org.openhab.core.config.core.ConfigurableService;
|
||||||
import org.openhab.core.config.core.Configuration;
|
import org.openhab.core.config.core.Configuration;
|
||||||
|
import org.openhab.core.io.net.http.HttpClientFactory;
|
||||||
import org.openhab.core.io.rest.LocaleService;
|
import org.openhab.core.io.rest.LocaleService;
|
||||||
import org.openhab.core.voice.RecognitionStartEvent;
|
import org.openhab.core.voice.RecognitionStartEvent;
|
||||||
import org.openhab.core.voice.RecognitionStopEvent;
|
import org.openhab.core.voice.RecognitionStopEvent;
|
||||||
@ -57,6 +65,7 @@ import org.openhab.core.voice.STTService;
|
|||||||
import org.openhab.core.voice.STTServiceHandle;
|
import org.openhab.core.voice.STTServiceHandle;
|
||||||
import org.openhab.core.voice.SpeechRecognitionErrorEvent;
|
import org.openhab.core.voice.SpeechRecognitionErrorEvent;
|
||||||
import org.openhab.core.voice.SpeechRecognitionEvent;
|
import org.openhab.core.voice.SpeechRecognitionEvent;
|
||||||
|
import org.openhab.voice.whisperstt.internal.WhisperSTTConfiguration.Mode;
|
||||||
import org.openhab.voice.whisperstt.internal.utils.VAD;
|
import org.openhab.voice.whisperstt.internal.utils.VAD;
|
||||||
import org.osgi.framework.Constants;
|
import org.osgi.framework.Constants;
|
||||||
import org.osgi.service.component.annotations.Activate;
|
import org.osgi.service.component.annotations.Activate;
|
||||||
@ -96,10 +105,13 @@ public class WhisperSTTService implements STTService {
|
|||||||
private @Nullable WhisperContext context;
|
private @Nullable WhisperContext context;
|
||||||
private @Nullable WhisperGrammar grammar;
|
private @Nullable WhisperGrammar grammar;
|
||||||
private @Nullable WhisperJNI whisper;
|
private @Nullable WhisperJNI whisper;
|
||||||
|
private boolean isWhisperLibAlreadyLoaded = false;
|
||||||
|
private final HttpClientFactory httpClientFactory;
|
||||||
|
|
||||||
@Activate
|
@Activate
|
||||||
public WhisperSTTService(@Reference LocaleService localeService) {
|
public WhisperSTTService(@Reference LocaleService localeService, @Reference HttpClientFactory httpClientFactory) {
|
||||||
this.localeService = localeService;
|
this.localeService = localeService;
|
||||||
|
this.httpClientFactory = httpClientFactory;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Activate
|
@Activate
|
||||||
@ -108,7 +120,8 @@ public class WhisperSTTService implements STTService {
|
|||||||
if (!Files.exists(WHISPER_FOLDER)) {
|
if (!Files.exists(WHISPER_FOLDER)) {
|
||||||
Files.createDirectory(WHISPER_FOLDER);
|
Files.createDirectory(WHISPER_FOLDER);
|
||||||
}
|
}
|
||||||
WhisperJNI.loadLibrary(getLoadOptions());
|
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
|
||||||
|
loadWhisperLibraryIfNeeded();
|
||||||
VoiceActivityDetector.loadLibrary();
|
VoiceActivityDetector.loadLibrary();
|
||||||
whisper = new WhisperJNI();
|
whisper = new WhisperJNI();
|
||||||
} catch (IOException | RuntimeException e) {
|
} catch (IOException | RuntimeException e) {
|
||||||
@ -117,6 +130,13 @@ public class WhisperSTTService implements STTService {
|
|||||||
configChange(config);
|
configChange(config);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private void loadWhisperLibraryIfNeeded() throws IOException {
|
||||||
|
if (config.mode == Mode.LOCAL && !isWhisperLibAlreadyLoaded) {
|
||||||
|
WhisperJNI.loadLibrary(getLoadOptions());
|
||||||
|
isWhisperLibAlreadyLoaded = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
private WhisperJNI.LoadOptions getLoadOptions() {
|
private WhisperJNI.LoadOptions getLoadOptions() {
|
||||||
Path libFolder = Paths.get("/usr/local/lib");
|
Path libFolder = Paths.get("/usr/local/lib");
|
||||||
Path libFolderWin = Paths.get("/Windows/System32");
|
Path libFolderWin = Paths.get("/Windows/System32");
|
||||||
@ -167,14 +187,27 @@ public class WhisperSTTService implements STTService {
|
|||||||
|
|
||||||
private void configChange(Map<String, Object> config) {
|
private void configChange(Map<String, Object> config) {
|
||||||
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
|
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
|
||||||
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
|
|
||||||
WhisperGrammar grammar = this.grammar;
|
WhisperGrammar grammar = this.grammar;
|
||||||
if (grammar != null) {
|
if (grammar != null) {
|
||||||
grammar.close();
|
grammar.close();
|
||||||
this.grammar = null;
|
this.grammar = null;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// API mode
|
||||||
|
if (this.config.mode == Mode.API) {
|
||||||
|
try {
|
||||||
|
unloadContext();
|
||||||
|
} catch (IOException e) {
|
||||||
|
logger.warn("IOException unloading model: {}", e.getMessage());
|
||||||
|
}
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Local mode
|
||||||
WhisperJNI whisper;
|
WhisperJNI whisper;
|
||||||
try {
|
try {
|
||||||
|
loadWhisperLibraryIfNeeded();
|
||||||
|
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
|
||||||
whisper = getWhisper();
|
whisper = getWhisper();
|
||||||
} catch (IOException ignored) {
|
} catch (IOException ignored) {
|
||||||
logger.warn("library not loaded, the add-on will not work");
|
logger.warn("library not loaded, the add-on will not work");
|
||||||
@ -228,9 +261,17 @@ public class WhisperSTTService implements STTService {
|
|||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Set<Locale> getSupportedLocales() {
|
public Set<Locale> getSupportedLocales() {
|
||||||
// as it is not possible to determine the language of the model that was downloaded and setup by the user, it is
|
// Attempt to create a locale from the configured language
|
||||||
// assumed the language of the model is matching the locale of the openHAB server
|
String language = config.language;
|
||||||
return Set.of(localeService.getLocale(null));
|
Locale modelLocale = localeService.getLocale(null);
|
||||||
|
if (!language.isBlank()) {
|
||||||
|
try {
|
||||||
|
modelLocale = Locale.forLanguageTag(language);
|
||||||
|
} catch (IllegalArgumentException e) {
|
||||||
|
logger.warn("Invalid language '{}', defaulting to server locale", language);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return Set.of(modelLocale);
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
@ -246,33 +287,18 @@ public class WhisperSTTService implements STTService {
|
|||||||
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
|
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
|
||||||
throws STTException {
|
throws STTException {
|
||||||
AtomicBoolean aborted = new AtomicBoolean(false);
|
AtomicBoolean aborted = new AtomicBoolean(false);
|
||||||
WhisperContext ctx = null;
|
|
||||||
WhisperState state = null;
|
|
||||||
try {
|
try {
|
||||||
var whisper = getWhisper();
|
|
||||||
ctx = getContext();
|
|
||||||
logger.debug("Creating whisper state...");
|
|
||||||
state = whisper.initState(ctx);
|
|
||||||
logger.debug("Whisper state created");
|
|
||||||
logger.debug("Creating VAD instance...");
|
logger.debug("Creating VAD instance...");
|
||||||
final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE);
|
final int nSamplesStep = (int) (config.stepSeconds * WHISPER_SAMPLE_RATE);
|
||||||
VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
|
VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
|
||||||
config.vadStep, config.vadSensitivity);
|
config.vadStep, config.vadSensitivity);
|
||||||
logger.debug("VAD instance created");
|
logger.debug("VAD instance created");
|
||||||
sttListener.sttEventReceived(new RecognitionStartEvent());
|
sttListener.sttEventReceived(new RecognitionStartEvent());
|
||||||
backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted);
|
backgroundRecognize(nSamplesStep, locale, sttListener, audioStream, vad, aborted);
|
||||||
} catch (IOException e) {
|
} catch (IOException e) {
|
||||||
if (ctx != null && !config.preloadModel) {
|
|
||||||
ctx.close();
|
|
||||||
}
|
|
||||||
if (state != null) {
|
|
||||||
state.close();
|
|
||||||
}
|
|
||||||
throw new STTException("Exception during initialization", e);
|
throw new STTException("Exception during initialization", e);
|
||||||
}
|
}
|
||||||
return () -> {
|
return () -> aborted.set(true);
|
||||||
aborted.set(true);
|
|
||||||
};
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private WhisperJNI getWhisper() throws IOException {
|
private WhisperJNI getWhisper() throws IOException {
|
||||||
@ -339,9 +365,8 @@ public class WhisperSTTService implements STTService {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep,
|
private void backgroundRecognize(final int nSamplesStep, Locale locale, STTListener sttListener,
|
||||||
Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
|
AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
|
||||||
var releaseContext = !config.preloadModel;
|
|
||||||
final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
|
final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
|
||||||
final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
|
final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||||
final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
|
final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||||
@ -353,21 +378,17 @@ public class WhisperSTTService implements STTService {
|
|||||||
logger.debug("Max silence samples {}", nMaxSilenceSamples);
|
logger.debug("Max silence samples {}", nMaxSilenceSamples);
|
||||||
// used to store the step samples in libfvad wanted format 16-bit int
|
// used to store the step samples in libfvad wanted format 16-bit int
|
||||||
final short[] stepAudioSamples = new short[nSamplesStep];
|
final short[] stepAudioSamples = new short[nSamplesStep];
|
||||||
// used to store the full samples in whisper wanted format 32-bit float
|
// used to store the full retained samples for whisper
|
||||||
final float[] audioSamples = new float[nSamplesMax];
|
final short[] audioSamples = new short[nSamplesMax];
|
||||||
executor.submit(() -> {
|
executor.submit(() -> {
|
||||||
int audioSamplesOffset = 0;
|
int audioSamplesOffset = 0;
|
||||||
int silenceSamplesCounter = 0;
|
int silenceSamplesCounter = 0;
|
||||||
int nProcessedSamples = 0;
|
int nProcessedSamples = 0;
|
||||||
int numBytesRead;
|
|
||||||
boolean voiceDetected = false;
|
boolean voiceDetected = false;
|
||||||
String transcription = "";
|
String transcription = "";
|
||||||
String tempTranscription = "";
|
|
||||||
VAD.@Nullable VADResult lastVADResult;
|
|
||||||
VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
|
VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
|
||||||
try {
|
try {
|
||||||
try (state; //
|
try (audioStream; //
|
||||||
audioStream; //
|
|
||||||
vad) {
|
vad) {
|
||||||
if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
|
if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
|
||||||
AudioWaveUtils.removeFMT(audioStream);
|
AudioWaveUtils.removeFMT(audioStream);
|
||||||
@ -376,10 +397,9 @@ public class WhisperSTTService implements STTService {
|
|||||||
.order(ByteOrder.LITTLE_ENDIAN);
|
.order(ByteOrder.LITTLE_ENDIAN);
|
||||||
// init remaining to full capacity
|
// init remaining to full capacity
|
||||||
int remaining = captureBuffer.capacity();
|
int remaining = captureBuffer.capacity();
|
||||||
WhisperFullParams params = getWhisperFullParams(ctx, locale);
|
|
||||||
while (!aborted.get()) {
|
while (!aborted.get()) {
|
||||||
// read until no remaining so we get the complete step samples
|
// read until no remaining so we get the complete step samples
|
||||||
numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
|
int numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
|
||||||
remaining);
|
remaining);
|
||||||
if (aborted.get() || numBytesRead == -1) {
|
if (aborted.get() || numBytesRead == -1) {
|
||||||
break;
|
break;
|
||||||
@ -395,17 +415,15 @@ public class WhisperSTTService implements STTService {
|
|||||||
while (shortBuffer.hasRemaining()) {
|
while (shortBuffer.hasRemaining()) {
|
||||||
var position = shortBuffer.position();
|
var position = shortBuffer.position();
|
||||||
short i16BitSample = shortBuffer.get();
|
short i16BitSample = shortBuffer.get();
|
||||||
float f32BitSample = Float.min(1f,
|
|
||||||
Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
|
|
||||||
stepAudioSamples[position] = i16BitSample;
|
stepAudioSamples[position] = i16BitSample;
|
||||||
audioSamples[audioSamplesOffset++] = f32BitSample;
|
audioSamples[audioSamplesOffset++] = i16BitSample;
|
||||||
nProcessedSamples++;
|
nProcessedSamples++;
|
||||||
}
|
}
|
||||||
// run vad
|
// run vad
|
||||||
if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
|
if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
|
||||||
logger.debug("VAD: Skipping, max length reached");
|
logger.debug("VAD: Skipping, max length reached");
|
||||||
} else {
|
} else {
|
||||||
lastVADResult = vad.analyze(stepAudioSamples);
|
VAD.@Nullable VADResult lastVADResult = vad.analyze(stepAudioSamples);
|
||||||
if (lastVADResult.isVoice()) {
|
if (lastVADResult.isVoice()) {
|
||||||
voiceDetected = true;
|
voiceDetected = true;
|
||||||
logger.debug("VAD: voice detected");
|
logger.debug("VAD: voice detected");
|
||||||
@ -484,43 +502,26 @@ public class WhisperSTTService implements STTService {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
// run whisper
|
// run whisper, either locally or by remote API
|
||||||
logger.debug("running whisper with {} seconds of audio...",
|
String tempTranscription = (switch (config.mode) {
|
||||||
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
|
case LOCAL -> recognizeLocal(audioSamplesOffset, audioSamples, locale.getLanguage());
|
||||||
long execStartTime = System.currentTimeMillis();
|
case API -> recognizeAPI(audioSamplesOffset, audioSamples, locale.getLanguage());
|
||||||
var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset);
|
});
|
||||||
logger.debug("whisper ended in {}ms with result code {}",
|
|
||||||
System.currentTimeMillis() - execStartTime, result);
|
if (tempTranscription != null && !tempTranscription.isBlank()) {
|
||||||
// process result
|
|
||||||
if (result != 0) {
|
|
||||||
emitSpeechRecognitionError(sttListener);
|
|
||||||
break;
|
|
||||||
}
|
|
||||||
int nSegments = whisper.fullNSegmentsFromState(state);
|
|
||||||
logger.debug("Available transcription segments {}", nSegments);
|
|
||||||
if (nSegments == 1) {
|
|
||||||
tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
|
|
||||||
if (config.createWAVRecord) {
|
if (config.createWAVRecord) {
|
||||||
createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
|
createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
|
||||||
locale.getLanguage());
|
locale.getLanguage());
|
||||||
}
|
}
|
||||||
|
transcription += tempTranscription;
|
||||||
if (config.singleUtteranceMode) {
|
if (config.singleUtteranceMode) {
|
||||||
logger.debug("single utterance mode, ending transcription");
|
logger.debug("single utterance mode, ending transcription");
|
||||||
transcription = tempTranscription;
|
|
||||||
break;
|
break;
|
||||||
} else {
|
|
||||||
// start a new transcription segment
|
|
||||||
transcription += tempTranscription;
|
|
||||||
tempTranscription = "";
|
|
||||||
}
|
}
|
||||||
} else if (nSegments == 0 && config.singleUtteranceMode) {
|
} else {
|
||||||
logger.debug("Single utterance mode and no results, ending transcription");
|
|
||||||
break;
|
|
||||||
} else if (nSegments > 1) {
|
|
||||||
// non reachable
|
|
||||||
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
|
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
// reset state to start with next segment
|
// reset state to start with next segment
|
||||||
voiceDetected = false;
|
voiceDetected = false;
|
||||||
silenceSamplesCounter = 0;
|
silenceSamplesCounter = 0;
|
||||||
@ -528,10 +529,6 @@ public class WhisperSTTService implements STTService {
|
|||||||
logger.debug("Partial transcription: {}", tempTranscription);
|
logger.debug("Partial transcription: {}", tempTranscription);
|
||||||
logger.debug("Transcription: {}", transcription);
|
logger.debug("Transcription: {}", transcription);
|
||||||
}
|
}
|
||||||
} finally {
|
|
||||||
if (releaseContext) {
|
|
||||||
ctx.close();
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
// emit result
|
// emit result
|
||||||
if (!aborted.get()) {
|
if (!aborted.get()) {
|
||||||
@ -543,7 +540,7 @@ public class WhisperSTTService implements STTService {
|
|||||||
emitSpeechRecognitionNoResultsError(sttListener);
|
emitSpeechRecognitionNoResultsError(sttListener);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
} catch (IOException e) {
|
} catch (STTException | IOException e) {
|
||||||
logger.warn("Error running speech to text: {}", e.getMessage());
|
logger.warn("Error running speech to text: {}", e.getMessage());
|
||||||
emitSpeechRecognitionError(sttListener);
|
emitSpeechRecognitionError(sttListener);
|
||||||
} catch (UnsatisfiedLinkError e) {
|
} catch (UnsatisfiedLinkError e) {
|
||||||
@ -553,7 +550,119 @@ public class WhisperSTTService implements STTService {
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException {
|
@Nullable
|
||||||
|
private String recognizeLocal(int audioSamplesOffset, short[] audioSamples, String language) throws STTException {
|
||||||
|
logger.debug("running whisper with {} seconds of audio...",
|
||||||
|
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
|
||||||
|
var releaseContext = !config.preloadModel;
|
||||||
|
|
||||||
|
WhisperJNI whisper = null;
|
||||||
|
WhisperContext ctx = null;
|
||||||
|
WhisperState state = null;
|
||||||
|
try {
|
||||||
|
whisper = getWhisper();
|
||||||
|
ctx = getContext();
|
||||||
|
logger.debug("Creating whisper state...");
|
||||||
|
state = whisper.initState(ctx);
|
||||||
|
logger.debug("Whisper state created");
|
||||||
|
WhisperFullParams params = getWhisperFullParams(ctx, language);
|
||||||
|
|
||||||
|
// convert to local whisper format (float)
|
||||||
|
float[] floatArray = new float[audioSamples.length];
|
||||||
|
for (int i = 0; i < audioSamples.length; i++) {
|
||||||
|
floatArray[i] = Float.min(1f, Float.max((float) audioSamples[i] / ((float) Short.MAX_VALUE), -1f));
|
||||||
|
}
|
||||||
|
|
||||||
|
long execStartTime = System.currentTimeMillis();
|
||||||
|
var result = whisper.fullWithState(ctx, state, params, floatArray, audioSamplesOffset);
|
||||||
|
logger.debug("whisper ended in {}ms with result code {}", System.currentTimeMillis() - execStartTime,
|
||||||
|
result);
|
||||||
|
// process result
|
||||||
|
if (result != 0) {
|
||||||
|
throw new STTException("Cannot use whisper locally, result code: " + result);
|
||||||
|
}
|
||||||
|
int nSegments = whisper.fullNSegmentsFromState(state);
|
||||||
|
logger.debug("Available transcription segments {}", nSegments);
|
||||||
|
if (nSegments == 1) {
|
||||||
|
return whisper.fullGetSegmentTextFromState(state, 0);
|
||||||
|
} else if (nSegments == 0 && config.singleUtteranceMode) {
|
||||||
|
logger.debug("Single utterance mode and no results, ending transcription");
|
||||||
|
return null;
|
||||||
|
} else {
|
||||||
|
// non reachable
|
||||||
|
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
} catch (IOException e) {
|
||||||
|
if (state != null) {
|
||||||
|
state.close();
|
||||||
|
}
|
||||||
|
throw new STTException("Cannot use whisper locally", e);
|
||||||
|
} finally {
|
||||||
|
if (releaseContext && ctx != null) {
|
||||||
|
ctx.close();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException {
|
||||||
|
// convert to byte array, Each short has 2 bytes
|
||||||
|
int size = audioSamplesOffset * 2;
|
||||||
|
ByteBuffer byteArrayBuffer = ByteBuffer.allocate(size).order(ByteOrder.LITTLE_ENDIAN);
|
||||||
|
for (int i = 0; i < audioSamplesOffset; i++) {
|
||||||
|
byteArrayBuffer.putShort(audioStream[i]);
|
||||||
|
}
|
||||||
|
javax.sound.sampled.AudioFormat jAudioFormat = new javax.sound.sampled.AudioFormat(
|
||||||
|
javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE,
|
||||||
|
false);
|
||||||
|
byte[] byteArray = byteArrayBuffer.array();
|
||||||
|
|
||||||
|
try {
|
||||||
|
AudioInputStream audioInputStream = new AudioInputStream(new ByteArrayInputStream(byteArray), jAudioFormat,
|
||||||
|
audioSamplesOffset);
|
||||||
|
|
||||||
|
// write stream as a WAV file, in a byte array stream :
|
||||||
|
ByteArrayInputStream byteArrayInputStream = null;
|
||||||
|
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
|
||||||
|
AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, baos);
|
||||||
|
byteArrayInputStream = new ByteArrayInputStream(baos.toByteArray());
|
||||||
|
}
|
||||||
|
|
||||||
|
// prepare HTTP request
|
||||||
|
HttpClient commonHttpClient = httpClientFactory.getCommonHttpClient();
|
||||||
|
MultiPartContentProvider multiPartContentProvider = new MultiPartContentProvider();
|
||||||
|
multiPartContentProvider.addFilePart("file", "audio.wav",
|
||||||
|
new InputStreamContentProvider(byteArrayInputStream), null);
|
||||||
|
multiPartContentProvider.addFieldPart("model", new StringContentProvider(this.config.apiModelName), null);
|
||||||
|
multiPartContentProvider.addFieldPart("response_format", new StringContentProvider("text"), null);
|
||||||
|
multiPartContentProvider.addFieldPart("temperature",
|
||||||
|
new StringContentProvider(Float.toString(this.config.temperature)), null);
|
||||||
|
if (!language.isBlank()) {
|
||||||
|
multiPartContentProvider.addFieldPart("language", new StringContentProvider(language), null);
|
||||||
|
}
|
||||||
|
Request request = commonHttpClient.newRequest(config.apiUrl).method(HttpMethod.POST)
|
||||||
|
.content(multiPartContentProvider);
|
||||||
|
if (!config.apiKey.isBlank()) {
|
||||||
|
request = request.header("Authorization", "Bearer " + config.apiKey);
|
||||||
|
}
|
||||||
|
// execute the request
|
||||||
|
ContentResponse response = request.send();
|
||||||
|
|
||||||
|
// check the HTTP status code from the response
|
||||||
|
int statusCode = response.getStatus();
|
||||||
|
if (statusCode < 200 || statusCode >= 300) {
|
||||||
|
logger.debug("HTTP error: Received status code {}, full error is {}", statusCode,
|
||||||
|
response.getContentAsString());
|
||||||
|
throw new STTException("Failed to retrieve transcription: HTTP status code " + statusCode);
|
||||||
|
}
|
||||||
|
return response.getContentAsString();
|
||||||
|
|
||||||
|
} catch (InterruptedException | TimeoutException | ExecutionException | IOException e) {
|
||||||
|
throw new STTException("Exception during attempt to get speech recognition result from api", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private WhisperFullParams getWhisperFullParams(WhisperContext context, String language) throws IOException {
|
||||||
WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
|
WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
|
||||||
var params = new WhisperFullParams(strategy);
|
var params = new WhisperFullParams(strategy);
|
||||||
params.temperature = config.temperature;
|
params.temperature = config.temperature;
|
||||||
@ -570,7 +679,7 @@ public class WhisperSTTService implements STTService {
|
|||||||
params.grammarPenalty = config.grammarPenalty;
|
params.grammarPenalty = config.grammarPenalty;
|
||||||
}
|
}
|
||||||
// there is no single language models other than the english ones
|
// there is no single language models other than the english ones
|
||||||
params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en";
|
params.language = getWhisper().isMultilingual(context) ? language : "en";
|
||||||
// implementation assumes this options
|
// implementation assumes this options
|
||||||
params.translate = false;
|
params.translate = false;
|
||||||
params.detectLanguage = false;
|
params.detectLanguage = false;
|
||||||
@ -605,7 +714,7 @@ public class WhisperSTTService implements STTService {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private void createAudioFile(float[] samples, int size, String transcription, String language) {
|
private void createAudioFile(short[] samples, int size, String transcription, String language) {
|
||||||
createSamplesDir();
|
createSamplesDir();
|
||||||
javax.sound.sampled.AudioFormat jAudioFormat;
|
javax.sound.sampled.AudioFormat jAudioFormat;
|
||||||
ByteBuffer byteBuffer;
|
ByteBuffer byteBuffer;
|
||||||
@ -615,7 +724,7 @@ public class WhisperSTTService implements STTService {
|
|||||||
WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
|
WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
|
||||||
byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
|
byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
|
||||||
for (int i = 0; i < size; i++) {
|
for (int i = 0; i < size; i++) {
|
||||||
byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE));
|
byteBuffer.putShort(samples[i]);
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
logger.debug("Saving audio file with sample format f32");
|
logger.debug("Saving audio file with sample format f32");
|
||||||
@ -623,7 +732,7 @@ public class WhisperSTTService implements STTService {
|
|||||||
WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
|
WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
|
||||||
byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
|
byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
|
||||||
for (int i = 0; i < size; i++) {
|
for (int i = 0; i < size; i++) {
|
||||||
byteBuffer.putFloat(samples[i]);
|
byteBuffer.putFloat(Float.min(1f, Float.max((float) samples[i] / ((float) Short.MAX_VALUE), -1f)));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),
|
AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),
|
||||||
|
@ -11,7 +11,7 @@
|
|||||||
</parameter-group>
|
</parameter-group>
|
||||||
<parameter-group name="vad">
|
<parameter-group name="vad">
|
||||||
<label>Voice Activity Detection</label>
|
<label>Voice Activity Detection</label>
|
||||||
<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description>
|
<description>Configure the VAD mechanism used to isolate single phrases to feed whisper with.</description>
|
||||||
</parameter-group>
|
</parameter-group>
|
||||||
<parameter-group name="whisper">
|
<parameter-group name="whisper">
|
||||||
<label>Whisper Options</label>
|
<label>Whisper Options</label>
|
||||||
@ -19,7 +19,7 @@
|
|||||||
</parameter-group>
|
</parameter-group>
|
||||||
<parameter-group name="grammar">
|
<parameter-group name="grammar">
|
||||||
<label>Grammar</label>
|
<label>Grammar</label>
|
||||||
<description>Define a grammar to improve transcrptions.</description>
|
<description>Define a grammar to improve transcriptions.</description>
|
||||||
</parameter-group>
|
</parameter-group>
|
||||||
<parameter-group name="messages">
|
<parameter-group name="messages">
|
||||||
<label>Info Messages</label>
|
<label>Info Messages</label>
|
||||||
@ -30,9 +30,27 @@
|
|||||||
<description>Options added for developers.</description>
|
<description>Options added for developers.</description>
|
||||||
<advanced>true</advanced>
|
<advanced>true</advanced>
|
||||||
</parameter-group>
|
</parameter-group>
|
||||||
|
<parameter-group name="openaiapi">
|
||||||
|
<label>API Configuration Options</label>
|
||||||
|
<description>Configure OpenAI compatible API, if you don't want to use the local model.</description>
|
||||||
|
</parameter-group>
|
||||||
|
<parameter name="mode" type="text" groupName="stt">
|
||||||
|
<label>Local Mode Or API</label>
|
||||||
|
<description>Use the local model or the OpenAI compatible API.</description>
|
||||||
|
<default>LOCAL</default>
|
||||||
|
<options>
|
||||||
|
<option value="LOCAL">Local</option>
|
||||||
|
<option value="API">OpenAI API</option>
|
||||||
|
</options>
|
||||||
|
</parameter>
|
||||||
<parameter name="modelName" type="text" groupName="stt" required="true">
|
<parameter name="modelName" type="text" groupName="stt" required="true">
|
||||||
<label>Model Name</label>
|
<label>Local Model Name</label>
|
||||||
<description>Model name without extension.</description>
|
<description>Model name without extension. Local mode only.</description>
|
||||||
|
</parameter>
|
||||||
|
<parameter name="language" type="text" groupName="whisper">
|
||||||
|
<label>Language</label>
|
||||||
|
<description>If specified, speed up recognition by avoiding auto-detection. Default to system locale.</description>
|
||||||
|
<default></default>
|
||||||
</parameter>
|
</parameter>
|
||||||
<parameter name="preloadModel" type="boolean" groupName="stt">
|
<parameter name="preloadModel" type="boolean" groupName="stt">
|
||||||
<label>Preload Model</label>
|
<label>Preload Model</label>
|
||||||
@ -225,5 +243,20 @@
|
|||||||
<default>false</default>
|
<default>false</default>
|
||||||
<advanced>true</advanced>
|
<advanced>true</advanced>
|
||||||
</parameter>
|
</parameter>
|
||||||
|
<parameter name="apiKey" type="text" groupName="openaiapi">
|
||||||
|
<label>API Key</label>
|
||||||
|
<description>Key to access the API</description>
|
||||||
|
<default></default>
|
||||||
|
</parameter>
|
||||||
|
<parameter name="apiUrl" type="text" groupName="openaiapi">
|
||||||
|
<label>API Url</label>
|
||||||
|
<description>OpenAI compatible API URL. Default to OpenAI transcription service.</description>
|
||||||
|
<default>https://api.openai.com/v1/audio/transcriptions</default>
|
||||||
|
</parameter>
|
||||||
|
<parameter name="apiModelName" type="text" groupName="openaiapi">
|
||||||
|
<label>API Model</label>
|
||||||
|
<description>Model name to use (API only). Default to OpenAI only available model (whisper-1).</description>
|
||||||
|
<default>whisper-1</default>
|
||||||
|
</parameter>
|
||||||
</config-description>
|
</config-description>
|
||||||
</config-description:config-descriptions>
|
</config-description:config-descriptions>
|
||||||
|
@ -3,6 +3,12 @@
|
|||||||
addon.whisperstt.name = Whisper Speech-to-Text
|
addon.whisperstt.name = Whisper Speech-to-Text
|
||||||
addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text.
|
addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text.
|
||||||
|
|
||||||
|
voice.config.whisperstt.apiKey.label = API Key
|
||||||
|
voice.config.whisperstt.apiKey.description = Key to access the API
|
||||||
|
voice.config.whisperstt.apiModelName.label = API Model
|
||||||
|
voice.config.whisperstt.apiModelName.description = Model name to use (API only). Default to OpenAI only available model (whisper-1).
|
||||||
|
voice.config.whisperstt.apiUrl.label = API Url
|
||||||
|
voice.config.whisperstt.apiUrl.description = OpenAI compatible API URL. Default to OpenAI transcription service.
|
||||||
voice.config.whisperstt.audioContext.label = Audio Context
|
voice.config.whisperstt.audioContext.label = Audio Context
|
||||||
voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size)
|
voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size)
|
||||||
voice.config.whisperstt.beamSize.label = Beam Size
|
voice.config.whisperstt.beamSize.label = Beam Size
|
||||||
@ -24,27 +30,35 @@ voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sam
|
|||||||
voice.config.whisperstt.group.developer.label = Developer
|
voice.config.whisperstt.group.developer.label = Developer
|
||||||
voice.config.whisperstt.group.developer.description = Options added for developers.
|
voice.config.whisperstt.group.developer.description = Options added for developers.
|
||||||
voice.config.whisperstt.group.grammar.label = Grammar
|
voice.config.whisperstt.group.grammar.label = Grammar
|
||||||
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions.
|
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcriptions.
|
||||||
voice.config.whisperstt.group.messages.label = Info Messages
|
voice.config.whisperstt.group.messages.label = Info Messages
|
||||||
voice.config.whisperstt.group.messages.description = Configure service information messages.
|
voice.config.whisperstt.group.messages.description = Configure service information messages.
|
||||||
|
voice.config.whisperstt.group.openaiapi.label = API Configuration Options
|
||||||
|
voice.config.whisperstt.group.openaiapi.description = Configure OpenAI compatible API, if you don't want to use the local model.
|
||||||
voice.config.whisperstt.group.stt.label = STT Configuration
|
voice.config.whisperstt.group.stt.label = STT Configuration
|
||||||
voice.config.whisperstt.group.stt.description = Configure Speech to Text.
|
voice.config.whisperstt.group.stt.description = Configure Speech to Text.
|
||||||
voice.config.whisperstt.group.vad.label = Voice Activity Detection
|
voice.config.whisperstt.group.vad.label = Voice Activity Detection
|
||||||
voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with.
|
voice.config.whisperstt.group.vad.description = Configure the VAD mechanism used to isolate single phrases to feed whisper with.
|
||||||
voice.config.whisperstt.group.whisper.label = Whisper Options
|
voice.config.whisperstt.group.whisper.label = Whisper Options
|
||||||
voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options.
|
voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options.
|
||||||
voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds
|
voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds
|
||||||
voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription.
|
voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription.
|
||||||
voice.config.whisperstt.initialPrompt.label = Initial Prompt
|
voice.config.whisperstt.initialPrompt.label = Initial Prompt
|
||||||
voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with.
|
voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with.
|
||||||
|
voice.config.whisperstt.language.label = Language
|
||||||
|
voice.config.whisperstt.language.description = If specified, speed up recognition by avoiding auto-detection. Default to system locale.
|
||||||
voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds
|
voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds
|
||||||
voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection.
|
voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection.
|
||||||
voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds
|
voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds
|
||||||
voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription.
|
voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription.
|
||||||
voice.config.whisperstt.minSeconds.label = Min Transcription Seconds
|
voice.config.whisperstt.minSeconds.label = Min Transcription Seconds
|
||||||
voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper.
|
voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper.
|
||||||
voice.config.whisperstt.modelName.label = Model Name
|
voice.config.whisperstt.mode.label = Local Mode Or API
|
||||||
voice.config.whisperstt.modelName.description = Model name without extension.
|
voice.config.whisperstt.mode.description = Use the local model or the OpenAI compatible API.
|
||||||
|
voice.config.whisperstt.mode.option.LOCAL = Local
|
||||||
|
voice.config.whisperstt.mode.option.API = OpenAI API
|
||||||
|
voice.config.whisperstt.modelName.label = Local Model Name
|
||||||
|
voice.config.whisperstt.modelName.description = Model name without extension. Local mode only.
|
||||||
voice.config.whisperstt.openvinoDevice.label = OpenVINO Device
|
voice.config.whisperstt.openvinoDevice.label = OpenVINO Device
|
||||||
voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
||||||
voice.config.whisperstt.preloadModel.label = Preload Model
|
voice.config.whisperstt.preloadModel.label = Preload Model
|
||||||
|
Loading…
Reference in New Issue
Block a user