This commit is contained in:
Gwendal Roulleau 2025-01-09 12:57:13 +01:00 committed by GitHub
commit e5d06abd64
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 303 additions and 97 deletions

View File

@ -5,6 +5,8 @@ It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity det
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications. [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
Alternatively, if you do not want to perform speech-to-text on the computer hosting openHAB, this add-on can consume an OpenAI/Whisper compatible transcription API.
Whisper enables speech recognition for multiple languages and dialects: Whisper enables speech recognition for multiple languages and dialects:
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish, english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
@ -15,9 +17,11 @@ marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, geo
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala, uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
hausa, bashkir, javanese and sundanese. hausa, bashkir, javanese and sundanese.
## Supported platforms ## Local mode (offline)
This add-on uses some native binaries to work. ### Supported platforms
This add-on uses some native binaries to work when performing offline recognition.
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni). You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
The following platforms are supported: The following platforms are supported:
@ -28,7 +32,7 @@ The following platforms are supported:
The native binaries for those platforms are included in this add-on provided with the openHAB distribution. The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
## CPU compatibility ### CPU compatibility
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU. To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds. The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
@ -40,18 +44,18 @@ You can check those flags on Windows using a program like `CPU-Z`.
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`. If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
You can check those flags on linux using the terminal with `lscpu`. You can check those flags on linux using the terminal with `lscpu`.
## Transcription time ### Transcription time
On a Raspberry PI 5, the approximate transcription times are: On a Raspberry PI 5, the approximate transcription times are:
| model | exec time | | model | exec time |
| ---------- | --------: | |------------|----------:|
| tiny.bin | 1.5s | | tiny.bin | 1.5s |
| base.bin | 3s | | base.bin | 3s |
| small.bin | 8.5s | | small.bin | 8.5s |
| medium.bin | 17s | | medium.bin | 17s |
## Configuring the model ### Configuring the model
Before you can use this service you should configure your model. Before you can use this service you should configure your model.
@ -64,7 +68,7 @@ You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link. Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
## Using alternative whisper.cpp library ### Using alternative whisper.cpp library
It's possible to use your own build of the whisper.cpp shared library with this add-on. It's possible to use your own build of the whisper.cpp shared library with this add-on.
@ -76,7 +80,7 @@ In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can fi
Note: You need to restart openHAB to reload the library. Note: You need to restart openHAB to reload the library.
## Grammar ### Grammar
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model. The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
@ -99,6 +103,14 @@ tv_channel ::= ("set ")? "tv channel to " [0-9]+
You can provide the grammar and enable its usage using the binding configuration. You can provide the grammar and enable its usage using the binding configuration.
## API mode
You can also use this add-on with a remote API that is compatible with the 'transcription' API from OpenAI. Online services exposing such an API may require an API key (paid services, such as OpenAI).
You can host you own compatible service elsewhere on your network, with third-party software such as faster-whisper-server.
Please note that API mode also uses libvfad for voice activity detection, and that grammar parameters are not available.
## Configuration ## Configuration
Use your favorite configuration UI to edit the Whisper settings: Use your favorite configuration UI to edit the Whisper settings:
@ -107,6 +119,7 @@ Use your favorite configuration UI to edit the Whisper settings:
General options. General options.
- **Mode : LOCAL or API** - Choose either local computation or remote API use.
- **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin) - **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
- **Preload Model** - Keep whisper model loaded. - **Preload Model** - Keep whisper model loaded.
- **Single Utterance Mode** - When enabled recognition stops listening after a single utterance. - **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
@ -139,6 +152,13 @@ Configure whisper options.
- **Initial Prompt** - Initial prompt for whisper. - **Initial Prompt** - Initial prompt for whisper.
- **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) - **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
- **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect) - **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
- **Language** - If specified, speed up recognition by avoiding auto-detection. Default to system locale.
### API Configuration
- **API key** - Optional use of an API key for online services requiring it.
- **API url** - You may use your own service and define its URL here. Default set to OpenAI transcription API.
- **API model name** - Your hosted service may have other models. Default to OpenAI only model 'whisper-1'.
### Grammar Configuration ### Grammar Configuration
@ -199,7 +219,9 @@ In case you would like to set up the service via a text file, create a new file
Its contents should look similar to: Its contents should look similar to:
```ini ```ini
org.openhab.voice.whisperstt:mode=LOCAL
org.openhab.voice.whisperstt:modelName=tiny org.openhab.voice.whisperstt:modelName=tiny
org.openhab.voice.whisperstt:language=en
org.openhab.voice.whisperstt:initSilenceSeconds=0.3 org.openhab.voice.whisperstt:initSilenceSeconds=0.3
org.openhab.voice.whisperstt:removeSilence=true org.openhab.voice.whisperstt:removeSilence=true
org.openhab.voice.whisperstt:stepSeconds=0.3 org.openhab.voice.whisperstt:stepSeconds=0.3
@ -229,6 +251,9 @@ org.openhab.voice.whisperstt:useGPU=false
org.openhab.voice.whisperstt:useGrammar=false org.openhab.voice.whisperstt:useGrammar=false
org.openhab.voice.whisperstt:grammarPenalty=80.0 org.openhab.voice.whisperstt:grammarPenalty=80.0
org.openhab.voice.whisperstt:grammarLines= org.openhab.voice.whisperstt:grammarLines=
org.openhab.voice.whisperstt:apiKey=mykeyaaaa
org.openhab.voice.whisperstt:apiUrl=https://api.openai.com/v1/audio/transcriptions
org.openhab.voice.whisperstt:apiModelName=whisper-1
``` ```
### Default Speech-to-Text Configuration ### Default Speech-to-Text Configuration

View File

@ -146,4 +146,29 @@ public class WhisperSTTConfiguration {
* Print whisper.cpp library logs as binding debug logs. * Print whisper.cpp library logs as binding debug logs.
*/ */
public boolean enableWhisperLog; public boolean enableWhisperLog;
/**
* local to use embedded whisper or openaiapi to use an external API
*/
public Mode mode = Mode.LOCAL;
/**
* If mode set to openaiapi, then use this URL
*/
public String apiUrl = "https://api.openai.com/v1/audio/transcriptions";
/**
* if mode set to openaiapi, use this api key to access apiUrl
*/
public String apiKey = "";
/**
* If specified, speed up recognition by avoiding auto-detection
*/
public String language = "";
/**
* Model name (API only)
*/
public String apiModelName = "whisper-1";
public static enum Mode {
LOCAL,
API;
}
} }

View File

@ -12,12 +12,10 @@
*/ */
package org.openhab.voice.whisperstt.internal; package org.openhab.voice.whisperstt.internal;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY; import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.*;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
import java.io.ByteArrayInputStream; import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream; import java.io.FileOutputStream;
import java.io.IOException; import java.io.IOException;
import java.nio.ByteBuffer; import java.nio.ByteBuffer;
@ -32,7 +30,9 @@ import java.util.Date;
import java.util.Locale; import java.util.Locale;
import java.util.Map; import java.util.Map;
import java.util.Set; import java.util.Set;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeoutException;
import java.util.concurrent.atomic.AtomicBoolean; import java.util.concurrent.atomic.AtomicBoolean;
import javax.sound.sampled.AudioFileFormat; import javax.sound.sampled.AudioFileFormat;
@ -41,6 +41,13 @@ import javax.sound.sampled.AudioSystem;
import org.eclipse.jdt.annotation.NonNullByDefault; import org.eclipse.jdt.annotation.NonNullByDefault;
import org.eclipse.jdt.annotation.Nullable; import org.eclipse.jdt.annotation.Nullable;
import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;
import org.eclipse.jetty.client.api.Request;
import org.eclipse.jetty.client.util.InputStreamContentProvider;
import org.eclipse.jetty.client.util.MultiPartContentProvider;
import org.eclipse.jetty.client.util.StringContentProvider;
import org.eclipse.jetty.http.HttpMethod;
import org.openhab.core.OpenHAB; import org.openhab.core.OpenHAB;
import org.openhab.core.audio.AudioFormat; import org.openhab.core.audio.AudioFormat;
import org.openhab.core.audio.AudioStream; import org.openhab.core.audio.AudioStream;
@ -48,6 +55,7 @@ import org.openhab.core.audio.utils.AudioWaveUtils;
import org.openhab.core.common.ThreadPoolManager; import org.openhab.core.common.ThreadPoolManager;
import org.openhab.core.config.core.ConfigurableService; import org.openhab.core.config.core.ConfigurableService;
import org.openhab.core.config.core.Configuration; import org.openhab.core.config.core.Configuration;
import org.openhab.core.io.net.http.HttpClientFactory;
import org.openhab.core.io.rest.LocaleService; import org.openhab.core.io.rest.LocaleService;
import org.openhab.core.voice.RecognitionStartEvent; import org.openhab.core.voice.RecognitionStartEvent;
import org.openhab.core.voice.RecognitionStopEvent; import org.openhab.core.voice.RecognitionStopEvent;
@ -57,6 +65,7 @@ import org.openhab.core.voice.STTService;
import org.openhab.core.voice.STTServiceHandle; import org.openhab.core.voice.STTServiceHandle;
import org.openhab.core.voice.SpeechRecognitionErrorEvent; import org.openhab.core.voice.SpeechRecognitionErrorEvent;
import org.openhab.core.voice.SpeechRecognitionEvent; import org.openhab.core.voice.SpeechRecognitionEvent;
import org.openhab.voice.whisperstt.internal.WhisperSTTConfiguration.Mode;
import org.openhab.voice.whisperstt.internal.utils.VAD; import org.openhab.voice.whisperstt.internal.utils.VAD;
import org.osgi.framework.Constants; import org.osgi.framework.Constants;
import org.osgi.service.component.annotations.Activate; import org.osgi.service.component.annotations.Activate;
@ -96,10 +105,13 @@ public class WhisperSTTService implements STTService {
private @Nullable WhisperContext context; private @Nullable WhisperContext context;
private @Nullable WhisperGrammar grammar; private @Nullable WhisperGrammar grammar;
private @Nullable WhisperJNI whisper; private @Nullable WhisperJNI whisper;
private boolean isWhisperLibAlreadyLoaded = false;
private final HttpClientFactory httpClientFactory;
@Activate @Activate
public WhisperSTTService(@Reference LocaleService localeService) { public WhisperSTTService(@Reference LocaleService localeService, @Reference HttpClientFactory httpClientFactory) {
this.localeService = localeService; this.localeService = localeService;
this.httpClientFactory = httpClientFactory;
} }
@Activate @Activate
@ -108,7 +120,8 @@ public class WhisperSTTService implements STTService {
if (!Files.exists(WHISPER_FOLDER)) { if (!Files.exists(WHISPER_FOLDER)) {
Files.createDirectory(WHISPER_FOLDER); Files.createDirectory(WHISPER_FOLDER);
} }
WhisperJNI.loadLibrary(getLoadOptions()); this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
loadWhisperLibraryIfNeeded();
VoiceActivityDetector.loadLibrary(); VoiceActivityDetector.loadLibrary();
whisper = new WhisperJNI(); whisper = new WhisperJNI();
} catch (IOException | RuntimeException e) { } catch (IOException | RuntimeException e) {
@ -117,6 +130,13 @@ public class WhisperSTTService implements STTService {
configChange(config); configChange(config);
} }
private void loadWhisperLibraryIfNeeded() throws IOException {
if (config.mode == Mode.LOCAL && !isWhisperLibAlreadyLoaded) {
WhisperJNI.loadLibrary(getLoadOptions());
isWhisperLibAlreadyLoaded = true;
}
}
private WhisperJNI.LoadOptions getLoadOptions() { private WhisperJNI.LoadOptions getLoadOptions() {
Path libFolder = Paths.get("/usr/local/lib"); Path libFolder = Paths.get("/usr/local/lib");
Path libFolderWin = Paths.get("/Windows/System32"); Path libFolderWin = Paths.get("/Windows/System32");
@ -167,14 +187,27 @@ public class WhisperSTTService implements STTService {
private void configChange(Map<String, Object> config) { private void configChange(Map<String, Object> config) {
this.config = new Configuration(config).as(WhisperSTTConfiguration.class); this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
WhisperGrammar grammar = this.grammar; WhisperGrammar grammar = this.grammar;
if (grammar != null) { if (grammar != null) {
grammar.close(); grammar.close();
this.grammar = null; this.grammar = null;
} }
// API mode
if (this.config.mode == Mode.API) {
try {
unloadContext();
} catch (IOException e) {
logger.warn("IOException unloading model: {}", e.getMessage());
}
return;
}
// Local mode
WhisperJNI whisper; WhisperJNI whisper;
try { try {
loadWhisperLibraryIfNeeded();
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
whisper = getWhisper(); whisper = getWhisper();
} catch (IOException ignored) { } catch (IOException ignored) {
logger.warn("library not loaded, the add-on will not work"); logger.warn("library not loaded, the add-on will not work");
@ -228,9 +261,17 @@ public class WhisperSTTService implements STTService {
@Override @Override
public Set<Locale> getSupportedLocales() { public Set<Locale> getSupportedLocales() {
// as it is not possible to determine the language of the model that was downloaded and setup by the user, it is // Attempt to create a locale from the configured language
// assumed the language of the model is matching the locale of the openHAB server String language = config.language;
return Set.of(localeService.getLocale(null)); Locale modelLocale = localeService.getLocale(null);
if (!language.isBlank()) {
try {
modelLocale = Locale.forLanguageTag(language);
} catch (IllegalArgumentException e) {
logger.warn("Invalid language '{}', defaulting to server locale", language);
}
}
return Set.of(modelLocale);
} }
@Override @Override
@ -246,33 +287,18 @@ public class WhisperSTTService implements STTService {
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set) public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
throws STTException { throws STTException {
AtomicBoolean aborted = new AtomicBoolean(false); AtomicBoolean aborted = new AtomicBoolean(false);
WhisperContext ctx = null;
WhisperState state = null;
try { try {
var whisper = getWhisper();
ctx = getContext();
logger.debug("Creating whisper state...");
state = whisper.initState(ctx);
logger.debug("Whisper state created");
logger.debug("Creating VAD instance..."); logger.debug("Creating VAD instance...");
final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE); final int nSamplesStep = (int) (config.stepSeconds * WHISPER_SAMPLE_RATE);
VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep, VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
config.vadStep, config.vadSensitivity); config.vadStep, config.vadSensitivity);
logger.debug("VAD instance created"); logger.debug("VAD instance created");
sttListener.sttEventReceived(new RecognitionStartEvent()); sttListener.sttEventReceived(new RecognitionStartEvent());
backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted); backgroundRecognize(nSamplesStep, locale, sttListener, audioStream, vad, aborted);
} catch (IOException e) { } catch (IOException e) {
if (ctx != null && !config.preloadModel) {
ctx.close();
}
if (state != null) {
state.close();
}
throw new STTException("Exception during initialization", e); throw new STTException("Exception during initialization", e);
} }
return () -> { return () -> aborted.set(true);
aborted.set(true);
};
} }
private WhisperJNI getWhisper() throws IOException { private WhisperJNI getWhisper() throws IOException {
@ -339,9 +365,8 @@ public class WhisperSTTService implements STTService {
} }
} }
private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep, private void backgroundRecognize(final int nSamplesStep, Locale locale, STTListener sttListener,
Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) { AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
var releaseContext = !config.preloadModel;
final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE; final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE); final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE); final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
@ -353,21 +378,17 @@ public class WhisperSTTService implements STTService {
logger.debug("Max silence samples {}", nMaxSilenceSamples); logger.debug("Max silence samples {}", nMaxSilenceSamples);
// used to store the step samples in libfvad wanted format 16-bit int // used to store the step samples in libfvad wanted format 16-bit int
final short[] stepAudioSamples = new short[nSamplesStep]; final short[] stepAudioSamples = new short[nSamplesStep];
// used to store the full samples in whisper wanted format 32-bit float // used to store the full retained samples for whisper
final float[] audioSamples = new float[nSamplesMax]; final short[] audioSamples = new short[nSamplesMax];
executor.submit(() -> { executor.submit(() -> {
int audioSamplesOffset = 0; int audioSamplesOffset = 0;
int silenceSamplesCounter = 0; int silenceSamplesCounter = 0;
int nProcessedSamples = 0; int nProcessedSamples = 0;
int numBytesRead;
boolean voiceDetected = false; boolean voiceDetected = false;
String transcription = ""; String transcription = "";
String tempTranscription = "";
VAD.@Nullable VADResult lastVADResult;
VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null; VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
try { try {
try (state; // try (audioStream; //
audioStream; //
vad) { vad) {
if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) { if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
AudioWaveUtils.removeFMT(audioStream); AudioWaveUtils.removeFMT(audioStream);
@ -376,10 +397,9 @@ public class WhisperSTTService implements STTService {
.order(ByteOrder.LITTLE_ENDIAN); .order(ByteOrder.LITTLE_ENDIAN);
// init remaining to full capacity // init remaining to full capacity
int remaining = captureBuffer.capacity(); int remaining = captureBuffer.capacity();
WhisperFullParams params = getWhisperFullParams(ctx, locale);
while (!aborted.get()) { while (!aborted.get()) {
// read until no remaining so we get the complete step samples // read until no remaining so we get the complete step samples
numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining, int numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
remaining); remaining);
if (aborted.get() || numBytesRead == -1) { if (aborted.get() || numBytesRead == -1) {
break; break;
@ -395,17 +415,15 @@ public class WhisperSTTService implements STTService {
while (shortBuffer.hasRemaining()) { while (shortBuffer.hasRemaining()) {
var position = shortBuffer.position(); var position = shortBuffer.position();
short i16BitSample = shortBuffer.get(); short i16BitSample = shortBuffer.get();
float f32BitSample = Float.min(1f,
Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
stepAudioSamples[position] = i16BitSample; stepAudioSamples[position] = i16BitSample;
audioSamples[audioSamplesOffset++] = f32BitSample; audioSamples[audioSamplesOffset++] = i16BitSample;
nProcessedSamples++; nProcessedSamples++;
} }
// run vad // run vad
if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) { if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
logger.debug("VAD: Skipping, max length reached"); logger.debug("VAD: Skipping, max length reached");
} else { } else {
lastVADResult = vad.analyze(stepAudioSamples); VAD.@Nullable VADResult lastVADResult = vad.analyze(stepAudioSamples);
if (lastVADResult.isVoice()) { if (lastVADResult.isVoice()) {
voiceDetected = true; voiceDetected = true;
logger.debug("VAD: voice detected"); logger.debug("VAD: voice detected");
@ -484,43 +502,26 @@ public class WhisperSTTService implements STTService {
} }
} }
} }
// run whisper // run whisper, either locally or by remote API
logger.debug("running whisper with {} seconds of audio...", String tempTranscription = (switch (config.mode) {
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f); case LOCAL -> recognizeLocal(audioSamplesOffset, audioSamples, locale.getLanguage());
long execStartTime = System.currentTimeMillis(); case API -> recognizeAPI(audioSamplesOffset, audioSamples, locale.getLanguage());
var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset); });
logger.debug("whisper ended in {}ms with result code {}",
System.currentTimeMillis() - execStartTime, result); if (tempTranscription != null && !tempTranscription.isBlank()) {
// process result
if (result != 0) {
emitSpeechRecognitionError(sttListener);
break;
}
int nSegments = whisper.fullNSegmentsFromState(state);
logger.debug("Available transcription segments {}", nSegments);
if (nSegments == 1) {
tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
if (config.createWAVRecord) { if (config.createWAVRecord) {
createAudioFile(audioSamples, audioSamplesOffset, tempTranscription, createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
locale.getLanguage()); locale.getLanguage());
} }
transcription += tempTranscription;
if (config.singleUtteranceMode) { if (config.singleUtteranceMode) {
logger.debug("single utterance mode, ending transcription"); logger.debug("single utterance mode, ending transcription");
transcription = tempTranscription;
break; break;
} else {
// start a new transcription segment
transcription += tempTranscription;
tempTranscription = "";
} }
} else if (nSegments == 0 && config.singleUtteranceMode) { } else {
logger.debug("Single utterance mode and no results, ending transcription");
break;
} else if (nSegments > 1) {
// non reachable
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
break; break;
} }
// reset state to start with next segment // reset state to start with next segment
voiceDetected = false; voiceDetected = false;
silenceSamplesCounter = 0; silenceSamplesCounter = 0;
@ -528,10 +529,6 @@ public class WhisperSTTService implements STTService {
logger.debug("Partial transcription: {}", tempTranscription); logger.debug("Partial transcription: {}", tempTranscription);
logger.debug("Transcription: {}", transcription); logger.debug("Transcription: {}", transcription);
} }
} finally {
if (releaseContext) {
ctx.close();
}
} }
// emit result // emit result
if (!aborted.get()) { if (!aborted.get()) {
@ -543,7 +540,7 @@ public class WhisperSTTService implements STTService {
emitSpeechRecognitionNoResultsError(sttListener); emitSpeechRecognitionNoResultsError(sttListener);
} }
} }
} catch (IOException e) { } catch (STTException | IOException e) {
logger.warn("Error running speech to text: {}", e.getMessage()); logger.warn("Error running speech to text: {}", e.getMessage());
emitSpeechRecognitionError(sttListener); emitSpeechRecognitionError(sttListener);
} catch (UnsatisfiedLinkError e) { } catch (UnsatisfiedLinkError e) {
@ -553,7 +550,119 @@ public class WhisperSTTService implements STTService {
}); });
} }
private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException { @Nullable
private String recognizeLocal(int audioSamplesOffset, short[] audioSamples, String language) throws STTException {
logger.debug("running whisper with {} seconds of audio...",
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
var releaseContext = !config.preloadModel;
WhisperJNI whisper = null;
WhisperContext ctx = null;
WhisperState state = null;
try {
whisper = getWhisper();
ctx = getContext();
logger.debug("Creating whisper state...");
state = whisper.initState(ctx);
logger.debug("Whisper state created");
WhisperFullParams params = getWhisperFullParams(ctx, language);
// convert to local whisper format (float)
float[] floatArray = new float[audioSamples.length];
for (int i = 0; i < audioSamples.length; i++) {
floatArray[i] = Float.min(1f, Float.max((float) audioSamples[i] / ((float) Short.MAX_VALUE), -1f));
}
long execStartTime = System.currentTimeMillis();
var result = whisper.fullWithState(ctx, state, params, floatArray, audioSamplesOffset);
logger.debug("whisper ended in {}ms with result code {}", System.currentTimeMillis() - execStartTime,
result);
// process result
if (result != 0) {
throw new STTException("Cannot use whisper locally, result code: " + result);
}
int nSegments = whisper.fullNSegmentsFromState(state);
logger.debug("Available transcription segments {}", nSegments);
if (nSegments == 1) {
return whisper.fullGetSegmentTextFromState(state, 0);
} else if (nSegments == 0 && config.singleUtteranceMode) {
logger.debug("Single utterance mode and no results, ending transcription");
return null;
} else {
// non reachable
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
return null;
}
} catch (IOException e) {
if (state != null) {
state.close();
}
throw new STTException("Cannot use whisper locally", e);
} finally {
if (releaseContext && ctx != null) {
ctx.close();
}
}
}
private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException {
// convert to byte array, Each short has 2 bytes
int size = audioSamplesOffset * 2;
ByteBuffer byteArrayBuffer = ByteBuffer.allocate(size).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < audioSamplesOffset; i++) {
byteArrayBuffer.putShort(audioStream[i]);
}
javax.sound.sampled.AudioFormat jAudioFormat = new javax.sound.sampled.AudioFormat(
javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE,
false);
byte[] byteArray = byteArrayBuffer.array();
try {
AudioInputStream audioInputStream = new AudioInputStream(new ByteArrayInputStream(byteArray), jAudioFormat,
audioSamplesOffset);
// write stream as a WAV file, in a byte array stream :
ByteArrayInputStream byteArrayInputStream = null;
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, baos);
byteArrayInputStream = new ByteArrayInputStream(baos.toByteArray());
}
// prepare HTTP request
HttpClient commonHttpClient = httpClientFactory.getCommonHttpClient();
MultiPartContentProvider multiPartContentProvider = new MultiPartContentProvider();
multiPartContentProvider.addFilePart("file", "audio.wav",
new InputStreamContentProvider(byteArrayInputStream), null);
multiPartContentProvider.addFieldPart("model", new StringContentProvider(this.config.apiModelName), null);
multiPartContentProvider.addFieldPart("response_format", new StringContentProvider("text"), null);
multiPartContentProvider.addFieldPart("temperature",
new StringContentProvider(Float.toString(this.config.temperature)), null);
if (!language.isBlank()) {
multiPartContentProvider.addFieldPart("language", new StringContentProvider(language), null);
}
Request request = commonHttpClient.newRequest(config.apiUrl).method(HttpMethod.POST)
.content(multiPartContentProvider);
if (!config.apiKey.isBlank()) {
request = request.header("Authorization", "Bearer " + config.apiKey);
}
// execute the request
ContentResponse response = request.send();
// check the HTTP status code from the response
int statusCode = response.getStatus();
if (statusCode < 200 || statusCode >= 300) {
logger.debug("HTTP error: Received status code {}, full error is {}", statusCode,
response.getContentAsString());
throw new STTException("Failed to retrieve transcription: HTTP status code " + statusCode);
}
return response.getContentAsString();
} catch (InterruptedException | TimeoutException | ExecutionException | IOException e) {
throw new STTException("Exception during attempt to get speech recognition result from api", e);
}
}
private WhisperFullParams getWhisperFullParams(WhisperContext context, String language) throws IOException {
WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy); WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
var params = new WhisperFullParams(strategy); var params = new WhisperFullParams(strategy);
params.temperature = config.temperature; params.temperature = config.temperature;
@ -570,7 +679,7 @@ public class WhisperSTTService implements STTService {
params.grammarPenalty = config.grammarPenalty; params.grammarPenalty = config.grammarPenalty;
} }
// there is no single language models other than the english ones // there is no single language models other than the english ones
params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en"; params.language = getWhisper().isMultilingual(context) ? language : "en";
// implementation assumes this options // implementation assumes this options
params.translate = false; params.translate = false;
params.detectLanguage = false; params.detectLanguage = false;
@ -605,7 +714,7 @@ public class WhisperSTTService implements STTService {
} }
} }
private void createAudioFile(float[] samples, int size, String transcription, String language) { private void createAudioFile(short[] samples, int size, String transcription, String language) {
createSamplesDir(); createSamplesDir();
javax.sound.sampled.AudioFormat jAudioFormat; javax.sound.sampled.AudioFormat jAudioFormat;
ByteBuffer byteBuffer; ByteBuffer byteBuffer;
@ -615,7 +724,7 @@ public class WhisperSTTService implements STTService {
WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false); WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN); byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < size; i++) { for (int i = 0; i < size; i++) {
byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE)); byteBuffer.putShort(samples[i]);
} }
} else { } else {
logger.debug("Saving audio file with sample format f32"); logger.debug("Saving audio file with sample format f32");
@ -623,7 +732,7 @@ public class WhisperSTTService implements STTService {
WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false); WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN); byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
for (int i = 0; i < size; i++) { for (int i = 0; i < size; i++) {
byteBuffer.putFloat(samples[i]); byteBuffer.putFloat(Float.min(1f, Float.max((float) samples[i] / ((float) Short.MAX_VALUE), -1f)));
} }
} }
AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()), AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),

View File

@ -11,7 +11,7 @@
</parameter-group> </parameter-group>
<parameter-group name="vad"> <parameter-group name="vad">
<label>Voice Activity Detection</label> <label>Voice Activity Detection</label>
<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description> <description>Configure the VAD mechanism used to isolate single phrases to feed whisper with.</description>
</parameter-group> </parameter-group>
<parameter-group name="whisper"> <parameter-group name="whisper">
<label>Whisper Options</label> <label>Whisper Options</label>
@ -19,7 +19,7 @@
</parameter-group> </parameter-group>
<parameter-group name="grammar"> <parameter-group name="grammar">
<label>Grammar</label> <label>Grammar</label>
<description>Define a grammar to improve transcrptions.</description> <description>Define a grammar to improve transcriptions.</description>
</parameter-group> </parameter-group>
<parameter-group name="messages"> <parameter-group name="messages">
<label>Info Messages</label> <label>Info Messages</label>
@ -30,9 +30,27 @@
<description>Options added for developers.</description> <description>Options added for developers.</description>
<advanced>true</advanced> <advanced>true</advanced>
</parameter-group> </parameter-group>
<parameter-group name="openaiapi">
<label>API Configuration Options</label>
<description>Configure OpenAI compatible API, if you don't want to use the local model.</description>
</parameter-group>
<parameter name="mode" type="text" groupName="stt">
<label>Local Mode Or API</label>
<description>Use the local model or the OpenAI compatible API.</description>
<default>LOCAL</default>
<options>
<option value="LOCAL">Local</option>
<option value="API">OpenAI API</option>
</options>
</parameter>
<parameter name="modelName" type="text" groupName="stt" required="true"> <parameter name="modelName" type="text" groupName="stt" required="true">
<label>Model Name</label> <label>Local Model Name</label>
<description>Model name without extension.</description> <description>Model name without extension. Local mode only.</description>
</parameter>
<parameter name="language" type="text" groupName="whisper">
<label>Language</label>
<description>If specified, speed up recognition by avoiding auto-detection. Default to system locale.</description>
<default></default>
</parameter> </parameter>
<parameter name="preloadModel" type="boolean" groupName="stt"> <parameter name="preloadModel" type="boolean" groupName="stt">
<label>Preload Model</label> <label>Preload Model</label>
@ -225,5 +243,20 @@
<default>false</default> <default>false</default>
<advanced>true</advanced> <advanced>true</advanced>
</parameter> </parameter>
<parameter name="apiKey" type="text" groupName="openaiapi">
<label>API Key</label>
<description>Key to access the API</description>
<default></default>
</parameter>
<parameter name="apiUrl" type="text" groupName="openaiapi">
<label>API Url</label>
<description>OpenAI compatible API URL. Default to OpenAI transcription service.</description>
<default>https://api.openai.com/v1/audio/transcriptions</default>
</parameter>
<parameter name="apiModelName" type="text" groupName="openaiapi">
<label>API Model</label>
<description>Model name to use (API only). Default to OpenAI only available model (whisper-1).</description>
<default>whisper-1</default>
</parameter>
</config-description> </config-description>
</config-description:config-descriptions> </config-description:config-descriptions>

View File

@ -3,6 +3,12 @@
addon.whisperstt.name = Whisper Speech-to-Text addon.whisperstt.name = Whisper Speech-to-Text
addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text. addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text.
voice.config.whisperstt.apiKey.label = API Key
voice.config.whisperstt.apiKey.description = Key to access the API
voice.config.whisperstt.apiModelName.label = API Model
voice.config.whisperstt.apiModelName.description = Model name to use (API only). Default to OpenAI only available model (whisper-1).
voice.config.whisperstt.apiUrl.label = API Url
voice.config.whisperstt.apiUrl.description = OpenAI compatible API URL. Default to OpenAI transcription service.
voice.config.whisperstt.audioContext.label = Audio Context voice.config.whisperstt.audioContext.label = Audio Context
voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size) voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size)
voice.config.whisperstt.beamSize.label = Beam Size voice.config.whisperstt.beamSize.label = Beam Size
@ -24,27 +30,35 @@ voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sam
voice.config.whisperstt.group.developer.label = Developer voice.config.whisperstt.group.developer.label = Developer
voice.config.whisperstt.group.developer.description = Options added for developers. voice.config.whisperstt.group.developer.description = Options added for developers.
voice.config.whisperstt.group.grammar.label = Grammar voice.config.whisperstt.group.grammar.label = Grammar
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions. voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcriptions.
voice.config.whisperstt.group.messages.label = Info Messages voice.config.whisperstt.group.messages.label = Info Messages
voice.config.whisperstt.group.messages.description = Configure service information messages. voice.config.whisperstt.group.messages.description = Configure service information messages.
voice.config.whisperstt.group.openaiapi.label = API Configuration Options
voice.config.whisperstt.group.openaiapi.description = Configure OpenAI compatible API, if you don't want to use the local model.
voice.config.whisperstt.group.stt.label = STT Configuration voice.config.whisperstt.group.stt.label = STT Configuration
voice.config.whisperstt.group.stt.description = Configure Speech to Text. voice.config.whisperstt.group.stt.description = Configure Speech to Text.
voice.config.whisperstt.group.vad.label = Voice Activity Detection voice.config.whisperstt.group.vad.label = Voice Activity Detection
voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with. voice.config.whisperstt.group.vad.description = Configure the VAD mechanism used to isolate single phrases to feed whisper with.
voice.config.whisperstt.group.whisper.label = Whisper Options voice.config.whisperstt.group.whisper.label = Whisper Options
voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options. voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options.
voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds
voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription. voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription.
voice.config.whisperstt.initialPrompt.label = Initial Prompt voice.config.whisperstt.initialPrompt.label = Initial Prompt
voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with. voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with.
voice.config.whisperstt.language.label = Language
voice.config.whisperstt.language.description = If specified, speed up recognition by avoiding auto-detection. Default to system locale.
voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds
voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection. voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection.
voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds
voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription. voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription.
voice.config.whisperstt.minSeconds.label = Min Transcription Seconds voice.config.whisperstt.minSeconds.label = Min Transcription Seconds
voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper. voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper.
voice.config.whisperstt.modelName.label = Model Name voice.config.whisperstt.mode.label = Local Mode Or API
voice.config.whisperstt.modelName.description = Model name without extension. voice.config.whisperstt.mode.description = Use the local model or the OpenAI compatible API.
voice.config.whisperstt.mode.option.LOCAL = Local
voice.config.whisperstt.mode.option.API = OpenAI API
voice.config.whisperstt.modelName.label = Local Model Name
voice.config.whisperstt.modelName.description = Model name without extension. Local mode only.
voice.config.whisperstt.openvinoDevice.label = OpenVINO Device voice.config.whisperstt.openvinoDevice.label = OpenVINO Device
voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect) voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
voice.config.whisperstt.preloadModel.label = Preload Model voice.config.whisperstt.preloadModel.label = Preload Model