mirror of
https://github.com/openhab/openhab-addons.git
synced 2025-01-10 15:11:59 +01:00
Merge 5487ef17bc
into adacdebb9f
This commit is contained in:
commit
fe8f06aa09
@ -5,6 +5,8 @@ It also uses [libfvad](https://github.com/dpirch/libfvad) for voice activity det
|
||||
|
||||
[Whisper.cpp](https://github.com/ggerganov/whisper.cpp) is a high-optimized lightweight c++ implementation of [whisper](https://github.com/openai/whisper) that allows to easily integrate it in different platforms and applications.
|
||||
|
||||
Alternatively, if you do not want to perform speech-to-text on the computer hosting openHAB, this add-on can consume an OpenAI/Whisper compatible transcription API.
|
||||
|
||||
Whisper enables speech recognition for multiple languages and dialects:
|
||||
|
||||
english, chinese, german, spanish, russian, korean, french, japanese, portuguese, turkish, polish, catalan, dutch, arabic, swedish,
|
||||
@ -15,9 +17,11 @@ marathi, punjabi, sinhala, khmer, shona, yoruba, somali, afrikaans, occitan, geo
|
||||
uzbek, faroese, haitian, pashto, turkmen, nynorsk, maltese, sanskrit, luxembourgish, myanmar, tibetan, tagalog, malagasy, assamese, tatar, lingala,
|
||||
hausa, bashkir, javanese and sundanese.
|
||||
|
||||
## Supported platforms
|
||||
## Local mode (offline)
|
||||
|
||||
This add-on uses some native binaries to work.
|
||||
### Supported platforms
|
||||
|
||||
This add-on uses some native binaries to work when performing offline recognition.
|
||||
You can find here the used [whisper.cpp Java wrapper](https://github.com/GiviMAD/whisper-jni) and [libfvad Java wrapper](https://github.com/GiviMAD/libfvad-jni).
|
||||
|
||||
The following platforms are supported:
|
||||
@ -28,7 +32,7 @@ The following platforms are supported:
|
||||
|
||||
The native binaries for those platforms are included in this add-on provided with the openHAB distribution.
|
||||
|
||||
## CPU compatibility
|
||||
### CPU compatibility
|
||||
|
||||
To use this binding it's recommended to use a device at least as powerful as the RaspberryPI 5 with a modern CPU.
|
||||
The execution times on Raspberry PI 4 are x2, so just the tiny model can be run on under 5 seconds.
|
||||
@ -40,18 +44,18 @@ You can check those flags on Windows using a program like `CPU-Z`.
|
||||
If you are going to use the binding in a `arm64` host the CPU should support the flags: `fphp`.
|
||||
You can check those flags on linux using the terminal with `lscpu`.
|
||||
|
||||
## Transcription time
|
||||
### Transcription time
|
||||
|
||||
On a Raspberry PI 5, the approximate transcription times are:
|
||||
|
||||
| model | exec time |
|
||||
| ---------- | --------: |
|
||||
|------------|----------:|
|
||||
| tiny.bin | 1.5s |
|
||||
| base.bin | 3s |
|
||||
| small.bin | 8.5s |
|
||||
| medium.bin | 17s |
|
||||
|
||||
## Configuring the model
|
||||
### Configuring the model
|
||||
|
||||
Before you can use this service you should configure your model.
|
||||
|
||||
@ -64,7 +68,7 @@ You should place the downloaded .bin model in '\<openHAB userdata\>/whisper/' so
|
||||
|
||||
Remember to check that you have enough RAM to load the model, estimated RAM consumption can be checked on the huggingface link.
|
||||
|
||||
## Using alternative whisper.cpp library
|
||||
### Using alternative whisper.cpp library
|
||||
|
||||
It's possible to use your own build of the whisper.cpp shared library with this add-on.
|
||||
|
||||
@ -76,7 +80,7 @@ In the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) README you can fi
|
||||
|
||||
Note: You need to restart openHAB to reload the library.
|
||||
|
||||
## Grammar
|
||||
### Grammar
|
||||
|
||||
The whisper.cpp library allows to define a grammar to alter the transcription results without fine-tuning the model.
|
||||
|
||||
@ -99,6 +103,14 @@ tv_channel ::= ("set ")? "tv channel to " [0-9]+
|
||||
|
||||
You can provide the grammar and enable its usage using the binding configuration.
|
||||
|
||||
## API mode
|
||||
|
||||
You can also use this add-on with a remote API that is compatible with the 'transcription' API from OpenAI. Online services exposing such an API may require an API key (paid services, such as OpenAI).
|
||||
|
||||
You can host you own compatible service elsewhere on your network, with third-party software such as faster-whisper-server.
|
||||
|
||||
Please note that API mode also uses libvfad for voice activity detection, and that grammar parameters are not available.
|
||||
|
||||
## Configuration
|
||||
|
||||
Use your favorite configuration UI to edit the Whisper settings:
|
||||
@ -107,6 +119,7 @@ Use your favorite configuration UI to edit the Whisper settings:
|
||||
|
||||
General options.
|
||||
|
||||
- **Mode : LOCAL or API** - Choose either local computation or remote API use.
|
||||
- **Model Name** - Model name. The 'ggml-' prefix and '.bin' extension are optional here but required on the filename. (ex: tiny.en -> ggml-tiny.en.bin)
|
||||
- **Preload Model** - Keep whisper model loaded.
|
||||
- **Single Utterance Mode** - When enabled recognition stops listening after a single utterance.
|
||||
@ -139,6 +152,13 @@ Configure whisper options.
|
||||
- **Initial Prompt** - Initial prompt for whisper.
|
||||
- **OpenVINO Device** - Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
||||
- **Use GPU** - Enables GPU usage. (built-in binaries do not support GPU usage, this has no effect)
|
||||
- **Language** - If specified, speed up recognition by avoiding auto-detection. Default to system locale.
|
||||
|
||||
### API Configuration
|
||||
|
||||
- **API key** - Optional use of an API key for online services requiring it.
|
||||
- **API url** - You may use your own service and define its URL here. Default set to OpenAI transcription API.
|
||||
- **API model name** - Your hosted service may have other models. Default to OpenAI only model 'whisper-1'.
|
||||
|
||||
### Grammar Configuration
|
||||
|
||||
@ -199,7 +219,9 @@ In case you would like to set up the service via a text file, create a new file
|
||||
Its contents should look similar to:
|
||||
|
||||
```ini
|
||||
org.openhab.voice.whisperstt:mode=LOCAL
|
||||
org.openhab.voice.whisperstt:modelName=tiny
|
||||
org.openhab.voice.whisperstt:language=en
|
||||
org.openhab.voice.whisperstt:initSilenceSeconds=0.3
|
||||
org.openhab.voice.whisperstt:removeSilence=true
|
||||
org.openhab.voice.whisperstt:stepSeconds=0.3
|
||||
@ -229,6 +251,9 @@ org.openhab.voice.whisperstt:useGPU=false
|
||||
org.openhab.voice.whisperstt:useGrammar=false
|
||||
org.openhab.voice.whisperstt:grammarPenalty=80.0
|
||||
org.openhab.voice.whisperstt:grammarLines=
|
||||
org.openhab.voice.whisperstt:apiKey=mykeyaaaa
|
||||
org.openhab.voice.whisperstt:apiUrl=https://api.openai.com/v1/audio/transcriptions
|
||||
org.openhab.voice.whisperstt:apiModelName=whisper-1
|
||||
```
|
||||
|
||||
### Default Speech-to-Text Configuration
|
||||
|
@ -146,4 +146,29 @@ public class WhisperSTTConfiguration {
|
||||
* Print whisper.cpp library logs as binding debug logs.
|
||||
*/
|
||||
public boolean enableWhisperLog;
|
||||
/**
|
||||
* local to use embedded whisper or openaiapi to use an external API
|
||||
*/
|
||||
public Mode mode = Mode.LOCAL;
|
||||
/**
|
||||
* If mode set to openaiapi, then use this URL
|
||||
*/
|
||||
public String apiUrl = "https://api.openai.com/v1/audio/transcriptions";
|
||||
/**
|
||||
* if mode set to openaiapi, use this api key to access apiUrl
|
||||
*/
|
||||
public String apiKey = "";
|
||||
/**
|
||||
* If specified, speed up recognition by avoiding auto-detection
|
||||
*/
|
||||
public String language = "";
|
||||
/**
|
||||
* Model name (API only)
|
||||
*/
|
||||
public String apiModelName = "whisper-1";
|
||||
|
||||
public static enum Mode {
|
||||
LOCAL,
|
||||
API;
|
||||
}
|
||||
}
|
||||
|
@ -12,12 +12,10 @@
|
||||
*/
|
||||
package org.openhab.voice.whisperstt.internal;
|
||||
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_CATEGORY;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_ID;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_NAME;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.SERVICE_PID;
|
||||
import static org.openhab.voice.whisperstt.internal.WhisperSTTConstants.*;
|
||||
|
||||
import java.io.ByteArrayInputStream;
|
||||
import java.io.ByteArrayOutputStream;
|
||||
import java.io.FileOutputStream;
|
||||
import java.io.IOException;
|
||||
import java.nio.ByteBuffer;
|
||||
@ -32,7 +30,9 @@ import java.util.Date;
|
||||
import java.util.Locale;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.ExecutionException;
|
||||
import java.util.concurrent.ScheduledExecutorService;
|
||||
import java.util.concurrent.TimeoutException;
|
||||
import java.util.concurrent.atomic.AtomicBoolean;
|
||||
|
||||
import javax.sound.sampled.AudioFileFormat;
|
||||
@ -41,6 +41,13 @@ import javax.sound.sampled.AudioSystem;
|
||||
|
||||
import org.eclipse.jdt.annotation.NonNullByDefault;
|
||||
import org.eclipse.jdt.annotation.Nullable;
|
||||
import org.eclipse.jetty.client.HttpClient;
|
||||
import org.eclipse.jetty.client.api.ContentResponse;
|
||||
import org.eclipse.jetty.client.api.Request;
|
||||
import org.eclipse.jetty.client.util.InputStreamContentProvider;
|
||||
import org.eclipse.jetty.client.util.MultiPartContentProvider;
|
||||
import org.eclipse.jetty.client.util.StringContentProvider;
|
||||
import org.eclipse.jetty.http.HttpMethod;
|
||||
import org.openhab.core.OpenHAB;
|
||||
import org.openhab.core.audio.AudioFormat;
|
||||
import org.openhab.core.audio.AudioStream;
|
||||
@ -48,6 +55,7 @@ import org.openhab.core.audio.utils.AudioWaveUtils;
|
||||
import org.openhab.core.common.ThreadPoolManager;
|
||||
import org.openhab.core.config.core.ConfigurableService;
|
||||
import org.openhab.core.config.core.Configuration;
|
||||
import org.openhab.core.io.net.http.HttpClientFactory;
|
||||
import org.openhab.core.io.rest.LocaleService;
|
||||
import org.openhab.core.voice.RecognitionStartEvent;
|
||||
import org.openhab.core.voice.RecognitionStopEvent;
|
||||
@ -57,6 +65,7 @@ import org.openhab.core.voice.STTService;
|
||||
import org.openhab.core.voice.STTServiceHandle;
|
||||
import org.openhab.core.voice.SpeechRecognitionErrorEvent;
|
||||
import org.openhab.core.voice.SpeechRecognitionEvent;
|
||||
import org.openhab.voice.whisperstt.internal.WhisperSTTConfiguration.Mode;
|
||||
import org.openhab.voice.whisperstt.internal.utils.VAD;
|
||||
import org.osgi.framework.Constants;
|
||||
import org.osgi.service.component.annotations.Activate;
|
||||
@ -96,10 +105,13 @@ public class WhisperSTTService implements STTService {
|
||||
private @Nullable WhisperContext context;
|
||||
private @Nullable WhisperGrammar grammar;
|
||||
private @Nullable WhisperJNI whisper;
|
||||
private boolean isWhisperLibAlreadyLoaded = false;
|
||||
private final HttpClientFactory httpClientFactory;
|
||||
|
||||
@Activate
|
||||
public WhisperSTTService(@Reference LocaleService localeService) {
|
||||
public WhisperSTTService(@Reference LocaleService localeService, @Reference HttpClientFactory httpClientFactory) {
|
||||
this.localeService = localeService;
|
||||
this.httpClientFactory = httpClientFactory;
|
||||
}
|
||||
|
||||
@Activate
|
||||
@ -108,7 +120,8 @@ public class WhisperSTTService implements STTService {
|
||||
if (!Files.exists(WHISPER_FOLDER)) {
|
||||
Files.createDirectory(WHISPER_FOLDER);
|
||||
}
|
||||
WhisperJNI.loadLibrary(getLoadOptions());
|
||||
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
|
||||
loadWhisperLibraryIfNeeded();
|
||||
VoiceActivityDetector.loadLibrary();
|
||||
whisper = new WhisperJNI();
|
||||
} catch (IOException | RuntimeException e) {
|
||||
@ -117,6 +130,13 @@ public class WhisperSTTService implements STTService {
|
||||
configChange(config);
|
||||
}
|
||||
|
||||
private void loadWhisperLibraryIfNeeded() throws IOException {
|
||||
if (config.mode == Mode.LOCAL && !isWhisperLibAlreadyLoaded) {
|
||||
WhisperJNI.loadLibrary(getLoadOptions());
|
||||
isWhisperLibAlreadyLoaded = true;
|
||||
}
|
||||
}
|
||||
|
||||
private WhisperJNI.LoadOptions getLoadOptions() {
|
||||
Path libFolder = Paths.get("/usr/local/lib");
|
||||
Path libFolderWin = Paths.get("/Windows/System32");
|
||||
@ -167,14 +187,27 @@ public class WhisperSTTService implements STTService {
|
||||
|
||||
private void configChange(Map<String, Object> config) {
|
||||
this.config = new Configuration(config).as(WhisperSTTConfiguration.class);
|
||||
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
|
||||
WhisperGrammar grammar = this.grammar;
|
||||
if (grammar != null) {
|
||||
grammar.close();
|
||||
this.grammar = null;
|
||||
}
|
||||
|
||||
// API mode
|
||||
if (this.config.mode == Mode.API) {
|
||||
try {
|
||||
unloadContext();
|
||||
} catch (IOException e) {
|
||||
logger.warn("IOException unloading model: {}", e.getMessage());
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
// Local mode
|
||||
WhisperJNI whisper;
|
||||
try {
|
||||
loadWhisperLibraryIfNeeded();
|
||||
WhisperJNI.setLibraryLogger(this.config.enableWhisperLog ? this::onWhisperLog : null);
|
||||
whisper = getWhisper();
|
||||
} catch (IOException ignored) {
|
||||
logger.warn("library not loaded, the add-on will not work");
|
||||
@ -228,9 +261,17 @@ public class WhisperSTTService implements STTService {
|
||||
|
||||
@Override
|
||||
public Set<Locale> getSupportedLocales() {
|
||||
// as it is not possible to determine the language of the model that was downloaded and setup by the user, it is
|
||||
// assumed the language of the model is matching the locale of the openHAB server
|
||||
return Set.of(localeService.getLocale(null));
|
||||
// Attempt to create a locale from the configured language
|
||||
String language = config.language;
|
||||
Locale modelLocale = localeService.getLocale(null);
|
||||
if (!language.isBlank()) {
|
||||
try {
|
||||
modelLocale = Locale.forLanguageTag(language);
|
||||
} catch (IllegalArgumentException e) {
|
||||
logger.warn("Invalid language '{}', defaulting to server locale", language);
|
||||
}
|
||||
}
|
||||
return Set.of(modelLocale);
|
||||
}
|
||||
|
||||
@Override
|
||||
@ -246,33 +287,18 @@ public class WhisperSTTService implements STTService {
|
||||
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
|
||||
throws STTException {
|
||||
AtomicBoolean aborted = new AtomicBoolean(false);
|
||||
WhisperContext ctx = null;
|
||||
WhisperState state = null;
|
||||
try {
|
||||
var whisper = getWhisper();
|
||||
ctx = getContext();
|
||||
logger.debug("Creating whisper state...");
|
||||
state = whisper.initState(ctx);
|
||||
logger.debug("Whisper state created");
|
||||
logger.debug("Creating VAD instance...");
|
||||
final int nSamplesStep = (int) (config.stepSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||
final int nSamplesStep = (int) (config.stepSeconds * WHISPER_SAMPLE_RATE);
|
||||
VAD vad = new VAD(VoiceActivityDetector.Mode.valueOf(config.vadMode), WHISPER_SAMPLE_RATE, nSamplesStep,
|
||||
config.vadStep, config.vadSensitivity);
|
||||
logger.debug("VAD instance created");
|
||||
sttListener.sttEventReceived(new RecognitionStartEvent());
|
||||
backgroundRecognize(whisper, ctx, state, nSamplesStep, locale, sttListener, audioStream, vad, aborted);
|
||||
backgroundRecognize(nSamplesStep, locale, sttListener, audioStream, vad, aborted);
|
||||
} catch (IOException e) {
|
||||
if (ctx != null && !config.preloadModel) {
|
||||
ctx.close();
|
||||
}
|
||||
if (state != null) {
|
||||
state.close();
|
||||
}
|
||||
throw new STTException("Exception during initialization", e);
|
||||
}
|
||||
return () -> {
|
||||
aborted.set(true);
|
||||
};
|
||||
return () -> aborted.set(true);
|
||||
}
|
||||
|
||||
private WhisperJNI getWhisper() throws IOException {
|
||||
@ -339,9 +365,8 @@ public class WhisperSTTService implements STTService {
|
||||
}
|
||||
}
|
||||
|
||||
private void backgroundRecognize(WhisperJNI whisper, WhisperContext ctx, WhisperState state, final int nSamplesStep,
|
||||
Locale locale, STTListener sttListener, AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
|
||||
var releaseContext = !config.preloadModel;
|
||||
private void backgroundRecognize(final int nSamplesStep, Locale locale, STTListener sttListener,
|
||||
AudioStream audioStream, VAD vad, AtomicBoolean aborted) {
|
||||
final int nSamplesMax = config.maxSeconds * WHISPER_SAMPLE_RATE;
|
||||
final int nSamplesMin = (int) (config.minSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||
final int nInitSilenceSamples = (int) (config.initSilenceSeconds * (float) WHISPER_SAMPLE_RATE);
|
||||
@ -353,21 +378,17 @@ public class WhisperSTTService implements STTService {
|
||||
logger.debug("Max silence samples {}", nMaxSilenceSamples);
|
||||
// used to store the step samples in libfvad wanted format 16-bit int
|
||||
final short[] stepAudioSamples = new short[nSamplesStep];
|
||||
// used to store the full samples in whisper wanted format 32-bit float
|
||||
final float[] audioSamples = new float[nSamplesMax];
|
||||
// used to store the full retained samples for whisper
|
||||
final short[] audioSamples = new short[nSamplesMax];
|
||||
executor.submit(() -> {
|
||||
int audioSamplesOffset = 0;
|
||||
int silenceSamplesCounter = 0;
|
||||
int nProcessedSamples = 0;
|
||||
int numBytesRead;
|
||||
boolean voiceDetected = false;
|
||||
String transcription = "";
|
||||
String tempTranscription = "";
|
||||
VAD.@Nullable VADResult lastVADResult;
|
||||
VAD.@Nullable VADResult firstConsecutiveSilenceVADResult = null;
|
||||
try {
|
||||
try (state; //
|
||||
audioStream; //
|
||||
try (audioStream; //
|
||||
vad) {
|
||||
if (AudioFormat.CONTAINER_WAVE.equals(audioStream.getFormat().getContainer())) {
|
||||
AudioWaveUtils.removeFMT(audioStream);
|
||||
@ -376,10 +397,9 @@ public class WhisperSTTService implements STTService {
|
||||
.order(ByteOrder.LITTLE_ENDIAN);
|
||||
// init remaining to full capacity
|
||||
int remaining = captureBuffer.capacity();
|
||||
WhisperFullParams params = getWhisperFullParams(ctx, locale);
|
||||
while (!aborted.get()) {
|
||||
// read until no remaining so we get the complete step samples
|
||||
numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
|
||||
int numBytesRead = audioStream.read(captureBuffer.array(), captureBuffer.capacity() - remaining,
|
||||
remaining);
|
||||
if (aborted.get() || numBytesRead == -1) {
|
||||
break;
|
||||
@ -395,17 +415,15 @@ public class WhisperSTTService implements STTService {
|
||||
while (shortBuffer.hasRemaining()) {
|
||||
var position = shortBuffer.position();
|
||||
short i16BitSample = shortBuffer.get();
|
||||
float f32BitSample = Float.min(1f,
|
||||
Float.max((float) i16BitSample / ((float) Short.MAX_VALUE), -1f));
|
||||
stepAudioSamples[position] = i16BitSample;
|
||||
audioSamples[audioSamplesOffset++] = f32BitSample;
|
||||
audioSamples[audioSamplesOffset++] = i16BitSample;
|
||||
nProcessedSamples++;
|
||||
}
|
||||
// run vad
|
||||
if (nProcessedSamples + nSamplesStep > nSamplesMax - nSamplesStep) {
|
||||
logger.debug("VAD: Skipping, max length reached");
|
||||
} else {
|
||||
lastVADResult = vad.analyze(stepAudioSamples);
|
||||
VAD.@Nullable VADResult lastVADResult = vad.analyze(stepAudioSamples);
|
||||
if (lastVADResult.isVoice()) {
|
||||
voiceDetected = true;
|
||||
logger.debug("VAD: voice detected");
|
||||
@ -484,43 +502,26 @@ public class WhisperSTTService implements STTService {
|
||||
}
|
||||
}
|
||||
}
|
||||
// run whisper
|
||||
logger.debug("running whisper with {} seconds of audio...",
|
||||
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
|
||||
long execStartTime = System.currentTimeMillis();
|
||||
var result = whisper.fullWithState(ctx, state, params, audioSamples, audioSamplesOffset);
|
||||
logger.debug("whisper ended in {}ms with result code {}",
|
||||
System.currentTimeMillis() - execStartTime, result);
|
||||
// process result
|
||||
if (result != 0) {
|
||||
emitSpeechRecognitionError(sttListener);
|
||||
break;
|
||||
}
|
||||
int nSegments = whisper.fullNSegmentsFromState(state);
|
||||
logger.debug("Available transcription segments {}", nSegments);
|
||||
if (nSegments == 1) {
|
||||
tempTranscription = whisper.fullGetSegmentTextFromState(state, 0);
|
||||
// run whisper, either locally or by remote API
|
||||
String tempTranscription = (switch (config.mode) {
|
||||
case LOCAL -> recognizeLocal(audioSamplesOffset, audioSamples, locale.getLanguage());
|
||||
case API -> recognizeAPI(audioSamplesOffset, audioSamples, locale.getLanguage());
|
||||
});
|
||||
|
||||
if (tempTranscription != null && !tempTranscription.isBlank()) {
|
||||
if (config.createWAVRecord) {
|
||||
createAudioFile(audioSamples, audioSamplesOffset, tempTranscription,
|
||||
locale.getLanguage());
|
||||
}
|
||||
transcription += tempTranscription;
|
||||
if (config.singleUtteranceMode) {
|
||||
logger.debug("single utterance mode, ending transcription");
|
||||
transcription = tempTranscription;
|
||||
break;
|
||||
}
|
||||
} else {
|
||||
// start a new transcription segment
|
||||
transcription += tempTranscription;
|
||||
tempTranscription = "";
|
||||
}
|
||||
} else if (nSegments == 0 && config.singleUtteranceMode) {
|
||||
logger.debug("Single utterance mode and no results, ending transcription");
|
||||
break;
|
||||
} else if (nSegments > 1) {
|
||||
// non reachable
|
||||
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
|
||||
break;
|
||||
}
|
||||
|
||||
// reset state to start with next segment
|
||||
voiceDetected = false;
|
||||
silenceSamplesCounter = 0;
|
||||
@ -528,10 +529,6 @@ public class WhisperSTTService implements STTService {
|
||||
logger.debug("Partial transcription: {}", tempTranscription);
|
||||
logger.debug("Transcription: {}", transcription);
|
||||
}
|
||||
} finally {
|
||||
if (releaseContext) {
|
||||
ctx.close();
|
||||
}
|
||||
}
|
||||
// emit result
|
||||
if (!aborted.get()) {
|
||||
@ -543,7 +540,7 @@ public class WhisperSTTService implements STTService {
|
||||
emitSpeechRecognitionNoResultsError(sttListener);
|
||||
}
|
||||
}
|
||||
} catch (IOException e) {
|
||||
} catch (STTException | IOException e) {
|
||||
logger.warn("Error running speech to text: {}", e.getMessage());
|
||||
emitSpeechRecognitionError(sttListener);
|
||||
} catch (UnsatisfiedLinkError e) {
|
||||
@ -553,7 +550,119 @@ public class WhisperSTTService implements STTService {
|
||||
});
|
||||
}
|
||||
|
||||
private WhisperFullParams getWhisperFullParams(WhisperContext context, Locale locale) throws IOException {
|
||||
@Nullable
|
||||
private String recognizeLocal(int audioSamplesOffset, short[] audioSamples, String language) throws STTException {
|
||||
logger.debug("running whisper with {} seconds of audio...",
|
||||
Math.round((((float) audioSamplesOffset) / (float) WHISPER_SAMPLE_RATE) * 100f) / 100f);
|
||||
var releaseContext = !config.preloadModel;
|
||||
|
||||
WhisperJNI whisper = null;
|
||||
WhisperContext ctx = null;
|
||||
WhisperState state = null;
|
||||
try {
|
||||
whisper = getWhisper();
|
||||
ctx = getContext();
|
||||
logger.debug("Creating whisper state...");
|
||||
state = whisper.initState(ctx);
|
||||
logger.debug("Whisper state created");
|
||||
WhisperFullParams params = getWhisperFullParams(ctx, language);
|
||||
|
||||
// convert to local whisper format (float)
|
||||
float[] floatArray = new float[audioSamples.length];
|
||||
for (int i = 0; i < audioSamples.length; i++) {
|
||||
floatArray[i] = Float.min(1f, Float.max((float) audioSamples[i] / ((float) Short.MAX_VALUE), -1f));
|
||||
}
|
||||
|
||||
long execStartTime = System.currentTimeMillis();
|
||||
var result = whisper.fullWithState(ctx, state, params, floatArray, audioSamplesOffset);
|
||||
logger.debug("whisper ended in {}ms with result code {}", System.currentTimeMillis() - execStartTime,
|
||||
result);
|
||||
// process result
|
||||
if (result != 0) {
|
||||
throw new STTException("Cannot use whisper locally, result code: " + result);
|
||||
}
|
||||
int nSegments = whisper.fullNSegmentsFromState(state);
|
||||
logger.debug("Available transcription segments {}", nSegments);
|
||||
if (nSegments == 1) {
|
||||
return whisper.fullGetSegmentTextFromState(state, 0);
|
||||
} else if (nSegments == 0 && config.singleUtteranceMode) {
|
||||
logger.debug("Single utterance mode and no results, ending transcription");
|
||||
return null;
|
||||
} else {
|
||||
// non reachable
|
||||
logger.warn("Whisper should be configured in single segment mode {}", nSegments);
|
||||
return null;
|
||||
}
|
||||
} catch (IOException e) {
|
||||
if (state != null) {
|
||||
state.close();
|
||||
}
|
||||
throw new STTException("Cannot use whisper locally", e);
|
||||
} finally {
|
||||
if (releaseContext && ctx != null) {
|
||||
ctx.close();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException {
|
||||
// convert to byte array, Each short has 2 bytes
|
||||
int size = audioSamplesOffset * 2;
|
||||
ByteBuffer byteArrayBuffer = ByteBuffer.allocate(size).order(ByteOrder.LITTLE_ENDIAN);
|
||||
for (int i = 0; i < audioSamplesOffset; i++) {
|
||||
byteArrayBuffer.putShort(audioStream[i]);
|
||||
}
|
||||
javax.sound.sampled.AudioFormat jAudioFormat = new javax.sound.sampled.AudioFormat(
|
||||
javax.sound.sampled.AudioFormat.Encoding.PCM_SIGNED, WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE,
|
||||
false);
|
||||
byte[] byteArray = byteArrayBuffer.array();
|
||||
|
||||
try {
|
||||
AudioInputStream audioInputStream = new AudioInputStream(new ByteArrayInputStream(byteArray), jAudioFormat,
|
||||
audioSamplesOffset);
|
||||
|
||||
// write stream as a WAV file, in a byte array stream :
|
||||
ByteArrayInputStream byteArrayInputStream = null;
|
||||
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
|
||||
AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, baos);
|
||||
byteArrayInputStream = new ByteArrayInputStream(baos.toByteArray());
|
||||
}
|
||||
|
||||
// prepare HTTP request
|
||||
HttpClient commonHttpClient = httpClientFactory.getCommonHttpClient();
|
||||
MultiPartContentProvider multiPartContentProvider = new MultiPartContentProvider();
|
||||
multiPartContentProvider.addFilePart("file", "audio.wav",
|
||||
new InputStreamContentProvider(byteArrayInputStream), null);
|
||||
multiPartContentProvider.addFieldPart("model", new StringContentProvider(this.config.apiModelName), null);
|
||||
multiPartContentProvider.addFieldPart("response_format", new StringContentProvider("text"), null);
|
||||
multiPartContentProvider.addFieldPart("temperature",
|
||||
new StringContentProvider(Float.toString(this.config.temperature)), null);
|
||||
if (!language.isBlank()) {
|
||||
multiPartContentProvider.addFieldPart("language", new StringContentProvider(language), null);
|
||||
}
|
||||
Request request = commonHttpClient.newRequest(config.apiUrl).method(HttpMethod.POST)
|
||||
.content(multiPartContentProvider);
|
||||
if (!config.apiKey.isBlank()) {
|
||||
request = request.header("Authorization", "Bearer " + config.apiKey);
|
||||
}
|
||||
// execute the request
|
||||
ContentResponse response = request.send();
|
||||
|
||||
// check the HTTP status code from the response
|
||||
int statusCode = response.getStatus();
|
||||
if (statusCode < 200 || statusCode >= 300) {
|
||||
logger.debug("HTTP error: Received status code {}, full error is {}", statusCode,
|
||||
response.getContentAsString());
|
||||
throw new STTException("Failed to retrieve transcription: HTTP status code " + statusCode);
|
||||
}
|
||||
return response.getContentAsString();
|
||||
|
||||
} catch (InterruptedException | TimeoutException | ExecutionException | IOException e) {
|
||||
throw new STTException("Exception during attempt to get speech recognition result from api", e);
|
||||
}
|
||||
}
|
||||
|
||||
private WhisperFullParams getWhisperFullParams(WhisperContext context, String language) throws IOException {
|
||||
WhisperSamplingStrategy strategy = WhisperSamplingStrategy.valueOf(config.samplingStrategy);
|
||||
var params = new WhisperFullParams(strategy);
|
||||
params.temperature = config.temperature;
|
||||
@ -570,7 +679,7 @@ public class WhisperSTTService implements STTService {
|
||||
params.grammarPenalty = config.grammarPenalty;
|
||||
}
|
||||
// there is no single language models other than the english ones
|
||||
params.language = getWhisper().isMultilingual(context) ? locale.getLanguage() : "en";
|
||||
params.language = getWhisper().isMultilingual(context) ? language : "en";
|
||||
// implementation assumes this options
|
||||
params.translate = false;
|
||||
params.detectLanguage = false;
|
||||
@ -605,7 +714,7 @@ public class WhisperSTTService implements STTService {
|
||||
}
|
||||
}
|
||||
|
||||
private void createAudioFile(float[] samples, int size, String transcription, String language) {
|
||||
private void createAudioFile(short[] samples, int size, String transcription, String language) {
|
||||
createSamplesDir();
|
||||
javax.sound.sampled.AudioFormat jAudioFormat;
|
||||
ByteBuffer byteBuffer;
|
||||
@ -615,7 +724,7 @@ public class WhisperSTTService implements STTService {
|
||||
WHISPER_SAMPLE_RATE, 16, 1, 2, WHISPER_SAMPLE_RATE, false);
|
||||
byteBuffer = ByteBuffer.allocate(size * 2).order(ByteOrder.LITTLE_ENDIAN);
|
||||
for (int i = 0; i < size; i++) {
|
||||
byteBuffer.putShort((short) (samples[i] * (float) Short.MAX_VALUE));
|
||||
byteBuffer.putShort(samples[i]);
|
||||
}
|
||||
} else {
|
||||
logger.debug("Saving audio file with sample format f32");
|
||||
@ -623,7 +732,7 @@ public class WhisperSTTService implements STTService {
|
||||
WHISPER_SAMPLE_RATE, 32, 1, 4, WHISPER_SAMPLE_RATE, false);
|
||||
byteBuffer = ByteBuffer.allocate(size * 4).order(ByteOrder.LITTLE_ENDIAN);
|
||||
for (int i = 0; i < size; i++) {
|
||||
byteBuffer.putFloat(samples[i]);
|
||||
byteBuffer.putFloat(Float.min(1f, Float.max((float) samples[i] / ((float) Short.MAX_VALUE), -1f)));
|
||||
}
|
||||
}
|
||||
AudioInputStream audioInputStreamTemp = new AudioInputStream(new ByteArrayInputStream(byteBuffer.array()),
|
||||
|
@ -11,7 +11,7 @@
|
||||
</parameter-group>
|
||||
<parameter-group name="vad">
|
||||
<label>Voice Activity Detection</label>
|
||||
<description>Configure the VAD mechanisim used to isolate single phrases to feed whisper with.</description>
|
||||
<description>Configure the VAD mechanism used to isolate single phrases to feed whisper with.</description>
|
||||
</parameter-group>
|
||||
<parameter-group name="whisper">
|
||||
<label>Whisper Options</label>
|
||||
@ -19,7 +19,7 @@
|
||||
</parameter-group>
|
||||
<parameter-group name="grammar">
|
||||
<label>Grammar</label>
|
||||
<description>Define a grammar to improve transcrptions.</description>
|
||||
<description>Define a grammar to improve transcriptions.</description>
|
||||
</parameter-group>
|
||||
<parameter-group name="messages">
|
||||
<label>Info Messages</label>
|
||||
@ -30,9 +30,27 @@
|
||||
<description>Options added for developers.</description>
|
||||
<advanced>true</advanced>
|
||||
</parameter-group>
|
||||
<parameter-group name="openaiapi">
|
||||
<label>API Configuration Options</label>
|
||||
<description>Configure OpenAI compatible API, if you don't want to use the local model.</description>
|
||||
</parameter-group>
|
||||
<parameter name="mode" type="text" groupName="stt">
|
||||
<label>Local Mode Or API</label>
|
||||
<description>Use the local model or the OpenAI compatible API.</description>
|
||||
<default>LOCAL</default>
|
||||
<options>
|
||||
<option value="LOCAL">Local</option>
|
||||
<option value="API">OpenAI API</option>
|
||||
</options>
|
||||
</parameter>
|
||||
<parameter name="modelName" type="text" groupName="stt" required="true">
|
||||
<label>Model Name</label>
|
||||
<description>Model name without extension.</description>
|
||||
<label>Local Model Name</label>
|
||||
<description>Model name without extension. Local mode only.</description>
|
||||
</parameter>
|
||||
<parameter name="language" type="text" groupName="whisper">
|
||||
<label>Language</label>
|
||||
<description>If specified, speed up recognition by avoiding auto-detection. Default to system locale.</description>
|
||||
<default></default>
|
||||
</parameter>
|
||||
<parameter name="preloadModel" type="boolean" groupName="stt">
|
||||
<label>Preload Model</label>
|
||||
@ -225,5 +243,20 @@
|
||||
<default>false</default>
|
||||
<advanced>true</advanced>
|
||||
</parameter>
|
||||
<parameter name="apiKey" type="text" groupName="openaiapi">
|
||||
<label>API Key</label>
|
||||
<description>Key to access the API</description>
|
||||
<default></default>
|
||||
</parameter>
|
||||
<parameter name="apiUrl" type="text" groupName="openaiapi">
|
||||
<label>API Url</label>
|
||||
<description>OpenAI compatible API URL. Default to OpenAI transcription service.</description>
|
||||
<default>https://api.openai.com/v1/audio/transcriptions</default>
|
||||
</parameter>
|
||||
<parameter name="apiModelName" type="text" groupName="openaiapi">
|
||||
<label>API Model</label>
|
||||
<description>Model name to use (API only). Default to OpenAI only available model (whisper-1).</description>
|
||||
<default>whisper-1</default>
|
||||
</parameter>
|
||||
</config-description>
|
||||
</config-description:config-descriptions>
|
||||
|
@ -3,6 +3,12 @@
|
||||
addon.whisperstt.name = Whisper Speech-to-Text
|
||||
addon.whisperstt.description = Whisper STT Service uses the whisper.cpp library to transcript audio data to text.
|
||||
|
||||
voice.config.whisperstt.apiKey.label = API Key
|
||||
voice.config.whisperstt.apiKey.description = Key to access the API
|
||||
voice.config.whisperstt.apiModelName.label = API Model
|
||||
voice.config.whisperstt.apiModelName.description = Model name to use (API only). Default to OpenAI only available model (whisper-1).
|
||||
voice.config.whisperstt.apiUrl.label = API Url
|
||||
voice.config.whisperstt.apiUrl.description = OpenAI compatible API URL. Default to OpenAI transcription service.
|
||||
voice.config.whisperstt.audioContext.label = Audio Context
|
||||
voice.config.whisperstt.audioContext.description = Overwrite the audio context size. (0 to use whisper default context size)
|
||||
voice.config.whisperstt.beamSize.label = Beam Size
|
||||
@ -24,27 +30,35 @@ voice.config.whisperstt.greedyBestOf.description = Best Of configuration for sam
|
||||
voice.config.whisperstt.group.developer.label = Developer
|
||||
voice.config.whisperstt.group.developer.description = Options added for developers.
|
||||
voice.config.whisperstt.group.grammar.label = Grammar
|
||||
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcrptions.
|
||||
voice.config.whisperstt.group.grammar.description = Define a grammar to improve transcriptions.
|
||||
voice.config.whisperstt.group.messages.label = Info Messages
|
||||
voice.config.whisperstt.group.messages.description = Configure service information messages.
|
||||
voice.config.whisperstt.group.openaiapi.label = API Configuration Options
|
||||
voice.config.whisperstt.group.openaiapi.description = Configure OpenAI compatible API, if you don't want to use the local model.
|
||||
voice.config.whisperstt.group.stt.label = STT Configuration
|
||||
voice.config.whisperstt.group.stt.description = Configure Speech to Text.
|
||||
voice.config.whisperstt.group.vad.label = Voice Activity Detection
|
||||
voice.config.whisperstt.group.vad.description = Configure the VAD mechanisim used to isolate single phrases to feed whisper with.
|
||||
voice.config.whisperstt.group.vad.description = Configure the VAD mechanism used to isolate single phrases to feed whisper with.
|
||||
voice.config.whisperstt.group.whisper.label = Whisper Options
|
||||
voice.config.whisperstt.group.whisper.description = Configure the whisper.cpp transcription options.
|
||||
voice.config.whisperstt.initSilenceSeconds.label = Initial Silence Seconds
|
||||
voice.config.whisperstt.initSilenceSeconds.description = Max initial seconds of silence to discard transcription.
|
||||
voice.config.whisperstt.initialPrompt.label = Initial Prompt
|
||||
voice.config.whisperstt.initialPrompt.description = Initial prompt to feed whisper with.
|
||||
voice.config.whisperstt.language.label = Language
|
||||
voice.config.whisperstt.language.description = If specified, speed up recognition by avoiding auto-detection. Default to system locale.
|
||||
voice.config.whisperstt.maxSeconds.label = Max Transcription Seconds
|
||||
voice.config.whisperstt.maxSeconds.description = Seconds to force transcription before silence detection.
|
||||
voice.config.whisperstt.maxSilenceSeconds.label = Max Silence Seconds
|
||||
voice.config.whisperstt.maxSilenceSeconds.description = Seconds of silence to trigger transcription.
|
||||
voice.config.whisperstt.minSeconds.label = Min Transcription Seconds
|
||||
voice.config.whisperstt.minSeconds.description = Min transcription seconds passed to whisper.
|
||||
voice.config.whisperstt.modelName.label = Model Name
|
||||
voice.config.whisperstt.modelName.description = Model name without extension.
|
||||
voice.config.whisperstt.mode.label = Local Mode Or API
|
||||
voice.config.whisperstt.mode.description = Use the local model or the OpenAI compatible API.
|
||||
voice.config.whisperstt.mode.option.LOCAL = Local
|
||||
voice.config.whisperstt.mode.option.API = OpenAI API
|
||||
voice.config.whisperstt.modelName.label = Local Model Name
|
||||
voice.config.whisperstt.modelName.description = Model name without extension. Local mode only.
|
||||
voice.config.whisperstt.openvinoDevice.label = OpenVINO Device
|
||||
voice.config.whisperstt.openvinoDevice.description = Initialize OpenVINO encoder. (built-in binaries do not support OpenVINO, this has no effect)
|
||||
voice.config.whisperstt.preloadModel.label = Preload Model
|
||||
|
Loading…
Reference in New Issue
Block a user